Bad habits to kick : relying on undocumented behavior
In my last post in this series, I talked about the common habit of creating an IDENTITY column on every single table. Today I want to talk about a more broad concept: relying on undocumented (and therefore probably undefined, and certainly far from guaranteed) behavior and objects.
ORDER BY in a view
This is probably the most infamous of all SQL Server undocumented behaviors. In SQL Server 2000, users learned to create views with a built-in ordering, such as follows:
CREATE VIEW dbo.MyView AS SELECT TOP 100 PERCENT a,b,c FROM dbo.MyTable ORDER BY c;
The problem is, they have assumed that if they then issue…
SELECT * FROM dbo.MyView;
…the results are guaranteed to come back ordered by c. In reality, the order of the results of such a query is arbitrary, since the behavior is undefined. Without an ORDER BY on the outer query, SQL Server is free to return the results in any order it chooses (more about this later). To dissect the syntax in the view a bit, I want to explain that this circumstance is a symptom of a poor design choice in the T-SQL dialect — the way TOP was implemented, the ORDER BY can now serve two functions: to determine the rows included in the result, and to indicate the order of the result. Which is why you sometimes see the following logic to get the top 10 finishers of a race, with 10th place listed first (well, let's ignore ranking functions and the OVER() clause for now):
SELECT RunnerID, [time_in_seconds] FROM ( SELECT TOP (10) RunnerID, [time_in_seconds] FROM dbo.RaceResults WHERE RaceID = 15 ORDER BY [time_in_seconds] ) AS x ORDER BY [time_in_seconds] DESC;
Going back to the view, the TOP 100 PERCENT is required in this case because the ORDER BY is only supposed to be used to determine the rows included in TOP, not to guarantee the order of the results. The syntax for a view allows you to include an ORDER BY clause *only* if you also include a TOP clause. In the 2000 version of the optimizer, it just so happened that you could depend on the fact that in this case, if you issue a SELECT against the view without an ORDER BY, you would in fact get the results in the order defined by the ORDER BY within the view. In SQL Server 2005, changes to the optimizer broke this behavior, and this was considered a bug by a large segment of the SQL Server community — even though the behavior that people were relying on in previous versions was not guaranteed; it was more or less a coincidence. One rather large customer of Microsoft, with plenty of leverage against the SQL Server team, made such a fuss about this change in behavior that a trace flag was introduced to instruct the optimizer to obey the ORDER BY in the view when returning rows. (You can get a lot more information about this trace flag in KB #926292.) Should you use this trace flag? In my opinion, no — I think the squeaky wheel was just too lazy to fix their application code so that it properly dictated the order in which it wanted its results. Should you be aware of what can happen when you rely on undocumented behavior, and that behavior changes? Absolutely.
Ordering without an ORDER BY
Aside from the view case above, there is a more general myth out there that SQL Server will always return rows in a specific order – the myth varies from person to person, but it is either the order of the clustered index, or the order of the IDENTITY column, or the chronological order of insertion. (Similarly, the assumption is that using TOP without an ORDER BY should have a predictable meaning; as with ordering results, it does not.) Often people come to this conclusion because in a limited sample size, that's the behavior they "always" see. I want to make it quite clear: ordering is arbitrary unless you use an ORDER BY clause. You should never, ever, ever rely on the ordering you observe in a query without an ORDER BY — and you should only issue a query without an ORDER BY clause if you truly do not care what order the results come back. In such a case, you may as well imagine that the rows are going to come back in a different, random order each time, even though that is not truly the case (random has a meaning completely separate from arbitrary, but like I said, just imagine).
As I mentioned earlier, in the case where you haven't told SQL Server how you want the results ordered, the optimizer is free to find the most efficient way to return the results to you, which can mean any ordering whatsoever. Typically you will see a predictable ordering, but it might not be the one you expect (for example, it may happen to use a non-clustered index to sort). And even if it is currently the one you expect, the next time you run the query, the choice the optimizer has made can change due to a variety of events, including but not limited to: statistics updates, data changes, statement or module recompiles, query/table hints, forceplan, plans dropping out of the plan cache, service packs, cumulative updates, engine upgrades, and even detach/attach or backup/restore to a different server at the same build level. So quite truthfully, you could run such a query right now, and 5 minutes later run it again, and get the results in a completely different order. This is the quintessential definition of relying on undocumented and undefined behavior, and I strongly recommend you stay away from it. If you want some proof, go and check out Alexander Kuznetsov's post in his Defensive Programming series, entitled, "Without ORDER BY, there is no default sort order."
I see a lot of cases where people expect SQL Server to short-circuit their clauses — basically, read the conditions from left to right (or top to bottom), and when any of the criteria returns false, don't bother evaluating any subsequent conditions. An example is where you have chosen to store DATETIME data in a VARCHAR column, then try to evaluate some DATETIME aspect — but only after determining which values really are dates (this pattern of poor data type choices is another really common bad habit, by the way). So you see queries like this:
CREATE TABLE dbo.foo(bar VARCHAR(20)); INSERT dbo.foo(bar) SELECT '20090201' UNION ALL SELECT 'last week' UNION ALL SELECT 'next Tuesday'; SELECT bar FROM dbo.foo WHERE ISDATE(bar) = 1 AND DATEPART(MONTH, bar) = 2; DROP TABLE dbo.foo;
Msg 241, Level 16, State 1, Line 6 Conversion failed when converting date and/or time from character string.
The truth is, SQL Server is free to evaluate the conditions in any order, so it doesn't necessarily have to first check the result of ISDATE() before trying to treat the data like DATETIME values. And like with the ORDER BY cases above, just because you observe a specific behavior today, does not mean you will see the same behavior tomorrow — many things can change the way SQL Server processes a query. So how do you get around it? Well, there are a few ways. The most obvious one in this specific case is, using a DATETIME column to store DATETIME data. But this isn't always possible, and it is not always the issue in this scenario anyway — it was just the handiest example I could come up with. The next thing people try, is to use a sub-query to first return only the rows that have DATETIME values:
SELECT bar FROM ( SELECT bar FROM dbo.foo WHERE ISDATE(bar) = 1 ) AS x WHERE DATEPART(MONTH, bar) = 2;
The result: same error (and same happens with a CTE). It's like the optimizer is reading your mind and knows you're trying to trick it; it still sees that there are rows in the table that would not pass the DATEPART check, but is evaluating things in the wrong order.
So how do we get this "short circuiting" to really work? The CASE expression is your friend (and thanks Alex for the improved syntax suggestion):
SELECT bar FROM dbo.foo WHERE CASE WHEN ISDATE(bar) = 1 THEN DATEPART(MONTH, bar) END = 2;
I'm sure I've seen this solved in other clever ways before, and that someone will remind me how to really make this short circuit, but on a Monday night, this is the only workaround I know is guaranteed to work. In some cases, SQL Server *will* short-circuit (see the comments from Nigel Ellis about 2/3 through this TechNet chat transcript), and you can see some working examples from Mark Cohen. My point, though, remains: feel free to take advantage of these optimizations, but do not rely on them.
You're probably thinking, wow, that's a strange one to mention. sp_who2 is undocumented and unsupported? Absolutely. Try to find sp_who2 in any current version of Books Online. I reminded them that there was a reference to sp_who2 in the 2005 version of Books Online in Connect #207997 back in September of 2006. Shortly thereafter, the last remaining trace of this procedure was removed. I also asked for a supported and documented version in March of 2007, in Connect #264681. Don't hold your breath, because all sp_* procedure development has been abandoned (which is why a lot of the sp_help style procedures haven't been updated to reflect new 2008 features). I am also not trying to scare you into believing that sp_who2 will suddenly stop working in the next release or service pack – I think it is pretty much here to stay. But with the advancements in performance tuning and troubleshooting, I question if it is of any value to use these days at all. Instead, there are many alternatives. Adam Machanic has written a great replacement with more flexibility and much more useful information than the new stuff (sp_WhoIsActive). I also documented a few approaches to querying the DMVs for certain information in Chapter 29 of SQL Server MVP Deep Dives, entitled, "My Favorite DMVs and Why."
This system procedure creates a cursor, loops through all of your databases, and performs some action against each of them (for some background, you can see this Database Journal article). In one of my systems (2005 SP3), I was able to reproduce many times a scenario where a random number of rows would come back from sp_MSforeachdb. It seemed to occur more frequently during our peak load times, but nonetheless, that got me spooked. I looked at what it was doing and, not surprisingly, it is actually pretty simple if you discard all of the security stuff (you should only be running this kind of code for admin / maintenance functionality anyway, IMHO). I was surprised even in 2008 R2 to still see this code in the procedure:
EXEC(N'declare hCForEachDatabase cursor global for select name from master.dbo.sysdatabases ...'
…the use of dbo.sysdatabases, as opposed to sys.databases, serves as further proof that there is no investment whatsoever in updating sp_* procedures to fit the new features and catalog views as SQL Server matures. Anyway, I like to write my own versions of these procedures. Not only does this enable me to avoid a flaky global cursor that is quite expensive, and future-proofs me against changes in their behavior (or deprecation that technically doesn't need to be announced), but it also allows me to add flexibility to the procedure, such as only executing code against a database with a certain naming pattern or in a certain recovery model, and only attempting to run on databases that are online and aren't read only. For example:
CREATE PROCEDURE dbo.ForEachDB_MyWay @cmd NVARCHAR(MAX), @name_pattern NVARCHAR(257) = '%', @recovery_model NVARCHAR(60) = NULL AS BEGIN SET NOCOUNT ON; DECLARE @sql NVARCHAR(MAX), @db NVARCHAR(257); DECLARE @c CURSOR LOCAL FORWARD_ONLY STATIC READ_ONLY FOR SELECT QUOTENAME([name]) FROM sys.databases WHERE (@recovery_model IS NULL OR (recovery_model_desc = @recovery_model)) AND [name] LIKE @name_pattern AND [state] = 0 AND [is_read_only] = 0 ORDER BY [name]; OPEN @c; FETCH NEXT FROM @c INTO @db; WHILE @@FETCH_STATUS <> -1 BEGIN SET @sql = REPLACE(@cmd, '?', @db); BEGIN TRY EXEC(@sql); END TRY BEGIN CATCH -- I'll leave more advanced error handling as an exercise: PRINT ERROR_MESSAGE(); END CATCH FETCH NEXT FROM @c INTO @db; END CLOSE @c; DEALLOCATE @c; END GO -- simple example: print names of all DBs in FULL recovery: EXEC dbo.ForEachDB_MyWay @cmd = N'PRINT ''?'';', @recovery_model = N'FULL';
You could also do other interesting things with this construct, such as filtering on databases where log_reuse_wait = 0, or is_broker_enabled = 1, or where it is not a system database (database_id > 4). You could also dictate an ORDER BY so that the databases you deem most important can appear at the top or the bottom of the queue, so to speak. But most importantly, you'll be able to keep augmenting your own version of the procedure, as SQL Server adds more features and as you find new ways to want to work on sets of databases.
I asked for a supported and documented version of this stored procedure in Connect #264677, so that hundreds of DBAs out there don't have to develop their own procedure as above. But it was promptly closed as "Won't Fix."
(Of course the same kind of logic can hold true for a similar undocumented stored procedure, sp_MSforeachtable.)
xp_fileexist / xp_getfiledetails
The writing is on the wall: the days of extended procedures are numbered, especially those that are undocumented. Writing code today that relies on these modules being available in future versions of SQL Server is a little like Russian roulette. The last thing you want to do during a SQL Server upgrade is worry about breaking code that accesses the file system. Thankfully, there are many tutorials out there that will help you bring CLR to the rescue, in the event that you really need to perform file handling from within SQL Server. Greg Larsen does a pretty good job of priming you to write your own xp_getfiledetails.
Sadly, I am just skimming the surface on features and behaviors that are undocumented and/or undefined. There are literally dozens and dozens of undocumented stored procedures and extended procedures, and many engine behaviors that we all take for granted. There are probably several behaviors I rely on every day that I don't realize are undocumented… it is definitely a difficult habit to avoid altogether. But the realization that you *might* be using undocumented methods, or making assumptions based on observed and not guaranteed behavior, is often a good chunk of the battle.