Bad habits to kick : mis-handling date / range queries
In my last post in this series, I talked about using user-defined data types (alias types). Today I wanted to discuss many of the ways in which people subject their date and time columns to very inappropriate query methodologies.
It's very easy to say, "Hey, don't do the wrong thing!" Not so easy to actually accomplish, right? In general, yes, I agree. But I see such frequent abuse of DATETIME columns in range queries that I felt it deserved some treatment.
The long, long, long laundry list of offenses (apologies in advance)
The most frequent faux pas I see is when someone uses regional date formats. For example, they want all the rows from a particular day. First they try:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn = '10/11/2009';
The first problem there is, what if the system has British regional settings or the language is set to French? Is that October 11th or November 10th? I wrote the query, and I don't even know! It would be a shame to pull data from the wrong month, and not even notice. Much better to use an unambiguous date format; in spite of what –CELKO– will try to force you to believe, the only truly safe formats for date/time literals in SQL Server, at least for DATETIME and SMALLDATETIME, are:
(If you are using the newer types introduced in SQL Server 2008, there is more precision allowed (.nnnnnnn) and also if you are using timezones you can say +/-hh:mm or Z.)
As an example, even if you try to use the seemingly unambiguous YYYY-MM-DD, this can break under certain scenarios — such as when the user's language settings are set to French:
SET LANGUAGE FRENCH; GO SELECT CONVERT(DATETIME, '2009-10-13');
Le paramètre de langue est passé à Français. Msg 242, Level 16, State 3, Line 1 La conversion d'un type de données varchar en type de données datetime a créé une valeur hors limites.
For those of you not fluent in Français, that essentially says (in my best Quebec accent), "There is no month 13, dummy!" This is because in French that date format is interpreted as YYYY-DD-MM. (For some background on the attempts we've made to deprecate this interpretation, see Connect #290971.)
As opposed to YYYY-MM-DD, YYYYMMDD will never break. If you decide to use any other format for your date string literals, at least for DATETIME and SMALLDATETIME types, you are leaving yourself open to errors or incorrect data should a user have different session settings, or should the application be moved to servers with different settings. In SQL Server 2008, the new types are a little more insulated from user or machine settings; still, I use YYYYMMDD for consistency and to be safe.
When the user fixes that and passes in a proper string literal format, there is a problem with this query in most situations, since DATETIME and SMALLDATETIME columns have a time component. Unless you always strip out the time when entering data (or use a computed column that does this for you), this query should yield few, if any, rows:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn = '20091011';
This is because the data looks like this:
DateColumn ----------------------- 2009-10-11 00:14:32.577 2009-10-11 04:31:16.465 2009-10-11 08:45:57.714
What the query above is actually asking is:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn = '2009-10-11T00:00:00.000';
So there should be no surprise that no results are returned, since none of the values match that criteria.
What does the user do next? The same thing I did the first time I came across this problem. Convert the left side of the equation to a string, stripping off the time component:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE CONVERT(CHAR(8), DateColumn, 112) = '20091011';
NOW I can get my data, right? Well, yes, you can get your data all right. But now you've effectively eliminated the possibility of SQL Server taking advantage of an index. Since you've forced it to build a nonsargable condition, this means it will have to convert every single value in the table to compare it to the string you've presented on the right hand side. Another approach users take is:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn BETWEEN '20091011' AND '20091012';
Well, this approach is okay, as long as you don't have any rows that fall on midnight at the upper bound – which can be much more common if parts of your application strip time from date/time values. In that case, this query will include data from the next day; not exactly what was intended. In some cases, that *is* what is intended: some people think the above query should return all the rows from October 11th, and also all the rows from October 12th. Remember that this query can be translated to one of the following, without changing the meaning:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn BETWEEN '2009-10-11T00:00:00.000' AND '2009-10-12T00:00:00.000'; -- or SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn >= '2009-10-11T00:00:00.000' AND DateColumn <= '2009-10-12T00:00:00.000';
(Note that in the second example, that is greater than or equal to the first variable and less than or equal to the second variable.) This means that you will return rows from October 12th at exactly midnight, but not at 1:00 AM, or 4:00 PM, or 11:59 PM.
Then the user tries this, so they can still use BETWEEN and save a few key strokes:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn BETWEEN '20091011' AND '2009-10-11T23:59:59.997'; -- or SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn BETWEEN '20091011' AND DATEADD(SECOND, -1, '20091012');
These are no good either. If the data type of the column is SMALLDATETIME, the comparison is going to round up, and you *still* might include data from the next day. For the second version, if the data type of the column is DATETIME, there is still the possibility that you are going to miss rows that have a time stamp between 11:59:59 PM and 11:59:59.997 PM. Probably not many, but if there is even one, your data is no longer accurate.
(Note that if you are using the DATE data type in SQL Server 2008, or can guarantee that you always remove the time component from the column, BETWEEN is okay. But for consistency, I still stay away from BETWEEN.)
Another thing I see a lot is when people want a range like a month or a year. Can you believe that people write code like this:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn LIKE '200910%';
The problem with this is that, even while SQL Server will implicitly convert DateColumn to a string for you, it does *not* convert it to CHAR(8) with style 112, which would be required for this wildcard search to work. (You can see what it will do "for you" when you try PRINT CURRENT_TIMESTAMP;.) So maybe they meant to do it this way:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE CONVERT(CHAR(8), DateColumn, 112) LIKE '200910%';
But this is still a bad idea because, like above, this creates a nonsargable condition, and prevents an index on DateColumn from being utilized. And finally, how about this one:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DATEPART(YEAR, DateColumn) = 2009 AND DATEPART(MONTH, DateColumn) = 10;
This looks more like something you would see in OLAP, where you actually have measures and dimensions that will allow you to query the data this way – efficiently. In the OLTP world, this type of nonsargable query is not going to perform any better than any of the others above, and it makes parameter passing and validation more complex as well (imagine the leap year validation you'd require for a date passed in as year, month, day when the date is February 29th).
I'm not making *ANY* of these up; I have seen them all out there in the wild, either in code I've reviewed, systems I've inherited, or questions I've seen on the newsgroups or forums.
The best approach, IMHO
In order to make best possible use of indexes, and to avoid capturing too few or too many rows, the best possible way to achieve the above query is:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn >= '20091011' AND DateColumn < '20091012';
Hopefully the queries are not being written this way, and the data is actually passed to the statement as a properly typed variable. When you can help SQL Server avoid implicit conversions, you should do so. If you are intending to allow just one day at a time in your query, you could write a stored procedure like this:
CREATE PROCEDURE dbo.GetLogCountByDay @date SMALLDATETIME AS BEGIN SET NOCOUNT ON; SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn >= @date AND DateColumn < DATEADD(DAY, 1, @date); END GO EXEC dbo.GetLogCountByDay @date = '20091011';
Why don't I use < (@date + 1) there? To enforce a best practice. I'll admit, I've used the lazy DATEADD shorthand for years. However, I now consider that a bad habit too, as it breaks with the new DATE data types in SQL Server 2008:
DECLARE @d DATETIME2(7) = SYSDATETIME(); SELECT @d + 1;
Msg 206, Level 16, State 2, Line 3 Operand type clash: datetime2 is incompatible with int
If you want to support a range of dates, then the change is minor:
CREATE PROCEDURE dbo.GetLogCountByDateRange @StartDate SMALLDATETIME, @EndDate SMALLDATETIME AS BEGIN SET NOCOUNT ON; SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn >= @StartDate AND DateColumn < DATEADD(DAY, 1, @EndDate); END GO EXEC dbo.GetLogCountByDateRange @StartDate = '20091011', @EndDate = '20091015';
And if you wanted to return the counts for each day, you could do this (assuming you have a Numbers table that starts at 1):
CREATE PROCEDURE dbo.GetDailyLogCountByMonth @Month SMALLDATETIME AS BEGIN SET NOCOUNT ON; WITH [days] AS ( SELECT [day] = DATEADD(DAY, [Number]-1, @Month) FROM dbo.Numbers WHERE [Number] <= DATEDIFF(DAY, @Month, DATEADD(MONTH, 1, @Month)) ) SELECT d.[day], COUNT(t.DateColumn) FROM [days] AS d INNER JOIN dbo.SomeLogTable AS t ON t.DateColumn >= d.[day] AND t.DateColumn < DATEADD(DAY, 1, d.[day]) GROUP BY d.[day] ORDER BY d.[day]; END GO EXEC dbo.GetDailyLogCountByMonth @Month = '20091001';
Let's compare a couple of these approaches. First, we need to build a table and some procedures (this looks like a LOT of code, but it took about 6 seconds to create on my VM):
CREATE DATABASE DateTesting; GO USE DateTesting; GO CREATE TABLE dbo.SomeLogTable ( DateColumn DATETIME ); GO CREATE CLUSTERED INDEX x ON dbo.SomeLogTable(DateColumn); GO SET NOCOUNT ON; -- populate a numbers table with 500K rows: DECLARE @UpperLimit INT; SET @UpperLimit = 500000; WITH n AS ( SELECT x = ROW_NUMBER() OVER (ORDER BY s1.[object_id]) FROM sys.objects AS s1 CROSS JOIN sys.objects AS s2 CROSS JOIN sys.objects AS s3 CROSS JOIN sys.objects AS s4 ) SELECT [Number] = x INTO dbo.Numbers FROM n WHERE x BETWEEN 1 AND @UpperLimit; GO CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers([Number]); GO -- get 500K pretty evenly distributed rows into the log table: INSERT dbo.SomeLogTable(DateColumn) SELECT DATEADD(SECOND, -[Number], DATEADD(MINUTE, ([Number]), '20090901')) FROM dbo.Numbers; GO -- create a procedure for getting a day's log count -- good way: GO CREATE PROCEDURE dbo.Good_LogCountByDay @date SMALLDATETIME AS BEGIN SET NOCOUNT ON; DECLARE @c INT; SELECT @c = COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn >= @date AND DateColumn < DATEADD(DAY, 1, @date); END GO -- bad way #1: GO CREATE PROCEDURE dbo.Bad_LogCountByDay_1 @date SMALLDATETIME AS BEGIN SET NOCOUNT ON; DECLARE @c INT; SELECT @c = COUNT(*) FROM dbo.SomeLogTable WHERE CONVERT(CHAR(8), DateColumn, 112) = @date; END GO -- bad way #2: GO CREATE PROCEDURE dbo.Bad_LogCountByDay_2 @year INT, @month INT, @day INT AS BEGIN SET NOCOUNT ON; DECLARE @c INT; SELECT @c = COUNT(*) FROM dbo.SomeLogTable WHERE DATEPART(YEAR, DateColumn) = @year AND DATEPART(MONTH, DateColumn) = @month AND DATEPART(DAY, DateColumn) = @day; END GO -- create procedures for getting a month's log count -- good way: GO CREATE PROCEDURE dbo.Good_LogCountByMonth @Month SMALLDATETIME AS BEGIN SET NOCOUNT ON; DECLARE @c INT; SELECT @c = COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn >= @Month AND DateColumn < DATEADD(MONTH, 1, @Month); END GO -- bad way #1: GO CREATE PROCEDURE dbo.Bad_LogCountByMonth_1 @Month SMALLDATETIME AS BEGIN SET NOCOUNT ON; DECLARE @c INT; SELECT @c = COUNT(*) FROM dbo.SomeLogTable WHERE CONVERT(CHAR(8), DateColumn, 112) LIKE CONVERT(CHAR(6), @Month, 112) + '%'; END GO -- bad way #2: GO CREATE PROCEDURE dbo.Bad_LogCountByMonth_2 @year INT, @month INT AS BEGIN SET NOCOUNT ON; DECLARE @c INT; SELECT @c = COUNT(*) FROM dbo.SomeLogTable WHERE DATEPART(YEAR, DateColumn) = @year AND DATEPART(MONTH, DateColumn) = @month; END GO
Just by looking at it, you probably have a good idea how this going to end. All the same, we can test each set of stored procedures in two different ways:
(a) Getting the data for a single day
- First, let's just do a one-to-one-to-one comparison of the execution plan, just to see what we get:
EXEC dbo.Good_LogCountByDay @date = '20091005'; EXEC dbo.Bad_LogCountByDay_1 @date = '20091005'; EXEC dbo.Bad_LogCountByDay_2 @year = 2009, @month = 10, @day = 5;
As expected, the "Good" version of the procedure has a far more favorable plan, using a clustered index seek as opposed to a clustered index scan. Here is how the plans compare (click to enlarge):
- In case the differences in the plan do not highlight the performance implications, let's run each procedure 1000 times, to see how long it takes. Remember to turn off the "Include Actual Execution Plan" option!
SELECT CURRENT_TIMESTAMP; GO EXEC dbo.Good_LogCountByDay @date = '20091005'; GO 1000 SELECT CURRENT_TIMESTAMP; GO EXEC dbo.Bad_LogCountByDay_1 @date = '20091005'; GO 1000 SELECT CURRENT_TIMESTAMP; GO EXEC dbo.Bad_LogCountByDay_2 @year = 2009, @month = 10, @day = 5; GO 1000 SELECT CURRENT_TIMESTAMP; GO
2009-10-16 12:05:06.123 2009-10-16 12:05:07.063 (~1 second) 2009-10-16 12:11:09.650 (~6 minutes, 2 seconds) 2009-10-16 12:12:46.197 (~1 minute, 46 seconds)
So, clearly the date range query is far superior to the other two. And while the execution plans for the two "bad" versions of the procedure showed that their costs should be roughly equivalent, in reality it turns out that the procedure that handles the CONVERT() on the left-hand side is far more costly, at least in terms of duration, than the version that uses DATEPART() to extract the year, month and day.
(b) Getting the data for a month
- Let's turn "Include Actual Execution Plan" back on, and compare the plans for the next set of procedures:
EXEC dbo.Good_LogCountByMonth @month = '20091001'; EXEC dbo.Bad_LogCountByMonth_1 @month = '20091001'; EXEC dbo.Bad_LogCountByMonth_2 @year = 2009, @month = 10;
We see a very similar result to the above, where the "good" procedure uses a clustered index seek, and the "bad" procedures use a scan (click to enlarge):
- Now, let's try these procedures 1000 times each, and measure how long they take (again, you don't want to run these loops with execution plan enabled):
SELECT CURRENT_TIMESTAMP; GO EXEC dbo.Good_LogCountByMonth @month = '20091001'; GO 1000 SELECT CURRENT_TIMESTAMP; GO EXEC dbo.Bad_LogCountByMonth_1 @month = '20091001'; GO 1000 SELECT CURRENT_TIMESTAMP; GO EXEC dbo.Bad_LogCountByMonth_2 @year = 2009, @month = 10; GO 1000 SELECT CURRENT_TIMESTAMP; GO
2009-10-16 12:16:59.727 2009-10-16 12:17:04.383 (~5 seconds) 2009-10-16 12:21:40.640 (~4 minutes, 36 seconds) 2009-10-16 12:23:13.950 (~1 minute, 33 seconds)
Again we see that the date range query performs quite well compared to the other two, and that the CONVERT() version takes far longer to complete than the DATEPART() version. I guess if you are going to continue to use a "bad" approach, you can at least easily determine which is the lesser of two evils. 🙂
It is not surprising that the performance aspect of the "good" approach shows significant improvement over the nonsargable versions. I could probably also demonstrate cases where you accidentally retrieve too few rows, or too many rows — but this article seems to be getting a little long already, so I'll leave the data correctness discussion for another day.
Don't forget to clean up:
USE [master]; GO DROP DATABASE DateTesting; GO
A few other tidbits
As an aside, if you only want whole dates, make sure your input validation is functional and that users know what format to enter. Nothing can go right if you let users enter freeform dates and some of them enter d/m/y and others enter m/d/y. Safest to use a calendar control / date picker, then you can dictate exactly what the format is. And to be safe, sanitize the input by converting it to midnight, e.g.:
SET @DateInput = DATEDIFF(DAY, 0, @DateInput);
If you want more control over beginning and end ranges (let's say for a month-based procedure, where you want to report on whole months), you can do this:
SELECT @StartDate = DATEADD(DAY, 1-DAY(@d), DATEDIFF(DAY, 0, @StartDate)), @EndDate = DATEADD(MONTH, 1, @StartDate);
Finally, I have seen stored procedures where DATETIME values are passed in as CHAR(8) or CHAR(10). Don't pass a date into a stored procedure using a string-based parameter: always use properly typed parameters. If your client-side validation is broken or being bypassed, this can cause problems you can stomp further up the chain by using the correct data type in the first place.
The main take-away points I was trying to get across in this post are:
- avoid ambiguous formats for date-only literals;
- avoid BETWEEN for range queries against DATETIME, SMALLDATETIME, DATETIME2, and DATETIMEOFFSET;
- avoid calculations on the left-hand side of the WHERE clause; and,
- avoid treating dates like strings.
For a lot more helpful information on date and time, see Tibor Karazsi's article, "The ultimate guide to the datetime datatypes."
Over the past couple of weeks, I have developed a series of "Bad habits to kick" articles, in an effort to motivate people to drop some of the things that I hate to see when I inherit code. I have several more ideas in development, and I'll gladly take suggestions for future articles, but this should be the last post in the series for at least a few days. I hope the series has been interesting.