Bad habits to kick : choosing the wrong data type
In my last post in this series, I talked about the assumptions many people make about IDENTITY columns. In this post, I want to focus on choosing data types. There are several areas where I see frequent mistakes in data type choice, and I'll mention a few here.
Using CHAR / VARCHAR / INT for date / time values
Countless times I have seen character-based data types for columns meant to store date/time information, usually to store their preferred format (e.g. d/m/y) or to strip the date or time from the value. Even worse is when they use NCHAR or NVARCHAR when we know that dates are not going to include any Unicode characters. There are several negative consequences to this. Most importantly, you lose proper validation – with a string, anybody can put "12/42/9999" or "36/8/211" or "foo" into that column. Depending on what format you store and how you run conversions, you will lose the ability to perform range queries, and you will have definite losses in the area of index operations. You also lose the ability to reliably perform date math operations like DATEADD(), DATEDIFF() and DATEPART().
You lose less in terms of index performance and storage when you store the date (e.g. in YYYYMMDD format) as an INT, but again you will have to be very creative in how you query the table and what types of parameters you pass in to support range queries. A useful example to point out is how painful it is for us to get useful date/time information out of the sysjobhistory table in msdb.
It is much better to store these values as DATETIME or SMALLDATETIME and either strip off the parts you don't want before insert, or ignore them when you query. You can also use computed columns (or populate them manually) at insert time, and keep both the original value and the "rounded" value. In SQL Server 2008, you can also use the new DATE and TIME data types if you are only interested in one component or the other.
Using DATETIME instead of SMALLDATETIME
If you are only interested in the date portion of the value, or don't need the granularity of time below the minute, use SMALLDATETIME instead of DATETIME. You will save 4 bytes per row and your indexes and range queries will be much more efficient. In SQL Server 2008, you can use DATE in the "date only" case, and save an additional byte. Your users and your SAN administrator will thank you. When I have some time, I will post an article demonstrating some of the gains of using the smaller date/time data types where possible.
Using TIME to store duration
With SQL Server 2008, it may be tempting to store a duration in a column defined with the new TIME data type. This data type is not meant to store duration but actually a point in time. A problem with trying to store duration here is, what happens when your duration exceeds 24 hours? It would be much more useful to store the "duration" in two columns – StartTime and EndTime – which gives you even more information. You can always calculate the duration in whatever unit you want. You could add a third column which can be computed, calculating the duration in minutes or seconds or whatever makes sense as the smallest granularity you may need to calculate. Your choice for the StartTime and EndTime columns could be TIME if you are reasonably confident that durations will always be less than 24 hours and will not cross any midnight boundaries, but more likely they should be SMALLDATETIME, DATETIME or DATETIME2, depending on the granularity you need.
Using NVARCHAR to store data like zip or phone
Once in a while I see zip and phone columns defined as NVARCHAR. I am not sure what phone number can contain Unicode characters, so maybe it is a valid choice for you, but somehow I doubt it. I also often see these defined far wider than they need to be. NVARCHAR(50) for phone? Really, people? This is how you can tell someone created their table either through the SSMS UI (which offers 50 as the default) or from upsizing Access (which converts all text columns to NVARCHAR).
E-mail is an interesting one. We use VARCHAR(320), since the local part (username) is limited to 64 characters, and the domain name is limited to 255 characters. 64 + @ symbol + 255 = 320. In our systems, we do not need to worry about supporting far East e-mail addresses, which can require Unicode characters. We also have yet to be bitten by someone who has a very long domain name *and* insists on using a sub-domain to exceed 255 characters. They would also have to use a very large local part which, I guess, is possible, but very unlikely. When that user comes in and breaks one of our systems, I'll point them to this post. 🙂
Using VARCHAR to store data like name and address
Your applications should be prepared for internationalization, even if right now you are not supporting it at all. You might think that because you only have American customers or users that you don't need to worry about foreign characters, however there are potential embarrassments lurking if you store a user's name with VARCHAR and lose part of the data. In most cases you will eventually need to support Unicode, so you are likely just delaying the inevitable by using VARCHAR now. In SQL Server 2008 R2, you can use data compression to significantly reduce the storage and I/O implications of doing so, since it will essentially treat all non-Unicode characters as if you had used VARCHAR.
Using VARCHAR instead of CHAR
If you know your column is going to store a (relatively) fixed number of bytes, you should consider using CHAR instead of VARCHAR. You may sometimes lose a little bit due to the padded spaces at the end, but you will often gain this back by not having to worry about row overhead for storing the number of characters. The basic rule of thumb I have seen out in the wild is that if you are storing fixed size, or variable size less than 10 or 12 characters, use CHAR. Of course, you need to be sure you understand how the column will be used in the future. If there is any chance that you will later store more data here, you are probably better off using VARCHAR. If memory serves me right, you will have much less pain later if you have to increase the size of the column and rebuild the table, if you are going from VARCHAR(x) -> VARCHAR(x+y), since this doesn't impact existing rows the way it would if you were converting CHAR(x) -> CHAR(x+y) or VARCHAR(x+y).
An example of where using CHAR ended up being a bad choice is this KB article referenced by KBAlertz. This is reading the text from KB #200040, but it should actually be reading the text from KB #2000403. How embarrassing! I'm sure they are working on a fix, but as a commenter pointed out, they probably passed 2000403 into a parameter, and it silently truncated to 200040 (see how silent truncation can be a problem?). Depending on how many KB articles they've loaded in the new 7-digit format, this could take them a while to straighten out, and it will almost certainly be a manual, one-by-one affair.
Using BIT for properties like Status
It's always dangerous to use two-three-valued logic to store a status value. Initially you might always say that something is "on" or "off" – but later, you might need additional states such as "loading" or "archived" or "hidden." Why not use TINYINT to leave room for expansion? You don't lose anything since TINYINT still takes up only 1 byte. You can add a check constraint which will make it easy to define the domain of values without having to change the data type later.
Using INT instead of BIGINT
If you have a large table with an IDENTITY column that holds transaction data, and will be doing so for very high volumes or for a very long time, you might run into a case where you run out of IDENTITY values because you hit the ~2 billion upper bound on the INT data type. As I explained earlier in this series, a common reaction is to "design this better" by starting the IDENTITY seed at the lower bound of INT instead of 1. This doesn't really buy you anything except that it temporarily delays the need to deal with it … you'll still have to fix it eventually, because if you're hitting 2 billion rows now, how long will it be before you hit 4 billion? It is relatively impossible, in our lifetimes, to use up all the positive values that a BIGINT supports, unless you set the increment to some ungodly number. So in cases this like this it can make sense to take the hit now and use BIGINT. If you're using SQL Server 2008, this is another area where data compression can help reduce the impact, since it can compress many of the smaller values.
Using INT instead of SMALLINT or TINYINT
Let's say you have a lookup table for sex_code. Instead of storing M/F/U etc. (or the full words Male/Female/Unknown) in all of the tables, you just want to store a numeric representation (there are industry standards for this). However, there is no reason to use INT here; since there is no impending evolutionary change we can envision that will require the number of options even jumping to 10, never mind > 255. So why not use TINYINT? It won't make a big difference to the lookup table itself, since this is going to be tiny and will likely be in memory anyway. But on the real tables, you're going to save 3 bytes per row in storage, and your reads, joins, seeks and scans are going to be slightly more efficient.
Using FLOAT / REAL instead of DECIMAL / NUMERIC
Since FLOAT and REAL are approximate numbers, you can run into all kinds of presentation / storage issues if you use them inadvertently. They have their place, and if you are using scientific data that requires the properties of floating point numbers, th they are the right choice. But in most cases, I think this is another evolutionary thing that happens accidentally when you upsize from Access. You are much better off using DECIMAL or NUMERIC (which are functionally equivalent) unless you really know that you should be using FLOAT. For some more background information, you can look at this thread from StackOverflow.
Using MONEY / SMALLMONEY instead of DECIMAL / NUMERIC
I see no gain in using the MONEY or SMALLMONEY types. One benefit that a few people have related to me is that it describes *exactly* what is in the column. This isn't just a number, it's CASH! Like I would look at a column called "Salary" or "RetailPrice" and assume it was storing peanuts or cheese doodles. The problem with these types is that there are definite issues with certain arithmetic operations. You can review this thread on StackOverflow for more information.
Using SQL_VARIANT at all
A lot of people think they are clever and will use SQL_VARIANT to store various types of data in a single column. I can tell you from experience that this will be a major headache for you before long. Several external tools don't support it (e.g. entity framework), some providers like ODBC automatically convert the values to NVARCHAR(4000), and you can't use the column in the definition of a computed column. It also can't be part of a primary or foreign key, nor can you use wildcard operators such as LIKE. And even if you only store numbers, you cannot use operators like SUM(), MAX(), addition or subtraction. For a lot of these things you need to do an explicit CONVERT(), but to avoid conversion issues, you need to be careful about your WHERE clause also. Jonathan Kehayias talks about a few SQL_VARIANT issues in his blog post earlier this year. Recently Microsoft PSS also posted an article about potential memory leaks involving linked servers and SQL_VARIANT.
Choosing the right data type is all about knowing the right questions to ask about how the data is being used and how it may be used in the future. There is no one-size-fits-all answer, and we can't predict every single future change to our applications, but you should be aware of the potential pitfalls you may encounter every time you make a data type decision.
I am working on a series of "Bad habits to kick" articles, in an effort to motivate people to drop some of the things that I hate to see when I inherit code. Up next: abusing triggers.