Bad Habits to Kick : Choosing the wrong data type
In my last post in this series, I talked about the assumptions many people make about IDENTITY columns. In this post, I want to focus on choosing data types. There are several areas where I see frequent mistakes in data type choice, and I'll mention a few here.
Using CHAR / VARCHAR / INT for date / time values
Countless times I have seen character-based data types for columns meant to store date/time information, usually to store their preferred format (e.g. d/m/y) or to strip the date or time from the value. Even worse is when they use NCHAR or NVARCHAR when we know that dates are not going to include any Unicode characters. There are several negative consequences to this. Most importantly, you lose proper validation – with a string, anybody can put "12/42/9999" or "36/8/211" or "foo" into that column. Depending on what format you store and how you run conversions, you will lose the ability to perform range queries, and you will have definite losses in the area of index operations. You also lose the ability to reliably perform date math operations like DATEADD(), DATEDIFF() and DATEPART().
You lose less in terms of index performance and storage when you store the date (e.g. in YYYYMMDD format) as an INT, but again you will have to be very creative in how you query the table and what types of parameters you pass in to support range queries. A useful example to point out is how painful it is for us to get useful date/time information out of the sysjobhistory table in msdb.
It is much better to store these values as DATETIME or SMALLDATETIME and either strip off the parts you don't want before insert, or ignore them when you query. You can also use computed columns (or populate them manually) at insert time, and keep both the original value and the "rounded" value. In SQL Server 2008, you can also use the new DATE and TIME data types if you are only interested in one component or the other.
Using DATETIME instead of SMALLDATETIME
If you are only interested in the date portion of the value, or don't need the granularity of time below the minute, use SMALLDATETIME instead of DATETIME. You will save 4 bytes per row and your indexes and range queries will be much more efficient. In SQL Server 2008, you can use DATE in the "date only" case, and save an additional byte. Your users and your SAN administrator will thank you. When I have some time, I will post an article demonstrating some of the gains of using the smaller date/time data types where possible.
Using TIME to store duration
With SQL Server 2008, it may be tempting to store a duration in a column defined with the new TIME data type. This data type is not meant to store duration but actually a point in time. A problem with trying to store duration here is, what happens when your duration exceeds 24 hours? It would be much more useful to store the "duration" in two columns – StartTime and EndTime – which gives you even more information. You can always calculate the duration in whatever unit you want. You could add a third column which can be computed, calculating the duration in minutes or seconds or whatever makes sense as the smallest granularity you may need to calculate. Your choice for the StartTime and EndTime columns could be TIME if you are reasonably confident that durations will always be less than 24 hours and will not cross any midnight boundaries, but more likely they should be SMALLDATETIME, DATETIME or DATETIME2, depending on the granularity you need.
Using NVARCHAR to store data like zip or phone
Once in a while I see zip and phone columns defined as NVARCHAR. I am not sure what phone number can contain Unicode characters, so maybe it is a valid choice for you, but somehow I doubt it. I also often see these defined far wider than they need to be. NVARCHAR(50) for phone? Really, people? This is how you can tell someone created their table either through the SSMS UI (which offers 50 as the default) or from upsizing Access (which converts all text columns to NVARCHAR).
E-mail is an interesting one. We use VARCHAR(320), since the local part (username) is limited to 64 characters, and the domain name is limited to 255 characters. 64 + @ symbol + 255 = 320. In our systems, we do not need to worry about supporting far East e-mail addresses, which can require Unicode characters. We also have yet to be bitten by someone who has a very long domain name *and* insists on using a sub-domain to exceed 255 characters. They would also have to use a very large local part which, I guess, is possible, but very unlikely. When that user comes in and breaks one of our systems, I'll point them to this post. 🙂
Using VARCHAR to store data like name and address
Your applications should be prepared for internationalization, even if right now you are not supporting it at all. You might think that because you only have American customers or users that you don't need to worry about foreign characters, however there are potential embarrassments lurking if you store a user's name with VARCHAR and lose part of the data. In most cases you will eventually need to support Unicode, so you are likely just delaying the inevitable by using VARCHAR now. In SQL Server 2008 R2, you can use data compression to significantly reduce the storage and I/O implications of doing so, since it will essentially treat all non-Unicode characters as if you had used VARCHAR.
Using VARCHAR instead of CHAR
If you know your column is going to store a (relatively) fixed number of bytes, you should consider using CHAR instead of VARCHAR. You may sometimes lose a little bit due to the padded spaces at the end, but you will often gain this back by not having to worry about row overhead for storing the number of characters. The basic rule of thumb I have seen out in the wild is that if you are storing fixed size, or variable size less than 10 or 12 characters, use CHAR. Of course, you need to be sure you understand how the column will be used in the future. If there is any chance that you will later store more data here, you are probably better off using VARCHAR. If memory serves me right, you will have much less pain later if you have to increase the size of the column and rebuild the table, if you are going from VARCHAR(x) -> VARCHAR(x+y), since this doesn't impact existing rows the way it would if you were converting CHAR(x) -> CHAR(x+y) or VARCHAR(x+y).
Using BIT for properties like Status
It's always dangerous to use two- or three-valued logic to store a status value unless you know it can always only ever be on or off. Initially you might always say that something is "on" or "off" – but later, you might need additional states such as "loading" or "archived" or "hidden" or "unknown." Why not use TINYINT to leave room for expansion? You don't lose anything space- or memory-wise, since TINYINT still takes up only 1 byte. You can add a check constraint which will make it easy to define the domain of values without having to change the data type later.
Using INT instead of BIGINT
If you have a large table with an IDENTITY column that holds transaction data, and will be doing so for very high volumes or for a very long time, you might run into a case where you run out of IDENTITY values because you hit the ~2 billion upper bound on the INT data type. As I explained earlier in this series, a common reaction is to "design this better" by starting the IDENTITY seed at the lower bound of INT instead of 1. This doesn't really buy you anything except that it temporarily delays the need to deal with it … you'll still have to fix it eventually, because if you're hitting 2 billion rows now, how long will it be before you hit 4 billion? It is relatively impossible, in our lifetimes, to use up all the positive values that a BIGINT supports, unless you set the increment to some ungodly number. So in cases this like this it can make sense to take the hit now and use BIGINT. If you're using SQL Server 2008, this is another area where data compression can help reduce the impact, since it can compress many of the smaller values.
Using INT instead of SMALLINT or TINYINT
Let's say you have a lookup table for sex_code. Instead of storing M/F/U etc. (or the full words Male/Female/Unknown) in all of the tables, you just want to store a numeric representation (there are industry standards for this). However, there is no reason to use INT here; since there is no impending evolutionary change we can envision that will require the number of options even jumping to 10, never mind > 255. So why not use TINYINT? It won't make a big difference to the lookup table itself, since this is going to be tiny and will likely be in memory anyway. But on the real tables, you're going to save 3 bytes per row in storage, and your reads, joins, seeks and scans are going to be slightly more efficient.
Using FLOAT / REAL instead of DECIMAL / NUMERIC
Since FLOAT and REAL are approximate numbers, you can run into all kinds of presentation / storage issues if you use them inadvertently. They have their place, and if you are using scientific data that requires the properties of floating point numbers, th they are the right choice. But in most cases, I think this is another evolutionary thing that happens accidentally when you upsize from Access. You are much better off using DECIMAL or NUMERIC (which are functionally equivalent) unless you really know that you should be using FLOAT. For some more background information, you can look at this thread from Stack Overflow.
Using MONEY / SMALLMONEY instead of DECIMAL / NUMERIC
I see no gain in using the MONEY or SMALLMONEY types. One benefit that a few people have related to me is that it describes *exactly* what is in the column. This isn't just a number, it's CASH! Like I would look at a column called "Salary" or "RetailPrice" and assume it was storing peanuts or cheese doodles. The problem with these types is that there are definite issues with certain arithmetic operations. You can review this thread on Stack Overflow for more information.
Using SQL_VARIANT at all
A lot of people think they are clever and will use SQL_VARIANT to store various types of data in a single column. I can tell you from experience that this will be a major headache for you before long. Several external tools don't support it (e.g. entity framework), some providers like ODBC automatically convert the values to NVARCHAR(4000), and you can't use the column in the definition of a computed column. It also can't be part of a primary or foreign key, nor can you use wildcard operators such as LIKE. And even if you only store numbers, you cannot use operators like SUM(), MAX(), addition or subtraction. For a lot of these things you need to do an explicit CONVERT(), but to avoid conversion issues, you need to be careful about your WHERE clause also. Jonathan Kehayias talks about a few SQL_VARIANT issues in his blog post earlier this year. Recently Microsoft PSS also posted an article about potential memory leaks involving linked servers and SQL_VARIANT.
Choosing the right data type is all about knowing the right questions to ask about how the data is being used and how it may be used in the future. There is no one-size-fits-all answer, and we can't predict every single future change to our applications, but you should be aware of the potential pitfalls you may encounter every time you make a data type decision.
I am working on a series of "Bad habits to kick" articles, in an effort to motivate people to drop some of the things that I hate to see when I inherit code. Up next: abusing triggers.
I see a lot of databases where product codes/item numbers/upc codes are stored as a numeric data type, which strikes me as just wrong. In my mind, the entire point of having data types is to define which operations are valid, and if you cannot perform a single arithmetic operation on the data, it clearly is not numeric. I had someone argue that numbers stored as text don't sort "properly", but if it is a product code, it probably has a defined length, and zero padding to that length fixes the problem. Is philosophical objection valid, or am I arguing about angels on pins?
@Rafael Not so sure a rebuild will help. I've started a series about this:
Thanks for your quick reply. I've rebuilt all of them, except the Clustered (yes, it is the PK), because it is too big to do so and will be really disruptive to the server. I've done the REORG though. Do you think is worth ask for an outage to run the Rebuild on the PK?
Hi Rafael, I am not surprised disk I/O goes up, because a bigint takes up twice as much space (and if that's the clustering key then that occurs in all non-clustered indexes too), though I don't have enough information to understand why it would go from 8k to 60k. Did you rebuild the index(es) after the change?
Hi Aaron, thanks for the article, it's awesome. I had to change our main table id from INT to BIGINT but since then I've noticed a high increase of the number "Avg Disk Bytes/Transfer" (From 8KB to 60KB) and consequently decrease on my I/O performance on the disk I host this table. Have you seen that? Do you think the BIGINT might be causing SQL to transfer that much more data?
@JO, note that I said in 2008 or above you should use DATE if you only care about the date. So your panic is only valid if you plan to still be using 2005 at some time in the future when you'll need to be storing dates beyond 2079. Hopefully people making these decisions know enough about their data model to be making these decisions.
Using SMALLDATETIME instead of DATETIME is a HUGE mistake, if your only reason is to avoid storing the time. The massive negative implication of this poor recommendation is that you will be setting yourself up for another Y2K like event. The ceiling for maximum allowed dates in a SMALLDATETIME datatype is June 6, 2079. That means that if you store any data related to insurance, land titles, or any other subject that will likely make references to far future dates, you are SCREWED! SQL will refuse to store any date past 6/6/2079.
Who ever came up with that deserves to be….
Once again an excellent article.
I'm just missing one thing: Using varchar or other text datatypes is often a strong indication for using string concatenation instead of sql-parameters. So another subtle but important reason to use the correct type is to prevent sql-injection.
Hi Aaron, thanks for the great article.It is very timely for me as I am currently reviewing the data types being used in our reporting database. Are you aware of any T-SQL that will go through each column in a table and based on the current data in the column suggest a more appropriate data type where applicable? For instance a BIGINT column whose values only ranged from 0 to 1000 it would suggest SMALLINT.
Numeric types are precise, that is great. But per-item price is many times not representable exactly no matter the data type. So we use approximate types for prices as they are not representable exactly anyway.
I think storing numeric data as characters might have a slight performance benefit when the recorded data is to be displayed in reports as is without computation.
Other than that, i don't imagine many numeric fields are not involved on computation. Like aggregation or comparison. Why throw away the versatility when somewhere down the line the number may become computationally useful.
@Rich, for what benefit? I could see some situations, e.g. numbers that require string parsing for validation etc. (social security, credit cards). But otherwise if the numbers are > 4 digits you're going to use more space storing them as strings than as ints. As strings you're also much more likely to incur performance issues due to implicit conversion.
Hi Aaron, Good stuff! Thanks for putting all this together. A colleague told me the other day that numerical data should be stored in character fields if it wasn't being used for doing math. Can you speak to that?
JamieC it should have read "two- or three-valued logic" meaning BIT NOT NULL (true, false) vs. BIT NULL (true, false, unknown).
"It's always dangerous to use two-three-valued logic to store a status value" — not sure what you are trying to say here. Did you start with "two-valued" then intended to correct to "three-valued" but mistyped? Two distinct tristate data elements? 23-valued logic?!
csm, you're right of course. I was thinking of the simplest case, where there is only one such column in the table.
Aaron, when you're talking about BIT or TINYINT, I think you have a (little) mistake. Of course, both data types takes up only 1 byte, but when you have 8 or less BIT columns, you still needs 1 byte to store the 8 columns.
As BOL says:
"The Microsoft SQL Server 2005 Database Engine optimizes storage of bit columns. If there are 8 or less bit columns in a table, the columns are stored as 1 byte. If there are from 9 up to 16 bit columns, the columns are stored as 2 bytes, and so on."
I might add…go larger when in doubt. Disk space is cheap; having to change a bunch of code and downstream processes ad infinitum because someone read a book in literal fashion and thought they were doing a service by saving two bytes by using a smallint – wrong. Two years later and the client now has a problem and a new project on their hands because of the decision.
Good read, thanks.
Choosing the wrong data type:
Storing date/time values as a unix timestamp rather than any of the date/time related types. You lose all of the date functions and have to mentally convert everything before writing. I will take DATEDIFF over (< 86400 * something) any day.
Thanks for the great advice. I still have to train myself to use SMALLDATETIME and TINYINT types when they are more appropriate. I would also like to add one more annoyance I come across very often.
Inappropriate Use of TEXT and NTEXT:
In most cases, a large VARCHAR or NVARCHAR field will we sufficient for storing all of the characters required for these fields. Beginning with SQL Server 2005, you can use the VARCHAR(MAX) or NVARCHAR(MAX) data types to store a virtually unlimited number of characters. The TEXT and NTEXT types introduce many complications such as the inability to group on these fields or perform many of the common string operations.
Heh. As the saying goes, the only definitive answer is "It depends". ;->
Thomas, I tried to prevent the argument you're presenting about BIGINT by qualifying in my description: "column that holds transaction data, and will be doing so for very high volumes or for a very long time"… hopefully you didn't interpret that as hundreds of thousands of rows. Otherwise I'm not sure why there would be any danger in hitting the upper bound of INT.
RE: int to bigint
This comes back to reasonable expectations. If (a big "if") the middle-tier is designed to accomodate bigints, then changing from int to bigint, while time consuming, isn't the end of the world. If you reasonably expect 10s of millions of rows, then yes, using a bigint is probably a good choice. If you expect at most hundreds of thousands of rows, then the use of a bigint can be reasonably questioned.
The issue isn't whether you will be working with SQL in 2079. It's whether the data will need to capture something beyond D-Day of 2079. One system I build captured contract data. Who'd sign a contract for 80 years in the future? Welp, they did (or tried to). However, as you said, if you only use it to capture "today" dates, then it's fine.
Thomas, have you ever tried to change a column from INT to BIGINT on a table with 2 billion rows? Your point about the upper bound on SMALLDATETIME is a good one. Where I use SMALLDATETIME, it is for *current* data, that doesn't need sub-minute or sub-day accuracy and isn't going to be holding dates 60+ years in the future. I'll worry about the upper bound if I'm still working with SQL Server in 2079. 🙂
Bruce, I was planning on treating the comma-separated list as a separate topic on its own. In most cases these are used because SQL Server doesn't know what arrays are, and because it is quite convenient for applications to pass such a list in a single parameter (and in a lot of cases, such as a set of checkboxes or multi-select list items in a web form, the values are already organized in this very format).
John, I am not writing an English dissertation. I don't think making the correction you suggest will change the meaning of the sentence for any of my other readers.
I disagree about using smalldatetime. IMO, it should almost never be used. Sure, it saves on space, but its maximum value is the bizarrely chosen 2079-06-06. Aside from not being intuitive, I have run into systems that initially would "never" hit that date but in reality did. Storage is cheap. If storing only the date is important, then upgrade to SQL 2008 and use the DATE data type.
RE: Using Char vs Varchar
There is more at stake than just adding a few extra characters. Equating values is different in middle-tier code. When you insert "FOO" into a Char(10), it will come back as 10 characters. While SQL statements will ignore the spaces, .NET code for example will not which means the developers need to trim every value coming back from that field if they are going to do equality tests against it.
RE: Int instead of BigInt
Again, it depends on the design. I find it interesting that you advocate smalldatetime for space reasons but not int for the same reason. I have systems I built over 10 years ago that are still in use using ints and are no where close to needing a bigint. If the system is designed well, it shouldn't be an issue to change from int to bigint later. In fact, a better argument could be made that the middle tier should assume 64-bit ints rather than the database.
RE: Using Smallint
There is a reasonable argument against smallint and tinyint in that they are not standard SQL and that using check constraints and ints (or bigints) makes your intent clearer than relying on the arbitrary restrictions of smallint and tinyint.
Nice article. I like how you are keeping these very much to the topic at hand.
I'm surprised you didn't include the list of values encoded in a single VARCHAR.
One thought regarding the use of NVARCHAR for phone numbers: it's a common practice now to have words (like business names) encoded as a phone number. You know, like the "1800 CALLMENOW" kind of thing. Do you store those as text or as numbers? If you store them as text, what happens in non-english speaking countries?
Much food for thought. My comment is about the use of English in the INT and BIGINT section:
"It is relatively impossible, in our lifetimes, to use up all the positive values that a BIGINT supports, unless you set the increment to some ungodly number."
Impossible is like a BIT function: on or off. Use "very difficult" instead of "relatively impossible".
Great series! Here's a couple of other issues I've seen when dealing with databases upgraded from 2000 to 2005:
1 – Using Text instead of varchar(MAX).
2 – Using Image instead of varbinary(MAX).
I also hate looking at the table definition and seeing all varchar(50) or nvarchar(255) columns.
Interesting concept about making the fields larger than the user wanted to allow for future growth.
i have been part of a move from int to bigint. the best part is if you have a server with years of archive data, older hardware and no chance to buy new hardware then you spend a few months moving the data manually.
Another point with the MAX designator is that it turns a normal column into a LOB column, which prevents features like online reindexing from being performed.
Great series, Aaron.
I will add that before setting absolute hard rules for space efficiency, consider the impact.
The cluster key should be almost perfect on space efficiency, as this will be part of every non-clustered index as well.
The next priority is any column that will be in a non-clustered index key. If the key is fat, then the index will be deep, impairing index seek efficiency (even though there does not appear to be much difference between index depth 3 vs 4).
Slightly lower in priority is any column in included columns of a nonclustered index. Its not part of the key, but it still has to be duplicated between the table and nonclustered index.
The lowest priority are columns not part of any index. If you feel compelled to splurge on space usuage, do it here. This might let you add a bullet on your resume that you managed 10TB database, even though we know its because you were lax in your choice of data types
I would never use MAX for something like this. I use MAX when I know I'm going to need to support more than 8000 bytes (4000 characters for NVARCHAR, 8000 characters for VARCHAR). You need to have a data dictionary and enforce it. If the app can pass 60 characters when the database expects 56, then there is a disconnect somewhere that you need to correct, sorry. Possible suggestion: ask what the most they will *ever* need to support. Double it for the database column itself (e.g. if they say VARCHAR(64), then use VARCHAR(128)). Make the parameters to your stored procedures VARCHAR(64). So, they will lose their data if they don't listen to you and pass more than 64 characters (but ideally their data validation before they submit data matches the data types of the parameters they use). When you really DO need to extend to 72 characters, or 96 characters, or 100 characters, or 128 characters, you will initially only have to change the input parameters as opposed to changing the underlying schema.
Nice series of articles you've got. What's your opinion on using the "max" scale with varchars. I've frequently fallen back on it because in the past I've had to deal with input scale changes (names an now be 56 characters not 55! Yeah!). But still, it makes me feel like I'm being lazy. I tell myself that varchar(max) uses just as much storage as varchar(50) for the same value but I also realize that I am misrepresenting the business logic by not constraining the input: if name is supposed to be <= 56 bytes and I get 60 from the source system, I need to have a conversation with someone and not just hide it under the "max" rug.
Thanks Michael, you're right about that item; that heading should have said VARCHAR. I've corrected it.
As for CHAR vs. VARCHAR, I was talking more about the row overhead. You're probably right that CHAR is rarely appropriate, even if the data is allegedly "fixed"; ISBNs expanded to 13 characters, and I'm sure SSNs and VINs will do so as well.
I was going with conventional wisdom on this one; and again, only suggested "should consider" with perhaps some unstated emphasis on "test for yourself." As for my own personal use, I very rarely use CHAR. I am looking through my codebase for the primary system I've developed and managed over the past several years, and I have a grand total of 17 CHAR or NCHAR columns (out of 334 character-based columns). Some of these columns are things like gender and state, non-nullable, and then some are columns in views (such as datetime values converted to (CHAR(10), 120) for display purposes).
I think your "Using NVARCHAR instead of VARCHAR" is a "Do" in a list of "Do nots".
From the text of the item, you probably meant vice versa.
You also mention using "VARCHAR instead of CHAR" and talk about space savings. I've never seen any significant space savings gained by this rule of thumb and with the danger of underestimating the size needed (a danger you acknowledged) I'd say it's best to just choose VARCHAR and move on. See http://dbwhisperer.blogspot.com/2009/09/char-what-is-it-good-for.html
"Choosing the right data type is all about knowing the right questions to ask about how the data is being used and how it may be used in the future."
Aaron, keep it up.