Bad habits to kick : making assumptions about IDENTITY
In my last post in this series, I talked about inconsistent naming conventions. This time I want to talk about a few of the assumptions people make when using IDENTITY columns, and what kind of trouble they can cause.
IDENTITY will prevent gaps
A lot of people choose IDENTITY as their "primary key" and expect it to leave no gaps – ever.
I'll leave the choice of IDENTITY as a primary key (over natural keys or GUIDs) for other discussions. Personally, I do use IDENTITY as the PRIMARY KEY in some scenarios, for various reasons. Usually there are unique constraints as well, and the primary key is not always clustered, since that is not always an optimal scenario. Again, these are conversations I will leave for elsewhere. (Back in 2003, I compared IDENTITY and GUID; much more recently, Mladen Prajdic talked about why why he prefers surrogates).
The way people use IDENTITY, seem to stem from the belief that the value assigned by the database should mean something. In fact, the value of an IDENTITY column should basically be unknown and meaningless to you except in the case where you are debugging actual data. The value should never be exposed to the user, or at least not exposed in a meaningful way.
But let's get back to the basics. The purpose of an IDENTITY column is first and foremost to provide UNIQUE values, not CONSECUTIVE values. Unfortunately, you cannot rely on IDENTITY values to be consecutive with no gaps. There are several ways that gaps can be created in an IDENTITY column's sequence, including rolling back a transaction and manually changing the seed value:
CREATE TABLE dbo.foo ( bar INT IDENTITY(1,1) ); GO INSERT dbo.foo DEFAULT VALUES; BEGIN TRANSACTION; INSERT dbo.foo DEFAULT VALUES; ROLLBACK TRANSACTION; INSERT dbo.foo DEFAULT VALUES; SELECT bar FROM dbo.foo;
bar -------- 1 3
Then if you later do this:
DBCC CHECKIDENT('dbo.foo', RESEED, 5); INSERT dbo.foo DEFAULT VALUES; SELECT bar FROM dbo.foo;
bar -------- 1 3 6
So, two very simple demonstrations of how there can be gaps in your IDENTITY columns. What you can do to solve this "problem" depends on your goal. If you just need "row number" values that have no gaps but that can change (e.g. for a report), you can just add a ROW_NUMBER() to the query and not worry about the actual IDENTITY values:
SELECT bar, rownum = ROW_NUMBER() OVER (ORDER BY bar) FROM dbo.foo ORDER BY rownum;
bar rownum -------- -------- 1 1 3 2 6 3
There are some situations where you have no choice in the matter; gaps are simply not allowed. For example, business rules in some jurisdictions require invoice numbers to be sequential with no gaps. In this case, you shouldn't be using IDENTITY at all, but rather you can use a different technique that generates the next value, typically referred to as "rolling your own IDENTITY."
CREATE TABLE dbo.mort ( bar INT PRIMARY KEY ); GO CREATE PROCEDURE dbo.mort_create AS BEGIN SET NOCOUNT ON; DECLARE @bar INT; BEGIN TRANSACTION; SELECT @bar = COALESCE(MAX(bar), 0) + 1 FROM dbo.mort WITH (UPDLOCK); INSERT dbo.mort SELECT @bar; COMMIT TRANSACTION; END GO
Now you are able to sufficiently guarantee that there will be no gaps (assume for a moment that there is actual error handling in the procedure, which I omitted for readability). The downside of this solution is that concurrency can be drastically affected. Or at least, that is the myth. Let's see how many rows we can insert using this solution vs. an IDENTITY column. I created five query windows for each method, dropped and re-created the table dbo.foo from above (this time with a primary key constraint on the IDENTITY column, to make the comparison fair), and then tried to insert 5,000 from each connection:
Method 1 : IDENTITY:
SET NOCOUNT ON; GO INSERT dbo.foo DEFAULT VALUES; GO 5000
Method 2 : "Roll your own":
EXEC dbo.mort_create; GO 5000
Results: both sets of tests performed in exactly 75 seconds. I truncated the tables and ran the tests multiple times, to ensure it wasn't an anomaly, that one of the scripts wasn't inadvertently delayed by an autogrow event, etc. Nope, exactly 75 seconds, every time.
So in the past, those of you who have a business requirement for no gaps and a requirement for high concurrency certainly, have faced an interesting challenge. You often had to trade off potential performance to meet your business requirement, or the opposite. And often because people told you that the MAX+1 solution wouldn't scale as well. I will admit that I have been guilty of telling people that from time to time, because it seemed pretty logical to me. Now that I have performed tests, I will get out of the bad habit of telling people which solution will scale better.
IDENTITY will prevent duplicates
While the IDENTITY property attempts to provide you with unique values, it cannot guarantee them. The element that prevents duplicates in a column is not the IDENTITY property itself, but rather the constraint(s) on that column. Here is very simple proof, a table with an IDENTITY column but the primary key was "forgotten":
CREATE TABLE dbo.splunge ( bar INT IDENTITY(1,1) ); INSERT dbo.splunge DEFAULT VALUES; DBCC CHECKIDENT('dbo.splunge', RESEED, 0); INSERT dbo.splunge DEFAULT VALUES; SELECT bar FROM dbo.splunge; DROP TABLE dbo.splunge;
bar -------- 1 1
So clearly, if you try hard enough, or are particularly accident-prone, you can create duplicate values (unless you also explicitly define the column as unique).
INT will be big enough
Some people apply an IDENTITY column to a table that doesn't need one. For example, a log table that holds high volumes of transaction data; what purpose does an IDENTITY column serve here? What does it mean exactly? If it is to ensure rows are always added to the "end" of the table, and you are also recording date and time as part of the data, then it is just wasting room on the page. Presumably the data is coming into the log table in (roughly, or exactly) chronological order. In which case, you could just as easily cluster (and even partition) on the datetime column. This will keep the same behavior (inserting to the "end" of the table) and will allow much more accurate purging of the data or partition switching. If you need to identify a set of rows later, this is typically going to be by date range. Getting the IDENTITY values represented by that date range is just an extra and unnecessary step.
Anyway, I brought up this example because, if you choose an INT, and you start at 1, you only have ~2 billion values to use up before you will have to make a change to the table. A knee-jerk reaction by some people is to say, "well, we should have started the seed at -2<billion>, then we wouldn't have this problem!" Well, you would, it would just come a little later; if 2 billion isn't enough, it's unlikely that 4 billion will be enough, either.
Ideally, you could switch to BIGINT and not look back. However, changing data types can be very disruptive, since you could have many stored procedures, schema-bound views and other objects that are using the INT type to represent the data that is in this column. While the data in the column is still inside the INT upper bound, you could change the column without changing all of these other elements immediately. Implicit conversion won't kill you, but if you're approaching the upper limit, you could wake up tomorrow and suddenly have overflow problems.
For a temporary solution while you plan your conversion, you could set the identity value to -2,147,483,648. Note that this significantly changes your index usage if they include the IDENTITY column. If you still want the rows to sort in the order of insert when you select them, you could use a CreatedDate column with a default of CURRENT_TIMESTAMP, and order by that. Or, if you don't have such a column, you could use something like this, but it is pretty ugly:
ORDER BY CASE WHEN IdentityColumn > 0 THEN '1' ELSE '2' END, CASE WHEN IdentityColumn < 0 THEN -IdentityColumn END DESC;
Hopefully you can see that if you anticipate this problem and can plan ahead, and you really do want the table to have an IDENTITY column, it is better to go with BIGINT from the start, rather than just double your pool of values by starting at -2,147,483,648 instead of 1. Never mind the debugging problems on initial implementation of that table. Another issue that will happen is if you start at 1, whether you use INT or BIGINT, you can achieve fantastic compression in SQL Server 2008 for the first many thousands of rows. If you start at -2 billion, forget about it.
@@IDENTITY and IDENT_CURRENT() will be reliable
I strongly recommend staying away from both @@IDENTITY and IDENT_CURRENT().
Many people used @@IDENTITY before SCOPE_IDENTITY() arrived on the scene. One problem with @@IDENTITY is that if the table from the first insert had a trigger, and that trigger also inserted into a table with an IDENTITY column (e.g. an audit table), the @@IDENTITY value returned is the one from the audit table, not the original table.
I also see the occasional person thinking that using IDENT_CURRENT('tablename') is a reliable way to retrieve the IDENTITY value that you just generated. This is not true, because it is not scope-specific; another user may have also inserted into the same table after you did, and you could be getting *their* IDENTITY value.
SCOPE_IDENTITY() avoids both of these problems by returning the IDENTITY value you generated. A couple of issues with SCOPE_IDENTITY() that you should keep in mind, however: you cannot expect SCOPE_IDENTITY() to retrieve the value that some other session inserted, and there are a few outstanding issues with SCOPE_IDENTITY() in SQL Server 2005 and 2008 (see Connect items #328811 and #476577).
Hopefully this has shed some light on a few of the assumptions people make about IDENTITY columns, and it helps prevent you from falling into the same traps.
I am working on a series of "Bad habits to kick" articles, in an effort to motivate people to drop some of the things that I hate to see when I inherit code. Up next: choosing the wrong data type.