Back in October, I started a series of blog posts called "Bad Habits to Kick," and thought I would revive the theme.
I've worked with developers that dabble in SQL, and they tend to have a few common traits. I'm not sure where they come from, but one that I find rather distracting is the tendency to place an IDENTITY column on every single table. Usually this is done because this is an "easy" way to add a column to the table that allows you to identify a single row.
Now don't get me wrong, I am a firm believer that IDENTITY columns have their place, and I use them quite liberally. When you are storing information about customers, products or companies, where there are going to be multiple references to those entities in related tables, it makes little sense to copy all of the "real" attributes that identify that entity into all of the related tables. In some cases this would make your storage requirements surge dramatically, never mind that the performance of JOINs will suffer. It also can make for a real mess when you are repeating information that is likely to change – such as surname, e-mail address, phone number, ISBN. Both of these issues are technical and not logical limitations – SQL Server doesn't support a "behind-the-scenes" foreign key construct that would make repeating the information unnecessary, and there are cases where cascading updates do not exactly work as advertised.
A simple example is where you have an Orders table and an OrderDetails table. (And as a heads up, whiteboarding your perception of Barnes & Noble's database schema is a typical part of my interviewing process – how detailed we get is up to you :-).)
CREATE TABLE dbo.Orders ( OrderID INT IDENTITY(1,1) PRIMARY KEY, ... ); CREATE TABLE dbo.OrderDetails ( OrderDetailID INT IDENTITY(1,1) PRIMARY KEY, OrderID INT NOT NULL FOREIGN KEY REFERENCES dbo.Orders(OrderID), ProductID INT NOT NULL FOREIGN KEY REFERENCES dbo.Products(ProductID), Quantity INT NOT NULL, ... );
Now tell me, what business purpose does the IDENTITY column on the OrderDetails table satisfy? When are you ever going to need to reference rows in this table by the OrderDetailID, and not by the OrderID and/or ProductID? And why is *that* the column that should uniquely identify the row? This seems to imply that you would allow multiple rows for the same order and product. Even if the PK is elsewhere and these questions disappear, the value of this column is questionable at best – for no discernible gain whatsoever, you seem to be taking up an extra 4 bytes (plus space in any indexes that include the column, and then the additional I/O involved in maintaining those indexes).
Another example is a logging table. Let's say you have a table that logs exceptions, or API calls, or other events over time. You insert rows monotonically either in real time or from log files. I often see this designed as:
CREATE TABLE dbo.ActivityLog ( LogID INT IDENTITY(1,1) PRIMARY KEY, EventDate DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, ... );
I think part of the knee-jerk reaction to add an IDENTITY column and make it the PK is that events can happen at the same time, so it makes it hard to define a real natural key. I'll concede this point, as I have several log tables that do not have a PK or a unique constraint – but they certainly have a clustered index, and they do not have an IDENTITY column. The purpose of this type of table is usually to maintain a history of activity and not to do a lot of searching for individual rows (and when you are searching for individual rows, you are not looking for a specific LogID, you are looking for details in other columns – say, part of an error message).
An interesting thing that came up in one of these cases was the need to purge older data. So, the developer intended to have a 30-day running window, and once a day a job would wake up and would delete any rows from the log that were more than 30 days old. Since the clustered index was on the IDENTITY column, their code basically did this:
– determined the highest IDENTITY value more than 30 days old
– assigned that to a variable
– deleted from the table where IDENTITY < variable
With the clustered index moved to the DATETIME column, and the IDENTITY column disposed of, his code became a lot easier (simply a DELETE with a WHERE clause), and more efficient too, since it didn't have to perform a lookup first.
Now let's talk about a very similar case of a logging table where I think an IDENTITY column *is* useful, such as a logging table that keeps track of backup or index activity.
CREATE TABLE dbo.ActivityLog ( LogID INT IDENTITY(1,1) UNIQUE NONCLUSTERED, EventType TINYINT, -- assume some kind of FK StartDate DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, EndDate DATETIME NULL, ... );
Now why is an IDENTITY column useful here? Well, rather than logging the start and end events in different rows, keeping the data on a single row makes it much easier to consolidate the start and end of a specific event. So the code would do something like:
DECLARE @LogID INT; INSERT dbo.ActivityLog(EventType) SELECT 1; SELECT @LogID = SCOPE_IDENTITY(); -- do other work UPDATE dbo.ActivityLog SET EndDate = CURRENT_TIMESTAMP WHERE LogID = @LogID;
The unique constraint makes it slightly easier for the engine to locate the row to be updated, but the clustered index could still be placed on the StartDate and/or EndDate columns to provide useful searches on date ranges (with the same monotonically increasing property as the IDENTITY column), and a primary key could still be defined on other columns. I am sure Celko is rolling his eyes right now, though he has probably not even read this far.*
So back to the basics, am I telling you to not use IDENTITY columns? Absolutely (and hopefully obviously) not. Just showing that before blindly applying an IDENTITY column to a table, think about what it means and why you are using it.
* P.S. I don't want to debate the merits of using an IDENTITY column or otherwise surrogate key (such as NEWID() or NEWSEQUENTIALID()), vs. using a natural primary key. This comes up with Celko a lot – he is completely against any kind of system-generated identifier that can't be physically verified. When asked a pointed question such as, "what do you think Barnes & Noble uses as the primary key to identify a customer or an order?" he disappears, never to appear in the thread again. So while he is passionate about his convictions, he has no desire to defend them.