See the full index.
There is a common tendency to place an IDENTITY column on every single table. Usually this is done because this is an "easy" way to identify a single row. A simple example is where you have a junction table that joins products and tags.
CREATE TABLE dbo.Products ( ProductID int IDENTITY(1,1) PRIMARY KEY, ... ); CREATE TABLE dbo.Tags ( TagID int IDENTITY(1,1) PRIMARY KEY, ... ); CREATE TABLE dbo.ProductTags ( ProductTagID int IDENTITY(1,1), ProductID int NOT NULL FOREIGN KEY REFERENCES dbo.Products(ProductID), TagID int NOT NULL FOREIGN KEY REFERENCES dbo.Tags(TagID), ... );
Now tell me, what business purpose does the IDENTITY column on the ProductTags table satisfy? When are you ever going to need to reference rows in this table by the ProductTagID, and not by the ProductID and/or TagID? And why is *that* the column that should uniquely identify the row? This seems to imply that you would allow multiple rows for the same product and tag. Even if the PK is elsewhere and these questions disappear, the value of this column is questionable at best – for no discernible gain whatsoever, you seem to be taking up an extra 4 bytes (plus space in any indexes that include the column, and then the additional I/O involved in maintaining those indexes).
Another example is a logging table. Let's say you have a table that logs exceptions, or API calls, or other events over time. You insert rows monotonically either in real time or from log files. I often see this designed as:
CREATE TABLE dbo.ActivityLog ( LogID int IDENTITY(1,1) PRIMARY KEY, EventDate datetime NOT NULL DEFAULT CURRENT_TIMESTAMP, ... );
I think part of the knee-jerk reaction to add an IDENTITY column here and make it the PK is that events can happen at the same time, so it makes it hard to define a real natural key. I'll concede this point, as I have several log tables that do not have a PK or a unique constraint – but they certainly have a clustered index, and they do not have an IDENTITY column. The purpose of this type of table is usually to maintain a history of activity and not to do a lot of searching for individual rows (and when you are searching for individual rows, you are not looking for a specific LogID, you are looking for details in other columns – say, part of an error message, or a specific time range).
An interesting thing that came up in one of these cases was the need to purge older data. So, the developer intended to have a 30-day running window, and once a day a job would wake up and would delete any rows from the log that were more than 30 days old. Since the clustered index was on the IDENTITY column, their code basically did this:
- determined the highest IDENTITY value more than 30 days old
- assigned that to a variable
- deleted from the table where IDENTITY < variable
With the clustered index moved to the datetime column, and the IDENTITY column disposed of, his code became a lot easier (simply a DELETE with a WHERE clause), and more efficient too, since it didn't have to perform a lookup first.
Now, let's talk about a very similar case of a logging table where I think an IDENTITY column *is* useful, such as a logging table that keeps track of backup or index activity.
CREATE TABLE dbo.ActivityLog ( LogID int IDENTITY(1,1) UNIQUE NONCLUSTERED, EventType tinyint, -- assume some kind of FK StartDate datetime NOT NULL DEFAULT CURRENT_TIMESTAMP, EndDate datetime NULL, ... );
Now why is an IDENTITY column useful here? Well, rather than logging the start and end events in different rows, keeping the data on a single row makes it much easier to consolidate the start and end of a specific event. So the code would do something like:
DECLARE @LogID int; INSERT dbo.ActivityLog(EventType) SELECT 1; SELECT @LogID = SCOPE_IDENTITY(); -- do other work UPDATE dbo.ActivityLog SET EndDate = CURRENT_TIMESTAMP WHERE LogID = @LogID;
The unique constraint makes it slightly easier for the engine to locate the row to be updated, but the clustered index could still be placed on the StartDate and/or EndDate columns to provide useful searches on date ranges (with the same monotonically increasing property as the IDENTITY column), and a primary key could still be defined on other columns. I am sure Celko is rolling his eyes right now, though he has probably not even read this far.*
So back to the basics, am I telling you to not use IDENTITY columns? Absolutely (and hopefully obviously) not. Just showing that before blindly applying an IDENTITY column to a table, think about what it means and why you are using it. Don't get me wrong; I am a firm believer that IDENTITY columns have their place, and I use them quite liberally. When you are storing information about customers, products, or companies, where there are going to be multiple references to those entities in related tables, it makes little sense to copy all of the "real" attributes that identify that entity into all of the related tables. In some cases this would make your storage requirements surge dramatically, never mind that the performance of JOINs will suffer. It also can make for a real mess when you are repeating information that is likely to change – such as surname, e-mail address, phone number, or ISBN. Both of these issues are technical and not logical limitations – SQL Server doesn't support a "behind-the-scenes" foreign key construct that would make repeating the information unnecessary, and there are cases where cascading updates do not exactly work as advertised.
See the full index.