What is so bad about EAV, anyway?
I see a lot of people trash-talking the EAV (entity-attribute-value) model. If you want the full story on EAV, I recommend you read the Wikipedia entry. Basically this is the situation where instead of storing the properties of an entity in a single row, you store it as a set of name-value pairs in multiple rows.
I understand the general objections to using the EAV model:
- it is difficult to control the attribute names, and to make them consistent;
- it is difficult to enforce data type integrity and referential integrity;
- it is difficult (and slow) to pivot and/or self-join the values to make a single row;
- it is difficult to make certain attributes mandatory; and,
- we've been trained to view this as the "wrong" way to solve the problem.
But what do you do when you have customers that demand real-time, on-demand addition of attributes that they want to store? In one of the systems I manage, our customers wanted to do exactly this. Since we run a SaaS (software as a service) application, we have many customers across several different industries, who in turn want to use our system to store different types of information about *their* customers. A salon chain might want to record facts such as 'hair color,' 'hair type,' and 'haircut frequency'; while an investment company might want to record facts such as 'portfolio name,' 'last portfolio adjustment date,' and 'current portfolio balance.'
Luckily for us, returning that data in a single row wasn't all that important. Our customers wanted to find all the people with red or auburn hair, to sell them some new redhead-only shampoo; or, target all customers with a portfolio balance of $1 million or more, to try to get them to invest in some new pyramid scheme.
The Non-EAV Solution
Without EAV, what would this design look like? Well, we'd have to add a column to the customers table for each of these attributes. So assuming we are opposed to EAV, and that we segment our data by our customers (which we call clients), then by their customers, we would have something like this:
CREATE TABLE dbo.Customers ( CustomerID INT PRIMARY KEY, ClientID INT NOT NULL REFERENCES dbo.Clients(ClientID), HairColor VARCHAR(10), HairType VARCHAR(10), HaircutFrequency VARCHAR(32), PortfolioName NVARCHAR(32), PortfolioBalance BIGINT, LastPortfolioAdjustmentDate SMALLDATETIME );
Then if we wanted to satisfy those two queries, it could be:
SELECT CustomerID, HairColor FROM dbo.Customers WHERE ClientID = 1 AND HairColor IN ('Red', 'Auburn'); SELECT CustomerID, PortfolioBalance FROM dbo.Customers WHERE ClientID = 2 AND PortfolioBalance >= 1000000;
Simple enough. But now what if the salon decides they also want to record their customers' head size? The table gets wider and wider as more attributes are required, and you have to build code that will do this (or, *shudder*, add the column manually):
ALTER TABLE dbo.Customers ADD HeadSize TINYINT;
(And then any code that needs to be used to retrieve head size for the customer needs to be modified to understand the existence of that column.)
So while this reduces the complexity of self-joining to get multiple attributes as you would need to do in the EAV model, it creates its own performance problem by requiring more and more space for any single row, and potentially page splits and fragmentation as that data gets updated. Never mind that HairColor, for example, is probably of very little interest to anyone outside of our salon customer (though maybe we should aim for some of the dating site business). Personally, I don't think dynamic ALTER TABLE ADD new_attribute_name is the way to go here, as you can see that with many customers requiring their own custom attributes, we will soon be up against the one-row-per-page problem. Can you say "maintenance nightmare"?
The EAV solution
My solution: EAV, baby! We talked to our customers who were driving these requirements, and we came up with a good compromise. We didn't need to support every single data type; just three basic ones: strings, numbers, and dates. To simplify things (and based on a LOT of discussion about use cases) we settled on NVARCHAR(1024), DECIMAL(16, 4), and SMALLDATETIME. These covered just about every scenario we were presented with on what our customers would want to store for each of their customers. Actually, the biggest string they wanted was 255, but I am a forward-thinker – tomorrow they would want 512, and I didn't want to be constantly incrementing this columns, so I stretched it out – if they want to increase them we can simply adjust the parameters and artificial constraints outside of the schema. Similarly, the date column only needed precision to one day, but I left it open to minutes (kind of by choice, since there is nothing between SMALLDATETIME and DATETIME in 2000/2005; in 2008 I would have used DATE most likely), and the decimal requirement was only two decimal places. You might think that such a big decimal is wasteful, but we are using VARDECIMAL storage, which means that only the big decimals will use that space (last year, I ran some tests that show some favorable results).
[And going back to the NVARCHAR column, with SQL Server 2008 R2, we plan to implement Unicode compression so that we can squeeze a lot more out of our space — we are I/O-bound, not CPU-bound: see post 1, post 2 and post 3 of my tests of this feature earlier this year.]
So the core table looked like this:
CREATE TABLE dbo.Attributes ( AttributeID INT PRIMARY KEY, ClientID INT NOT NULL REFERENCES dbo.Clients(ClientID), [Name] NVARCHAR(32) NOT NULL UNIQUE, DataTypeID TINYINT NOT NULL -- 1 = string, 2 = numeric, 3 = date );
Then the actual data table looked like this:
CREATE TABLE dbo.CustomerAttributes ( CustomerID INT NOT NULL REFERENCES dbo.Customers(CustomerID), AttributeID INT NOT NULL REFERENCES dbo.Attributes(AttributeID), StringValue NVARCHAR(1024), NumericValue DECIMAL(16,4), DateValue SMALLDATETIME, ModifiedDate SMALLDATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (CustomerID, AttributeID) );
The eventual goal of this was to provide better self-service reporting without having to make custom modifications to the system every time a client has a new requirement. This self-service reporting can now be driven by a very simple drop-down that lists all of the IDs/names from the Attributes table, and offers different choices based on the data type. (For example, if it is a string, they can use =, contains, starts with, ends with, etc.; if it is a date, they can use >=, >, BETWEEN, <, <=, etc.) When they add an attribute, nobody has to go change any of this code because it is already written using this flexibility; the list of attribute choices just gets longer or shorter as attributes are added or removed. (Sorry, I'm not at liberty to share the C# source for the code that builds the self-service reporting queries. It is complicated, but not rocket surgery. Yes, I am mixing metaphors.)
So, for the above queries, we might end up with something like this instead:
SELECT c.CustomerID, HairColor = a.StringValue FROM dbo.CustomerAttributes AS a INNER JOIN dbo.Customers AS c ON a.CustomerID = c.CustomerID WHERE c.ClientID = 1 AND a.AttributeID = 1 AND a.StringValue IN (N'Red', N'Auburn'); SELECT c.CustomerID, PortfolioBalance = a.NumericValue FROM dbo.CustomerAttributes AS a INNER JOIN dbo.Customers AS c ON a.CustomerID = c.CustomerID WHERE c.ClientID = 2 AND a.AttributeID = 2 AND a.NumericValue >= 1000000;
Yes, these queries are a little more verbose, but they are fairly trivial to generate programmatically, when you have a well-defined underlying structure instead of an endless stream of column names on a wide, ever-growing table.
The grass is always greener…
Does this approach have its downsides? Sure:
- Indexes on the value columns are only marginally useful, and only in some cases.
By logging the actual queries that customers run, and how long they take, it is easy to periodically review the heavy hitters, and consider index or statistics changes that would benefit them. So far this hasn't been necessary; maybe we've just been lucky. We have customers with 100s of millions of rows in this table, and while the queries are not sub-second, since they are running behind the scenes and not in real-time, they are certainly well within our SLAs. Anyway for the ones that take longer, these are not queries that would have been any faster in a wide table either (e.g. WHERE PortfolioName LIKE '%small cap%').
- It is not trivial to return all of the values for a customer's set of attributes in a single row.
This is true, but then again, these are easy enough to transpose at the presentation layer (and that is what we do). There is no reason for us to need to present the data for a customer as single-row structure (YMMV).
- Pulling multiple rows for a customer is more expensive than pulling a single row.
This is also true. Though, you could argue that getting to a wide row to access just one skinny attribute is just as wasteful, especially if you are typically after certain and a small number of attributes with much higher frequency than larger numbers. This really depends on the table, its index structure, the I/O subsystem, access patterns, buffer pool, etc. But more importantly, since we can't define what a "row" looks like for all customers right now, never mind permanently, this trade-off is worth it to us. Again, YMMV.
- As outlined above, we can't control the consistency or even the presence of certain attributes across customers.
Well, we could, with a lot of cumbersome code. But since attribute names are in complete and utter control of each customer, they can live and die by their own control over them. I didn't want to get into the business of determining which attributes customers wanted to make mandatory; if one customer wanted HairColor to be mandatory, and another didn't, then that is an interesting issue to solve declaratively. The most complicated problem has been when a customer has added two similar attributes, e.g. HairColor and Hair_Color, then applied both attributes to a single customer. The resolution is tedious but logically simple:
- pick one to keep (let's say we keep AttributeID = 1, and throw away AttributeID = 2);
- for any Customers where only AttributeID = 2 exists, update to AttributeID = 1;
- for any Customers where both 1 and 2 exist, keep whichever one has a greater ModifiedDate, delete the older one, and if the one remaining is 2, update to 1;
3a. in the case of a tie, keep 1 and throw away 2;
- finally, delete Attribute 2 from the Attributes table.
I just wanted to shine some light on a case where the EAV model might make sense (other than it being written off as a "rookie or object-oriented mistake"). The viability of this approach will depend heavily on how much of the requirements you can gather up front (and whether they are likely to change over time, as in our case), as well as how you want to balance your maintenance time down the road: writing code during every "enhancement" request vs. occasionally troubleshooting performance. At least this is how my experience has panned out thus far. And if I were to design this system from scratch tomorrow, the design would be quite similar, though I might have a better opportunity in that case to see if sparse columns and filtered indexes might make the non-EAV approaches more attractive. In the meantime, I am donning my flame-resistant suit, as I am sure some purists and/or EAV opponents will come out blazing…