Mixing OLTP and reporting using indexed views
I'm not a big fan of denormalizing, nor of repeating redundant
redundant information in a database when those facts can already be derived from other information. A classic example of the latter is when I see questions on newsgroups, forums or Stack Overflow that ask how they can update a table's rank column to reflect the current rank based on some other criteria in the table. The problem with this is that you have to run the update every time any DML operation touches any row in the table, and if you can calculate that rank in an UPDATE query, you can also calculate that rank in a SELECT query, so why not just figure it out in real time? An example of the former is when we want to maintain a second table with aggregates from another table. This is where I want to spend my time today.
In reality, many of us deal with OLTP systems that must also serve as the reporting source, so it's not always feasible to calculate aggregates in real time against a constantly moving target. In several of these scenarios I've used indexed views, where we have tables that are inserted often, read often for statistics, but rarely or never updated… sure, you pay a little hit up front on the insert, but the benefit achieved during real-time reporting was worth it. Let me give you an idea of what I was dealing with initially, and how I have fixed it over time to perform even better.
One of our systems is an SaaS e-mail platform, where we offer the ability for our customers to distribute messaging to their customers (mostly via e-mail, but we also support fax, SMS, widgets/gadgets, and social media). For brevity, let's call these people "recipients." If you stripped the schema down to the bare essentials for metadata, you'd see this (I'm going to leave out the foreign key and other constraints, as they are either obvious or irrelevant):
CREATE TABLE dbo.Domains ( DomainID INT PRIMARY KEY, DomainName VARCHAR(255) ); CREATE TABLE dbo.Recipients ( RecipientID INT PRIMARY KEY, DomainID INT, LocalPart VARCHAR(64) /* , ... other columns ... */ );
Obviously we store many other details about recipients, such as demographic information, preferences, opt-in status, subscriptions, etc. But for this discussion, the above is sufficient. Notice that for space savings we don't actually store the e-mail address of the recipient, though it is required and is enforced to be unique via a constraint on (DomainID, LocalPart); this information can easily be derived using a view (you can even make this an indexed view if you want quicker access to e-mail addresses, but I'll leave that for another day):
CREATE VIEW dbo.vRecipientDetails AS SELECT r.RecipientID, EmailAddress = r.LocalPart + '@' + d.DomainName FROM dbo.Recipients AS r INNER JOIN dbo.Domains AS d ON r.DomainID = d.DomainID;
RecipientID is the primary key because it is used in a lot of related tables, most importantly, stats tables. We record every single transaction for a recipient: when a message is attempted to be sent to them; when a successful delivery occurs; when a bounce occurs; when the recipient opens the message or clicks on a link in the message; when the recipient changes their preferences; or, when the user declines to receive further communications from our customer. The thought of storing LocalPart + DomainID (or the fully composed e-mail address) in all of these other tables that would grow and grow over time made a surrogate representation a very easy choice for the primary key. I didn't want to make the discussion revolve around this, but I've seen several raised eyebrows in the past about surrogate primary keys and wanted to assure you that, in this system, I think it is the best choice. (If you want to debate that, let's have a different discussion, as long as it doesn't turn religious.)
Okay, so I mentioned this is an OLTP system, and I mentioned that we record all of the individual transactions for a recipient in various stats tables. Let's take a quick look at how one of these stats tables looked for the first, oh, 5 years of their existence:
CREATE TABLE dbo.Deliveries ( MessageID INT, -- references a table dbo.Messages RecipientID INT, EventDate SMALLDATETIME /* , ... other columns... */ );
So a customer would send out a message targeted to, say, 50,000 recipients. During the next few hours we would fill up this table with successful deliveries (and other similar tables with bounces, opens, clicks, etc). Soon after inception of the system, we found that our customers wanted to pull real-time statistics on how the messages were going… mostly on the lookout for high bounce rates (well, low delivery rates too, I guess). Of course this was very contentious if they pulled reports while the tables were loading. So in order to give a much more satisfactory experience to the user pulling reports in our web UI, we created indexed views that would automatically maintain the number of deliveries, bounces, etc.:
CREATE VIEW dbo.vMessageDeliveries WITH SCHEMABINDING AS SELECT MessageID, c = COUNT_BIG(*) FROM dbo.Deliveries GROUP BY MessageID; GO CREATE UNIQUE CLUSTERED INDEX m ON dbo.vMessageDeliveries(MessageID);
So now the inserts were a little more expensive but, as mentioned above, this was much better for overall system performance (at least perceived performance) and, more importantly, customer happiness.
Then customers wanted more details on their statistics. They wanted to see how their messages were getting into, say, Hotmail compared to AOL. We have this information, but it was quite expensive to retrieve. Remember earlier I noted that we don't typically store redundant information, but this means that in cases like this, we need to do a lot of joins. So for example, to provide a responsive report that showed real-time message status per domain, we created indexed views like this:
CREATE VIEW dbo.vMessageDomainDeliveries WITH SCHEMABINDING AS SELECT d.MessageID, r.DomainID, c = COUNT_BIG(*) FROM dbo.Deliveries AS d INNER JOIN dbo.Recipients AS r ON r.RecipientID = d.RecipientID GROUP BY d.MessageID, r.DomainID; GO CREATE UNIQUE CLUSTERED INDEX md ON dbo.vMessageDomainDeliveries(MessageID, DomainID);
Of course this again added more strain to the insert process on the deliveries table, since it now had to maintain two indexed views. In isolation, this still yielded better overall performance than essentially expanding that view and trying to get at the domain aggregates in real time. But it introduced a new problem to the system: severe blocking. Picture the case where deliveries are being recorded while a recipient is trying to update their preferences or change their e-mail address. Since both processes need to lock and potentially update the index on the indexed view, one has to wait for the other. This can lead to a sad face on an end user.
What I ended up doing was to recant my disdain for storing redundant information, and store the domain information in the deliveries table. Yes, this was an extra 4 bytes, but as we were moving the whole system to a new 2008 cluster with a much faster I/O subsystem, we would get much more than that back with page compression. So starting over, we still have the same Domains and Recipients tables, but now domain-based reports are driven from objects that look like these:
CREATE TABLE dbo.Deliveries ( MessageID INT, RecipientID INT, DomainID INT, EventDate SMALLDATETIME, /* , ... other columns ... */ ); GO CREATE VIEW dbo.vMessageDomainDeliveries WITH SCHEMABINDING AS SELECT MessageID, DomainID, c = COUNT_BIG(*) FROM dbo.Deliveries GROUP BY MessageID, DomainID; GO CREATE UNIQUE CLUSTERED INDEX md ON dbo.vMessageDomainDeliveries(MessageID, DomainID);
The up-front cost of calculating the domain of each recipient during insert is likely very equivalent to the system doing the same lookup while maintaining the indexed view (except we can use snapshot isolation in our query, but can't really enforce the same during clustered index updates). And in addition to no longer causing blocking on updates to the recipients table, we also get a much more accurate representation of history: the delivery is marked with the domain the recipient had on that day, since their e-mail address could have changed 20 times since then. All in all it has turned out that storing the domain information multiple times has helped contribute to making a very busy system suddenly seem almost idle.
The primary lesson learned here: indexed views can be a very helpful tool in your arsenal, but they can bring you down if you try to over-use them. Another lesson learned: do not assume that redundancy is the devil. It can help out in many scenarios, even if it feels a little dirty.
Alex, a lot of the changes are already processed in a queue pretty much exactly how you describe, and service broker is employed for those. However there are a few methods that need to be relayed back to the user immediately – instant feedback kind of things, like change this or that preference.
I think that for better concurrency instead of the indexed view we could log the relevant changes in a queue or insert them into a log table. Another process could read messages from that queue and modify the summary table manually. Service Broker could shine in this scenario. What do you think?
Gotcha. I figured you'd already be on top of that sort of setup.
I too am curious to see what Wes meant about synonyms.
Jamie, yes, the inserts are performed by batch. An external process actually sends the mail, and logs deliveries to a flat file. These logs are bulk inserted into a work table, processed there, then inserted into the main deliveries table on a schedule. This makes it very easy to schedule stats loading around periods of other activity, though for the most part this process runs every 10 minutes throughout the day.
Wes, since indexed views are schema-bound, I'm curious what you expect to gain from utilizing synonyms (other than, perhaps, simpler object names in the view definition). Even if synonyms were supported, it's not like you could magically point the synonym at a different object and expect the view to automatically "catch up"…
Too bad SQL Server indexed views can't be bound to synonyms.
I can identify with the "feels a little dirty" sensation. I always get a pang of guilt when I introduce redundancy into a model no matter what Ralph Kimball may have to say on the subject (that's not a sleight against Kimball by the way).
Just a question about the processing of the deliveries. I assume (perhaps wrongly) that its a more-or-less synchronous process; by that I mean a record gets inserted into [dbo].[Deliveries] as soon as an email is sent. I wonder if there is any scope to decouple those 2 activities e.g. An email gets sent and the insertion into [dbo].[Deliveries] is done at a different time (perhaps by employing a messaging infrastructure such as Service Broker). That way maybe you could batch-up more of the insertions and incur less strain on [dbo].[Deliveries].
Of course you then get into what exactly do the users mean when they say they want "real-time"? and that's another can of worms in itself!
Just a thought.