Playing with page compression – for real

During maintenance windows and against test databases, I have been applying page compression to some largish tables (~200+ million rows, ~50GB). These tables are heavy write and medium read, but all writes are inserts (no updates). Clustered indexes are on a DATETIME column, and then on INT columns. Since there will be very repetitive data within each page, they seemed like decent candidates for compression.

The actual rebuild tends to take about 3.5 hours on average, sometimes a little more, for tables of this size. Before I run the rebuild, I grab an estimate of how much space savings I should expect, to make sure it will be worth it:

EXEC sys.sp_estimate_data_compression_savings
     @schema_name      = N'dbo',
     @object_name      = N'table',
     @index_id         =  NULL,
     @partition_id     =  NULL,
     @data_compression = N'PAGE';

(Quick observations: Why can't the procedure take schema + object name together? Or object_id? Seems so archaic to still be passing in schema name as a separate param. Also, why can't the result estimate how long the actual index rebuild(s) will take? I digress, but wanted to jot those questions down somewhere.)

The estimates for space savings seem to be a little exaggerated in all cases so far. Not huge differences, but maybe in the 20% range. So when it says it is going to cut your 40GB table to 20GB, expect the end result to be somewhere around 25GB. What I neglected to do time after time was save the current and estimate, and compare after the fact. When I tried to run the estimate while the rebuild was running, I came across an interesting problem. Not that I expected accurate results while the table was being rebuilt, but don't run the sp_estimate_data_compression_savings again against a table where you have run the estimate and are now rebuilding the index(es) online. You will get an error message like this:

Msg 1913, Level 16, State 1
The operation failed because an index or statistics with name
'sample_indexDBA05385A6FF40F888204D05C7D56D2B_clustered' already
exists on table '#sample_tableDBA05385A6FF40F888204D05C7D56D2B'.

Which is quite odd, because I tried this in a new window – which shouldn't have any #temp objects associated with a different session. My guess is there is something in here that is creating a #temp table that does not append the system-generated uniqueifiers we are used to seeing. When I tried looking through the catalog views (sys.indexes, sys.all_objects, etc.) I couldn't find out whether this was attached to the original table, the new shadow table that was being used for the online rebuild, or something else. And of course the sp_estimate_data_compression_savings procedure contains several elements I can't chase down, like the system function sys.generate_index_ddl. Anyway, I think the error message could be a little more intuitive, since there must be some way (e.g. checking for schema locks) to determine that a DDL operation against that table is currently in progress.

Anyway, while it would be nice to be able to compare, after the fact, how much the procedure originally predicted it would conserve by applying compression; I would constantly run the estimate, and then rebuild the index in the same window. Which means my estimates would disappear and eventually be replaced with "The command(s) completed succesfully." So what I started doing was running the estimate in one window, then immediately launching a new window and running a before and after script like this:

EXEC sys.sp_spaceused N'dbo.table';
GO
 
SELECT * FROM sys.dm_db_partition_stats
    WHERE [object_id] = OBJECT_ID(N'dbo.table');
GO
 
ALTER INDEX index_a ON dbo.table
    REBUILD WITH
    (
        ONLINE           = ON,
        DATA_COMPRESSION = PAGE
    );
 
    -- ... other indexes ...
GO
 
EXEC sys.sp_spaceused N'dbo.table'; GO
 
SELECT * FROM sys.dm_db_partition_stats
    WHERE [object_id] = OBJECT_ID('dbo.table');
GO

Now, as long as I don't re-use or close those query windows after everything is done, I'll have the *actual* before and after statistics. I just need to remember to hold on to the other window where I ran the original estimate (or run the estimate at the beginning of the above set of batches in a single run), so I can see how far off the estimates were. So far my batting average on remembering to do that has been pretty low, so I don't have any screen shots or other proof of where I lined up in terms of estimate vs. actual. But my observations were consistent in that the estimates were slightly inflated every single time. This isn't a complaint, just something to keep in mind when you are running estimates for your own tables.

Another thing I'd like to be able to do is monitor progress. As I complained on Connect, online index rebuilds don't bother to update percent_complete in sys.dm_exec_requests. So, it is very tough to judge whether the rebuild operation is 5% done, 50% done, or 95% done. I would argue that this information is often useful, especially when the rebuild is still running and you are nearing the end of your maintenance window. The Connect item is here:

#520072 : Online index operations should populate percent_complete in sys.dm_exec_requests

In this case I was trying to determine what was going on while a rebuild was occurring, and watched sys.dm_exec_requests.writes for the session running the rebuild. I was seeing numbers up over 17 million, yet the grand total from sys.dm_db_partition_stats.reserved_page_count was < 5 million pages before the rebuild even started. I would understand if it was double, because it probably builds a shadow table in the meantime (since the rebuild is happening online). For the new table, I would expect writes in the range of much less than 100% of the original size of the table, never mind over 300%.

It feels like I am missing something here, but I haven't yet has time to delve into it further. So, it is still a mystery to me what exactly the writes column represents when performing an online index rebuild with compression. Anyone with any information on this, I'm all ears… it would be great to be able to write a script that gives a rough estimate about how far along the compression is getting, based on the number of pages in the original table and the current number of writes for the session that is running the rebuild.

These little complications aside, I am experiencing some great results with data compression so far. Along with some other optimizations and smarter data archival processes, I have brought a 430GB database to just under 180GB, with actual improvements in throughput and overall performance (yes Adam, really, the CPU overhead is not noticeable). And I haven't even started playing with the new Unicode compression in SQL Server 2008 R2, at least on real data (all of my tests in previous blog posts were on completely made-up data).

As always, your mileage may vary; depending on your data, data types and index structure, you may observe completely different estimates and actuals than I did, and if you are already CPU-bound, your performance may not improve at all. If you have the luxury, I highly recommend running a full load test against the compressed versions to ensure you are going to get acceptable results. You might also want to rebuild your indexes with compression on that test system, while your load tests are running, so you can get a feel for how long the rebuild will take, and also to observe how your system handles both regular load and an online rebuild.

I'd be happy to hear about your experiences with data compression, good or bad!

By: Aaron Bertrand

I am a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. I am a long-time Microsoft MVP, write at Simple Talk, SQLPerformance, and MSSQLTips, and have had the honor of speaking at more conferences than I can remember. In non-tech life, I am a husband, a father of two, a huge hockey and football fan, and my pronouns are he/him.

AaronBertrand says:

December 29, 2009 at 5:56 PM

Right Wes, space savings are nice, but I/O was our weakest link. We've moved from Raid 5, not-properly-aligned storage with 8k alloc size on 8-CPU Server 2003 x86 with 16 GB, to Raid 10 on properly aligned storage with 64k alloc size on 16-CPU Server 2008 R2 x64 with 32 GB. The combined impact has been phenomenal, and I plan to write a post about it soon – especially showing SQL Sentry Performance Advisor dashboard views, before and after. The screen shots alone should blow some minds.
Wes W. says:

December 29, 2009 at 5:04 PM

Also another user very happy using page compression on out mostly read-only data warehouse. Have occasionally seen tables drop to less than half the original size, though my primary desire was to decrease I/O on the storage as that was our weakest link.
AaronBertrand says:

December 29, 2009 at 4:28 PM

Uri, I went for page compression because of the nature of my data. In addition to row compression, page also includes dictionary compression for multiple rows on the same page that contain the same data. Since I have many, many rows per page with repeated data (the datetime column is derived from log files that are bulk inserted, so many rows will be timestamped identically), the results for page were about 40% better than for row alone on the primary tables.
Also, since we were I/O bound and not CPU-bound for this specific customer, my hunch turned out to be right that maximum compression would turn out to be a win-win in terms of storage savings and performance.
Denis Gobo says:

December 29, 2009 at 4:17 PM

Aaron I am also getting good results with compression, page compression for my type of data is always a little better than row compression. I did post here on sqlblog a while back some sample data and perf numbers
http://sqlblog.com/blogs/denis_gobo/archive/2009/04/07/13137.aspx
Overall I am very happy with compression, combine this with partitioning and I am a really happy camper 🙂 And in the days of expensive SANs ($30 per GB) compression will also save some serious money
Uri Dimant says:

December 29, 2009 at 4:10 PM

Aaron
Intresting , I was compressed 800gb datavase to just under 450 gb , but I have only READ ONLY database and ny decision was to go on PAGE level,on the other hand you do have mostly inserts and medium reads so why did you go for PAGE and not for ROW level…? Just intrested in..
http://msdn.microsoft.com/en-us/library/dd894051.aspx