August 11, 2009 | SQL Server

SQL Server 2008 R2 : A quick experiment in Unicode Compression

Fellow MVP Simon Sabin blogged today about one of the few engine enhancements we'll be seeing in SQL Server 2008 R2 : Unicode compression. You can read more about the topic in Books Online, but basically what is going to happen is that NCHAR / NVARCHAR (but not NVARCHAR(MAX)) columns, in objects that are row- or page-compressed, can benefit from additional compression, where realistically you can cut your storage requirements in half, depending on the language / character sets in use.

But I'm the kind of guy who has to see it to believe it!  So, I mocked up a quick test of storing some Finnish Danish/Norweigan (?) characters, and ran it both on a SQL Server 2008 instance (SP1 + CU3, 10.0.2723), and a SQL Server 2008 R2 instance (10.50.1092). 

USE tempdb;
GO
SET NOCOUNT ON;
GO
CREATE TABLE dbo.test(foo NVARCHAR(2048)) WITH (DATA_COMPRESSION = ROW);
GO
INSERT dbo.test(foo) SELECT RTRIM(RAND()) + REPLICATE(N'øååøæ', 300);
GO 50000
EXEC sp_spaceused N'dbo.test';
GO
DROP TABLE dbo.test;
GO

SQL Server 2008 results:

 

SQL Server 2008 R2 results:

 

The difference is astounding: a space savings of roughly 60%, FOR FREE.  That's right, this is an enhancement you get just by upgrading… I did not do anything differently about the creation of these tables except continue to use compression.  (And note that I also performed this test with page compression, and the results were identical all around.)  Keep in mind that if you upgrade at some point (you can't upgrade to R2 in its present form), you will need to rebuild indexes in order to implement this new compression method across the entire table.

And of course, by "free," I am not talking about the licenses.  As with row and page compression, this feature is only available in Developer, Enterprise, and Enterprise Evaluation editions.

Now, I'll admit, the test is not super-realistic, and is biased toward good compression (since the same pattern is repeated over and over again on every row and on every page), and as such it demonstrates something pretty close to best-case scenario.  But even worst-case scenario is not exactly "bad" — you may not see a gain at all, but you can't lose anything either, because the compression algorithm in SQL Server 2008 is smart enough to know when it is actually going to *lose* space by implementing compression, and won't do it in that case.

At some point I will test the performance of writing, reading and seeking against a Unicode compressed table (and I will come up with more plausible test data at that point).  Because nothing is ever really free, is it?  Stay tuned to find out.

7 comments on this post

    • Denis Gobo - August 11, 2009, 9:46 PM

      That is a nice storage saving, unfortunately I almost use no Unicode at all at the moment

    • AaronBertrand - August 11, 2009, 9:58 PM

      I use a lot of Unicode, since we have to support several foreign languages and symbols like Euro and pound.  All of this data gets entered via a Web UI.  Unfortunately, while about 1/3 of these columns are NVARCHAR(64 or 255), the rest are NVARCHAR(MAX), which won't benefit from this compression at all… even when the data is stored in-row.

    • Denis Gobo - August 11, 2009, 10:03 PM

      <Sarcasm>
      Mmmm, that 255 number sounds familiar, did you upgrade this from Access with the upgrade wizard?</Sarcasm>
      I do have some unicode but it is in lookup tables, for example  道琼斯第一财经中国600指数

    • AaronBertrand - August 11, 2009, 10:08 PM

      255 came at us in a few cases as a "requirement"… wherever 64 was not enough, they pushed for 255.  Their magic number, not mine.  🙂  And the 64, well, that was legacy… it was there long before I ever got my hands on the schema.

    • Linchi Shea - August 11, 2009, 10:10 PM

      Denis;
      > unfortunately I almost use no Unicode at all at the moment
      But that means you are saving space already 🙂

    • cinahcaM madA - August 11, 2009, 10:22 PM

      One issue is that compression brings a lot of CPU overhead; in my tests I've seen up to 2x the amount of time required for retrieval of hot pages. I wonder if UTF8 support would be just as beneficial without having as much overhead?

    • AaronBertrand - August 11, 2009, 11:25 PM

      Agreed, I am definitely interested in seeing how this affects performance and if there is any noticeable impact over and above page/row compression.  I think it all depends on the nature of the data and the decisions the engine has to make as it compresses/decompresses.
      In the long run, we're I/O-bound, not CPU-bound, so we're probably better off getting any I/O gain we can, even if it costs us a little CPU.  YMMV.

Comments are closed.