More testing of Unicode compression

Background

Several weeks ago, I ran some tests on the new implementation of Unicode Compression in SQL Server 2008 R2 – both space-related and performance-related. Of course it was pointed out that my performance tests were somewhat flawed, because I was using data from a very wide and atypical variety of characters sets. I also didn't like that I was comparing too many different things in the same pass. So, I thought I would run one more battery of tests, and right a couple of wrongs.

So this time I decided to do more narrow testing : simply SQL 2008 RTM vs. SQL 2008 R2 (August CTP, build 10.50.1092), both on x64.

The Test

The tables are otherwise identical, as I populated them with the exact same 500,000 rows. So, update and read performance will be based on identical data (except for the compression parts, of course). Instead of comparing page to row to none, and then using different type of indexes for each test, I created two tables in each database: one with compression = page (which includes row), and one with no compression. All four tables have one clustered index and one non-clustered index:

USE CompressionTesting;
GO
 
CREATE TABLE dbo.Customers_Compressed
(
   CustomerID      INT PRIMARY KEY,
   FirstName       NVARCHAR(64),
   LastName        NVARCHAR(64),
   Address1        NVARCHAR(255),
   Address2        NVARCHAR(255),
   City            NVARCHAR(64),
   [Region]        NVARCHAR(64),
   PostalCode      NVARCHAR(10),
   Country         NVARCHAR(64),
   Phone           VARCHAR(16),
   Email           VARCHAR(320),
   [Password]      VARBINARY(16),
   DateCreated     SMALLDATETIME,
   DateModified    SMALLDATETIME
)
   WITH (DATA_COMPRESSION = PAGE);

I ran the above script in both instances, and then again for a table called dbo.Customers_Uncompressed. As you can guess, the only difference is that the table called dbo.Customers_Uncompressed does not have the option WITH (DATA_COMPRESSION = PAGE).

Instead of spending hours perfecting a script that would simulate random data, I decided to use a tool off the shelf. The one I was most familiar with was Red-Gate's SQL Data Generator, though there are several other ways to avoid re-inventing the wheel, including built-in support in Visual Studio Team System 2008 Database Edition (read more on GertD's blog).

I made a copy of one of the tables above (structure only) to prime a single source table on each instance. This way I could populate all four real tables that I wanted to measure, with everything else being the same. So on each instance I created this table in tempdb:

USE [tempdb];
GO
 
CREATE TABLE dbo.PrimeData
(
    CustomerID     INT IDENTITY(1,1) PRIMARY KEY,
    FirstName      NVARCHAR(64),
    LastName       NVARCHAR(64),
    Address1       NVARCHAR(255),
    Address2       NVARCHAR(255),
    City           NVARCHAR(64),
    [Region]       NVARCHAR(64),
    PostalCode     NVARCHAR(10),
    Country        NVARCHAR(64),
    Phone          VARCHAR(16),
    Email          VARCHAR(320),
    [Password]     VARBINARY(16),
    DateCreated    SMALLDATETIME,
    DateModified   SMALLDATETIME
);

Then I pointed SQL Data Generator at this table on my R2 instance. I made some adjustments to the defaults; first, I changed the target rows from 1,000 to 500,000. And instead of using the wide range of characters I used before, I adjusted sample data to include mainly Finnish last names, from this page. (I simply created a new NamesLast.txt file, and pointed the tool at that file instead of the default file.) Since most last names didn't have any Unicode characters at all, I threw in a few more names from various character sets in this WikiPedia entry. While not as widely random as my previous experiment, I still think there is enough of a variance here to represent a typical internationalized application. I did a similar thing with the first names, but left addresses and all of the other columns alone. Once I was happy with the way the sample data would be generated, I let 'er rip. It populated the 500,000 rows in about 20 seconds. To get the exact same data over to the RTM instance, I did not run the generator tool again, as that would create different data. Instead, I ran the following script on the RTM instance to prime the local table with the exact same data:

USE [tempdb];
GO
 
SET IDENTITY_INSERT dbo.PrimeData ON;
 
INSERT dbo.PrimeData
(
    [CustomerID],
    [FirstName],
    [LastName],
    [Address1],
    [Address2],
    [City],
    [Region],
    [PostalCode],
    [Country],
    [Phone],
    [Email],
    [Password],
    [DateCreated],
    [DateModified]
)
SELECT *
    FROM [R2_LinkedServer].tempdb.dbo.PrimeData;
 
SET IDENTITY_INSERT dbo.PrimeData OFF;

Now I was ready to populate the main tables. But before I started, I wanted to set up a server-side trace on each instance, so I could measure various aspects of the operations. This trace setup looks alarmingly like one you might have seen before in my previous Unicode Compression testing. The only thing I've dropped is filtering on login; I am not running anything else on this server, so I didn't need to worry about that. So I start this trace on both the 2008 RTM and the 2008 R2 instance:

USE [master];
GO
 
DECLARE
    @rc INT,
    @TraceID INT,
    @TraceName SYSNAME = N'C:\temp\CT_' +
        CONVERT(SYSNAME, SERVERPROPERTY('InstanceName')),
    @MaxFS BIGINT = 5;
 
EXEC 
    @rc = sp_trace_create
    @TraceID OUTPUT,
    0,
    @TraceName,
    @MaxFS,
    NULL;
 
EXEC sp_trace_setevent @TraceID, 12, 15, 1;
EXEC sp_trace_setevent @TraceID, 12, 16, 1;
EXEC sp_trace_setevent @TraceID, 12, 1, 1;
EXEC sp_trace_setevent @TraceID, 12, 17, 1;
EXEC sp_trace_setevent @TraceID, 12, 14, 1;
EXEC sp_trace_setevent @TraceID, 12, 18, 1;
EXEC sp_trace_setevent @TraceID, 12, 11, 1;
EXEC sp_trace_setevent @TraceID, 12, 35, 1;
EXEC sp_trace_setevent @TraceID, 12, 12, 1;
EXEC sp_trace_setevent @TraceID, 12, 13, 1;
 
EXEC sp_trace_setfilter @TraceID, 35, 0, 6, N'CompressionTesting';
 
EXEC sp_trace_setstatus @TraceID, 1;
 
SELECT TraceID = @TraceID, TraceFile = @TraceName + N'.trc';

Now, for the tests themselves. I am measuring writing to and reading from each copy of the table. It starts with an initial population of the 500,000 rows, then some selects, then some updates. Finally, I dump the index stats from sys.dm_db_partition_stats into a table. When all four versions of the script are done, I turn off the trace on both systems. Here is a sample script:

USE CompressionTesting;
GO
 
-- R2, compressed
INSERT dbo.Customers_Compressed
(
    [CustomerID],
    [FirstName],
    [LastName],
    [Address1],
    [Address2],
    [City],
    [Region],
    [PostalCode],
    [Country],
    [Phone],
    [Email],
    [Password],
    [DateCreated],
    [DateModified]
)
SELECT * FROM tempdb.dbo.PrimeData;
 
-- R2, compressed
SELECT TOP 1 * FROM dbo.Customers_Compressed ORDER BY NEWID();
GO 10
 
-- R2, compressed
SELECT TOP 1 email FROM dbo.Customers_Compressed ORDER BY LastName, FirstName;
GO 10
 
-- R2, compressed
SELECT TOP 1 * FROM dbo.Customers_Compressed WHERE FirstName LIKE 'a%';
GO 10
 
-- R2, compressed
SELECT TOP 1 * FROM dbo.Customers_Compressed WHERE LastName LIKE '%a';
GO 10
 
-- R2, compressed
UPDATE dbo.Customers_Compressed SET LastName = REPLACE(LastName, 'a', 'b') WHERE LastName LIKE '%a%';
UPDATE dbo.Customers_Compressed SET LastName = REPLACE(LastName, 'c', 'b') WHERE LastName LIKE '%c%';
UPDATE dbo.Customers_Compressed SET LastName = REPLACE(LastName, 'd', 'b') WHERE LastName LIKE '%d%';
UPDATE dbo.Customers_Compressed SET LastName = REPLACE(LastName, 'e', 'b') WHERE LastName LIKE '%e%';
UPDATE dbo.Customers_Compressed SET LastName = REPLACE(LastName, 'f', 'b') WHERE LastName LIKE '%f%';
UPDATE dbo.Customers_Compressed SET LastName = REPLACE(LastName, 'g', 'b') WHERE LastName LIKE '%g%';
UPDATE dbo.Customers_Compressed SET LastName = REPLACE(LastName, 'h', 'b') WHERE LastName LIKE '%h%';
UPDATE dbo.Customers_Compressed SET LastName = REPLACE(LastName, 'i', 'b') WHERE LastName LIKE '%i%';
UPDATE dbo.Customers_Compressed SET LastName = REPLACE(LastName, 'j', 'b') WHERE LastName LIKE '%j%';
UPDATE dbo.Customers_Compressed SET LastName = REPLACE(LastName, 'k', 'b') WHERE LastName LIKE '%k%';
 
-- R2, compressed
UPDATE dbo.Customers_Compressed SET FirstName = REPLACE(FirstName, 'a','c') WHERE FirstName LIKE '%a%';
UPDATE dbo.Customers_Compressed SET FirstName = REPLACE(FirstName, 'b','c') WHERE FirstName LIKE '%b%';
UPDATE dbo.Customers_Compressed SET FirstName = REPLACE(FirstName, 'd','c') WHERE FirstName LIKE '%d%';
UPDATE dbo.Customers_Compressed SET FirstName = REPLACE(FirstName, 'e','c') WHERE FirstName LIKE '%e%';
UPDATE dbo.Customers_Compressed SET FirstName = REPLACE(FirstName, 'f','c') WHERE FirstName LIKE '%f%';
UPDATE dbo.Customers_Compressed SET FirstName = REPLACE(FirstName, 'g','c') WHERE FirstName LIKE '%g%';
UPDATE dbo.Customers_Compressed SET FirstName = REPLACE(FirstName, 'h','c') WHERE FirstName LIKE '%h%';
UPDATE dbo.Customers_Compressed SET FirstName = REPLACE(FirstName, 'i','c') WHERE FirstName LIKE '%i%';
UPDATE dbo.Customers_Compressed SET FirstName = REPLACE(FirstName, 'j','c') WHERE FirstName LIKE '%j%';
UPDATE dbo.Customers_Compressed SET FirstName = REPLACE(FirstName, 'k','c') WHERE FirstName LIKE '%k%';
 
-- R2, compressed, drop
INSERT dbo.indexstats
SELECT
   'R2, compressed',
   index_id,
   in_row_data_page_count,
   used_page_count,
   reserved_page_count,
   row_count
FROM
   sys.dm_db_partition_stats
WHERE
   [object_id] = OBJECT_ID('dbo.Customers_Compressed');

Of course the difference being that on the RTM instance the comments say RTM (in case I get my trace data results confused), and the scripts that run against the uncompressed version of the table are slightly different.

The Results

Given the results of my previous tests, the outcome was a little surprising to me this time. First we'll deal with space savings on the clustered index. It is no surprise at all that using database compression in R2 yielded significant space savings. In this case, the uncompressed version of the table weighed in at 122 MB, while the compressed version came in at around 81 MB (~33% savings). On RTM, however, compression didn't buy us much at all: the table went from 122 MB to 121 MB. (The nonclustered index was static with a constant space utilization of about 56 MB.) Here is the overall layout for all four scenarios:

 
<img src="https://sqlblog.org/wp-content/uploads/2018/01/17258_Screen20shot202009-10-0420at209.29.3520PM.png" width="554" height="338"> 
<i>Space used, in kilobytes</i>

And then for performance, I graphed out total CPU usage, total duration, and total I/O:

 
<img src="https://sqlblog.org/wp-content/uploads/2018/01/17258_Screen20shot202009-10-0420at209.29.5920PM.png" width="554" height="338"> 
<i>CPU time, in milliseconds</i>
 
 
<img src="https://sqlblog.org/wp-content/uploads/2018/01/17258_Screen20shot202009-10-0420at209.30.2720PM.png" width="554" height="337">
<i>Elapsed time, in seconds</i>

 
<img src="https://sqlblog.org/wp-content/uploads/2018/01/17258_Screen20shot202009-10-0420at209.30.4220PM.png" width="554" height="338"> 
<i>Number of 8KB read or write operations</i>

Conclusion

In my tests, Unicode Compression in 2008 R2 provides great I/O and space gains at a very marginal CPU and duration overhead (IMHO). Duration itself wasn't so bad for R2 compressed vs. RTM compressed, especially given that RTM compressed didn't buy us any measurable space savings. Again, whether or not data compression is right for you (in 2008 or 2008 R2) will depend on a lot of factors, including the characteristics of your current hardware, workloads, data distribution and access patterns. I just hope I'm adding some insight into how you can test this for yourself in your own environment (it is not my intention to say, "See, you *should* use compression!").

By: Aaron Bertrand

I am a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. I am a long-time Microsoft MVP, write at Simple Talk, SQLPerformance, and MSSQLTips, and have had the honor of speaking at more conferences than I can remember. In non-tech life, I am a husband, a father of two, a huge hockey and football fan, and my pronouns are he/him.

AaronBertrand says:

November 18, 2009 at 10:36 PM

leoniv, remember that this R2 engine was an early CTP and not necessarily indicative of all of the optimizations that will work correctly in the release version. What I was trying to show was that *perhaps*, for some people who are using compression and have systems bound by certain resources, R2 will improve their performance in some way (or that people may skip 2008 and go straight to R2, bypassing one round of licensing costs). I think if you want to measure the speed of the engine for your own purposes, compression aside, you should grab the most recent CTP and test your workload against your data on your hardware.
leoniv says:

November 18, 2009 at 10:26 PM

If we throw out compression for a second, according to your graphs it looks as if R2 Uncompressed is a tad slower and has a little more overhead (higher CPU and I/O) than RTM Uncompressed. I'm not sure how I/O could be higher for one over the other given the tables are identical (query optimizer funkiness?). That makes me think the R2 engine itself is a little slower vs the RTM version. Is that what you saw in your testing? If that's the case, between the hike in licensing $$ for R2 and being a step slower, I'll sit this one out.
cinahcaM madA says:

October 5, 2009 at 5:44 PM

Great job, Aaron!
Linchi Shea says:

October 5, 2009 at 6:00 AM

Aaron;
Well done.
It may be more immediately readable if the unit of every horizontal axis were identified.