How a columnstore index is different
February 28th, 201117
How a columnstore index is different
February 28th, 201117
 
 

At the end of my SQL Server 2012 presentation at SQLSaturday #65 in Vancouver, a member of the audience asked, "What makes a columnstore index different from a regular nonclustered index?" At the end of a busy day, I was at a loss for an answer, and I'll explain why.

First, I'll briefly explain the basic, core, high-level functionality of a columnstore index (you can read a lot more details in this white paper). Basically, instead of storing index data together on a page, it divvies up the data from each column into its own set of pages. If you are after the data from only one column, this doesn't provide a great advantage over a traditional index, since you're not going to be able to store that many more rows on a page.

The reason I was having a hard time articulating the benefit is because I was thinking about a simple non-clustered index with one column, and comparing it to a columnstore index with only one column defined. The benefit of the columnstore index is that you can basically declare it against most or all of the columns in a table, effectively similar to creating an index on each column – and this is where column store indexes provide the largest benefit. The differences between a column store index and a handful of indexes for each column are (1) size due to the structure of the index and the fact that compression will likely work much better when you can fit more like values on a page, and (2) you only pay for the index maintenance on a columnstore index once, because of what I'll describe next.

The columnstore index has limitations that definitely make you feel like the grass is greener on the other side. Most importantly, in Denali, due to the cost of maintenance, a columnstore index will be read only… essentially, you will need to rebuild the index when data changes. Some other limitations: only one columnstore index per table, the index must be partition-aligned, can't be filtered, and the base object must be a table (so, no indexed views).

This feature is definitely geared to data warehouse scenarios, or where you are free to rebuild indexes nightly or on some other interval. I can't give you a good idea about the cost of maintaining this index manually compared to multiple traditional indexes, since the feature did not make the cut for the publicly available CTP1. But I hope to publish some blog posts with real numbers on both the read and write sides when the next public CTP is made available. I am fairly certain I will be able to demonstrate cases where this type of index will drastically increase read performance (without losing substantial write performance), provided that you are in a scenario where you can update the data on an infrequent schedule, such as nightly.

By: Aaron Bertrand

I am a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. I am a long-time Microsoft MVP, write at Simple Talk, SQLPerformance, and MSSQLTips, and have had the honor of speaking at more conferences than I can remember. In non-tech life, I am a husband, a father of two, a huge hockey and football fan, and my pronouns are he/him.

17 Responses

  1. Guru says:

    Playing around to check if Column Store index would be preferred over Row based index and for what type of queries. For a simple table Row based index is being picked up, may be because all columns, rows are on same page.. Planning to get realistic data and check it out..
    http://www.consultguru.me/post/2011/08/21/Columnar-Store-Indexes-in-SQL-Server-2011.aspx

  2. AaronBertrand says:

    It is a more complex type of index, but in the context of what you're asking, it is an index – e.g. a copy of the data, not the data. It is read-only for performance reasons, not because of any changes to the underlying data.

  3. mssqldude says:

    Thanks Aaron … in reading the WP on columnstore in Denali, since INSERT, UPDATE, DELETE cannot be performed against a table with a columnstore index and only 1 columnstore index is allowed per table, it seemed to me that the columnstore is more than just an index.

  4. AaronBertrand says:

    It's an index, so it's not *the table* – you can dictate the ordering for each column (just like a normal index) and compression is definitely in effect (it's one of the major advantages).

  5. mssqldude says:

    Do you know if the columnstore index is modifying the way that the table is being stored (ordering & compression) or is it creating a copy of the data?

  6. Slapo says:

    I thought so, too. It's a pity – small businesses could use this feature as well. This way, they might have to use something like the community editions of Greenplum, InfinityDB, Inforbright…
    Prices of Enterprise+ Editions are a bit too steep for most smaller companies.

  7. AaronBertrand says:

    This hasn't been revealed yet, but I would put money on this being Enterprise+.

  8. Slapo says:

    Do you happen to know which editions of the new SQL Server are going to have the new index type?

  9. will s says:

    Thanks for the feedback.

  10. AaronBertrand says:

    Will, the biggest areas where you will see benefits are:
    (1) aggregates. Think about having all of the OrderTotal tuples on a sequential set of pages… you can get SUM, MIN, MAX, AVG etc. without having to read pages that are full of other information that is irrelevant to the query at hand.
    (2) compression. With pages of all INTs, for example, compression ratios will be better and this will mean lower reads (which we are almost always willing to trade CPU for). How overall storage will be impacted (since there will need to be some kind of key/RID-type locators on each set of pages) could go either way depending on a variety of factors.
    Now, it is also possible that the performance of certain other queries will suffer.  It will be hard to come to any meaningful conclusions until there is a CTP available that actually exposes the column store to us. I should draw some diagrams in the meantime.

  11. will s says:

    Thanks for posting on this, I've had the same question, especially in the case of a single column index. I know they're borrowing some SSAS concepts here, but at my current level of understanding I just don't see where the performance gains are coming from.

  12. Geoff says:

    My thought process is this:
    I have worked with Vertica and SQL Server. For massive boat loads of data, I would go with a columnar database (esp. one like Vertica which is built columnar from the ground up). For transactional database I would go with SQL Server.
    For BI stuff in the middle, when is Denali a good idea and when is a true columnar database a good idea? I honestly don't know.
    Btw, Vertica was recently acquired by HP.

  13. AaronBertrand says:

    Geoff, an easy answer is, when you already have a SQL Server Denali license. 🙂
    Of course the question isn't that simple. Are you talking about:
    (1) you're on 2005, 2008 or 2008 R2, and are curious whether you should upgrade to Denali to take advantage of columnstore indexes?
    (2) you're on 2005, 2008 or 2008 R2, know you want to use columnstore indexes, just not sure if you should move to Denali or move to another solution?
    (3) you're on 2005, 2008 or 2008 R2, know you want to use columnstore indexes, just not sure if you want to move to Denali and add them to your existing platform, or add another platform to co-exist and integrate with your existing solution?
    A lot of the criteria you'll evaluate to come to a decision will be common, but in some cases there will be certain factors that will be more or less relevant depending on your over-arching goals.

  14. Geoff says:

    The real question will be when to go this route with SQL Server instead of using a columnar database like Vertica or Sybase IQ.

  15. Gorm Braarvig says:

    …the star-join optimization must also be seen over again, the combination will be explosive. This is potentially a game-changer!

  16. AaronBertrand says:

    True, but that wouldn't make much of a blog post. 🙂

  17. Robert L Davis says:

    Short answer: The difference is 95% execution time. 😉