BackToBasics: Common Table Expressions (CTEs)

At the end of last year, I accepted a challenge for 2016: to publish one post per month that makes fewer assumptions about the reader's knowledge. In the first post, which came up rather quickly I might add, I'm going to talk about a concept in SQL Server introduced in SQL Server 2005: the Common Table Expression (CTE).

What is it?

A CTE is probably best described as a temporary inline view - in spite of its official name, it is not a table, and it is not stored (like a #temp table or @table variable). It operates more like a derived table or subquery, and can only be used for the duration of a single SELECT, UPDATE, INSERT, or DELETE statement (though it can be referenced multiple times within that statement). It can also be used to drive a MERGE statement, but I'm not a big fan of those.

Paul White (@SQL_Kiwi) once said that a better name might have been Common View Expression (CVE); I think Temporary View or Inline View would have been more on point as well.

People don't generally know to call it a CTE because the label of CTE isn't in the syntax - it simply appears like this, using the keyword WITH:

;WITH some_alias_name AS (SELECT ...

This is not the only use for WITH in Transact-SQL - you may also have seen it used for table hints (e.g. WITH (NOLOCK)), database setting options (e.g. WITH ROLLBACK IMMEDIATE), index options (e.g. WITH (DROP_EXISTING = ON)), backup options (e.g. WITH RECOVERY) and others.

Why is it used?

A CTE can serve many purposes, some of which could be swapped with derived table or subquery variants.

Eliminate redundant expressions

Sometimes you have a complex expression that you reference multiple times, such as:

SELECT 
  h = MAX(DATEADD(DAY, 30, column_name)),
  l = MIN(DATEADD(DAY, 30, column_name)),
  a = AVG(DATEADD(DAY, 30, column_name))
FROM dbo.some_table;

A CTE can simplify this by only having to run the DATEADD expression once:

;WITH cte AS
(
  SELECT d = DATEADD(DAY, 30, column_name)
  FROM dbo.some_table
)
SELECT 
  h = MAX(d),
  l = MIN(d),
  a = AVG(d)
FROM cte;

Of course, as I suggested, this could also be written using a nested subquery:

SELECT 
  h = MAX(d),
  l = MIN(d),
  a = AVG(d)
FROM
(
  SELECT d = DATEADD(DAY, 30, column_name)
  FROM dbo.some_table
) AS nsq;

Which of these forms you use is subjective (and will perform the same), but I have found myself leaning toward CTEs because, unlike nested subqueries, they can be referenced multiple times without being repeated. For example:

;WITH cte AS
(
  SELECT d = DATEADD(DAY, 30, column_name)
  FROM dbo.some_table
)
SELECT MAX(d) FROM cte
UNION ALL
SELECT MIN(d) FROM cte;

This doesn't mean that dbo.some_table is only touched once, though. More on that in a moment.

A couple of other things I want to point out here:

  • You can stack / nest / cascade CTEs, by separating them with a comma:
    ;WITH cte1 AS (SELECT x = 5),
          cte2 AS (SELECT x = x + 1 FROM cte1),
          cte3 AS (SELECT y = x + 4 FROM cte2)
    SELECT y FROM cte3;
  • You can provide the column names alongside the CTE name, rather than relying on aliases assigned within the CTE:
    ;WITH cte(col1, col2) AS (SELECT 5, 'foo') 
      SELECT col1, col2 FROM cte;
  • You cannot use a CTE in multiple subsequent queries. As mentioned above, this is not like a #temp table; people often want to do something like this, but at the point of the second query, the CTE no longer exists:
    ;WITH cte AS (SELECT x = 5)
      SELECT x FROM cte;
     
    SELECT x FROM cte;

Self-join

Prior to SQL Server 2012, CTEs were often used to mimic the functionality that eventually arrived with LAG/LEAD, where you could reference the "previous" or "next" row. This would mean using some kind of windowing function, like ROW_NUMBER(), and joining the CTE to itself in order to find the previous or next row. As an example, let's say I have a simple table like this, with a sequence of events:

CREATE TABLE #log
(
  EventType nvarchar(32),
  EventTime datetime
);
 
INSERT #log(EventType, EventTime)
VALUES(N'start',       '20160106 12:00:00'),
      (N'middle',      '20160106 12:20:00'),
      (N'almost done', '20160106 12:30:00'),
      (N'end',         '20160106 12:45:00');

Now I need a query to show me how long each event took. I could write a CTE like the following, which joins to itself to find the previous row based on the ROW_NUMBER() value:

;WITH cte AS
(
  SELECT EventType, EventTime,
    rn = ROW_NUMBER() OVER (ORDER BY EventTime)
  FROM #log
)
SELECT x.EventType, x.EventTime, duration = DATEDIFF(MINUTE, x.EventTime, y.EventTime)
FROM cte AS x -- initial row
LEFT OUTER JOIN cte AS y -- "next" row
ON x.rn = y.rn - 1;

We use an outer join here so that we can still see the row that represents the end, even though that row doesn't technically have a duration:

EventType      EventTime                  duration
-----------    -----------------------    --------
start          2016-01-06 12:00:00.000    20
middle         2016-01-06 12:20:00.000    10
almost done    2016-01-06 12:30:00.000    15
end            2016-01-06 12:45:00.000    NULL

In SQL Server 2012 and up, we could express this query a little simpler and without a CTE, and get the exact same results:

SELECT EventType, EventTime, duration = DATEDIFF(MINUTE, EventTime, 
   LEAD(EventTime, 1) OVER (ORDER BY EventTime))
FROM #log;

In this case, there is a reason to use LEAD rather than a CTE, and I already alluded to it above: there is a performance difference because the LEAD variant only references the underlying table once. Here are the plans (click either to enlarge):

bb_cte_page1-blog
The CTE version, with an additional scan and sort

bb_cte_page2-blog
The LEAD version, with only one reference to the table

Paging

Paging is a long-established process of allowing a user to view a large result set, n rows at a time. The most prolific example is Google or Bing, which typically show 10 search results per page. Again, prior to SQL Server 2012, paging wasn't as simple as OFFSET/FETCH. But OFFSET/FETCH alone does not necessarily provide any benefits aside from syntactically. If we look at the 2005 variation (lifted directly from Robert Cary's SSC article, SQL Server 2005 Paging – The Holy Grail):

DECLARE @startRow INT ; SET @startrow = 50
;WITH cols
AS
(
    SELECT table_name, column_name, 
        ROW_NUMBER() OVER(ORDER BY table_name, column_name) AS seq, 
        ROW_NUMBER() OVER(ORDER BY table_name DESC, column_name desc) AS totrows
    FROM [INFORMATION_SCHEMA].columns
)
SELECT table_name, column_name, totrows + seq -1 as TotRows
FROM cols
WHERE seq BETWEEN @startRow AND @startRow + 49
ORDER BY seq

This uses a CTE, and two ROW_NUMBER() functions used to traverse the table in each direction. A translation of that query to use the new syntax, without a CTE, would be:

SELECT table_name, column_name, COUNT(*) OVER() AS totrows
  FROM INFORMATION_SCHEMA.COLUMNS
  ORDER BY table_name, column_name
  OFFSET @startRow-1 ROWS FETCH NEXT 50 ROWS ONLY;

This looks a lot simpler, but as it turns out (at least in this case), while the plans definitely have different shapes, they are roughly the same in terms of costs - the old style has fewer reads, but more sort operations, and these seem to offset in terms of duration (I'd like to validate that further on larger result sets, but you can download these plans here to examine them yourself in Plan Explorer).

If you are performing paging in your application, you should check out my article over on SQLPerformance.com, Pagination with OFFSET / FETCH : A better way, where I promote this hybrid form (which uses a CTE to retrieve just the key values you're after, than joins to the main table to retrieve the other columns):

;WITH pg AS 
(
  SELECT [key_column] 
  FROM dbo.[some_table]
  ORDER BY [some_column_or_columns] 
  OFFSET @PageSize * (@PageNumber - 1) ROWS
  FETCH NEXT @PageSize ROWS ONLY
)
SELECT t.[bunch_of_columns]
  FROM dbo.[some_table] AS t
  INNER JOIN pg ON t.[key_column] = pg.[key_column]
  ORDER BY [some_column_or_columns];

This form will work best when you are retrieving a lot of columns, and your table has both a narrow clustering key and an index that supports the desired ORDER BY. This isn't quite the case with the example Robert chose (the INFORMATION_SCHEMA views are, well, not anything like real user tables).

Removing duplicates

Another place where CTEs can help to simplify a query - mostly by eliminating a self-join or dumping values to a #temp or other table - is removing duplicates. Let's say I have a simple table where I forgot to define a primary key or unique constraint, and now I need to clean it up:

CREATE TABLE #people
(
  name nvarchar(32) NOT NULL
);
 
INSERT #people(name) VALUES(N'Aaron'),(N'Bob'),(N'Aaron');

A CTE can make this very easy, by simply applying a ROW_NUMBER() grouped by the name, and deleting any rows where that row number is greater than 1:

;WITH cte AS
(
  SELECT name, 
    rn = ROW_NUMBER() OVER (PARTITION BY name ORDER BY name)
  FROM #people
)
DELETE cte WHERE rn > 1;

Note that this will delete arbitrary duplicate rows - if you have other criteria that you want to use in order to determine which rows to keep, you can apply tie-breakers to the ORDER BY. Let's say we have a slightly different table:

CREATE TABLE #people2
(
  name      nvarchar(32) NOT NULL,
  StartDate DATE NOT NULL
);
 
INSERT #people2(name, StartDate) 
VALUES(N'Aaron', '20101025'),
      (N'Bob',   '20120101'),
      (N'Aaron', '20150601');

If I want to keep the oldest (or "first") row, I can use StartDate as the ordering column:

;WITH x(name, rn) AS
(
  SELECT name, 
    rn = ROW_NUMBER() OVER (PARTITION BY name ORDER BY StartDate)
  FROM #people2
)
DELETE x WHERE rn > 1;

If I want to keep the newest (or "last") row, I can simply reverse the ordering:

;WITH x(name, rn) AS
(
  SELECT name, 
    rn = ROW_NUMBER() OVER (PARTITION BY name ORDER BY StartDate DESCENDING)
  FROM #people2
)
DELETE x WHERE rn > 1;

And then there are situations where you might have two Aarons starting on the same day, in which case you can add another column to help break ties (or just accept that an arbitrary row might be deleted).

Updating with an aggregate

I am largely against storing redundant information in a table, such as a calculated value indicating the cheapest or most expensive order a customer has placed. But a lot of people seem to want to do this. Let's say we have an orders table:

CREATE TABLE #orders
(
  CustomerID    INT NOT NULL,
  OrderTotal    DECIMAL(10,2),
  BiggestOrder  DECIMAL(10,2),
  SmallestOrder DECIMAL(10,2)
);
 
INSERT #orders(CustomerID, OrderTotal) 
VALUES(1, 32.05),  (2, 12.04),
      (1, 55.72),  (2, 65.15),
      (1, 44.65),  (2, 11.42);

Now we want to apply the biggest and smallest order for a customer to all of that customer's rows. The first type of attempt I see many people try goes something like this:

UPDATE #orders
  SET BiggestOrder = MAX(OrderTotal),
      SmallestOrder = MIN(OrderTotal)
GROUP BY CustomerID;

Seems like a logical approach, but this doesn't work due to a syntax error on the GROUP BY. And even if you remove it, you get a more fundamental error:

Msg 157, Level 15, State 1
An aggregate may not appear in the set list of an UPDATE statement.

An aggregate may not appear in the set list of an UPDATE statement.

Next, they try this (using windowed aggregates, which don't require outer grouping):

UPDATE #orders
  SET BiggestOrder = MAX(OrderTotal) OVER (PARTITION BY CustomerID),
      SmallestOrder = MIN(OrderTotal) OVER (PARTITION BY CustomerID);

But alas, this also yields an error message:

Msg 4108, Level 15, State 1
Windowed functions can only appear in the SELECT or ORDER BY clauses.

Windowed functions can only appear in the SELECT or ORDER BY clauses.

One way around this is to run a correlated subquery for each aggregate:

UPDATE o
SET BiggestOrder = (SELECT MAX(OrderTotal) 
  FROM #orders AS bo WHERE bo.CustomerID = o.CustomerID),
    SmallestOrder = (SELECT MIN(OrderTotal)
  FROM #orders AS so WHERE so.CustomerID = o.CustomerID)
FROM #orders AS o;

Another way to express this (with a lower runtime cost, especially as the number of aggregates increases) would be to access the aggregates using a derived table:

UPDATE o
SET BiggestOrder = agg.MaxOrder,
    SmallestOrder = agg.MinOrder
FROM #orders AS o
INNER JOIN
(
  SELECT CustomerID, 
      MaxOrder = MAX(OrderTotal), 
      MinOrder = MIN(OrderTotal)
    FROM #orders
    GROUP BY CustomerID
) AS agg
ON o.CustomerID = agg.CustomerID

Or a CTE:

;WITH cte AS
(
  SELECT 
    CustomerID, 
    BiggestOrder, 
    MaxOrder = MAX(OrderTotal) OVER (PARTITION BY CustomerID),
    SmallestOrder,
    MinOrder = MIN(OrderTotal) OVER (PARTITION BY CustomerID)
  FROM #orders
)
UPDATE cte SET BiggestOrder = MaxOrder, SmallestOrder = MinOrder;

In this case, the CTE and derived table produce nearly identical plans, but the CTE is much more clearly written (though I agree this is subjective). I won't clog up the post with screen shots and other evidence, but again will make the .QueryAnalysis file available for you to download here.

Recursive queries

CTEs can also be useful for recursion, where a CTE references itself one or more times to produce output. The two most common use cases are:

Generating a set

people often use a recursive CTE to define a set of numbers or a date range. For example, here is a simple CTE that generates a set of numbers from 1 to 100:

;WITH x AS
(
  SELECT y = 1
  UNION ALL
  SELECT y + 1 FROM x WHERE y < 100
)
SELECT y FROM x;

If you try to go over 100, though, you will get this error:

Msg 530, Level 16, State 1
The statement terminated. The maximum recursion 100 has been exhausted before statement completion.


The statement terminated. The maximum recursion 100 has been exhausted before statement completion.

In order to exceed 100 iterations, you'll need to use OPTION (MAXRECURSION):

;WITH x AS
(
  SELECT y = 1
  UNION ALL
  SELECT y + 1 FROM x WHERE y < 200
)
SELECT y FROM x
OPTION (MAXRECURSION 200);

This will work up until 32,767, after which you will receive this error:

Msg 310, Level 15, State 1
The value 32768 specified for the MAXRECURSION option exceeds the allowed maximum of 32767.

(To exceed even that limit, you can use (MAXRECURSION 0), though whether that's a good idea or not...)

More importantly, it is my experience that recursive CTEs do not scale well when deriving large sets; often having a pre-calculated Numbers or Calendar table is much better. I've proven this in multiple previous articles:

  • Hierarchy - this I believe was the original driver behind CTEs; to be able to map out an assembly process (a bill of materials), to produce an org chart (find all of an employee's managers, their manager's managers, those managers' managers, etc), and other similar tasks where you would need to know how many levels could potentially exist in order to use traditional joins or unions (and these would perform poor at scale in any case). I will perhaps take a look at hierarchical queries in a more advanced post, outside of the scope of this series; in the meantime, TechNet has some examples for you.

Why the leading semi-colon?

This is probably the most common question I receive personally about CTEs (aside from "What's a CTE? I've never heard of it..."). In just about every CTE I've ever published in blog posts, tips, or answers out on Stack Exchange, you'll see me begin my code with a semi-colon, like this:

;WITH cte_name AS (SELECT ...

The reason for this is simple - because of the overloading of the WITH keyword, there are several scenarios where copying and pasting a CTE into existing code will break. For example:

SELECT columns FROM dbo.some_table -- existing code
 
WITH cte_name AS (SELECT ... -- pasted

This would fail with the following error:

Msg 336, Level 15, State 1
Incorrect syntax near 'cte_name'. If this is intended to be a common table expression, you need to explicitly terminate the previous statement with a semi-colon.

After getting this complaint at least a dozen times, I quickly made it a habit to always use a leading semi-colon on every CTE I post, much to the chagrin of a few of my colleagues, who think it is perverse to use a semi-colon to begin a statement, rather than to end one.

In most cases, this is okay; you can have as many semi-colons between two statements as you like. This is legal, for example, even if it's not something I would write (or want to read):

;;;;;;;; SELECT 1; ;;; ;;;;; ;; ; ;; ; ;;; SELECT 2;;;;;;

However, there is at least one case where it's not okay, and that is when your CTE is used at the beginning of an inline module, such as a view or inline table-valued function:

CREATE VIEW dbo.view_name 
AS
  ;WITH cte_name AS (SELECT ...

This yields the following error:

Msg 102, Level 15, State 1, Procedure view_name, Line 3
Incorrect syntax near ';'.

Beware the Sugar

In cases where you use a CTE to avoid redundancy, it is important to note that this is mostly just syntactic sugar. As I showed a couple of times above, just because you only typed the CTE once, does not mean it is going to be evaluated only once. Even in very simple cases, you can see that a CTE that accesses a table once, but is referenced multiple times, ends up evaluating the CTE (and hence accessing the underlying table) multiple times. Let's say we have a simple table like this:

CREATE TABLE dbo.floob(blat INT PRIMARY KEY);
 
INSERT dbo.floob(blat) SELECT object_id FROM sys.all_objects;

And then we have a simple CTE query, which only references the table once:

;WITH n AS 
(
  SELECT TOP (1) blat FROM dbo.floob ORDER BY blat
)
SELECT n.blat, n2.blat FROM n CROSS JOIN n AS n2;

If we look at the results, it might *seem* like there is no reason that the inner query could have been executed multiple times (your results may be slightly different depending on your build of SQL Server):

blat          blat
-----------   -----------
-1372343552   -1372343552

But if we look at the plan, it becomes more clear:

bb_cte_page3-blog
Execution plan showing multiple references to base table

If we change the ORDER BY to something that can change with each execution (making the result non-deterministic), such as NEWID():

;WITH n AS 
(
  SELECT TOP (1) blat FROM dbo.floob ORDER BY NEWID()
)
SELECT n.blat, n2.blat FROM n CROSS JOIN n AS n2;

It is much more obvious from the outset that the query was executed twice, since we end up with two completely different TOP (1) values:

blat          blat
-----------   -----------
-858063793    -321212601

And there is a much more expensive plan to go with it:

bb_cte_page4-blog
Execution plan again showing multiple references to base table

So please be careful, as every reference to a CTE will most likely represent an additional evaluation of the inner query.

My one wish

Personally, aside from the thing being named correctly and deterministic inner query results to only be evaluated once, I would have liked to have seen explicit, distinguishable CTE syntax, such as:

WITH_CTE cte_name AS (SELECT ...

However, while that would avoid some of the confusion about how a CTE works, and would eliminate the leading semi-colon nonsense, it would have violated the ANSI standard - which defines a CTE structure using WITH. Let's pretend to ignore the fact that all of SQL Server's other implementations of WITH are not from the standard, because it's safe to assume that they're not going to fix all of that syntax to make CTEs friendlier. Theoretically, they could add more specific syntax as a synonym that the parser could understand, giving us a better option while not destroying backward compatibility, but I wouldn't hold my breath.

Conclusion

This was an interesting exercise and was a lot harder than I thought it would be. But it was refreshing to walk through many examples of CTEs starting with a very early baseline of knowledge. I'm looking forward to publishing more entries in the series, and if you have any suggestions for topics, please comment below.