T-SQL Tuesday #168: Window functions

For this month's T-SQL Tuesday, Steve Jones asks us to talk about problems we've solved using window functions.

When I first used window functions back in SQL Server 2005, I was in awe. I had always used inefficient self-joins to calculate things like running totals, and these really didn't scale well with size-of-data. I quickly realized you could also use them for ranks and moving averages without those cumbersome self-joins, elaborate sub-queries, or #temp tables. Those all have their place, but window functions can make them feel old-school and dirty.

If you're brand new to window functions, I recommend spending some time with this tutorial from Koen Verbeeck. If you're already familiar, read on.

Using mostly Stack Exchange Data Explorer, I'll show a few examples of problems I've solved in the past using window functions.

Ranking rows

RANK() is a pretty simple one. In this example, let's say we want the top 5 Stack Overflow users by reputation:

SELECT TOP (5) Id, DisplayName, Reputation,
  rk = RANK() OVER (ORDER BY Reputation DESC)
  FROM dbo.Users
  ORDER BY Reputation DESC;

The results of this query:

Showing RANK()

The scores are all unique, but RANK() would have shown the same value if there happened to be ties. Which may or may be intended or expected.

Let's do something a little more interesting to show how RANK(), DENSE_RANK(), and ROW_NUMBER() work: we'll ask for the top 20 questions, this month, by score:

SELECT TOP (20) Title, CreationDate, Score,
    [row_number] = ROW_NUMBER() OVER (ORDER BY Score DESC),
    [rank]       = RANK()       OVER (ORDER BY Score DESC),
    [dense_rank] = DENSE_RANK() OVER (ORDER BY Score DESC)
  FROM Posts 
  WHERE PostTypeId = 1 -- Question
  AND CreationDate >= DATEFROMPARTS(YEAR(GETDATE()),MONTH(GETDATE()),1)
  ORDER BY Score DESC;

The results of this query show that ROW_NUMBER() produces contiguous results, while RANK() and DENSE_RANK() produce slightly different results: the former leaves gaps when there are ties, and the latter leaves no gaps:

Comparing ROW_NUMBER(), RANK(), and DENSE_RANK()

Top N per group

When grouping or joining it can be tedious to get just the first or just the newest row for anything. Let's take a simple example where I want to know the date and title of the last question posted to Stack Overflow by me or by Jon Skeet:

WITH src AS
(
  SELECT Id, Title, CreationDate, OwnerUserId,
    rn = ROW_NUMBER() OVER 
    (
      PARTITION BY OwnerUserId
      ORDER BY CreationDate DESC
    )
  FROM Posts
  WHERE PostTypeId = 1 -- Question
  AND OwnerUserId IN (61305, 22656) -- me, Jon Skeet
)
SELECT Id, Title, CreationDate, OwnerUserId
  FROM src
  WHERE rn = 1;

The window function lets me find the newest question grouped ("partitioned") by OwnerUserId. Here are the results:

TOP N per group example 1

You can put more complex expressions in the OVER() clause, too. For example, if I wanted to find the first user created in each month this year, I could write this query:

WITH UsersThisYear AS
(
  SELECT Id, DisplayName, CreationDate,
    rn = ROW_NUMBER() OVER 
    (
      PARTITION BY DATEFROMPARTS
                   (YEAR(CreationDate), 
                    MONTH(CreationDate), 1)
      ORDER BY CreationDate
    )
  FROM Users
  WHERE CreationDate >= DATEFROMPARTS(YEAR(GETDATE()),1,1)
)
SELECT Id, DisplayName, CreationDate
FROM UsersThisYear
WHERE rn = 1
ORDER BY CreationDate;

Results:

TOP N per group example 2

I've solved a lot of other people's problems with ROW_NUMBER(), too. Here are 197 answers by me that, for one reason or another, contain WHERE rn = 1.

Running totals

Another luxury of modern versions is the ability to use window frames in the OVER() clause. We can use this technique to calculate aggregates over a large number of rows while still only reading those rows once. For example, if I wanted to see how many answers I've posted for each month in 2023, with a running total, I can write this query:

WITH AaronPosts AS
(
  SELECT  [Month]   = MONTH(CreationDate),
          PostCount = COUNT(*)
    FROM  Posts
    WHERE CreationDate >= '20230101'
      AND CreationDate <  '20240101'
      AND OwnerUserId  = 61305
      AND PostTypeId   = 2 -- answer
    GROUP BY MONTH(CreationDate)
  )
  SELECT [Month], 
    CountForMonth = PostCount, 
    RunningCount  = SUM(PostCount) OVER
    (
      ORDER BY [Month] 
      ROWS UNBOUNDED PRECEDING
    )
  FROM AaronPosts
  ORDER BY [Month];

Results:

Running totals example

You can see a few of the cumbersome ways we used to have to do this in this answer (along with some links to posts about performing grouped running totals, and why you want to avoid the RANGE keyword).

SQL Server 2022 enhancements

There is one window function enhancement that I am really looking forward to using: named windows. Itzik gave this a much thorough treatment in T-SQL Windowing Improvements in SQL Server 2022, but here's a quick example of how it can simplify a query.

Let's change the second query above so that it has a slightly more complex OVER() clause:

SELECT TOP (20) Id, PostTypeId, Score,
    rn = ROW_NUMBER() OVER 
         (PARTITION BY PostTypeID ORDER BY Score DESC),
    rk = RANK() OVER 
         (PARTITION BY PostTypeID ORDER BY Score DESC),
    dr = DENSE_RANK() OVER 
         (PARTITION BY PostTypeID ORDER BY Score DESC)
  FROM Posts 
  WHERE CreationDate >= DATEADD(DAY, -14, GETDATE())
  ORDER BY Score DESC;

Note that the same over clause is repeated multiple times, which makes the query harder to maintain, particularly if there may be changes later. In SQL Server 2022, you can alias a defined window, e.g.:

SELECT TOP (20) Id, PostTypeId, Score,
    rn = ROW_NUMBER() OVER PScore,
    rk = RANK()       OVER PScore,
    dr = DENSE_RANK() OVER PScore
  FROM Posts 
  WHERE CreationDate >= DATEADD(DAY, -14, GETDATE())
  WINDOW PScore AS (PARTITION BY PostTypeID ORDER BY Score DESC)
  ORDER BY Score DESC;

You can also consolidate just part of the OVER() clause, for example a common use case is to take the row number ascending and descending over the same partition. You can simplify this:

SELECT ...cols...,
    rnA = ROW_NUMBER() OVER 
          (PARTITION BY PostTypeID ORDER BY Score),
    rnB = ROW_NUMBER() OVER 
          (PARTITION BY PostTypeID ORDER BY Score DESC)
  FROM Posts 
  WHERE ...

To this:

SELECT ...cols...,
    rnA = ROW_NUMBER() OVER (PScore ORDER BY Score),
    rnB = ROW_NUMBER() OVER (PScore ORDER BY Score DESC)
  FROM Posts 
  WHERE ...
  WINDOW PScore AS (PARTITION BY PostTypeID)
  ...

Which can be really helpful with more elaborate partition clauses.

I can't demonstrate these on Stack Exchange Data Explorer directly, since its underlying version is still SQL Server 2019. But you can try them out on your own systems or using db<>fiddle. These enhancements are also available in Azure SQL Database / Azure SQL Managed Instance.

Conclusion

Window functions can solve all kinds of problems in SQL Server, and this post just barely scratches the surface. They can also be used for gaps and islands, grouped medians, and even more elaborate things. I am interested in reading all of the other T-SQL Tuesday #168 contributions.

By: Aaron Bertrand

I am a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. I am a long-time Microsoft MVP, write at Simple Talk, SQLPerformance, and MSSQLTips, and have had the honor of speaking at more conferences than I can remember. In non-tech life, I am a husband, a father of two, a huge hockey and football fan, and my pronouns are he/him.

Ranking rows

Top N per group

Running totals

SQL Server 2022 enhancements

Conclusion

1 Response