Take the high road to middle ground

A long, long, time ago, I wrote an article on calculating the median in a table. (I've removed the link, since aspfaq.com has been taken over by malware slimeballs.)

Of course, this was against SQL Server 2000; long before we had seen but glimpes of upcoming features in "Yukon" like Common Table Expressions (CTEs) and ROW_NUMBER().

Well, now SQL Server 2005 has been with us for a while, and I am going back to see how many of my "solutions" could be improved upon.

Last night, I went to a talk by Itzik Ben-Gan at the monthly meeting for the New England SQL User Group, and he reminded me how powerful ROW_NUMBER() could be in a situation like calculating the median. So, we have a good candidate here, right?

Let's take this trivial table:

USE tempdb;
GO
 
CREATE TABLE dbo.MedianTest
( 
  ClientID INT, 
  Requests INT 
); 
GO  
 
CREATE CLUSTERED INDEX c  
  ON dbo.MedianTest(ClientID, Requests); 
GO  
 
SET NOCOUNT ON;
GO  
 
INSERT dbo.MedianTest (ClientID, Requests)
  SELECT 1, 50 
  UNION ALL SELECT 1, 40 
  UNION ALL SELECT 1, 24 
  UNION ALL SELECT 2, 25 
  UNION ALL SELECT 2, 75 
  UNION ALL SELECT 3, 10 
  UNION ALL SELECT 3, 2 
  UNION ALL SELECT 3, 7 
  UNION ALL SELECT 3, 12 
  UNION ALL SELECT 4, 22 
  UNION ALL SELECT 4, 26 
  UNION ALL SELECT 4, 31;

Now, by following the example in the article above, I would get the median for each ClientID as follows:

SELECT DISTINCT 
  ClientID, 
  Median = 
  ( 
    (
      SELECT MAX(Requests) FROM 
      (
        SELECT TOP 50 PERCENT Requests FROM dbo.MedianTest tA 
        WHERE tA.ClientID = T.ClientID 
        ORDER BY Requests
      ) AS TopHalf
    ) 
    + 
    (
      SELECT MIN(Requests) FROM 
      (
        SELECT TOP 50 PERCENT Requests FROM dbo.MedianTest tB 
        WHERE tB.ClientID = T.ClientID 
        ORDER BY Requests DESC
      ) AS BottomHalf
    ) 
  ) / 2.0  
  FROM dbo.MedianTest T 
  ORDER BY ClientID;

To do this in SQL Server 2005, we can get a huge increase in performance by using a CTE and ROW_NUMBER():

;WITH cte AS 
( 
  SELECT 
    ClientID, 
    Requests, 
    rn = ROW_NUMBER() OVER 
    ( 
      PARTITION BY ClientID 
      ORDER BY Requests 
    ), 
    rc = COUNT(*) OVER 
    ( 
      PARTITION BY ClientID 
    ) 
    FROM dbo.MedianTest 
) 
SELECT 
  ClientID,  
  Median = AVG(Requests) 
FROM cte 
WHERE rn IN ((rc + 1) / 2, (rc + 2) / 2) 
GROUP BY ClientID 
ORDER BY ClientID;

With such a small table, you're not going to see a noticeable difference. But imagine a table much, much larger. You may be astonished if you run these together and turn on the actual execution plan. The "old" solution uses three independent sort operations, and on my hardware the cost ratio for the overall queries is 98(old):2(new).

Don't forget to clean up:

DROP TABLE dbo.MedianTest;

Aaron Bertrand

I am a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. I am a long-time Microsoft MVP, speak frequently at conferences, and write at SQLPerformance and MSSQLTips. In real life I am a father of two, an architect at Wayfair, and my pronouns are he/him.

2 Responses

  1. CJan says:

    Given the data set below, Celko's method is not working properly.
    INSERT dbo.MedianTest
             SELECT 1, 50
    UNION ALL SELECT 1, 50 –Dup
    UNION ALL SELECT 1, 40
    UNION ALL SELECT 1, 24
    UNION ALL SELECT 4, 22
    UNION ALL SELECT 4, 22 –Dup
    UNION ALL SELECT 4, 26
    UNION ALL SELECT 4, 31;

  2. Aaron says:

    Adam Machanic posted me offline, suggesting that I investigate what he called "Celko's method."  He said that it gave better performance against the Sales.SalesOrderHeader table in AdventureWorks.  I don't happen to have the sample database installed here, so I can't test right now, and will have to take his word for it that Celko's method had a fraction of the logical reads compared to the above.  Here is the approach adapted to my sample here:
    SELECT ClientID, Median = AVG(Requests) FROM ( SELECT ClientID, Requests, RowAsc = ROW_NUMBER() OVER ( PARTITION BY ClientID ORDER BY Requests ), RowDesc = ROW_NUMBER() OVER ( PARTITION BY ClientID ORDER BY Requests DESC ) FROM dbo.MedianTest ) x WHERE RowAsc IN ( RowDesc, RowDesc – 1, RowDesc + 1 ) GROUP BY ClientID ORDER BY ClientID;