A long, long, time ago, I wrote an article on calculating the median in a table. (I've removed the link, since aspfaq.com has been taken over by malware slimeballs.)
Of course, this was against SQL Server 2000; long before we had seen but glimpes of upcoming features in "Yukon" like Common Table Expressions (CTEs) and ROW_NUMBER().
Well, now SQL Server 2005 has been with us for a while, and I am going back to see how many of my "solutions" could be improved upon.
Last night, I went to a talk by Itzik Ben-Gan at the monthly meeting for the New England SQL User Group, and he reminded me how powerful ROW_NUMBER() could be in a situation like calculating the median. So, we have a good candidate here, right?
Let's take this trivial table:
USE tempdb; GO CREATE TABLE dbo.MedianTest ( ClientID INT, Requests INT ); GO CREATE CLUSTERED INDEX c ON dbo.MedianTest(ClientID, Requests); GO SET NOCOUNT ON; GO INSERT dbo.MedianTest (ClientID, Requests) SELECT 1, 50 UNION ALL SELECT 1, 40 UNION ALL SELECT 1, 24 UNION ALL SELECT 2, 25 UNION ALL SELECT 2, 75 UNION ALL SELECT 3, 10 UNION ALL SELECT 3, 2 UNION ALL SELECT 3, 7 UNION ALL SELECT 3, 12 UNION ALL SELECT 4, 22 UNION ALL SELECT 4, 26 UNION ALL SELECT 4, 31;
Now, by following the example in the article above, I would get the median for each ClientID as follows:
SELECT DISTINCT ClientID, Median = ( ( SELECT MAX(Requests) FROM ( SELECT TOP 50 PERCENT Requests FROM dbo.MedianTest tA WHERE tA.ClientID = T.ClientID ORDER BY Requests ) AS TopHalf ) + ( SELECT MIN(Requests) FROM ( SELECT TOP 50 PERCENT Requests FROM dbo.MedianTest tB WHERE tB.ClientID = T.ClientID ORDER BY Requests DESC ) AS BottomHalf ) ) / 2.0 FROM dbo.MedianTest T ORDER BY ClientID;
To do this in SQL Server 2005, we can get a huge increase in performance by using a CTE and ROW_NUMBER():
;WITH cte AS ( SELECT ClientID, Requests, rn = ROW_NUMBER() OVER ( PARTITION BY ClientID ORDER BY Requests ), rc = COUNT(*) OVER ( PARTITION BY ClientID ) FROM dbo.MedianTest ) SELECT ClientID, Median = AVG(Requests) FROM cte WHERE rn IN ((rc + 1) / 2, (rc + 2) / 2) GROUP BY ClientID ORDER BY ClientID;
With such a small table, you're not going to see a noticeable difference. But imagine a table much, much larger. You may be astonished if you run these together and turn on the actual execution plan. The "old" solution uses three independent sort operations, and on my hardware the cost ratio for the overall queries is 98(old):2(new).
Don't forget to clean up:
DROP TABLE dbo.MedianTest;
Given the data set below, Celko's method is not working properly.
INSERT dbo.MedianTest
SELECT 1, 50
UNION ALL SELECT 1, 50 –Dup
UNION ALL SELECT 1, 40
UNION ALL SELECT 1, 24
UNION ALL SELECT 4, 22
UNION ALL SELECT 4, 22 –Dup
UNION ALL SELECT 4, 26
UNION ALL SELECT 4, 31;
Adam Machanic posted me offline, suggesting that I investigate what he called "Celko's method." He said that it gave better performance against the Sales.SalesOrderHeader table in AdventureWorks. I don't happen to have the sample database installed here, so I can't test right now, and will have to take his word for it that Celko's method had a fraction of the logical reads compared to the above. Here is the approach adapted to my sample here:
SELECT ClientID,
Median = AVG(Requests) FROM (
SELECT ClientID, Requests, RowAsc = ROW_NUMBER() OVER ( PARTITION BY ClientID ORDER BY Requests ), RowDesc = ROW_NUMBER() OVER ( PARTITION BY ClientID ORDER BY Requests DESC ) FROM dbo.MedianTest ) x WHERE RowAsc IN ( RowDesc, RowDesc – 1, RowDesc + 1 ) GROUP BY ClientID ORDER BY ClientID;