December 15, 2006 | SQL Server

Take the high road to middle ground

A long, long, time ago, I wrote an article on calculating the median in a table. (I've removed the link, since aspfaq.com has been taken over by malware slimeballs.)

Of course, this was against SQL Server 2000; long before we had seen but glimpes of upcoming features in "Yukon" like Common Table Expressions (CTEs) and ROW_NUMBER().

Well, now SQL Server 2005 has been with us for a while, and I am going back to see how many of my "solutions" could be improved upon.

Last night, I went to a talk by Itzik Ben-Gan at the monthly meeting for the New England SQL User Group, and he reminded me how powerful ROW_NUMBER() could be in a situation like calculating the median. So, we have a good candidate here, right?

Let's take this trivial table:

USE tempdb; GO  CREATE TABLE dbo.MedianTest ( 	ClientID INT, 	Requests INT ); GO  CREATE CLUSTERED INDEX c  	ON dbo.MedianTest(ClientID, Requests); GO  SET NOCOUNT ON; GO  INSERT dbo.MedianTest  	SELECT 1, 50 	UNION ALL SELECT 1, 40 	UNION ALL SELECT 1, 24 	UNION ALL SELECT 2, 25 	UNION ALL SELECT 2, 75 	UNION ALL SELECT 3, 10 	UNION ALL SELECT 3, 2 	UNION ALL SELECT 3, 7 	UNION ALL SELECT 3, 12 	UNION ALL SELECT 4, 22 	UNION ALL SELECT 4, 26 	UNION ALL SELECT 4, 31; GO

Now, by following the example in the article above, I would get the median for each ClientID as follows:

SELECT DISTINCT 	ClientID, 	Median = ( 		 (SELECT MAX(Requests) FROM 		   (SELECT TOP 50 PERCENT Requests FROM dbo.MedianTest tA 			WHERE tA.ClientID = T.ClientID 			ORDER BY Requests) TopHalf) 		 + 		 (SELECT MIN(Requests) FROM 		   (SELECT TOP 50 PERCENT Requests FROM dbo.MedianTest tB 			WHERE tB.ClientID = T.ClientID 			ORDER BY Requests DESC) BottomHalf) 		) / 2  	FROM dbo.MedianTest T 	ORDER BY ClientID; GO

To do this in SQL Server 2005, we can get a huge increase in performance by using a CTE and ROW_NUMBER():

WITH cte AS ( 	SELECT 		ClientID, 		Requests, 		rn = ROW_NUMBER() OVER 		( 			PARTITION BY ClientID 			ORDER BY Requests 		), 		rc = COUNT(*) OVER 		( 			PARTITION BY ClientID 		) 	FROM 		dbo.MedianTest ) 	SELECT 		ClientID,  		Median = AVG(Requests) 	FROM 		cte 	WHERE 		rn IN ((rc + 1) / 2, (rc + 2) / 2) 	GROUP BY 		ClientID 	ORDER BY 		ClientID; GO

On this size of data, you're not going to see a noticeable difference. But imagine a table much, much larger. You may be astonished if you run these together and turn on the actual execution plan. The "old" solution uses three independent sort operations, and on my hardware the cost ratio for the overall queries is 98(old):2(new).

Don't forget to clean up:

DROP TABLE dbo.MedianTest; GO

2 comments on this post

    • Aaron - December 18, 2006, 2:25 PM

      Adam Machanic posted me offline, suggesting that I investigate what he called "Celko's method."  He said that it gave better performance against the Sales.SalesOrderHeader table in AdventureWorks.  I don't happen to have the sample database installed here, so I can't test right now, and will have to take his word for it that Celko's method had a fraction of the logical reads compared to the above.  Here is the approach adapted to my sample here:
      SELECT ClientID, Median = AVG(Requests) FROM ( SELECT ClientID, Requests, RowAsc = ROW_NUMBER() OVER ( PARTITION BY ClientID ORDER BY Requests ), RowDesc = ROW_NUMBER() OVER ( PARTITION BY ClientID ORDER BY Requests DESC ) FROM dbo.MedianTest ) x WHERE RowAsc IN ( RowDesc, RowDesc – 1, RowDesc + 1 ) GROUP BY ClientID ORDER BY ClientID;

    • CJan - September 15, 2011, 8:13 PM

      Given the data set below, Celko's method is not working properly.
      INSERT dbo.MedianTest
               SELECT 1, 50
      UNION ALL SELECT 1, 50 –Dup
      UNION ALL SELECT 1, 40
      UNION ALL SELECT 1, 24
      UNION ALL SELECT 4, 22
      UNION ALL SELECT 4, 22 –Dup
      UNION ALL SELECT 4, 26
      UNION ALL SELECT 4, 31;

Comments are closed.