Bad habits to kick : using loops to populate large tables
In my last post in this series, I talked about using ORDER BY <ordinal position>. Today I'd like to talk about using primitive loops to generate a large number of rows.
Okay, I will admit, in most cases you are doing this for a demo or proof of concept, and so speed / performance is not all that important. But using a loop to populate a table with 1 million rows, one row at a time, is not fun at all, no matter how much thumb-twiddling you have to do. I often see people building up test cases like this:
SET NOCOUNT ON; CREATE TABLE dbo.foo(bar INT); DECLARE @i INT, @UpperLimit INT; SELECT @i = 1, @UpperLimit = 1000000; WHILE @i <= @UpperLimit BEGIN INSERT dbo.foo(bar) VALUES(@i); SET @i = @i + 1; END
For 500 or 50,000 rows, this probably wouldn't be a big deal. But when you get into the bigger numbers, you are going to be spending a long time waiting for the inserts to finish. So how do you get around it?
I have long been an advocate of an auxiliary numbers table. It makes quick work of generating dummy data, does not require a lot of storage space, and is easy to join against for various reporting-type functions. Let's say you create a numbers table with 1,000,000 rows, then you can change the above code to:
SET NOCOUNT ON; CREATE TABLE dbo.foo(bar INT); DECLARE @UpperLimit INT; SET @UpperLimit = 1000000; INSERT dbo.foo(bar) SELECT [Number] FROM dbo.Numbers WHERE [Number] BETWEEN 1 AND @UpperLimit;
However, there are many people who fundamentally object to adding a table to their schema, even if they put it in some utility database. [As an aside, I see the exact same objections from people who don't think it's worth it to store a calendar table in their systems, even though it extremely simplifies date calculations such as business days between two dates, taking into account things like Easter and company holidays.]
Well, thankfully, there are some clever workarounds to this that let you construct a numbers table on the fly (and in fact you need to use something like this to populate the numbers table in the first place, without having to use a loop). Here is one that I have used in previous posts:
SET NOCOUNT ON; DECLARE @UpperLimit INT; SET @UpperLimit = 1000000; WITH n AS ( SELECT x = ROW_NUMBER() OVER (ORDER BY s1.[object_id]) FROM sys.all_objects AS s1 CROSS JOIN sys.all_objects AS s2 CROSS JOIN sys.all_objects AS s3 ) SELECT [Number] = x INTO dbo.Numbers FROM n WHERE x BETWEEN 1 AND @UpperLimit; GO CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers([Number]);
And here is one using a set of CTEs and no tables, based on code I stole from Itzik Ben-Gan:
SET NOCOUNT ON; DECLARE @UpperLimit INT; SET @UpperLimit = 1000000; WITH n5 (x) AS (SELECT 1 UNION SELECT 0), n4 (x) AS (SELECT 1 FROM n5 CROSS JOIN n5 AS x), n3 (x) AS (SELECT 1 FROM n4 CROSS JOIN n4 AS x), n2 (x) AS (SELECT 1 FROM n3 CROSS JOIN n3 AS x), n1 (x) AS (SELECT 1 FROM n2 CROSS JOIN n2 AS x), n0 (x) AS (SELECT 1 FROM n1 CROSS JOIN n1 AS x), Nbrs (x) AS ( SELECT ROW_NUMBER() OVER (ORDER BY x) FROM n0 ) SELECT [Number] = x INTO dbo.Numbers FROM Nbrs WHERE n BETWEEN 1 AND @UpperLimit; GO CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers([Number]);
So, you could use either of these examples either (a) to populate your own numbers table, which you can later use instead of loops; or (b) directly, every time you need to simulate a loop. I prefer (a) because, if you use it enough, the numbers table will reside in memory, have statistics, etc. For full disclosure (and not to say my method is better), I compared the actual execution plans of both methods when using 1,000,000 rows as my upper limit. The comparison is as follows:
CROSS JOIN CTE Query cost (relative) 10% 90% Cached plan size (bytes) 88 176 CompileCPU (ms) 28 50 CompileMemory (KB) 2336 3376 CompileTime (ms) 28 50 Estimated Subtree Cost 0.0206647 0.193489
And the most visible difference is the query plan (click to enlarge). In case it's unclear, I'll give you a hint: the cross join of sys.objects is at the top.
Again, these aren't things you need to be concerned about if you are only populating the numbers table once. But if you are using these methods on the fly, the difference in performance could be significant.
If you require <= 32,767 rows, you can use this alternative syntax, employing a recursive CTE. It is much more efficient, but has a cap of 32,767 due to the limit on the MAXRECURSION option:
SET NOCOUNT ON; DECLARE @UpperLimit INT; SET @UpperLimit = 32767; WITH n AS ( SELECT x = 1 UNION ALL SELECT x = x + 1 FROM n WHERE x < @UpperLimit ) SELECT [Number] = x INTO dbo.Numbers FROM n OPTION (MAXRECURSION 32767); GO CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers([Number]);
I'll admit, this isn't the easiest syntax to memorize, so you will likely want to keep a copy of it in your toolbox. If you are going to continue using a loop, at the very least, please get in the habit of adding SET NOCOUNT ON statements to your code. This way, at least, that big loop won't be wasting all that effort passing "1 row(s) affected" to the client n times, and the client won't be wasting all that effort printing it out to you n times.
Now don't get me wrong, I am not saying that loops are a bad thing. In fact for a lot of things they are the right answer, and for some things they are even the only answer. But in cases where things don't have to happen one at a time, they shouldn't happen one at a time.
I am working on a series of "Bad habits to kick" articles, in an effort to motivate people to drop some of the things that I hate to see when I inherit code. Up next: using old-style joins.