Bad Habits to Kick: Looping to populate large tables
October 7th, 200925
Bad Habits to Kick: Looping to populate large tables
October 7th, 200925
 
 
This is part of more than a decade of posts involving Bad Habits and Best Practices.
See the full index.

Today I'd like to talk about using primitive loops to generate a large number of rows. Okay, I'll admit, in most cases you are doing this for a demo or proof of concept, and so speed / performance is not all that important. But using a loop to populate a table with 1 million rows, one row at a time, is not fun at all, no matter how much thumb-twiddling you are willing to do.

I often see people building up test cases like this:

CREATE TABLE dbo.foo(bar int);
 
DECLARE 
  @i          int,
  @UpperLimit int;
 
SELECT
  @i          = 1,
  @UpperLimit = 1000000;
 
WHILE @i <= @UpperLimit
BEGIN
  INSERT dbo.foo(bar) VALUES(@i);
  SET @i = @i + 1;
END

For 500 or 50,000 rows, this probably wouldn't be a big deal. But when you get into the bigger numbers, you are going to be spending a long time waiting for the inserts to finish. So how do you get around it?

I have long been an advocate of an auxiliary numbers table. It makes quick work of generating dummy data, does not require a lot of storage space, and is easy to join against for various reporting-type functions. Let's say you create a numbers table with 1,000,000 rows, then you can change the above code to:

SET NOCOUNT ON;
 
CREATE TABLE dbo.foo(bar int);
 
DECLARE @UpperLimit int = 1000000;
 
INSERT dbo.foo(bar)
  SELECT [Number]
  FROM dbo.Numbers
  WHERE [Number] BETWEEN 1 AND @UpperLimit;

However, there are some who fundamentally object to adding a table to their schema, even if they put it in some utility database.

Thankfully, there are some clever workarounds to this that let you construct a numbers table on the fly. Here is one that I have used in previous posts:

SET NOCOUNT ON;
 
DECLARE @UpperLimit int = 1000000;
 
;WITH n AS
(
    SELECT
        x = ROW_NUMBER() OVER
        (ORDER BY s1.[object_id])
    FROM       sys.all_objects AS s1
    CROSS JOIN sys.all_objects AS s2
    CROSS JOIN sys.all_objects AS s3
)
SELECT [Number] = x
  INTO #Numbers
  FROM n
  WHERE x BETWEEN 1 AND @UpperLimit;
 
GO
 
CREATE UNIQUE CLUSTERED INDEX n ON #Numbers([Number]);

And here is one using a set of CTEs and no system views, based on code I stole from Itzik Ben-Gan at some point:

SET NOCOUNT ON;
 
DECLARE @UpperLimit int = 1000000;
 
;WITH
   n5   (x) AS (SELECT 1 UNION SELECT 0),
   n4   (x) AS (SELECT 1 FROM n5 CROSS JOIN n5 AS x),
   n3   (x) AS (SELECT 1 FROM n4 CROSS JOIN n4 AS x),
   n2   (x) AS (SELECT 1 FROM n3 CROSS JOIN n3 AS x),
   n1   (x) AS (SELECT 1 FROM n2 CROSS JOIN n2 AS x),
   n0   (x) AS (SELECT 1 FROM n1 CROSS JOIN n1 AS x),
   Nbrs (x) AS 
   (
       SELECT ROW_NUMBER() OVER (ORDER BY x)
       FROM n0
   )
SELECT [Number] = x
  INTO #Numbers
  FROM Nbrs
  WHERE n BETWEEN 1 AND @UpperLimit;
GO
 
CREATE UNIQUE CLUSTERED INDEX n ON #Numbers([Number]);

Another alternative is the recursive CTE.

SET NOCOUNT ON;
 
DECLARE @UpperLimit int = 1000000;
 
;WITH n(x) AS 
(
   SELECT x = 1 
   UNION ALL
   SELECT x + 1
     FROM n WHERE x < @UpperLimit
)
SELECT [Number] = x 
  INTO #Numbers 
  FROM n
  OPTION (MAXRECURSION 0); -- or <= 32767 if you need fewer rows
GO
 
CREATE UNIQUE CLUSTERED INDEX n ON #Numbers([Number]);
Since writing this post, I wrote another series on generating sets without loops: Part 1 | Part 2 | Part 3

If you are going to continue using a loop, at the very least, please get in the habit of adding SET NOCOUNT ON; statements to your code. This way, at least, that big loop won't be wasting all that effort passing 1 row(s) affected to the client n times, and the client won't be wasting all that effort printing it out to you n times.

Now, don't get me wrong; I am not saying that loops are a bad thing. In fact, for a lot of things, they are the right answer, and for some things they are the only answer. But in cases where things don't have to happen one at a time, they shouldn't happen one at a time.

This is part of more than a decade of posts involving Bad Habits and Best Practices.
See the full index.
By: Aaron Bertrand

I am a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. I am a long-time Microsoft MVP, write at Simple Talk, SQLPerformance, and MSSQLTips, and have had the honor of speaking at more conferences than I can remember. In non-tech life, I am a husband, a father of two, a huge hockey and football fan, and my pronouns are he/him.

25 Responses

  1. Balaji Ram says:

    Linchi/Aaron,
    Regarding using optimal transaction size (100000 in the example above). This seems to matter only for permanent tables in user databases. Tempdb's #tables do not suffer much of a performance hit even when no transaction size is used and left to default(auto commit) with no explicit transaction.
    Thanks to Paul Randal, I know tempdb data pages are NOT flushed to disk unless there is memory pressure. What about the tempdb LOG BLOCKs? Is it also NOT flushed to tempdb log file unless there is memory pressure?
    Any Insights/thoughts? Thanks.

  2. Dean says:

    Thanks Aaron, I understand cross joins, I just meant from a functional point of view, I was interested in why you chose that particular approach.  I think I can pretty much imply it now from your reply – you just want access to a table with sufficient rows that a cross join to itself will create a large number of results, without having to create and populate it yourself, right?  If so hopefully my experience will be useful to people that if they have a very particular and large number in mind, they may want to be careful about relying on any system tables.  (Even my sys.all_objects table only has 2,267 rows, so would only create 5 million results when cross joined.)
    The general point of your post has really helped me though, I had a requirement to fill a table with 50 million rows of random sales data, and the way I was previously doing it (with a 1 to 50,000,000 WHILE loop) took 50 minutes.  Following your approach, but using a smaller loop to populate a table variable with 7,072 rows, I now have it down to 5 minutes.  I've posted it here in case a) it helps anyone or b) anyone has any further suggestions for improvement!  (Once again, thanks for the post, it was a life saver!)
    SET NOCOUNT ON
    DECLARE @first_tab TABLE (col1 INT);
    DECLARE @i INT = 1
    WHILE @i <= 7072
    BEGIN
    INSERT INTO @first_tab VALUES(@i);
    SET @i = @i + 1;
    END
    — Sales figures random between 315 and 1140
    — Dates random number of minutes between 1 and 2629440 after 1/1/07
    — (makes a range of approx 5 years)
    IF OBJECT_ID('sales_figures') IS NOT NULL DROP TABLE dbo.sales_figures;
    SELECT TOP 50000000
           CONVERT(VARCHAR(30), DATEADD(minute, ABS(CHECKSUM(NewId())) % 2629440, '01-JAN-2007'), 100) AS sales_date
    ,       315 + (ABS(CHECKSUM(NewId())) % 825)  AS sales_figure
    INTO dbo.sales_figures
    FROM @first_tab t1 CROSS JOIN @first_tab t2;

  3. AaronBertrand says:

    Hi Dean, sorry, the code should be using sys.all_objects instead of sys.objects. I ran my tests in AdventureWorks but in relatively empty databases I could understand that you might fall short of 1,000,000 rows as a result.
    I'll think about a separate blog post to explain the cross joins, but in the meantime maybe this series will help:
    http://sqlperformance.com/2013/01/t-sql-queries/generate-a-set-1
    http://sqlperformance.com/2013/01/t-sql-queries/generate-a-set-2
    http://sqlperformance.com/2013/01/t-sql-queries/generate-a-set-3

  4. Dean says:

    Hi Aaron.  I like the look of your preferred option but would you be able to explain it a little?  What exactly does it do, and why?  I have just run it for the first time and I only get 970,299 rows, not 1,000,000.  I just think it may be useful to people to explain what the code is doing (especially to give people confidence to edit it to their own needs). Thanks!

  5. Abhijit says:

    Guys, fantastic. I kicked myself (better late than never).
    Having said that, any better solution for my problem below.
    I need to generate following output where R1..Rn are rows and C1..Cn are columns (max limit known)
    e.g.
    R1 => C1 | C2 | C3 | C4 …….Cn
    R2 => C1 | C2 | C3 | C4 …….Cn
    R3 => C1 | C2 | C3 | C4 …….Cn
    ……
    For this I generate a series of numbers for Rows and Columns.
    Is using left join good (assuming there are 10000 rows and 5000 columns)
    I have tried to depict my problem in a very simple mechanism but let me know if the description is adequate.

  6. Razvan Socol says:

    Hello Aaron,
    You wrote that the recursive CTE method is "much more efficient" than Itzik's method (at least that's what I understand from that sentence). However, in my tests, it takes about 70ms to generate 10000 rows using Itzik's method and 563ms to generate 20000 rows with the recursive CTE. The recursive CTE also seems slower than the sysobjects CROSS JOIN, which provides the 20000 rows in about 80ms.
    I guess that you meant to say that the recursive CTE method is faster than the iterative method, but the sentence seems to imply something else.
    Razvan

  7. Jeff Moden says:

    Aaron said: Thanks Brian, in all fairness, I didn't say loops were a bad habit.
    Heh… I will!  ðŸ˜‰
    I know this is a relatively old blog but I just had to leave you a note about this.  I'm really glad to see someone else trying to keep people from using the addictive drug known as "While Loops".  I'm also pressing to keep them from using other forms of unncessary RBAR such as recursive CTEs to do such simple things as counting because they're sometime worse than a While Loop.
    One of my favorite peeves is when someone wants to demonstrate some hot new setbased technique and they go to build a test table to prove their code… and they use a While Loop or recursive CTE to do it.  It just doesn't make sense to me and it tends to destroy the "set based" credibility of the author in my eyes.
    A couple of years ago and again recently, SSC had an article on the ol' "FizzBuzz" problem that's sometimes used during interviews.  It's amazing to me that so many people submit solutions that contain a While Loop and try to justify its use on "short time allowed" and "it's only for 100 rows".  They just don't understand that when I'm doing interviews, I'm looking for people who…
    1.  Don't have the bad habit of using "While Loops" for something so simple.
    2.  Don't have the bad habit of justifying a total lack of future performance and scalability because they don't understand that scalability requirements can change quickly and the code should be able to handle it without rework.
    I took a lot of the same heat on those two threads for the same reason you took some heat on this thread.  I'm thrilled that someone else is standing their ground in a fight against RBAR and non scalable code.  Very well done and my hat's off to you!
    –Jeff Moden  
    BTW… sys.All_Columns in the Master database will always have more than 4k rows for a full install.

  8. Jeff Moden says:

    Ron drauter said:  If you had a million customers in a table and you wanted to email each one of them, wouldn't you use the loop example you provided?
    Absolutely (although as of SQL SERVER 2005, there are even ways around that).  In the case you cite, the While Loop won't be the performance problem and is necessary because the proc can only do 1 email at a time (IIRC).
    The problem is that a lot of people write procs that can only handle 1 row or record at a time instead of writing a set based solution.  Heh…I guess they do it out of "habit".
    –Jeff Moden

  9. Ron Krauter says:

    Aaron,
    If you had a million customers in a table and you wanted to email each one of them, wouldn't you use the loop example you provided?
    WHILE …
     BEGIN
       –send dbmail
     END
    Thanks.

  10. Brian Tkatch says:

    >I didn't say loops were a bad habit.  Just that they were a bad habit
    >when the goal is to generate millions of rows.
    oh, ok. Heh. I was just focusing on the title as a subject identifier and ran with it. Regardless, it is a good point to read, and i'll read it whatever you call it.

  11. AaronBertrand says:

    Thanks Brian, in all fairness, I didn't say loops were a bad habit.  Just that they were a bad habit when the goal is to generate millions of rows.

  12. Brian Tkatch says:

    Aarton, this is a great series. Or rather, these are great posts. The series, however should be bifurcated (finally, got to use it is a "real" sentence 😛 ). Loops aren't a bad habit, as has been pointed out. There is just a better way. Perhaps you ought to seaprate between bad habits and best practices.

  13. OllyA says:

    Actually, you can use the recursive CTE for large numbers of rows.  Don't think it's as fast as a numbers table, but a lot quicker than the loop.
    Just use OPTION (MAXRECURSION 0).
    Oh, and make sure your CTE has the WHERE clause… or you'll be trying to go for an infinite numbers table 😉

  14. AaronBertrand says:

    Cool, so at least one person has kicked a bad habit based on my complaints.  ðŸ™‚

  15. Uri Dimant says:

    :-)Hey,I posted this comment before your last one,
    You know, I have not written anymore an old JOIN style :-))

  16. AaronBertrand says:

    Uri, that's still a loop, even though you don't actually use BEGIN / END or WHILE — while it is simple to code, it still suffers the same performance problem I pointed out in my post, and isn't subject to the enhancement Linchi mentioned about reducing the number of commits.  Also, without a GO between CREATE TABLE and INSERT INTO, the script you posted will return 9,999 error messages that the table Num already exists.  So, even the simplest types of loops are still subject to memorization / transposition issues.  And where is the schema prefix?  For shame!  ðŸ™‚

  17. Uri Dimant says:

    Aaron , great posts
    I often use to make numbers table as
    CREATE TABLE Num(number INT NOT NULL IDENTITY(1,1))
    INSERT INTO Num DEFAULT VALUES
    GO 10000

  18. AaronBertrand says:

    Linchi, I think a lot of people don't realize the implications of these things, such as the costs of commit overhead.  So, saying they "deserve" something that isn't really explained well anywhere is kind of harsh, IMHO.  I still think the inclination for most people is to use a set-based approach rather than a one-by-one loop, even if there are cumbersome ways to defer or limit the costs associated with the latter.

  19. Linchi Shea says:

    Aaron;
    I agree that if they literally use the first approach, and inserting and committing one row at a time, they deserve the poor performance. And it would be a bad habit.

  20. AaronBertrand says:

    Fair enough Linchi, but do you know anybody who is going to code their "simple loops" that way?

  21. Linchi Shea says:

    Aaron;
    Your (a) and (c) are not strictly apple to apple comparison. If you can retrieve data from some other table, sure, go ahead. It would be silly to use a loop in that case.
    When you use a loop, you most likely are generating new data, and the bottleneck is almost always on the INSERTs, not the loop itself. By 'minor change', I primarily meant to change the transaction commit behavior of the loop. Instead of committing each and every single row insert, which is very expensive, you may want to commit every 100,000 rows or some other batch size (big enough to make transaction commits efficient). That would make it dramatically faster than 8 minutes.
    Change it to someting like the following and it should finish in seconds:
    SET NOCOUNT ON;
    CREATE TABLE dbo.foo(bar INT);
    DECLARE   @i INT,  @UpperLimit INT;
    SELECT  @i = 1,  @UpperLimit = 1000000;
    begin tran
    WHILE @i <= @UpperLimit
    BEGIN  
       if @i % 100000 = 0
       begin
          commit tran
          begin tran
       end
       INSERT dbo.foo(bar) VALUES(@i);  
       SET @i = @i + 1;
    END
    if @@trancount > 0
      commit tran
    Here are some more data points:
    http://sqlblog.com/blogs/linchi_shea/archive/2007/08/23/performance-impact-finding-the-most-optimal-batch-size.aspx

  22. AaronBertrand says:

    Scott, good comments.  I'll address a few.
    1) I realize that I am not going to change a whole lot of habits.  But if I made one person realize a better way to accomplish a common task, mission accomplished, IMHO.
    2) I agree that if you already have a numbers table, then SELECT INTO is best, without any of the CTE complications.  However, you still have to get a numbers table populated in the first place; surely you should do this in the most efficient manner possible, no?  And again, I explained that some people are opposed to such a table on principle – so hopefully this post gives them some alternatives as well.
    3) As for SET NOCOUNT ON, I use it on all my queries, regardless of their complexity.  And like anyone trying to demonstrate and reinforce good habits, repetition is key.  When you are performing one operation with one rowcount output, I agree, it is not necessary and does not buy you anything measurable.  But what harm does it do to add it?  You can even make it the first line in your template for all new queries, then the agonizing keystrokes can't even be a good excuse anymore.  ðŸ™‚

  23. AaronBertrand says:

    Linchi, I agree with you about simplicity.  One of the problems with bad habits is that often they are borne out of simplicity.  In a lot of cases the wait time is irrelevant, so the simple option is best.  But perhaps this gives people a better way when they're really in a rush.
    As for performance, I disagree completely (though I don't know what "simple change" you imply).  I ran an insert of 1,000,000 rows into a new table using three methods: (a) a loop with a counter, using SET NOCOUNT ON, (b) a loop with a counter, using SET NOCOUNT OFF, and (c) the sys.objects cross join from above.  I only observed elapsed time, and the results were:
    (a) 8 minutes, 18 seconds
    (b) 8 minutes, 32 seconds
    (c) 3 seconds

  24. Scott Whigham says:

    I like the breakdown – nice post. I don't think that, even if you showed people that performance of a loop was 500 times worse than using a recursive CTE, that it would change more than 2% of the population's technique of doing this. Why? As you pointed out, it's not a technique that people will generally be able to memorize. My "benchmark" on stuff like this is to imagine that I've just learned this technique on December 1 and then I take the whole month off and go to the Bahamas. When I come back to work in mid-January, will I remember this technique? If the answer is, "No", then I do one of two things: I either (a) do as you said and add a script to my "toolkit", or (b) I just forget about it. Your final example has been in my toolkit for several years but when it comes time to write a loop on the fly in a demo, I reach for the declarative technique (that you showed initially).
    If you have a numbers table that covers your min/max values, the simplest technique, for me, is to just use SELECT INTO – no need for more fancy options IMO. I don't even need the SET NOCOUNT ON with such a query:
     SELECT [Number]
     INTO Foo
     FROM dbo.Numbers
     WHERE [Number] BETWEEN 1 AND 1000000
     GO
     ALTER TABLE foo ADD CONSTRAINT PK_Foo PRIMARY KEY (Number)
     GO
    This is simple, easy to remember, and it's fast. Of course, it depends on your having a numbers table that covers your min/max values.

  25. Linchi Shea says:

    Aaron;
    If we are talking about performance and simplicity, I'd actually argue that your very first method with a loop is actually the best (with some minor change).