October 7, 2009 | SQL Server

Bad habits to kick : using loops to populate large tables

In my last post in this series, I talked about using ORDER BY <ordinal position>.  Today I'd like to talk about using primitive loops to generate a large number of rows.

Okay, I will admit, in most cases you are doing this for a demo or proof of concept, and so speed / performance is not all that important.  But using a loop to populate a table with 1 million rows, one row at a time, is not fun at all, no matter how much thumb-twiddling you have to do.  I often see people building up test cases like this:

CREATE TABLE dbo.foo(bar INT);
  @i INT,
  @UpperLimit INT;
  @i = 1,
  @UpperLimit = 1000000;
WHILE @i <= @UpperLimit
  INSERT dbo.foo(bar) VALUES(@i);
  SET @i = @i + 1;

For 500 or 50,000 rows, this probably wouldn't be a big deal.  But when you get into the bigger numbers, you are going to be spending a long time waiting for the inserts to finish. So how do you get around it?

I have long been an advocate of an auxiliary numbers table.  It makes quick work of generating dummy data, does not require a lot of storage space, and is easy to join against for various reporting-type functions.  Let's say you create a numbers table with 1,000,000 rows, then you can change the above code to:

CREATE TABLE dbo.foo(bar INT);
DECLARE @UpperLimit INT;
SET @UpperLimit = 1000000;
INSERT dbo.foo(bar)
  SELECT [Number]
  FROM dbo.Numbers
  WHERE [Number] BETWEEN 1 AND @UpperLimit;

However, there are many people who fundamentally object to adding a table to their schema, even if they put it in some utility database.  [As an aside, I see the exact same objections from people who don't think it's worth it to store a calendar table in their systems, even though it extremely simplifies date calculations such as business days between two dates, taking into account things like Easter and company holidays.]

Well, thankfully, there are some clever workarounds to this that let you construct a numbers table on the fly (and in fact you need to use something like this to populate the numbers table in the first place, without having to use a loop).  Here is one that I have used in previous posts:

DECLARE @UpperLimit INT;
SET @UpperLimit = 1000000;
        x = ROW_NUMBER() OVER
        (ORDER BY s1.[object_id])
    FROM       sys.all_objects AS s1
    CROSS JOIN sys.all_objects AS s2
    CROSS JOIN sys.all_objects AS s3
SELECT [Number] = x
  INTO dbo.Numbers
  FROM n
  WHERE x BETWEEN 1 AND @UpperLimit;

And here is one using a set of CTEs and no tables, based on code I stole from Itzik Ben-Gan:

DECLARE @UpperLimit INT;
SET @UpperLimit = 1000000;
   n5   (x) AS (SELECT 1 UNION SELECT 0),
   n4   (x) AS (SELECT 1 FROM n5 CROSS JOIN n5 AS x),
   n3   (x) AS (SELECT 1 FROM n4 CROSS JOIN n4 AS x),
   n2   (x) AS (SELECT 1 FROM n3 CROSS JOIN n3 AS x),
   n1   (x) AS (SELECT 1 FROM n2 CROSS JOIN n2 AS x),
   n0   (x) AS (SELECT 1 FROM n1 CROSS JOIN n1 AS x),
   Nbrs (x) AS 
           ROW_NUMBER() OVER
           (ORDER BY x)
       FROM n0
SELECT [Number] = x
  INTO dbo.Numbers
  FROM Nbrs
  WHERE n BETWEEN 1 AND @UpperLimit;

So, you could use either of these examples either (a) to populate your own numbers table, which you can later use instead of loops; or (b) directly, every time you need to simulate a loop.  I prefer (a) because, if you use it enough, the numbers table will reside in memory, have statistics, etc.  For full disclosure (and not to say my method is better), I compared the actual execution plans of both methods when using 1,000,000 rows as my upper limit.  The comparison is as follows:

Query cost (relative) 10% 90%
Cached plan size (bytes) 88 176
CompileCPU (ms) 28 50
CompileMemory (KB) 2336 3376
CompileTime (ms) 28 50
Estimated Subtree Cost 0.0206647 0.193489

And the most visible difference is the query plan (click to enlarge).  In case it's unclear, I'll give you a hint: the cross join of sys.objects is at the top.


Again, these aren't things you need to be concerned about if you are only populating the numbers table once.  But if you are using these methods on the fly, the difference in performance could be significant.

If you require <= 32,767 rows, you can use this alternative syntax, employing a recursive CTE.  It is much more efficient, but has a cap of 32,767 due to the limit on the MAXRECURSION option:

DECLARE @UpperLimit INT;
SET @UpperLimit = 32767;
   SELECT x = 1 
   SELECT x = x + 1
     FROM n
     WHERE x &lt; @UpperLimit
SELECT [Number] = x 
  INTO dbo.Numbers 
  FROM n

I'll admit, this isn't the easiest syntax to memorize, so you will likely want to keep a copy of it in your toolbox.  If you are going to continue using a loop, at the very least, please get in the habit of adding SET NOCOUNT ON statements to your code.  This way, at least, that big loop won't be wasting all that effort passing "1 row(s) affected" to the client n times, and the client won't be wasting all that effort printing it out to you n times.

Now don't get me wrong, I am not saying that loops are a bad thing.  In fact for a lot of things they are the right answer, and for some things they are even the only answer.  But in cases where things don't have to happen one at a time, they shouldn't happen one at a time.

I am working on a series of "Bad habits to kick" articles, in an effort to motivate people to drop some of the things that I hate to see when I inherit code.  Up next: using old-style joins.

25 comments on this post

    • Linchi Shea - October 8, 2009, 4:06 AM

      If we are talking about performance and simplicity, I'd actually argue that your very first method with a loop is actually the best (with some minor change).

    • Scott Whigham - October 8, 2009, 2:57 PM

      I like the breakdown – nice post. I don't think that, even if you showed people that performance of a loop was 500 times worse than using a recursive CTE, that it would change more than 2% of the population's technique of doing this. Why? As you pointed out, it's not a technique that people will generally be able to memorize. My "benchmark" on stuff like this is to imagine that I've just learned this technique on December 1 and then I take the whole month off and go to the Bahamas. When I come back to work in mid-January, will I remember this technique? If the answer is, "No", then I do one of two things: I either (a) do as you said and add a script to my "toolkit", or (b) I just forget about it. Your final example has been in my toolkit for several years but when it comes time to write a loop on the fly in a demo, I reach for the declarative technique (that you showed initially).
      If you have a numbers table that covers your min/max values, the simplest technique, for me, is to just use SELECT INTO – no need for more fancy options IMO. I don't even need the SET NOCOUNT ON with such a query:
       SELECT [Number]
       INTO Foo
       FROM dbo.Numbers
       WHERE [Number] BETWEEN 1 AND 1000000
      This is simple, easy to remember, and it's fast. Of course, it depends on your having a numbers table that covers your min/max values.

    • AaronBertrand - October 8, 2009, 5:07 PM

      Linchi, I agree with you about simplicity.  One of the problems with bad habits is that often they are borne out of simplicity.  In a lot of cases the wait time is irrelevant, so the simple option is best.  But perhaps this gives people a better way when they're really in a rush.
      As for performance, I disagree completely (though I don't know what "simple change" you imply).  I ran an insert of 1,000,000 rows into a new table using three methods: (a) a loop with a counter, using SET NOCOUNT ON, (b) a loop with a counter, using SET NOCOUNT OFF, and (c) the sys.objects cross join from above.  I only observed elapsed time, and the results were:
      (a) 8 minutes, 18 seconds
      (b) 8 minutes, 32 seconds
      (c) 3 seconds

    • AaronBertrand - October 8, 2009, 5:07 PM

      Scott, good comments.  I'll address a few.
      1) I realize that I am not going to change a whole lot of habits.  But if I made one person realize a better way to accomplish a common task, mission accomplished, IMHO.
      2) I agree that if you already have a numbers table, then SELECT INTO is best, without any of the CTE complications.  However, you still have to get a numbers table populated in the first place; surely you should do this in the most efficient manner possible, no?  And again, I explained that some people are opposed to such a table on principle – so hopefully this post gives them some alternatives as well.
      3) As for SET NOCOUNT ON, I use it on all my queries, regardless of their complexity.  And like anyone trying to demonstrate and reinforce good habits, repetition is key.  When you are performing one operation with one rowcount output, I agree, it is not necessary and does not buy you anything measurable.  But what harm does it do to add it?  You can even make it the first line in your template for all new queries, then the agonizing keystrokes can't even be a good excuse anymore.  ðŸ™‚

    • Linchi Shea - October 8, 2009, 6:31 PM

      Your (a) and (c) are not strictly apple to apple comparison. If you can retrieve data from some other table, sure, go ahead. It would be silly to use a loop in that case.
      When you use a loop, you most likely are generating new data, and the bottleneck is almost always on the INSERTs, not the loop itself. By 'minor change', I primarily meant to change the transaction commit behavior of the loop. Instead of committing each and every single row insert, which is very expensive, you may want to commit every 100,000 rows or some other batch size (big enough to make transaction commits efficient). That would make it dramatically faster than 8 minutes.
      Change it to someting like the following and it should finish in seconds:
      CREATE TABLE dbo.foo(bar INT);
      DECLARE   @i INT,  @UpperLimit INT;
      SELECT  @i = 1,  @UpperLimit = 1000000;
      begin tran
      WHILE @i <= @UpperLimit
         if @i % 100000 = 0
            commit tran
            begin tran
         INSERT dbo.foo(bar) VALUES(@i);  
         SET @i = @i + 1;
      if @@trancount > 0
        commit tran
      Here are some more data points:

    • AaronBertrand - October 8, 2009, 6:36 PM

      Fair enough Linchi, but do you know anybody who is going to code their "simple loops" that way?

    • Linchi Shea - October 8, 2009, 8:02 PM

      I agree that if they literally use the first approach, and inserting and committing one row at a time, they deserve the poor performance. And it would be a bad habit.

    • AaronBertrand - October 9, 2009, 6:40 PM

      Linchi, I think a lot of people don't realize the implications of these things, such as the costs of commit overhead.  So, saying they "deserve" something that isn't really explained well anywhere is kind of harsh, IMHO.  I still think the inclination for most people is to use a set-based approach rather than a one-by-one loop, even if there are cumbersome ways to defer or limit the costs associated with the latter.

    • Uri Dimant - October 11, 2009, 9:04 AM

      Aaron , great posts
      I often use to make numbers table as
      GO 10000

    • AaronBertrand - October 11, 2009, 7:01 PM

      Uri, that's still a loop, even though you don't actually use BEGIN / END or WHILE — while it is simple to code, it still suffers the same performance problem I pointed out in my post, and isn't subject to the enhancement Linchi mentioned about reducing the number of commits.  Also, without a GO between CREATE TABLE and INSERT INTO, the script you posted will return 9,999 error messages that the table Num already exists.  So, even the simplest types of loops are still subject to memorization / transposition issues.  And where is the schema prefix?  For shame!  ðŸ™‚

    • Uri Dimant - October 11, 2009, 9:57 PM

      :-)Hey,I posted this comment before your last one,
      You know, I have not written anymore an old JOIN style :-))

    • AaronBertrand - October 11, 2009, 10:28 PM

      Cool, so at least one person has kicked a bad habit based on my complaints.  ðŸ™‚

    • OllyA - October 12, 2009, 11:51 AM

      Actually, you can use the recursive CTE for large numbers of rows.  Don't think it's as fast as a numbers table, but a lot quicker than the loop.
      Just use OPTION (MAXRECURSION 0).
      Oh, and make sure your CTE has the WHERE clause… or you'll be trying to go for an infinite numbers table 😉

    • Brian Tkatch - October 12, 2009, 5:48 PM

      Aarton, this is a great series. Or rather, these are great posts. The series, however should be bifurcated (finally, got to use it is a "real" sentence 😛 ). Loops aren't a bad habit, as has been pointed out. There is just a better way. Perhaps you ought to seaprate between bad habits and best practices.

    • AaronBertrand - October 12, 2009, 5:52 PM

      Thanks Brian, in all fairness, I didn't say loops were a bad habit.  Just that they were a bad habit when the goal is to generate millions of rows.

    • Brian Tkatch - October 12, 2009, 7:18 PM

      >I didn't say loops were a bad habit.  Just that they were a bad habit
      >when the goal is to generate millions of rows.
      oh, ok. Heh. I was just focusing on the title as a subject identifier and ran with it. Regardless, it is a good point to read, and i'll read it whatever you call it.

    • Ron Krauter - November 5, 2009, 9:00 AM

      If you had a million customers in a table and you wanted to email each one of them, wouldn't you use the loop example you provided?
      WHILE …
         –send dbmail

    • Jeff Moden - March 7, 2010, 12:01 AM

      Ron drauter said:  If you had a million customers in a table and you wanted to email each one of them, wouldn't you use the loop example you provided?
      Absolutely (although as of SQL SERVER 2005, there are even ways around that).  In the case you cite, the While Loop won't be the performance problem and is necessary because the proc can only do 1 email at a time (IIRC).
      The problem is that a lot of people write procs that can only handle 1 row or record at a time instead of writing a set based solution.  Heh…I guess they do it out of "habit".
      –Jeff Moden

    • Jeff Moden - March 7, 2010, 12:16 AM

      Aaron said: Thanks Brian, in all fairness, I didn't say loops were a bad habit.
      Heh… I will!  ðŸ˜‰
      I know this is a relatively old blog but I just had to leave you a note about this.  I'm really glad to see someone else trying to keep people from using the addictive drug known as "While Loops".  I'm also pressing to keep them from using other forms of unncessary RBAR such as recursive CTEs to do such simple things as counting because they're sometime worse than a While Loop.
      One of my favorite peeves is when someone wants to demonstrate some hot new setbased technique and they go to build a test table to prove their code… and they use a While Loop or recursive CTE to do it.  It just doesn't make sense to me and it tends to destroy the "set based" credibility of the author in my eyes.
      A couple of years ago and again recently, SSC had an article on the ol' "FizzBuzz" problem that's sometimes used during interviews.  It's amazing to me that so many people submit solutions that contain a While Loop and try to justify its use on "short time allowed" and "it's only for 100 rows".  They just don't understand that when I'm doing interviews, I'm looking for people who…
      1.  Don't have the bad habit of using "While Loops" for something so simple.
      2.  Don't have the bad habit of justifying a total lack of future performance and scalability because they don't understand that scalability requirements can change quickly and the code should be able to handle it without rework.
      I took a lot of the same heat on those two threads for the same reason you took some heat on this thread.  I'm thrilled that someone else is standing their ground in a fight against RBAR and non scalable code.  Very well done and my hat's off to you!
      –Jeff Moden  
      BTW… sys.All_Columns in the Master database will always have more than 4k rows for a full install.

    • Razvan Socol - May 11, 2010, 4:48 PM

      Hello Aaron,
      You wrote that the recursive CTE method is "much more efficient" than Itzik's method (at least that's what I understand from that sentence). However, in my tests, it takes about 70ms to generate 10000 rows using Itzik's method and 563ms to generate 20000 rows with the recursive CTE. The recursive CTE also seems slower than the sysobjects CROSS JOIN, which provides the 20000 rows in about 80ms.
      I guess that you meant to say that the recursive CTE method is faster than the iterative method, but the sentence seems to imply something else.

    • Abhijit - April 4, 2014, 5:22 PM

      Guys, fantastic. I kicked myself (better late than never).
      Having said that, any better solution for my problem below.
      I need to generate following output where R1..Rn are rows and C1..Cn are columns (max limit known)
      R1 => C1 | C2 | C3 | C4 …….Cn
      R2 => C1 | C2 | C3 | C4 …….Cn
      R3 => C1 | C2 | C3 | C4 …….Cn
      For this I generate a series of numbers for Rows and Columns.
      Is using left join good (assuming there are 10000 rows and 5000 columns)
      I have tried to depict my problem in a very simple mechanism but let me know if the description is adequate.

    • Dean - June 8, 2016, 4:12 PM

      Hi Aaron.  I like the look of your preferred option but would you be able to explain it a little?  What exactly does it do, and why?  I have just run it for the first time and I only get 970,299 rows, not 1,000,000.  I just think it may be useful to people to explain what the code is doing (especially to give people confidence to edit it to their own needs). Thanks!

    • AaronBertrand - June 8, 2016, 7:58 PM

      Hi Dean, sorry, the code should be using sys.all_objects instead of sys.objects. I ran my tests in AdventureWorks but in relatively empty databases I could understand that you might fall short of 1,000,000 rows as a result.
      I'll think about a separate blog post to explain the cross joins, but in the meantime maybe this series will help:

    • Dean - June 9, 2016, 4:04 PM

      Thanks Aaron, I understand cross joins, I just meant from a functional point of view, I was interested in why you chose that particular approach.  I think I can pretty much imply it now from your reply – you just want access to a table with sufficient rows that a cross join to itself will create a large number of results, without having to create and populate it yourself, right?  If so hopefully my experience will be useful to people that if they have a very particular and large number in mind, they may want to be careful about relying on any system tables.  (Even my sys.all_objects table only has 2,267 rows, so would only create 5 million results when cross joined.)
      The general point of your post has really helped me though, I had a requirement to fill a table with 50 million rows of random sales data, and the way I was previously doing it (with a 1 to 50,000,000 WHILE loop) took 50 minutes.  Following your approach, but using a smaller loop to populate a table variable with 7,072 rows, I now have it down to 5 minutes.  I've posted it here in case a) it helps anyone or b) anyone has any further suggestions for improvement!  (Once again, thanks for the post, it was a life saver!)
      DECLARE @first_tab TABLE (col1 INT);
      DECLARE @i INT = 1
      WHILE @i <= 7072
      INSERT INTO @first_tab VALUES(@i);
      SET @i = @i + 1;
      — Sales figures random between 315 and 1140
      — Dates random number of minutes between 1 and 2629440 after 1/1/07
      — (makes a range of approx 5 years)
      IF OBJECT_ID('sales_figures') IS NOT NULL DROP TABLE dbo.sales_figures;
      SELECT TOP 50000000
             CONVERT(VARCHAR(30), DATEADD(minute, ABS(CHECKSUM(NewId())) % 2629440, '01-JAN-2007'), 100) AS sales_date
      ,       315 + (ABS(CHECKSUM(NewId())) % 825)  AS sales_figure
      INTO dbo.sales_figures
      FROM @first_tab t1 CROSS JOIN @first_tab t2;

    • Balaji Ram - August 15, 2016, 7:56 PM

      Regarding using optimal transaction size (100000 in the example above). This seems to matter only for permanent tables in user databases. Tempdb's #tables do not suffer much of a performance hit even when no transaction size is used and left to default(auto commit) with no explicit transaction.
      Thanks to Paul Randal, I know tempdb data pages are NOT flushed to disk unless there is memory pressure. What about the tempdb LOG BLOCKs? Is it also NOT flushed to tempdb log file unless there is memory pressure?
      Any Insights/thoughts? Thanks.

Comments are closed.