Determining distribution of NULL values
November 17th, 20117
Determining distribution of NULL values
November 17th, 20117
 
 

Today someone asked:

How can I figure out the percentage of fields that don't have data?

After further clarification, it turns out what they are after is the proportion of column values that are NULL.

Some folks suggested using a data profiling task in SSIS. There may be some validity to that, but I'm still a fan of sticking to T-SQL when I can, so here is how I would approach it:

  1. Create a #temp table or @table variable to store the results.
  2. Create a cursor that loops through all of the columns for all tables, including total row count for the table, and whether or not the column is in fact nullable.
  3. Within the cursor, for columns that are nullable, get a count where columns are NULL, and insert the data for each column into the #temp table or @table variable.
        – I think it's ok here to use WITH (NOLOCK) to obtain the count, since you are likely only looking for estimates
  4. Run queries against the results, depending on exactly what you are after – percentage for certain columns, percentage across the database, what have you.

Some sample code, though depending on your tables and data distribution I am not going to vouch for how long this will take – may be a good idea to restore a backup somewhere and run this kind of thing there, as opposed to your production system:

DECLARE @allcolumns TABLE
(
   [table]    NVARCHAR(512), 
   [column]   SYSNAME,
   nullable   INT,
   row_count  BIGINT,
   null_count BIGINT
);
 
DECLARE 
   @sql        NVARCHAR(MAX),
   @table      NVARCHAR(512),
   @column     SYSNAME,
   @nullable   BIT,
   @row_count  INT,
   @null_count INT;
 
DECLARE c CURSOR 
   LOCAL STATIC FORWARD_ONLY READ_ONLY
   FOR 
   SELECT 
       QUOTENAME(OBJECT_SCHEMA_NAME(c.[object_id]))
        + '.' + QUOTENAME(OBJECT_NAME(c.[object_id])),
       c.name, c.is_nullable, p.row_count
   FROM sys.columns AS c
   INNER JOIN
   (
       SELECT [object_id], row_count = SUM(row_count)
       FROM sys.dm_db_partition_stats
       WHERE index_id IN (0,1)
       GROUP BY [object_id]
   ) AS p
   ON c.[object_id] = p.[object_id]
   WHERE OBJECTPROPERTY(c.[object_id], 'IsMsShipped') = 0;
 
OPEN c;
 
FETCH NEXT FROM c INTO @table, @column, @nullable, @row_count;
 
WHILE @@FETCH_STATUS = 0
BEGIN
   SET @null_count = 0;
 
   IF @nullable = 1 AND @row_count > 0
   BEGIN
       SET @sql = N'SELECT @null_count = COUNT(*) 
			FROM ' + @table + ' WITH (NOLOCK)
			WHERE ' + QUOTENAME(@column) + ' IS NULL;';
 
       EXEC sp_executesql 
           @query      = @sql, 
           @params     = N'@null_count INT OUTPUT', 
           @null_count = @null_count OUTPUT;
   END
 
   INSERT @allcolumns([table], [column], nullable, row_count, null_count)
     SELECT @table, @column, @nullable, @row_count, @null_count;
 
   FETCH NEXT FROM c INTO @table, @column, @nullable, @row_count;
END
 
CLOSE c;
DEALLOCATE c;
 
SELECT 
   [table],
   [column],
   nullable,
   null_count,
   row_count,
   null_perc = CONVERT(DECIMAL(5,2), 100.0*null_count/NULLIF(row_count,0))
FROM @allcolumns
ORDER BY [table], [column];

Hope that helps. Even though it uses NOLOCK, dynamic SQL, and a cursor. I'm such a naughty database guy.

By: Aaron Bertrand

I am a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. I am a long-time Microsoft MVP, write at Simple Talk, SQLPerformance, and MSSQLTips, and have had the honor of speaking at more conferences than I can remember. In non-tech life, I am a husband, a father of two, a huge hockey and football fan, and my pronouns are he/him.

7 Responses

  1. AaronBertrand says:

    Uri, I left those columns in because I suspect most people would still want to see the full report. I don't think performance is a primary concern here because this is a one-off task that you're not running 50 times a day…
    Shy, yes you can look at statistics where they're available, but as I told Uri performance isn't the number one priority here, and the code is much simpler if I just do a count instead of checking if there are stats, checking if they're updated, updating them if they're not, and still having to code the count method for columns that don't have stats.

  2. Uri Dimant says:

    Hi Aaron
    Great script, I have my two cents to be added 🙂
    Is that  possible to filter out for is_nullable and row_count before starting a cursor with CTE –WHERE row_count>0 and etc. in terms of performance? What do you think?
    WITH cte
    AS
    (
    SELECT
          QUOTENAME(OBJECT_SCHEMA_NAME(c.[object_id]))
           + '.' + QUOTENAME(OBJECT_NAME(c.[object_id])) tblname,
          c.name, c.is_nullable, p.row_count
      FROM sys.columns AS c
      INNER JOIN
      (
          SELECT [object_id], row_count = SUM(row_count)
          FROM sys.dm_db_partition_stats
          WHERE index_id IN (0,1)
          GROUP BY [object_id]
      ) AS p
      ON c.[object_id] = p.[object_id]
      WHERE OBJECTPROPERTY(c.[object_id], 'IsMsShipped') = 0
    ) SELECT * FROM cte WHERE is_nullable>0 AND row_count>0

  3. Shy Engelberg says:

    If the column has an index on it, or column statistics defined, then using the DBCC SHOW_STATISTICS statement can be usefull.
    you can itterate through the nullable columns using a Cursor and run DBCC SHOW_STATISTICS using dynamic SQL.
    The downside is that it requires statistics to be defined on the column (by using an index or explicitly creating the statistics) and that the statistics will be updated. (WITH FULLSCAN)

  4. AaronBertrand says:

    Thanks Greg, I'm usually quite meticulous about QUOTENAME() but I guess you get what you pay for. I put the script together pretty quickly last night and perhaps should have held off until today. I've incorporated your fixes.
    Cheers,
    Aaron

  5. Greg Faulk says:

    The script failed with two different errors on my development system. Our programmers sometimes choose reserved words for column names (I have no control over this) so I added quotename() for the column name dynamic SQL:
    SET @sql = N'SELECT @null_count = COUNT(*)
    FROM ' + @table + ' WITH (NOLOCK)
    WHERE ' + QUOTENAME(@column) + ' IS NULL;';
    Also the final select statement produces a divide by zero error when reporting a table with zero rows so I added a CASE statement:
    SELECT
      [table],
      [column],
      nullable,
      null_count,
      row_count,
      null_perc = case when row_count = 0 then NULL else CONVERT(DECIMAL(5,2), 100.0*null_count/row_count) end
    FROM @all_columns
    where row_count != 0
    ORDER BY [table], [column];

  6. AaronBertrand says:

    Sorry, swapped out names last minute. Missed one. Fixed.

  7. Stephen Mandeville says:

    SELECT @table, @column, @nullable, @total_rows, @null_count;
    @total_rows is declared where?