November 17, 2011 | SQL Server

Determining distribution of NULL values

Today on the twitter hash tag #sqlhelp, @leenux_tux asked:

How can I figure out the percentage of fields that don't have data ?

After further clarification, it turns out he is after what proportion of columns are NULL.

Some folks suggested using a data profiling task in SSIS. There may be some validity to that, but I'm still a fan of sticking to T-SQL when I can, so here is how I would approach it:

  1. Create a #temp table or @table variable to store the results.
  2. Create a cursor that loops through all of the columns for all tables, including total row count for the table, and whether or not the column is in fact nullable.
  3. Within the cursor, for columns that are nullable, get a count where columns are NULL, and insert the data for each column into the #temp table or @table variable.
        – I think it's ok here to use WITH (NOLOCK) to obtain the count, since you are likely only looking for estimates
  4. Run queries against the results, depending on exactly what you are after – percentage for certain columns, percentage across the database, what have you.

Some sample code, though depending on your tables and data distribution I am not going to vouch for how long this will take – may be a good idea to restore a backup somewhere and run this kind of thing there, as opposed to your production system:

 DECLARE @|/4/>all_columns TABLE
(
   [table]    NVARCHAR(512), 
   [column]   SYSNAME,
   nullable   INT,
   row_count  BIGINT,
   null_count BIGINT
);
 
DECLARE 
   @sql        NVARCHAR(MAX),
   @table      NVARCHAR(512),
   @column     SYSNAME,
   @nullable   BIT,
   @row_count  INT,
   @null_count INT;
 
DECLARE c CURSOR 
   LOCAL STATIC FORWARD_ONLY READ_ONLY
   FOR 
   SELECT 
       QUOTENAME(OBJECT_SCHEMA_NAME(c.[object_id]))
        + '.' + QUOTENAME(OBJECT_NAME(c.[object_id])),
       c.name, c.is_nullable, p.row_count
   FROM sys.columns AS c
   INNER JOIN
   (
       SELECT [object_id], row_count = SUM(row_count)
       FROM sys.dm_db_partition_stats
       WHERE index_id IN (0,1)
       GROUP BY [object_id]
   ) AS p
   ON c.[object_id] = p.[object_id]
   WHERE OBJECTPROPERTY(c.[object_id], 'IsMsShipped') = 0;
 
OPEN c;
 
FETCH NEXT FROM c INTO @table, @column, @nullable, @row_count;
 
WHILE @@FETCH_STATUS = 0
BEGIN
   SET @null_count = 0;
 
   IF @nullable = 1 AND @row_count > 0
   BEGIN
       SET @sql = N'SELECT @null_count = COUNT(*) 
			FROM ' + @table + ' WITH (NOLOCK)
			WHERE ' + QUOTENAME(@column) + ' IS NULL;';
 
       EXEC sp_executesql 
           @query      = @sql, 
           @params     = N'@null_count INT OUTPUT', 
           @null_count = @null_count OUTPUT;
   END
 
   INSERT @|/4/>all_columns([table], [column], nullable, row_count, null_count)
     SELECT @table, @column, @nullable, @row_count, @null_count;
 
   FETCH NEXT FROM c INTO @table, @column, @nullable, @row_count;
END
 
CLOSE c;
DEALLOCATE c;
 
SELECT 
   [table],
   [column],
   nullable,
   null_count,
   row_count,
   null_perc = CONVERT(DECIMAL(5,2), 100.0*null_count/NULLIF(row_count,0))
FROM @|/4/>all_columns
ORDER BY [table], [column];

Hope that helps. Even though it uses NOLOCK, dynamic SQL, and a cursor. I'm such a naughty database guy.
 

7 comments on this post

    • Stephen Mandeville - November 18, 2011, 4:07 PM

      SELECT @table, @column, @nullable, @total_rows, @null_count;
      @total_rows is declared where?

    • AaronBertrand - November 18, 2011, 4:14 PM

      Sorry, swapped out names last minute. Missed one. Fixed.

    • Greg Faulk - November 18, 2011, 7:39 PM

      The script failed with two different errors on my development system. Our programmers sometimes choose reserved words for column names (I have no control over this) so I added quotename() for the column name dynamic SQL:
      SET @sql = N'SELECT @null_count = COUNT(*)
      FROM ' + @table + ' WITH (NOLOCK)
      WHERE ' + QUOTENAME(@column) + ' IS NULL;';
      Also the final select statement produces a divide by zero error when reporting a table with zero rows so I added a CASE statement:
      SELECT
        [table],
        [column],
        nullable,
        null_count,
        row_count,
        null_perc = case when row_count = 0 then NULL else CONVERT(DECIMAL(5,2), 100.0*null_count/row_count) end
      FROM @all_columns
      where row_count != 0
      ORDER BY [table], [column];

    • AaronBertrand - November 18, 2011, 10:39 PM

      Thanks Greg, I'm usually quite meticulous about QUOTENAME() but I guess you get what you pay for. I put the script together pretty quickly last night and perhaps should have held off until today. I've incorporated your fixes.
      Cheers,
      Aaron

    • Shy Engelberg - November 20, 2011, 4:35 PM

      If the column has an index on it, or column statistics defined, then using the DBCC SHOW_STATISTICS statement can be usefull.
      you can itterate through the nullable columns using a Cursor and run DBCC SHOW_STATISTICS using dynamic SQL.
      The downside is that it requires statistics to be defined on the column (by using an index or explicitly creating the statistics) and that the statistics will be updated. (WITH FULLSCAN)

    • Uri Dimant - November 20, 2011, 5:22 PM

      Hi Aaron
      Great script, I have my two cents to be added 🙂
      Is that  possible to filter out for is_nullable and row_count before starting a cursor with CTE –WHERE row_count>0 and etc. in terms of performance? What do you think?
      WITH cte
      AS
      (
      SELECT
            QUOTENAME(OBJECT_SCHEMA_NAME(c.[object_id]))
             + '.' + QUOTENAME(OBJECT_NAME(c.[object_id])) tblname,
            c.name, c.is_nullable, p.row_count
        FROM sys.columns AS c
        INNER JOIN
        (
            SELECT [object_id], row_count = SUM(row_count)
            FROM sys.dm_db_partition_stats
            WHERE index_id IN (0,1)
            GROUP BY [object_id]
        ) AS p
        ON c.[object_id] = p.[object_id]
        WHERE OBJECTPROPERTY(c.[object_id], 'IsMsShipped') = 0
      ) SELECT * FROM cte WHERE is_nullable>0 AND row_count>0

    • AaronBertrand - November 20, 2011, 6:14 PM

      Uri, I left those columns in because I suspect most people would still want to see the full report. I don't think performance is a primary concern here because this is a one-off task that you're not running 50 times a day…
      Shy, yes you can look at statistics where they're available, but as I told Uri performance isn't the number one priority here, and the code is much simpler if I just do a count instead of checking if there are stats, checking if they're updated, updating them if they're not, and still having to code the count method for columns that don't have stats.

Comments are closed.