Determining distribution of NULL values
Today on the twitter hash tag #sqlhelp, @leenux_tux asked:
How can I figure out the percentage of fields that don't have data ?
After further clarification, it turns out he is after what proportion of columns are NULL.
Some folks suggested using a data profiling task in SSIS. There may be some validity to that, but I'm still a fan of sticking to T-SQL when I can, so here is how I would approach it:
- Create a #temp table or @table variable to store the results.
- Create a cursor that loops through all of the columns for all tables, including total row count for the table, and whether or not the column is in fact nullable.
- Within the cursor, for columns that are nullable, get a count where columns are NULL, and insert the data for each column into the #temp table or @table variable.
– I think it's ok here to use WITH (NOLOCK) to obtain the count, since you are likely only looking for estimates
- Run queries against the results, depending on exactly what you are after – percentage for certain columns, percentage across the database, what have you.
Some sample code, though depending on your tables and data distribution I am not going to vouch for how long this will take – may be a good idea to restore a backup somewhere and run this kind of thing there, as opposed to your production system:
DECLARE @|/4/>all_columns TABLE ( [table] NVARCHAR(512), [column] SYSNAME, nullable INT, row_count BIGINT, null_count BIGINT ); DECLARE @sql NVARCHAR(MAX), @table NVARCHAR(512), @column SYSNAME, @nullable BIT, @row_count INT, @null_count INT; DECLARE c CURSOR LOCAL STATIC FORWARD_ONLY READ_ONLY FOR SELECT QUOTENAME(OBJECT_SCHEMA_NAME(c.[object_id])) + '.' + QUOTENAME(OBJECT_NAME(c.[object_id])), c.name, c.is_nullable, p.row_count FROM sys.columns AS c INNER JOIN ( SELECT [object_id], row_count = SUM(row_count) FROM sys.dm_db_partition_stats WHERE index_id IN (0,1) GROUP BY [object_id] ) AS p ON c.[object_id] = p.[object_id] WHERE OBJECTPROPERTY(c.[object_id], 'IsMsShipped') = 0; OPEN c; FETCH NEXT FROM c INTO @table, @column, @nullable, @row_count; WHILE @@FETCH_STATUS = 0 BEGIN SET @null_count = 0; IF @nullable = 1 AND @row_count > 0 BEGIN SET @sql = N'SELECT @null_count = COUNT(*) FROM ' + @table + ' WITH (NOLOCK) WHERE ' + QUOTENAME(@column) + ' IS NULL;'; EXEC sp_executesql @query = @sql, @params = N'@null_count INT OUTPUT', @null_count = @null_count OUTPUT; END INSERT @|/4/>all_columns([table], [column], nullable, row_count, null_count) SELECT @table, @column, @nullable, @row_count, @null_count; FETCH NEXT FROM c INTO @table, @column, @nullable, @row_count; END CLOSE c; DEALLOCATE c; SELECT [table], [column], nullable, null_count, row_count, null_perc = CONVERT(DECIMAL(5,2), 100.0*null_count/NULLIF(row_count,0)) FROM @|/4/>all_columns ORDER BY [table], [column];
Hope that helps. Even though it uses NOLOCK, dynamic SQL, and a cursor. I'm such a naughty database guy.