Today someone asked:
How can I figure out the percentage of fields that don't have data?
After further clarification, it turns out what they are after is the proportion of column values that are NULL.
Some folks suggested using a data profiling task in SSIS. There may be some validity to that, but I'm still a fan of sticking to T-SQL when I can, so here is how I would approach it:
- Create a #temp table or @table variable to store the results.
- Create a cursor that loops through all of the columns for all tables, including total row count for the table, and whether or not the column is in fact nullable.
- Within the cursor, for columns that are nullable, get a count where columns are NULL, and insert the data for each column into the #temp table or @table variable.
– I think it's ok here to use WITH (NOLOCK) to obtain the count, since you are likely only looking for estimates - Run queries against the results, depending on exactly what you are after – percentage for certain columns, percentage across the database, what have you.
Some sample code, though depending on your tables and data distribution I am not going to vouch for how long this will take – may be a good idea to restore a backup somewhere and run this kind of thing there, as opposed to your production system:
DECLARE @allcolumns TABLE ( [table] NVARCHAR(512), [column] SYSNAME, nullable INT, row_count BIGINT, null_count BIGINT ); DECLARE @sql NVARCHAR(MAX), @table NVARCHAR(512), @column SYSNAME, @nullable BIT, @row_count INT, @null_count INT; DECLARE c CURSOR LOCAL STATIC FORWARD_ONLY READ_ONLY FOR SELECT QUOTENAME(OBJECT_SCHEMA_NAME(c.[object_id])) + '.' + QUOTENAME(OBJECT_NAME(c.[object_id])), c.name, c.is_nullable, p.row_count FROM sys.columns AS c INNER JOIN ( SELECT [object_id], row_count = SUM(row_count) FROM sys.dm_db_partition_stats WHERE index_id IN (0,1) GROUP BY [object_id] ) AS p ON c.[object_id] = p.[object_id] WHERE OBJECTPROPERTY(c.[object_id], 'IsMsShipped') = 0; OPEN c; FETCH NEXT FROM c INTO @table, @column, @nullable, @row_count; WHILE @@FETCH_STATUS = 0 BEGIN SET @null_count = 0; IF @nullable = 1 AND @row_count > 0 BEGIN SET @sql = N'SELECT @null_count = COUNT(*) FROM ' + @table + ' WITH (NOLOCK) WHERE ' + QUOTENAME(@column) + ' IS NULL;'; EXEC sp_executesql @query = @sql, @params = N'@null_count INT OUTPUT', @null_count = @null_count OUTPUT; END INSERT @allcolumns([table], [column], nullable, row_count, null_count) SELECT @table, @column, @nullable, @row_count, @null_count; FETCH NEXT FROM c INTO @table, @column, @nullable, @row_count; END CLOSE c; DEALLOCATE c; SELECT [table], [column], nullable, null_count, row_count, null_perc = CONVERT(DECIMAL(5,2), 100.0*null_count/NULLIF(row_count,0)) FROM @allcolumns ORDER BY [table], [column];
Hope that helps. Even though it uses NOLOCK, dynamic SQL, and a cursor. I'm such a naughty database guy.
Uri, I left those columns in because I suspect most people would still want to see the full report. I don't think performance is a primary concern here because this is a one-off task that you're not running 50 times a day…
Shy, yes you can look at statistics where they're available, but as I told Uri performance isn't the number one priority here, and the code is much simpler if I just do a count instead of checking if there are stats, checking if they're updated, updating them if they're not, and still having to code the count method for columns that don't have stats.
Hi Aaron
Great script, I have my two cents to be added 🙂
Is that possible to filter out for is_nullable and row_count before starting a cursor with CTE –WHERE row_count>0 and etc. in terms of performance? What do you think?
WITH cte
AS
(
SELECT
QUOTENAME(OBJECT_SCHEMA_NAME(c.[object_id]))
+ '.' + QUOTENAME(OBJECT_NAME(c.[object_id])) tblname,
c.name, c.is_nullable, p.row_count
FROM sys.columns AS c
INNER JOIN
(
SELECT [object_id], row_count = SUM(row_count)
FROM sys.dm_db_partition_stats
WHERE index_id IN (0,1)
GROUP BY [object_id]
) AS p
ON c.[object_id] = p.[object_id]
WHERE OBJECTPROPERTY(c.[object_id], 'IsMsShipped') = 0
) SELECT * FROM cte WHERE is_nullable>0 AND row_count>0
If the column has an index on it, or column statistics defined, then using the DBCC SHOW_STATISTICS statement can be usefull.
you can itterate through the nullable columns using a Cursor and run DBCC SHOW_STATISTICS using dynamic SQL.
The downside is that it requires statistics to be defined on the column (by using an index or explicitly creating the statistics) and that the statistics will be updated. (WITH FULLSCAN)
Thanks Greg, I'm usually quite meticulous about QUOTENAME() but I guess you get what you pay for. I put the script together pretty quickly last night and perhaps should have held off until today. I've incorporated your fixes.
Cheers,
Aaron
The script failed with two different errors on my development system. Our programmers sometimes choose reserved words for column names (I have no control over this) so I added quotename() for the column name dynamic SQL:
SET @sql = N'SELECT @null_count = COUNT(*)
FROM ' + @table + ' WITH (NOLOCK)
WHERE ' + QUOTENAME(@column) + ' IS NULL;';
Also the final select statement produces a divide by zero error when reporting a table with zero rows so I added a CASE statement:
SELECT
[table],
[column],
nullable,
null_count,
row_count,
null_perc = case when row_count = 0 then NULL else CONVERT(DECIMAL(5,2), 100.0*null_count/row_count) end
FROM @all_columns
where row_count != 0
ORDER BY [table], [column];
Sorry, swapped out names last minute. Missed one. Fixed.
SELECT @table, @column, @nullable, @total_rows, @null_count;
@total_rows is declared where?