Data cleanup in SQL Server becomes more important with GDPR

By: Aaron Bertrand | Comments (2) | Related: > Security

Problem

With the General Data Protection Regulation (GDPR) in full effect, it is more important than ever to make sure that you comply with requests from end users whose personal data you have stored. These include requests to permanently delete all of someone’s data from your system, and I bet you think SQL Server’s DELETE command is sufficient. It’s not, and I’ll tell you why.

Solution

First of all, saying “permanently delete me” implies more than a simple delete. If you ask me to delete all your data today, and I comply, what prevents me from scraping your e-mail address from some source tomorrow and re-introducing you to my system? If I have removed all traces of you from my system, how can the system possibly recognize you as someone who wanted to be permanently forgotten? If you find that you are still in my system, I am most likely liable, since you can’t tell the difference between (a) non-compliance, and (b) compliance followed by accidental reintroduction of your data.

I’ll provide some suggestions for this scenario in a future tip, including both full documentation along with some automated possibilities as well. Today, I’m merely demonstrating when a delete isn’t really a delete, and how careful you must be with not only your primary system, but also secondaries and even backups over time.

Let’s look at a simple example. I’m going to create a new database and a very simple table with an e-mail address column.

CREATE DATABASE GDPRTest;
GO
USE GDPRTest;
GO
CREATE TABLE dbo.Customers
(
  CustomerID    int IDENTITY(1,1) NOT NULL,
  EmailAddress  nvarchar(320)     NOT NULL,
  CONSTRAINT    PK_Customers      PRIMARY KEY (CustomerID),
  CONSTRAINT    UQ_CustomerEmail  UNIQUE      (EmailAddress)
);
GO
INSERT dbo.Customers(EmailAddress) VALUES
  (N'[email protected]'),(N'[email protected]'),(N'[email protected]');
GO

Now, [email protected] calls you up and demands that you remove them from your database. You agree to their request and use this stored procedure for the first time:

CREATE PROCEDURE dbo.Customer_Delete
  @EmailAddress nvarchar(320)
AS
BEGIN
  SET NOCOUNT ON;
  DELETE dbo.Customers WHERE EmailAddress = @EmailAddress;
END
GO
EXEC dbo.Customer_Delete @EmailAddress = N'[email protected]';

Then you go home for the day, satisfied that you have complied with the request and that there is no problem.

Astute readers will note that I explicitly said that the stored procedure was executed for the first time, and I did that for a reason: while the plan for this procedure remains in the plan cache, it will be associated with compiled parameter values from the first execution. In this case, [email protected]. We can demonstrate this by borrowing from Bhavesh Patel’s code in his tip, “How to find compiled parameter values for SQL Server cached plans”:

;WITH cvalue AS
(  SELECT est.text,
    eqs.statement_start_offset/2 +1 as statement_start_offset,
    ( CASE WHEN eqs.statement_end_offset = -1 
      THEN LEN(CONVERT(NVARCHAR(MAX), est.text)) * 2 
      ELSE eqs.statement_end_offset END - eqs.statement_start_offset
    ) / 2 as statement_end_offset,
    TRY_CONVERT(XML,SUBSTRING(etqp.query_plan,CHARINDEX('<ParameterList>',etqp.query_plan), 
      CHARINDEX('</ParameterList>',etqp.query_plan) + LEN('</ParameterList>') 
      - CHARINDEX('<ParameterList>',etqp.query_plan) )) AS Parameters
  FROM sys.dm_exec_query_stats eqs
  CROSS APPLY sys.dm_exec_sql_text(eqs.sql_handle) est
  CROSS APPLY sys.dm_exec_text_query_plan
    (eqs.plan_handle, eqs.statement_start_offset, eqs.statement_end_offset) etqp
  WHERE est.objectid = OBJECT_ID(N'dbo.Customer_Delete')
  AND est.dbid = db_id(N'GDPRTest')
)
SELECT pc.compiled.value('@Column', 'nvarchar(128)') AS Parameterlist,
  pc.compiled.value('@ParameterCompiledValue', 'nvarchar(128)') AS [compiled Value],
  SUBSTRING(cvalue.text,cvalue.statement_start_offset,cvalue.statement_end_offset+1) AS sql_text
FROM cvalue
OUTER APPLY cvalue.parameters.nodes('//ParameterList/ColumnReference') AS pc(compiled);

Sure enough, that e-mail address is still in the database in the form of a compiled parameter:

An end user can’t necessarily find that, but an auditor might, and of course there is always the potential scenario where an unwitting developer troubleshoots the execution plan and shares it somewhere. Note that this type of data leakage is worth worrying about in the general case, not just involving GDPR. Hugo Kornelis addresses other ways execution plans expose data in his recent article, “Execution Plans and Data Protection.”

But wait, there’s more!

It’s likely an edge case that an execution plan may reveals personal data behind “forget me forever” request, simply because it happened to be the parameter value that was used the first time the plan was compiled. But where else might we find this data?

Well, we have undocumented DBCC commands that give us visibility into the contents of data and index pages. Let’s ensure that while a query against the table no longer reveals our forgotten user, that the data is really gone. Our job is relatively easy here because the table only contains three rows, so this won’t take a lot of digging to prove or disprove. First, use DBCC IND to determine the data and index pages (PageType IN (1,2)) of interest:

DBCC TRACEON (3604, -1);
DBCC IND(N'GDPRTest', N'dbo.Customers', 1); -- clustered index
DBCC IND(N'GDPRTest', N'dbo.Customers', 2); -- unique index

On my system, this yielded pages 312 and 320:

So, let’s look closer at those pages, using DBCC PAGE:

DBCC PAGE(N'GDPRTest', 1, 312, 2);
DBCC PAGE(N'GDPRTest', 1, 320, 2);

Ugh, sure enough, if we scroll down to the memory dump section, that data is still visible on the physical pages:

Which means it exists on any secondaries, any backups you’ve already taken, any systems where those backups have been restored, and so on. And if it exists anywhere, an auditor can find it – including an auditor working for the GDPR people or a lawyer representing the requestor. Which means you could be in trouble.

You might be thinking, well, a CHECKPOINT should take care of that, right? Nope. Issuing a CHECKPOINT has no impact. How about reorganizing the indexes? Nada. Rebuild? Well, sort of. The data appears on the new pages (in my case, pages 384 and 400), but the old pages (312 and 320) still exist, and still show the old copy of the data until they have been truly cleaned, removed, or replaced.

To be certain you remove deleted data immediately, you can use DBCC CLEANPAGE, which zeroes out the parts of the page that have been deallocated. Let’s do this for page 312:

DECLARE @db int = DB_ID();
DBCC CLEANPAGE(@db, 1, 312);

Now, when we inspect the page, it has changed:

DBCC CLEANPAGE has zeroed out the deleted data.

If you don’t want to use undocumented DBCC commands, you can use sys.sp_clean_db_file_free_space – but keep in mind that this cleans up all the pages in an entire data file, and on big, busy systems this can be quite intrusive. It also doesn’t avoid the undocumented DBCC command since the stored procedure just executes that for you in a loop. And, of course, you would have to know about any orphaned pages left behind by recent rebuilds (so in my case I should run this against pages 312 and 320 and also against pages 384 and 400).

Michael J. Swart – long before the GDPR came about – demonstrated that data from dropped columns is incredibly persistent too. So, while you might think that just dropping columns that contain personal data is an easy out, you need to follow similar techniques. You can read about the details in his article, “It's Hard to Destroy Data.”

Summary

Cleaning up data has much higher stakes today than it did a year ago and, as I have shown, it is harder than it might seem on first glance. Not only do you need to make sure you fully document personal data requests, you also need to make sure your delete processes include all the various ways the data might stick around, even if only long enough to get you caught in a trap. Here I looked at the plan cache and data and index pages, but there are other things I haven’t touched on, such as the transaction log. Be careful out there, and always think like an auditor who has sysadmin rights on your system!

Next Steps

Read on for related tips and other resources:

About the author

Aaron Bertrand (@AaronBertrand) is a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. He is editor-in-chief of the performance-related blog, SQLPerformance.com, and also blogs at sqlblog.org.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips