So my Saturday was basically ruined by the following:
- sys admin makes a minor adjustment to the system clock on the data center's domain controller, Friday @ 6 PM.
- while correcting the time, said sys admin inadvertently changes April to July, then leaves for the weekend.
- over the next 20 minutes, every server in the data center adopts the new date/time from the DC.
- the entire data center is in this state for about six hours, until the sys admin answers the phone.
Seems like a simple problem to solve… just change the date back on the DC, and let it propagate out to all of the servers again, right? Not so fast, Charlie. That is certainly the first step, yes. In our case, we rely on the system clock of our application and database servers to indicate when an event happened (and also to schedule events in the future). So, during our six hour period of incorrect data, many users of our application *may* have entered incorrect data, unknowingly, in any of about 600 different databases across dozens of servers. Even if our applications were completely idle, the system itself was also going to be responsible for generating some new rows with these off-by-three-months datetime values. So now I had to assess the damage, and there were two primary areas that I needed to deal with: SQL Server Agent, and the data itself.
SQL Server Agent
Shortly after the system clocks were corrected, I received complaints from support that certain rollup activities had not occurred in several hours. I went to our main database server and checked SQL Server Agent's Job Activity Monitor, and sure enough, all of the jobs were idle and hadn't run in some time. Their schedules all read:
I ran a couple of jobs manually, to see if this would correct the issue. It did not. A typical schedule now read:
Head-scratcher, right? So I go to msdb and update the next_run_date for a specific job:
UPDATE msdb.dbo.sysjobschedules SET next_run_date = 20090411 WHERE job_id = <job_id>;
Nothing. Job Activity Monitor still reflects July 10th and running the job again manually simply resets the next_run_date value back to 2009-07-10.
What ended up working was right-clicking each job individually, opening up each schedule, changing some arbitrary property, and saving the job. (I chose moving the start date for the schedule to today's date, FWIW.) Now SQL Server Agent was fixed, and all my jobs were running on the right schedule, though my clicking finger was a little sore. And I still had the biggest task in front of me.
Correcting the Data
Some of you may be aware that I manage some very unique systems; the one in particular that was perplexing in this case has 500+ nearly identical databases on a single instance. Since all of these databases could have possibly been affected by data changes within the "bad window" from last night, I basically had to correct the values in every single DATETIME or SMALLDATETIME column across the entire system (after discussions with other engineers, we were extremely confident that no data found to be in the July timeframe was actually legitimate). I wrote a script that would handle this, that looked as follows in its raw form:
SET NOCOUNT ON; SELECT 'UPDATE ' + QUOTENAME(OBJECT_SCHEMA_NAME(c.[object_id], DB_ID())) + '.' + QUOTENAME(OBJECT_NAME(c.[object_id], DB_ID())) + ' SET ' + QUOTENAME(c.name) + ' = DATEADD(MONTH, -3, ' + QUOTENAME(c.name) + ')' + ' WHERE ' + QUOTENAME(c.name) + ' >= ''20090710'';' FROM sys.columns AS c INNER JOIN sys.types AS t ON c.system_type_id = t.system_type_id INNER JOIN sys.tables AS tb ON c.[object_id] = tb.[object_id] WHERE t.name IN ('DATETIME', 'SMALLDATETIME');
I then executed that script against one of the "identical" databases, using results to text mode, and it produced output like this:
UPDATE [dbo].[table_1] SET [col_1] = DATEADD(MONTH, -3, [col_1]) WHERE [col_1] >= '20090710'; UPDATE [dbo].[table_1] SET [col_2] = DATEADD(MONTH, -3, [col_2]) WHERE [col_2] >= '20090710'; UPDATE [dbo].[table_1] SET [col_3] = DATEADD(MONTH, -3, [col_3]) WHERE [col_3] >= '20090710'; UPDATE [dbo].[table_2] SET [col_1] = DATEADD(MONTH, -3, [col_1]) WHERE [col_1] >= '20090710'; ...
(Now, this assumes that you don't have any alias or other custom types that store DATETIME/SMALLDATETIME data, and that you don't store such data in INTs like msdb or in CHAR/VARCHAR like I see a lot of people do.)
Anyway, I took that output, and executed it with a multi-DB deployment tool, SQL Farm Combine, which allows me to execute the same query against multiple databases simultaneously. I don't want this to come off as an advertisement, but if you find yourself doing a lot of sp_MSforeachdb wrangling, you should definitely look at this tool. I haven't used others, but Red-Gate has something similar called SQL Multi Script, Apex has multiple DB support in their SQL Script tool, and I'm sure there are some from other vendors as well.
Now, you might remember that I said *nearly* identical above. To ensure that I captured every single column that might not be in every database (some databases have custom tables for a specific customer), I needed to add an additional step to the process. Rather than run the exact same script against every database, I needed to capture tables and columns that were custom, but did not really feel like going through every single database to find them. Instead I created the following script, falling back to sp_MSforeachdb after all:
SET NOCOUNT ON; EXEC master.dbo.sp_MSforeachdb 'IF DB_ID(''?'') > 4 BEGIN SELECT ''/* ? */''; SELECT ''UPDATE '' + QUOTENAME(''?'') + ''.'' + QUOTENAME(OBJECT_SCHEMA_NAME(c.[object_id], DB_ID(''?''))) + ''.'' + QUOTENAME(OBJECT_NAME(c.[object_id], DB_ID(''?''))) + '' SET '' + QUOTENAME(c.name) + '' = DATEADD(MONTH, -3, '' + QUOTENAME(c.name) + '')'' + '' WHERE '' + QUOTENAME(c.name) + '' >= ''''20090710'''';'' FROM [?].sys.columns AS c INNER JOIN [?].sys.types AS t ON c.system_type_id = t.system_type_id INNER JOIN [?].sys.tables AS tb ON c.[object_id] = tb.[object_id] WHERE t.name IN (''DATETIME'', ''SMALLDATETIME''); END';
Now I ran this in results to text mode – which again is important, mostly due to the number of grids – and it created output like this:
/* DB_1 */ --------------------- UPDATE [DB_1].[dbo].[table_1] SET [col_1] = DATEADD(MONTH, -3, [col_1]) WHERE [col_1] >= '20090710'; UPDATE [DB_1].[dbo].[table_1] SET [col_2] = DATEADD(MONTH, -3, [col_2]) WHERE [col_2] >= '20090710'; ... /* DB_2 */ --------------------- UPDATE [DB_2].[dbo].[table_1] SET [col_1] = DATEADD(MONTH, -3, [col_1]) WHERE [col_1] >= '20090710'; UPDATE [DB_2].[dbo].[table_1] SET [col_2] = DATEADD(MONTH, -3, [col_2]) WHERE [col_2] >= '20090710'; ...
If I had been thinking more cleverly at the time, I would have added logic to filter out the tables I had already updated. But alas, I just took various segments of that output and ran them manually in multiple Management Studio query windows. I think pruning the list may have taken more time than just waiting for no-op UPDATE statements against the tables I had already updated.
Yes, this took a long time, as not all of the tables are indexed on their DATETIME/SMALLDATETIME columns. But I learned a few quick lessons, and hopefully my experience here will help you if you ever find yourself in a similar fate.