A cautionary tale for this month's T-SQL Tuesday, where Kenneth Fisher (@sqlstudent144) asks us to write about Non SQL Tips and tricks. I'm going to talk about SQL Server but the overarching concept here can apply to anything technical (and non-technical).
Over the weekend, I intended to do some maintenance on a database… at over 60 TB, it was getting cumbersome to restore / CHECKDB. There were several very large tables that had recently been pruned of extraneous LOB data, and one in particular – a currently static table, alone on its own 27 TB filegroup – that I had my eyes on.
My plan was to simply rebuild the table onto a new filegroup (expected to be about 5 TB), run an EMPTYFILE
against each of the old files, and then remove the old filegroup. This would free up valuable space on their underlying LUNs, and bring some sanity back to our restore validation and logical check processes.
I'm trying to stay away from the technical parts of this, but it's hard. A little technicality here was that I couldn't move the remaining LOB data to a new filegroup (using TEXTIMAGE_ON
) via a rebuild, so I would actually have to create a new table there, then migrate the data.
I planned out my strategy, not really any different than any other substantial data migration… move the data in batches, drop the old table, rename the new one.
I highlighted a manual insert that represented the first chunk of rows I would move and, before realizing I'd also highlighted the drop and rename commands, hit F5.
Yeah, oh poop is right.
My first thought was to dust off my resume. Luckily, this table is not customer-facing, and my blunder did not cause any major disruptions. I'm also surrounded by great peers – storage and infra folks who can help allocate a new LUN almost immediately, and fellow architects like Andy who are just sane voices of reason:
You know that was f*cking stupid so I won't tell you & beat you up for it.
He was absolutely right, and reminded me that there's nothing much to do but fix it. It took a few hours, but I used piecemeal restore to revive that one table, moved the data over, and started again.
The lessons I learned include being much more vigilant about creating defensive (and separate!) scripts for one-time maintenance, and not being so overconfident in my ability to highlight and execute the right portions of a script at the right time. I did manage to free up ~32 TB of disk space once all was said and done, but it took a lot more time and sweat than I originally planned.
There's nothing magical or incredibly insightful here; just a reminder that, no matter how long we've been doing something, we're all human, and we're going to eff up. This won't be my last screw-up, I promise. But you have to learn from these experiences and use them to your advantage, and make sure that you don't make the same mistake a second time. Take your time, triple-check your scripts, and don't try to be a surgeon at midnight.