Navigating DBCC CHECKDB for VLDB
We have a lot of data. Some of that data is stored in large databases (dozens of terabytes each). In some shops, this is an excuse to not run integrity checks. We are not one of those shops.
But we don't run full
CHECKDB operations in production; we have a set of servers dedicated to testing our restores and running checks. We follow a lot of the guidance in these articles:
- CHECKDB From Every Angle: Consistency Checking Options for a VLDB
- Minimizing the impact of DBCC CHECKDB : DOs and DON'Ts
- Minimize performance impact of SQL Server DBCC CHECKDB
And our process was pretty simple:
- if successful, drop
- on failure, notify
But we have a lot of data.
For some of our larger databases, this process could take a day and a half. Occasionally it would fail, but not because of anything
CHECKDB found. Some of the errors were:
In some cases, these failures would also cause
CHECKDB to report troubling errors in the metadata:
Table error: page (67:159147577), whose header indicates that it is allocated to object ID 1560872915, index ID 1, partition ID 72058803011125248, alloc unit ID 72059344479649792 (type LOB data), is allocated by another object.
[SQLSTATE 42000] (Error 2534)
Table error: page (67:138078340) allocated to object ID 1560872915, index ID 1, partition ID 72058803011125248, alloc unit ID 72059344479649792 (type LOB data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
These were not always reproducible in any of the following scenarios:
- Restoring the same backup again on the same server (sometimes failed again)
- Restoring the same backup on a different server (never failed)
- Restoring a newer or older backup on any server (sometimes failed)
So, was the problem isolated to the backup? Did this condition exist in production? We certainly didn't want to run a full
CHECKDB there to check, but because the results on our test servers were so inconsistent, we were already pretty convinced the problem didn't exist at the source.
It was the snapshot.
CHECKDB by default creates an internal database snapshot, and a lot of the errors we were seeing involved this snapshot. Since these are servers dedicated to validation, and since the snapshot is only required so that the database can stay online for other users, we decided to rule out the snapshot as a cause by using
TABLOCK. Here's what the docs say about that:
DBCC CHECKDBto obtain locks instead of using an internal database snapshot. This includes a short-term exclusive (X) lock on the database.
DBCC CHECKDBto run faster on a database under heavy load, but decreases the concurrency available on the database while
DBCC CHECKDBis running.
But there's an issue with that.
If you try to use
TABLOCK, you may see this error:
The database could not be exclusively locked to perform the operation.
Msg 7926, Level 16, State 1, Line 1
Check statement aborted. The database could not be checked as a database snapshot could not be created and the database or table could not be locked.
So, the easy workaround in our case (since nobody else is using this database): put the database in
single_user mode first. This is great, except the documentation warns us that, if we use
CHECKCATALOG won't happen:
TABLOCKlimits the checks that are performed;
DBCC CHECKCATALOGis not run on the database, and Service Broker data is not validated.
We don't need Service Broker checking on the restored copy, but we do want
CHECKCATALOG to happen. So we added an extra step at the end to perform a separate
CHECKCATALOG (which is fast, even on multi-TB databases).
Thankfully, that was redundant.
When we tested our new process against various known corrupted databases (thanks to Paul Randal), we found that
CHECKDB was correctly surfacing the
CHECKCATALOG errors, in spite of what the documentation says. I submitted a pull request to make the docs more clear.
Long story short: switching to
TABLOCK eliminated various snapshot issues, and did not require us to perform any extra checks. As a side note, eliminating the snapshot did not make the process any faster. We didn't expect it to, but – and I haven't tested this – it may help in cases where you don't use instant file initialization.
Will TABLOCK work for you?
This is of course not something you can choose to do on production servers; only on servers where you're restoring a copy. You will still want to perform
PHYSICAL_ONLY checks against production, too, because that will bring up any underlying hardware issues that won't be visible on a restored copy.