We have a lot of data. Some of that data is stored in large databases (dozens of terabytes each). In some shops, this is an excuse to not run integrity checks. We are not one of those shops.
But we don't run full CHECKDB
operations in production; we have a set of servers dedicated to testing our restores and running checks. We follow a lot of the guidance in these articles:
- CHECKDB From Every Angle: Consistency Checking Options for a VLDB
- Minimizing the impact of DBCC CHECKDB : DOs and DON'Ts
- Minimize performance impact of SQL Server DBCC CHECKDB
And our process was pretty simple:
- if successful, drop
- on failure, notify
But we have a lot of data.
For some of our larger databases, this process could take a day and a half. Occasionally it would fail, but not because of anything CHECKDB
found. Some of the errors were:
In some cases, these failures would also cause CHECKDB
to report troubling errors in the metadata:
Table error: page (67:159147577), whose header indicates that it is allocated to object ID 1560872915, index ID 1, partition ID 72058803011125248, alloc unit ID 72059344479649792 (type LOB data), is allocated by another object.
[SQLSTATE 42000] (Error 2534)
Table error: page (67:138078340) allocated to object ID 1560872915, index ID 1, partition ID 72058803011125248, alloc unit ID 72059344479649792 (type LOB data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
These were not always reproducible in any of the following scenarios:
- Restoring the same backup again on the same server (sometimes failed again)
- Restoring the same backup on a different server (never failed)
- Restoring a newer or older backup on any server (sometimes failed)
So, was the problem isolated to the backup? Did this condition exist in production? We certainly didn't want to run a full CHECKDB
there to check, but because the results on our test servers were so inconsistent, we were already pretty convinced the problem didn't exist at the source.
It was the snapshot.
CHECKDB
by default creates an internal database snapshot, and a lot of the errors we were seeing involved this snapshot. Since these are servers dedicated to validation, and since the snapshot is only required so that the database can stay online for other users, we decided to rule out the snapshot as a cause by using TABLOCK
. Here's what the docs say about that:
Causes
DBCC CHECKDB
to obtain locks instead of using an internal database snapshot. This includes a short-term exclusive (X) lock on the database. TABLOCK
will cause DBCC CHECKDB
to run faster on a database under heavy load, but decreases the concurrency available on the database while DBCC CHECKDB
is running.But there's an issue with that.
If you try to use TABLOCK
, you may see this error:
The database could not be exclusively locked to perform the operation.
Msg 7926, Level 16, State 1, Line 1
Check statement aborted. The database could not be checked as a database snapshot could not be created and the database or table could not be locked.
So, the easy workaround in our case (since nobody else is using this database): put the database in single_user
mode first. This is great, except the documentation warns us that, if we use TABLOCK
, CHECKCATALOG
won't happen:
TABLOCK
limits the checks that are performed; DBCC CHECKCATALOG
is not run on the database, and Service Broker data is not validated.We don't need Service Broker checking on the restored copy, but we do want CHECKCATALOG
to happen. So we added an extra step at the end to perform a separate CHECKCATALOG
(which is fast, even on multi-TB databases).
Thankfully, that was redundant.
When we tested our new process against various known corrupted databases (thanks to Paul Randal), we found that CHECKDB
was correctly surfacing the CHECKCATALOG
errors, in spite of what the documentation says. I submitted a pull request to make the docs more clear.
Long story short: switching to TABLOCK
eliminated various snapshot issues, and did not require us to perform any extra checks. As a side note, eliminating the snapshot did not make the process any faster. We didn't expect it to, but – and I haven't tested this – it may help in cases where you don't use instant file initialization.
Will TABLOCK work for you?
This is of course not something you can choose to do on production servers; only on servers where you're restoring a copy. You will still want to perform PHYSICAL_ONLY
checks against production, too, because that will bring up any underlying hardware issues that won't be visible on a restored copy.
I also have to deal with verly large databases and the need to check these. Instead of using TABLOCK (which won't work in my 24×7 environment) I splitted the original CHECKDB into CHECKTABLE/CHECKCATALOG/CHECKALLOC depending on the size of the database (small databases still use CHECKDB). This logica is all captured in a table and updated weekly. Every night a job runs for 2 hour and picks the oldest object from that table. The only drawback is the false alarm about not checking (large) databases with CheckDB.