Navigating DBCC CHECKDB for VLDB

We have a lot of data. Some of that data is stored in large databases (dozens of terabytes each). In some shops, this is an excuse to not run integrity checks. We are not one of those shops.

But we don't run full CHECKDB operations in production; we have a set of servers dedicated to testing our restores and running checks. We follow a lot of the guidance in these articles:

And our process was pretty simple:

  • restore
  • run checkdb
    • if successful, drop
    • on failure, notify
  • But we have a lot of data.

    For some of our larger databases, this process could take a day and a half. Occasionally it would fail, but not because of anything CHECKDB found. Some of the errors were:

    DBCC CHECKDB (<dbname>) executed by <login> terminated abnormally due to error state 6. Elapsed time: 30 hours 47 minutes 30 seconds.
    "<path>\<dbname>_<file>.mdf_MSSQL_DBCC95: Operating system error 112(There is not enough space on the disk.) encountered.
    The operating system returned error 21(The device is not ready.) to SQL Server during a read at offset <0x…> in file '<path>\<dbname>_<file>.mdf_MSSQL_DBCC115'.
    The operating system returned error 665(The requested operation could not be completed due to a file system limitation) to SQL Server during a write at offset <0x…> in file '<path>\<dbname>_<file>.mdf_MSSQL_DBCC51'.

    In some cases, these failures would also cause CHECKDB to report troubling errors in the metadata:

    [SQLSTATE 42000] (Error 2533)
    Table error: page (67:159147577), whose header indicates that it is allocated to object ID 1560872915, index ID 1, partition ID 72058803011125248, alloc unit ID 72059344479649792 (type LOB data), is allocated by another object.

    [SQLSTATE 42000] (Error 2534)
    Table error: page (67:138078340) allocated to object ID 1560872915, index ID 1, partition ID 72058803011125248, alloc unit ID 72059344479649792 (type LOB data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.

    These were not always reproducible in any of the following scenarios:

    • Restoring the same backup again on the same server (sometimes failed again)
    • Restoring the same backup on a different server (never failed)
    • Restoring a newer or older backup on any server (sometimes failed)

    So, was the problem isolated to the backup? Did this condition exist in production? We certainly didn't want to run a full CHECKDB there to check, but because the results on our test servers were so inconsistent, we were already pretty convinced the problem didn't exist at the source.

    It was the snapshot.

    CHECKDB by default creates an internal database snapshot, and a lot of the errors we were seeing involved this snapshot. Since these are servers dedicated to validation, and since the snapshot is only required so that the database can stay online for other users, we decided to rule out the snapshot as a cause by using TABLOCK. Here's what the docs say about that:

    TABLOCK
    Causes DBCC CHECKDB to obtain locks instead of using an internal database snapshot. This includes a short-term exclusive (X) lock on the database. TABLOCK will cause DBCC CHECKDB to run faster on a database under heavy load, but decreases the concurrency available on the database while DBCC CHECKDB is running.

    But there's an issue with that.

    If you try to use TABLOCK, you may see this error:

    Msg 5030, Level 16, State 12, Line 1
    The database could not be exclusively locked to perform the operation.
    Msg 7926, Level 16, State 1, Line 1
    Check statement aborted. The database could not be checked as a database snapshot could not be created and the database or table could not be locked.

    So, the easy workaround in our case (since nobody else is using this database): put the database in single_user mode first. This is great, except the documentation warns us that, if we use TABLOCK, CHECKCATALOG won't happen:

    Important
    TABLOCK limits the checks that are performed; DBCC CHECKCATALOG is not run on the database, and Service Broker data is not validated.

    We don't need Service Broker checking on the restored copy, but we do want CHECKCATALOG to happen. So we added an extra step at the end to perform a separate CHECKCATALOG (which is fast, even on multi-TB databases).

    Thankfully, that was redundant.

    When we tested our new process against various known corrupted databases (thanks to Paul Randal), we found that CHECKDB was correctly surfacing the CHECKCATALOG errors, in spite of what the documentation says. I submitted a pull request to make the docs more clear.

    Long story short: switching to TABLOCK eliminated various snapshot issues, and did not require us to perform any extra checks. As a side note, eliminating the snapshot did not make the process any faster. We didn't expect it to, but – and I haven't tested this – it may help in cases where you don't use instant file initialization.

    Will TABLOCK work for you?

    This is of course not something you can choose to do on production servers; only on servers where you're restoring a copy. You will still want to perform PHYSICAL_ONLY checks against production, too, because that will bring up any underlying hardware issues that won't be visible on a restored copy.

    Aaron Bertrand

    I am a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. I am a long-time Microsoft MVP, speak frequently at conferences, and write at SQLPerformance and MSSQLTips. In real life I am a father of two, an architect at Wayfair, and my pronouns are he/him.

    2 Responses

    1. Wilfred van Dijk says:

      I also have to deal with verly large databases and the need to check these. Instead of using TABLOCK (which won't work in my 24×7 environment) I splitted the original CHECKDB into CHECKTABLE/CHECKCATALOG/CHECKALLOC depending on the size of the database (small databases still use CHECKDB). This logica is all captured in a table and updated weekly. Every night a job runs for 2 hour and picks the oldest object from that table. The only drawback is the false alarm about not checking (large) databases with CheckDB.

    1. November 6th, 2020

      […] Aaron Bertrand shares some thoughts on CHECKDB: […]