High LCK_M_SCH_M Waits

We've recently come across multiple customers who have had excessively high LCK_M_SCH_M waits. These surface in sys.dm_os_wait_stats, but much more visibly in the SQL Sentry Performance Advisor dashboard:

LCK_M_SCH_M waits

Upon digging deeper, we observed the following resource_description in sys.dm_os_waiting_tasks:

metadatalock subresource=LOGIN_STATS classid=$login_stats_id = 65536 dbid=1 id=lock54db84b00 mode=Sch-M

Other than seeing high waits in the DMVs (something will always be your top wait), the symptom in general of these waits will likely be significantly long processing of logins, potentially leading to timeouts. This will be especially true if your applications perform a high number of rapid login/logout operations and when different concurrent clients are using the same login; this was the case with each customer we worked with.

So, we started enumerating through all of the things that might affect the amount of time a logon operation could take, even though some of these are not fully plausible:

  • Logon triggers
  • Resource Governor - rogue classifier function
  • Extended Events sessions tracking logon events
  • Kerberos / AD issues - slow or busy domain controller
  • Default database set to auto-close
  • Non-default number of maximum concurrent connections
  • Connection pool exhaustion
  • Apps affected by publisher revocation checking
  • Logging all successful logins
  • More invasive auditing features like C2, Common Criteria, and SQL Server Audit

Many of these candidates were quickly and easily ruled out. We also looked at the connection methods and we were seeing the same thing for Windows auth (including Kerberos) and SQL auth, and also the issue persisted whether the connections were using Named Pipes, TCP/IP or even local connections using shared memory.

I'll save you the suspense or the details involved with all of our testing; the feature that turned out to be causing the problem was an auditing feature, Common Criteria compliance. In a nutshell, the reason is that, with this feature enabled, SQL Server maintains statistics in a system table about each login - things like last success, last failure, and number of retries. With a high number of login events by the same login, there is contention on that row in the system table, since a lock must be taken to update the information in a transaction, and a log flush is forced after each commit (so anything that can slow down log flushes could also contribute to and exacerbate the problem).

You can check if this is enabled in server properties / security:

Enable Common Criteria Compliance

Or by looking at the configuration settings for the server:

SELECT [value]
     , value_in_use
FROM sys.configurations
WHERE name = N'common criteria compliance enabled';

You can also use sp_configure, but I find that more tedious because you have to turn advanced options on first.

None of these customers had this setting enabled intentionally; they didn't remember turning it on, and could offer no explanation as to why it was enabled.

Initially, we turned the setting off, and didn't observe any changes in the behavior. We then restarted the SQL Server service, and still didn't see any change - the waits went down slightly, but they gradually crept back up. We didn't see any true relief from the symptom until we completely rebooted the server. This is explained in the documentation (emphasis mine):

The setting takes effect after the server is restarted.

Through more testing in the reverse direction (turning the setting *on*), we observed that the symptoms would not necessarily reveal themselves immediately. So it's important to note that you may start seeing these symptoms weeks or months after turning the feature on - this made the correlation difficult to track down for our customers, and may prevent it from being immediately obvious to you, too.

Long story short, if you see high wait times for LCK_M_SCH_M, and you have Common Criteria compliance enabled, you can consider one of two paths:

  1. Turning the feature off (please check with stakeholders - make sure you don't need it!).
  2. If you need the feature on, then consider changing the way your application works - use different logins and/or log in and out less frequently.
Thwack - Symbolize TM, R, and C