Optimized locking in Azure SQL Database

I don’t think I’ve ever had a great impression of Azure SQL Database as a place for production applications. In its early days, it was missing several important features (like data compression). While that hole has been plugged, there are still other limitations and differences you should be aware of, like T-SQL differences, lack of SQL Server Agent, and some observability gaps. There are several other differences, and while this platform comparison puts Azure SQL Database and Azure SQL Managed Instance side-by-side, the list can help you understand what’s different between PaaS offerings and the on-premises product. Many of the limitations make sense, but you still need to understand them if you’re moving an existing application (even with migration tools to help).

Whether or not it’s a cloud offering that makes sense for you, Azure SQL Database is often the staging ground for new features that will eventually make it to all flavors of SQL Server. For example, it was where I could first test GENERATE_SERIES, GREATEST and LEAST, and changes to STRING_SPLIT. Following that trend, earlier this year, Microsoft announced this magical-sounding feature, “optimized locking.”

What is it?

In a sentence: Instead of locking individual rows and pages for the life of the transaction, a single lock is held at the transaction level, and row and lock pages are taken and released as needed.

This is made possible by previous investments in Accelerated Database Recovery and its persistent version store. A modification can evaluate the predicate against the latest committed version, bypassing the need for a lock until it is ready to update (this is called lock after qualification, or LAQ). There’s a lot more to it than that, and I’m not going to dive deep today, but the result is simple: long-running transactions will lead to fewer lock escalations and will do a lot less standing in the way of the rest of your workload. Locks held for shorter periods of time will naturally help reduce blocking, update conflicts, and deadlocks. And with fewer locks being held at any given time, this will help improve concurrency and reduce overall lock memory.

Prerequisites

The prerequisites are pretty easy:

  • You need to be in Azure SQL Database.
  • Accelerated Database Recovery needs to be enabled; this is on by default.
  • Read Committed Snapshot Isolation (RCSI) is not required, but it’s recommended, and also on by default. While you can turn it off, you shouldn’t.

Other than that, you’ll just get this benefit by default. To verify that your Azure SQL Database supports optimized locking, you should see 1 in all three columns of the output for this query:

Configuration validation
Configuration validation

As others have noted, to turn the behavior off, you need to submit a support request. But why would you want to do that? Well…

A behavior change

Before demonstrating the benefits, I wanted to highlight a potentially unexpected but important change in behavior that you’ll want to take into account. From the documentation:

Even without LAQ, applications should not assume that SQL Server (under versioning isolation levels) will guarantee strict ordering, without using locking hints. Our general recommendation for customers on concurrent systems under RCSI with workloads that rely on strict execution order of transactions (as shown in the previous exercise), is to use stricter isolation levels.

Put another way, locks on individual rows that would normally be held until the end of the transaction might now be released a lot earlier. Which means other transactions that started later may succeed in changing the row, without being blocked, but the first transaction that eventually commits will “win.” Until now, it’s always been the case that last writer wins, and now it’s possible for an earlier writer to have priority.

It’s also important to note that this feature will work with read committed but not with repeatable read or serializable:

When using stricter isolation levels like repeatable read or serializable, the Database Engine is forced to hold row and page locks until the end of the transaction.

Translation: if you want to maintain the older behavior for any reason (including application compatibility), you don’t necessarily need to submit a support ticket to disable optimized locking; you can just run all of your transactions under serializable.

That was a joke.

An example

I created this database in regular old on-prem SQL Server 2022, making it as close as possible to Azure SQL Database defaults:

And created a table with a million relatively wide rows:

New databases may still get created in 150 compatibility level; if so, you will need to bump it:

If your on-premises database is on a version earlier than SQL Server 2022, the function won’t be an option anyway, and you’ll need to use another way to generate a million rows. I demonstrate a few alternatives in the aforementioned post on GENERATE_SERIES.

Then I set up two sessions to update 500 rows at a time in a loop; maybe they will collide on some sets of rows, maybe they won’t. But they probably will. Making note of @@SPID in each case, I set them both to start at the same time:

While those were running, I polled for locking and waits experienced by either session. First by creating a few temp tables:

Then running this loop referencing a #temp table with the SPIDs, which made it easier to run multiple tests starting from zero waits:

I didn’t bother timing the duration of the entire process, since so many other performance-influencing factors are different between an Azure SQL Database and an on-premises deployment. All four scripts took less than a minute, but how much less isn’t relevant. There is other data I collected that may be worth looking into if you follow my lead and try this out for yourself. Much more interesting (and perhaps predictable, given the intent of the feature) were the number of locks, the amount of lock memory, and the wait types. I used the following queries to summarize what had been captured:

The results

In my local instance, the results were:

In Azure SQL Database, the results were:

Let’s put these together and make it pretty (ignoring the log throttling):

Rather impressive results
Rather impressive results

Now, that’s not totally fair, as some of the locks were counted multiple times if they spanned polling intervals. It’s also possible, though, that some locks were taken and released without ever being observed. We could average instead, or de-dupe, or use Extended Events to track locks (but that would potentially impact the workload). I don’t think a different analysis method would really change the outcome.

In Azure, there are obviously some different performance constraints in play, particularly that writes to the transaction log are throttled pretty badly at lower tiers, never mind vastly different underlying hardware. So it’s not a fair apples-to-apples comparison, and there’s no way to test like-for-like: locally, you’d need a version that supports the feature, which doesn’t exist; in Azure, you’d need to downgrade to earlier behavior through support.

You could potentially spend more time than I did trying to find the “batch size sweet spot” to escape the clutches of LOG_RATE_GOVERNOR, but my limited testing has already told me enough without that equivalence.

You like it so far?

This change is, in a word, fantastic, and I am looking forward to it creeping its way into the full product. Though I hope it is not on by default, or tied solely to compatibility level, as I think there are still plenty of questions to answer here about the underlying behavior and how applications might need to adjust. It should also be a more discoverable option than one buried in DATABASEPROPERTYEX().