Pin the blame on the query

Last week, the online version of Visual Studio had a significant outage (about five hours). In this apology post, Microsoft's Brian Harry puts a significant portion of the blame on the SQL Server team, saying:

Clearly the SQL server team has some work to do to further tune the new cardinality estimator and they are all over that.

(And he goes into more specific detail in A bit more on the Feb 3 and 4 incidents.)

Yes, a new cardinality estimator is generally going to improve a lot of queries, but there are inevitable regressions, too. This query, in particular, is not exactly "switch the cardinality estimator from underneath me"-friendly:

SELECT TOP 1
   @collidingAccountName = ids.AccountName,
   @preExistingDomain = i.Domain,
   @preExistingTfidGuid = i.Id
FROM #identities ids
INNER LOOP JOIN tbl_Identity i
WITH (INDEX=IX_tbl_Identity_AccountName_TypeId, FORCESEEK, UPDLOCK, HOLDLOCK)
ON i.PartitionId = @partitionId
   AND i.AccountName = ids.AccountName
   AND i.Domain = ids.Domain
   AND i.Sid <> ids.Sid
WHERE i.PartitionId = @partitionId AND ids.TypeId in (5, 6) AND i.TypeId in (5, 6) 
OPTION (OPTIMIZE FOR (@partitionId UNKNOWN))

Let's count the things in this single query that absolutely must be regression tested after an upgrade, service pack, cumulative update, or compatibility level change:

TOP without ORDER BY (against a #temp table, no less), if you're expecting predictable sorting
LOOP JOIN hint
explicit INDEX hint
FORCESEEK hint
UPDLOCK and HOLDLOCK hints
OPTIMIZE FOR UNKNOWN

These are all things that may have been necessary under the old estimator, but are likely just tying the optimizer's hands under the new one. This is a query that could have, and should have, been tested in their dev / staging / QA environments under the new cardinality estimator long before they flipped the switch in production, and probably could have gone through series of tests where different combinations of those hints and options could have been removed. This is something for which that team can only blame themselves. Michael Campbell tweeted:

Every time I use a hint I think of ways NOT to – simply because you never know of long-term impacts like this case.

Hear, hear. Query hints should be used as a last resort (you'll hear this repeatedly from experts like Paul White). They used six hints in this single query, if you include OPTIMIZE FOR.

And for mitigation to have taken 30 engineers a total of five hours is also something that could have, and should have, been handled better – especially since their fix was to add a SEVENTH hint. Since they knew the problem happened as a result of switching to the new cardinality estimator, and they identified the problematic query immediately, they could have changed the query to use the old estimator using QUERYTRACEON, which I suspect would have been quicker than whatever train of thought went into adding the MAX_MEMORY_GRANT hint that "solved" the issue. (A joke about how many VSTS engineers it takes to change compatibility level is surely in the works.) And I suspect this query will stay like this, with all those volatile items above in place, until the next cardinality estimator bump.

I hope they have learned from this, and I do applaud their transparency. But several have suggested that the team take a forced break to read Joe Sack's white paper, "Optimizing Your Query Plans with the SQL Server 2014 Cardinality Estimator." I don't think that's such a bad idea.

And I do sincerely hope they fix those object names – why a table needs a tbl prefix is beyond me.

By: Aaron Bertrand

I am a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. I am a long-time Microsoft MVP, write at Simple Talk, SQLPerformance, and MSSQLTips, and have had the honor of speaking at more conferences than I can remember. In non-tech life, I am a husband, a father of two, a huge hockey and football fan, and my pronouns are he/him.