BackToBasics: The "Runaway" Query

Continuing with the 2016 challenge from Tim Ford (T) that I accepted at the end of last year, this month I wanted to talk about the runaway query - or, more specifically, what you can do before you call your query a runaway query.

Make Sure It Is A Runaway

Sometimes people will start a query, wait five seconds, and then declare that this must be a runaway query. Sometimes a query's runtime can vary based on other things going on in the system, so four seconds on one run and six seconds on another is not necessarily "running forever." Be sure that you've given a query adequate time to start returning results before giving up on it and remember that Management Studio might seem "stuck" before it starts to render any grid results, especially if the result set is large. If you feel you've waited a reasonable amount of time, and you've tried both Results to Grid and Results to Text, then...

Check For Simple Things

Once I have determined that I'm not just being impatient, the first thing I do is check the query logic for obvious issues that I am prone to miss - especially if the query text has been changed since the last time it ran well. A few of the main things I check for are:

  • An infinite loop - it's very easy to forget the SET @i += 1; in a counter-based loop, or the FETCH from a cursor.
  • A ginormous cartesian product - sometimes you shoot yourself in the foot by adding a table to a join and forgetting some of the join criteria (or all of them, if you like living in the 1970s).
  • Missing or less restrictive filter - it's possible you highlighted most of the query, but left off an important WHERE clause, and now the query returns far more rows than you expected.
  • Uncommitted transactions - maybe you highlighted BEGIN TRANSACTION and some other statement in one window or opened the table designer or Edit Top n Rows and are now trying to affect that same table in another window. The query will look like it's taking forever, but really, it's just blocked by your initial transaction that you forgot to commit or roll back.
  • A statistics update - a lot of events on an instance can trigger a stats update on a table or index, and that won't happen until the next time a query runs that needs those statistics. That query could be yours, and the delay could simply be due to an auto-stats update (if your settings are not ASYNC). In SQL Sentry Plan Explorer, for example, we capture this for you by showing the [StatMan] call as part of the full query call stack:

    statman-1-blog

    (You can monitor for these events in other ways, of course. See Erin Stellato's two posts on tracking these updates, Tracking Automatic Updates to Statistics and Another Way to View Automatic Updates to Statistics.)

Blocking And Other Waits

Blocking chain

If it isn't the query logic, next thing I check is sys.dm_exec_requests, which can quickly tell me if the query is being blocked (look at blocking_session_id) and, if not, what it's waiting on (wait_type and last_wait_type). Another thing I take into account is how many rows are being returned to the client, and the possibility that the delay is actually due to client rendering or network transfer - these are often indicated by a constant wait type of ASYNC_NETWORK_IO. Plan Explorer can help here as well, since we intentionally discard results instead of pulling them across the wire, making sure that you're not measuring client rendering or data transfer. We now also allow you to see data movement at the operator level, using Live Query Profile, so you can see exactly where your query is spending its effort in real time - as long as you're running your query against SQL Server 2014 or better.

I can also look at status to see if it is runnable, running, suspended, or sleeping, and double-check that command is reflecting the statement that I *think* should currently be executing. SQL Sentry software can also help identify blocking proactively and help visualize even the most complex blocking chains (see an example at right). For more insight, see Understanding and Resolving SQL Server Blocking with SQL Sentry. Confession: I stole this image from that post.

Sometimes you will find that you are being blocked by another session, and that session doesn't have an entry in sys.dm_exec_requests. This doesn't mean you are being blocked by nothing or a ghost session; in a lot of cases, it just means that they have locks on your objects but aren't actively running anything. This is easy to reproduce; just run BEGIN TRANSACTION; UPDATE dbo.table SET column += 0;, then go to lunch. You can try to get that session's last command (note that it may not be related to the current locks) and which objects they are locking:

DECLARE @SPID INT = < blocking_session_id from your runaway query >;

-- see the command they last run - might be related, might not
DBCC INPUTBUFFER (@SPID);

-- see what that session has locked
SELECT s.name
     , o.name
FROM sys.dm_tran_session_transactions AS tst
INNER JOIN sys.dm_tran_locks AS tl
     ON tst.transaction_id = tl.request_owner_id
INNER JOIN sys.objects AS o
     ON tl.resource_associated_entity_id = o.[object_id]
INNER JOIN sys.schemas AS s
     ON o.[schema_id] = s.[schema_id]
WHERE tst.session_id = @SPID;

Uncommitted Transactions

I mentioned uncommitted transactions above, but one way to be sure is to stop the query and try to close the window. If you are dealing with a transaction in this session, you will get:

There are uncommitted transactions. Do you wish to commit these transactions before closing the window?

Check other windows, and also check to make sure you haven't changed Management Studio's default for SET IMPLICIT_TRANSACTIONS from no to yes:

bb_runaway_implicit-1-blog

Is it the Same Plan? Is It Optimal?

If you've ruled out all of the above, then maybe it's time to start looking into the mechanics of the query itself and, more importantly, its execution plan.

Before you do that, though, you need to make sure that you're actually comparing apples to apples - when the query was running fast (and I assume it is the exact same query), were you getting the same execution plan, or at least the same plan shape? This is where it pays off to be keeping historical performance metrics, like you can with SQL Sentry Performance Advisor for all supported versions of SQL Server, or the Query Store in SQL Server 2016 and Azure SQL Database. Being able to compare query plans that were captured yesterday, last week, or last month can be invaluable in determining the actual reasons behind performance changes in the plan you're looking at today.

If the plan has changed, it could be for a variety of reasons. I go over many of these in two earlier blog posts:

Those posts explain a variety of things that can cause performance differences between different versions of the query, or different execution contexts, or different environments. There are a whole host of other things that could have changed and caused a query to suddenly become slow, even the exact same query, executed by the same user, on the same server. Some off-the-cuff ideas:

  • Someone applied a cumulative update, service pack, version upgrade, or compatibility level change (if an upgrade or compatibility level change, test under the old cardinality estimator using TF 2312)
  • Data for your query was evicted from the buffer pool due to concurrent activity
  • All data was evicted from the buffer pool, either directly using DBCC DROPCLEANBUFFERS, or as the result of a server configuration change, service restart, or server reboot
  • Your plan is recompiling - this can be caused by a variety of things, including schema changes, index changes, or explicit DBCC FREEPROCCACHE or similar
  • Substantial statistics updates (which can also be caused by a variety of things, such as major data changes)
  • Your resources have been constricted due to concurrency, a server or database configuration change, or Resource Governor

There are many more, of course, those are just a few low-hanging fruits.

What you do next will vary. Sometimes the fix is as simple as updating statistics or creating or changing an index. Sometimes you need to re-work the query itself - fix the logic, remove joins, add filters, or eliminate scalar functions. Sometimes you may need to evict a plan that was generated from an atypical set of parameters. And sometimes your query is better off running under a specific trace flag, with MAXDOP, with a specific index or join hint, or with OPTION (RECOMPILE) to thwart parameter sniffing. How you actually *fix* a slow query is a very large topic and probably better scoped to a different blog series.

Conclusion

Not every runaway query is really running away. There are a few things you can check first before tearing your hair out trying to fix the statement, eliminate execution plan operators, muck with indexes, or try the old reboot trick that magically fixes everything. Those things may ultimately be necessary, but in my experience, it's often one of the other things I mention above.