An in-depth analysis of SQL Sentry's overhead
One of my first major projects at SQL Sentry was framed shortly after I joined the company: to publish an analysis of the overhead that Performance Advisor and Event Manager place on both the server(s) being monitored and on the server(s) doing the monitoring. Every vendor wants to sell you on the "zero impact" and "no footprint" lines, but we all know that it is not possible for you to accurately measure performance on a server without also causing at least some performance degradation by virtue of performing said monitoring. So what does "zero impact" really mean in those cases? Most vendors won't tell you, and I can understand why: there are so many variables and "what ifs" involved that they couldn't possibly tell you – at least not with a straight face.
As I would tend to advise you even from an impartial, outside perspective (regardless of the solution being researched), the only way you can get the absolute truth about impact is to test it for yourself, in your own environment, using your hardware, data, network, usage patterns, etc. On the other hand, we realize that you may not have an environment yet, or you may not have a suitable replica of production (we all know how most folks feel about testing in production). We know this because, even with trial versions available and frequently downloaded, two of the most common questions we get about our software are "how much overhead will this cause on the monitored instance?" and "what kind of server(s) will we need for the monitoring components?"
To help answer this question without all of the effort that might be required on your part, the paper sets out to show, in our simulated environment, roughly how much additional load is placed on a single monitored server, and the total load placed on the monitoring servers when they are watching 1, 10, 25, 50 and 100 SQL Server instances. We tried to make the test environment as generic as possible, with separate servers for each monitoring component (to isolate and measure the resource impact of each), and a virtual environment of over 100 servers (we went virtual here to ease creation and re-creation of the environment, and also for the obvious budgetary benefits). Now, your environment may be quite different, so the numbers shown in the paper may not necessarily reflect your exact scenario. But they should be close enough to give you a general idea of what to expect from our software.
This is the first of many impact studies we will be conducting and making public. We are not afraid to show you our numbers, even in cases where they might not be the most flattering, because we can often learn as much from this process as you can. We are quite happy with the results of these tests, but we look forward to further tests where we measure the impact on a single monitored server with more variables and subject to heavier and more realistic workloads.
Download the paper (.doc, 2.1 MB)
Excellent post, Aaron. This is useful for any enterprise DBA who wants to know what he gets out of the box with the native Microsoft tools.
100s of connections are certainly difficult. The loads generated by PowerShell though were run across all 100 machines using remote jobs, as described in this post:
/blogs/aaron_bertrand/archive/2011/02/08/t-sql-tuesday-15-running-t-sql-workloads-remotely-on-multiple-servers.aspx
I would run 4 threads/jobs on each machine, for up to 100 machines, and the majority of the work would be performed on each individual machine. So other than the synchronization of the jobs, we could simulate 400 concurrent users quite easily without exhausting the resources on a single machine. We found several side effects when we went much higher than 4 threads, especially since each VM only had a single virtual processor.
I'll repeat just for clarity that we were not trying to simulate a high number of concurrent queries or even come close to what you might do in a TPC-type test. We were merely trying to replicate enough load to generate observable activity, with a rate of top SQL "hits" that matched what we observed in real production systems, and at the same time not burden the targeted VMs in such a way that their performance variance made the test results suspect.
Thanks, Aaron!
Again nice work. I'm absolutely not questioning the validity of the simulated prod workloads at all. I'm just curious about whether you were running 'highly' concurrent workloads and how that may be generated via Powershell scripts. The reason I'm curious is that I understand that Powershell does not do multithreading, thus kind of difficult to manage concurrency in Powesheel scripts (my understanding anyway and not being a Powershell expert by any stretch I could be totally wrong in that understanding). But then for your purpose, you may not need the simulated prod workloads to be highly concurrent. By highly concurrent, I mean submitting small queries over 100's or even 1000's of connections.
Linchi, I would certainly categorize the simulated workloads as "small queries" – SELECTs against single-row tables, using multiple self-joins constructed dynamically based on the input parameters, but the SQL text itself for any single statement would not exceed ~2K. In a future study we'll investigate the impact of more expensive queries (both in terms of resources and in terms of text size) against a single target, which we couldn't feasibly do in this study if we wanted a consistent workload against 1 or 100 targets.
Interesting read Aaron. Thanks to you and SQL Sentry for being open and honest enough to invest the time in this and put it out there for the public to read.
Nice work! A quick about the nature of the simulated prod workloads. Since you are using SQL script files to generate them, would you characterize the simulated workloads as more of 'large queries' in nature than the typical OLTP workloads that include primarily small but highly concurrent SQL statements?