About two months ago, I began planning a cluster migration for one of our primary SQL Server 2005 clusters to newer hardware, and a simultaneous upgrade to SQL Server 2008 SP1. The old system runs Windows Server 2003 R2 Enterprise, 32-bit, with 4 dual-core CPUs and 16GB of RAM. The new system runs Windows Server 2008 R2 Enterprise, 64-bit, with 4 quad-core CPUs and 32GB of RAM. We were also making substantial upgrades to our disk structure (more on this below).
There were a lot of little details to iron out, including validating that the upgrade would work, testing copy speeds, and creating a checklist of a barrage of connection string updates that would need to be synchronized (too bad Linchi hadn't written his recent bad practices post 6 months ago, as I probably would have spent more time focusing on that aspect of our environment).
In part to minimize overall downtime, in part to ensure there would be a fallback plan, and in part because I wanted to utilize this new server hardware, I opted to perform a side-by-side upgrade. There were several aspects of the move that would introduce new variables to the server's setup, and unfortunately due to maintenance windows and looming vacations, we didn't have the luxury of making a hardware upgrade independent of upgrading SQL Server itself.
It sounds like a risky endeavor, but it actually went very smoothly. Here are the general steps I took:
- Upgrade Advisor
This pointed out a few things that needed to be updated. Well, actually, that didn't need to be updated. We had an outdated stored procedure that was no longer being used but had BACKUP LOG WITH TRUNCATE_ONLY to shrink log files periodically for less important databases. And the advisor sent me on a few wild goose chases; for example, it complained that "ORDER BY aliases cannot use prefixes" for a few larger stored procedures, though combing through them quite thoroughly, one by one, I could not find an offending statement, and have not observed any kind of failure yet (knock on wood). Of course we had been running our QA/test environment on SQL Server 2008 for several months, so I wasn't really expecting any surprises here. But I still think it is a good idea to run the Upgrade Advisor, just in case.
- Preparing the disks
The underlying disk subsystem is an EMC
^2CLARiiON, and the LUNs in use by the system were initially set up by an IT person in a bubble about 5 years ago. All of the LUNs were set up with 10K drives, RAID 5, without proper partition alignment, and with either 4K or 8K allocation unit sizes. (I know, you can almost smell the I/O overhead, right?) For the new system, we added shelves that had 15K drives, RAID 10, partition aligned, and with 64K allocation unit sizes. The difference in I/O overhead has been huge (again, more on this below). Most importantly, on the old system, we only had three LUNs of any substance; on the new system, there are seven. This has allowed us to lay out our data and log files more appropriately and to get tempdb data files onto dedicated spindles.
- Inventory of connection strings
We had to go through all of our web applications, as well as 3rd party / vendor applications and one-off console applications, and identify all of the locations where we would have to point code at the new cluster. Since we knew exactly when the migration would start and that these apps would not run again until we turned them on, in cases where it made sense (e.g. classic ASP and VBScript code, where connection strings were not provided by config files), it was very trivial to add IF/ELSE and/or CASE logic to check the current timestamp and connect to the new or old cluster accordingly. This significantly reduced the number of lines of code and/or config files that had to be touched on the day of the migration, where we would certainly be under more pressure. As I mentioned earlier, given more time, it probably would have made sense to use this opportunity to take Linchi's approach to connection management (though in some cases we don't want to rely on DNS, and having to change the hosts file on each machine doesn't really gain us much over changing a web.config file). But at the very least, we were able to document all of these locations so we could revisit our connection strategy in the future.
- Preparing the install
Slipstreaming is the way to go in SQL Server 2008; Kendra Little has a great blog post about it. Thanks to Peter Saddow and the rest of the setup team; this has absolutely been the best enhancement to SQL Server's setup module. Ever. You wouldn't think that saving the hassle of installing a service pack and then a cumulative update would be that big of a deal, but when dealing with a cluster install, the fewer variables and restarts, the better.
- Setting up jobs
One of the nice side effects of preparing for a move like this is that you get to do a little housekeeping. Over the years, several SQL Server Agent jobs in our system had become multi-purpose: "Let's add a step to this job, instead of creating a new job." The intention was to reduce clutter in the Job Activity Monitor and in 3rd party tools like SQL Sentry Event Manager. In some cases, you do want jobs to have multiple steps, so that you can prevent different tasks from running over one another. But quite often the reduction in clutter actually makes these tasks harder to manage and, in some cases, to even find. This migration gave me an
excuseopportunity to reorganize our jobs, create additional categories for easier management, and to clean up some obsolete stuff.
- Preparing to move the databases
With a side-by-side upgrade, you can choose to either (a) replicate/mirror then switch, (b) detach/attach, or (c) backup/restore. Well there are probably other ways we could have considered, but those were the primary three options we had on the table. Since we had a maintenance window where we could afford downtime, and because there were literally hundreds of databases to transfer, I chose the detach/attach route; I couldn't possibly expect to synchronize that much replication (and it's far too demanding for mirroring), and I didn't want to wait for the window to start to kick off backups, and have to wait for them to finish. So well in advance of the actual migration day, I built a script that would detach all of the databases, and then a script for each destination LUN that would move the detached database and log files to those new drive letters on the new server. These scripts would be kicked off independently, to avoid the copies from being serial, but only spread out enough to avoid saturating the controllers and getting in each others' way.
With all that preparation in place, the day of the actual migration was very smooth. SQL Server was already set up on the new hardware, and the new disks were in place; all we really had to do was kick off the detach / copy scripts, then hurry up and wait. A few of the attach operations took a little longer than I anticipated, but only by a little and well within the fudge factor I had applied. We used the full maintenance window to perform unit testing on all of the different applications and modules, and to clean up the minor but inevitable things that we missed (the biggest problem was a database that was left out of the manually tweaked attach scripts, and that was very easy to solve). All told, we had shut down the system, moved the entire application and nearly 2TB of data, tested all of the apps, and turned everything back on, all in just under 8 hours.
So what did we get for all of our troubles?
In a word : plenty!
The first and most obvious improvement was in our regular backups. Previously, we were using RedGate's SQL Backup, which is a fantastic product if you're on SQL Server 2005. But with 2008's native backup compression, we are getting better sizing, faster. For example, for a 300GB database, RedGate had been compressing it to 77GB, and natively it is being compressed to 48GB. And our backup jobs are completing about four times faster: our nightly full backups used to start at 7 PM and finish around 11:45 PM; now they are finished by about 8:15 PM (actually tonight they were finished by 8:24 PM). (Edit: I want to emphasize that the comparison here is about RedGate on 2005 and old, crappy hardware against native on 2008 and new, shiny hardware – no true apples to apples comparison was performed.) Restores (which we test nightly, and also use for scheduled historic reporting queries) are about four times faster as well, though that is on another server that did NOT receive hardware upgrades (still on x86, 10K RAID5, etc.), so that is all about compression alone.
My next observation was improved I/O. Due to all of the differences in our hardware, as well as introducing page compression to the largest tables in our busiest database, our I/O numbers are orders of magnitudes better. And this has led to less blocking, since we now have far lower waits on I/O (all of our blocking on the old system was due to PAGELATCH_IO_SH). In our busiest database, we would often see I/O waits exceeding 5 minutes (I eventually became desensitized to these alert conditions because the blocking, while annoying, was mostly affecting background, non-user processes). Now we see virtually no I/O blocking at all, and PAGELATCH waits have disappeared from our top waits list.
And finally, we are simply observing far better overall system stability. In addition to the improved I/O, CPU usage is much lower, cache hit ratios are better, and the previously sluggish supporting applications have returned to the snappiness they enjoyed in their infancy. Just to give an idea of how similar workloads affect the system, I'll leave you with a couple of screen shots from SQL Sentry Performance Advisor. In both cases, the top image is before the migration, and the bottom image is after the migration.
CPU, before and after (click to embiggen):
Memory & Cache Hit Ratios, before and after (click to embiggen):
I obviously still have some page life expectancy issues to deal with, involving ad hoc plan cache (unfortunately the nature of the system lends itself to a LOT of ad hoc queries). I also plan to add more data compression, add filtered indexes, and implement partitioning on this system. And I hope that I will be able to present a strong enough case to convince those above me to gradually migrate all of our systems to x64, 15K RAID10, and SQL Server 2008. The sooner the better!