Your laptop may be ready for SSDs, but are your SQL Servers?
Brent Ozar (blog | twitter) recently made some comments about the FusionIO SSD drives. Basically, he was able to break three drives in a row – simply by doing load testing against them (using SQLIO). The symptom is simple: the drives go offline and disappear from the O/S, and need to be physically pulled from the machines. Kind of scary.
I am not trying to be Debbie Downer here … SSDs sound great, and my next MacBook will definitely come with one. I am sure they are not far from production excellence, and the I/O performance they offer — especially in shops like ours, where we are I/O-bound — will be world-shattering. But right now, if you are looking at expanding or upgrading your I/O under SQL Server, I'd give the vendors some time to shake off these early jitters. Hopefully they will come back with a quick fix or even an explanation that there was a misconfiguration of some kind … since I know others are already using these drives in production, and haven't come across similar issues.
Greg Linwood has asked that I publicly clarify that I wasn't thinking about the MVP newsgroups. Microsoft maintains a private newsgroup server where MVPs can communicate, and because it's private, I don't usually think about it. A few minutes after my comment here, I realized that he meant those newsgroups, so I went in to check the messages there. We continued the discussion in the newsgroup.
Linchi – yeah, I didn't have to sign an NDA or anything like that. We set the whole thing up pretty casually over email.
I have been watching this thread with great interest. Among many things, I must say that I was hugely surprised that FusionIO did not cover Brent's tests with an NDA. I used to do a lot of tests on various vendor products and often ran into interesting results. But in almost all cases, I could not talk about them publicly even though I was dying to share the results.
Greg – what newsgroup are you talking about?
Brent, I'm not referring to this blog – I questioned your post to the private NG, to which you still haven't responded.
Greg – I don't keep coming back to this blog post to look for questions. I didn't even know there were comments about what was going on. Aaron's blog doesn't have a "subscribe to comments" feature, so I have no way of knowing that people are asking me questions here unless I come visit.
If you have questions for someone, leaving comments for them on somebody else's blog is not the way to get them answered. You will be disappointed with the lag time in results, just as you've been here.
As Steven pointed out, I was taking a wait-and-see attitude. I waited days for FusionIO to get back to me, and they haven't been able to clarify the problem for me. I've got a blog post coming out tomorrow to summarize my findings.
Steve, I'm not trying to win a popularity contest here, so you might do well to look elsewhere for social media tips.
You said I "insisted that there are no problems" but I actually described a device fault that we encountered, maybe you didn't read the whole thread?
Brent & Aaron's posts go a lot further than saying that problems are "possible" (of course they are). The blog post actually questions whether FusionIO & SSDs are ready for SQL Server generally (refer to the title of this blog post) and I feel it is reasonable to respond to that question. Others have also agreed with me on this in the thread.
Anyway, the information posted by FusionIO rep here seems to confirm that the devices didn't fail after all.
Not a SQL/DB admin, and no association with Fusion-IO whatsoever. I was directed to this page as an example of what to-do and what not-to-do when using social media.
From reading this entire comment thread, I'd have to say that Greg Linwood's comments are doing the most to damage credibility, because of his insistence that there are no problems.
Here's how I read it.
Brent posts that he had some problems with the drives. Aaron indicates that this is kind of scary, and bears watching. Greg cries "scare-mongering" and "half-baked tests" without even knowing what the exact tests were, and cites only anecdotal evidence that there are no problems because he hasn't seen "a single failure in a production environment".
Seems to me that Brent and Aaron were taking a wait-and-see attitude, but noting that there were problems, while Greg is denying problems altogether. That damages credibility.
I'd much rather hear about problems, so that I can follow the process of the resolution make a final decision.
But hey, that's just me.
Greg, if you properly test and configure SSDs before deploying them to a production environment, then I don't see how a post like this makes your life harder. The issue: it is a fact that a lot of people are not as thorough, and if they expected to just plug it in and go, they could be sadly mistaken when they have the same experience as Brent… especially given that the machines were set up for him by the vendor.
Brent – I asked you for clarification on the devices you were using in the forum thread but you failed to respond.
I'm mainly disappointed at Brent Ozar's original post & my responses are inteded to focus on his mis-information. Not only did he slap FusionIO around amongst an influential group, he also failed to answer questions (not good)..
You simply referred to his post in a public forum & I felt it was important to respond because I routinely recommend this product and this kind of mis-information can make my life harder when dealing with my own customers and staff (some of who read this blog).
I have a lot of respect for you & highly regard your blog, which is one of only a handful that I read routinely and actually benefit from. You usually provide great information which is useful for real world DBAs, unlike some of the SQL Server MVPs who rant on about useless stuff such splitting IP addresses with single SQL statements.
Your posts are nearly always useful (eg insights into CUs & SPs as well as many other things) On this occassion, I think you missed the mark but hopefully we can move on now.
I defintely apologise if I've come across in any negative way toward you personally – that certainly wasn't intended.
Greg – if you had questions or complaints about the way I tested the gear with FusionIO, I wish you would have talked to me. I would have been more than happy to elaborate on it.
In my testing, I had three successive failures. They were able to reproduce the failures, and the responses I got via email do not line up with Jodi's explanation of too many cards in the server. I'm waiting to hear back from FusionIO before I go live with a post about the incident. Frankly, if the cards were really working as designed, I would have expected that the engineers involved would have recognized that rather than having me go through three successive drives, each time telling me the problem was fixed.
Jodi, ditto on the thanks for clarifying.
Greg, take it down a notch. If FusionIO can configure a system and have it blow away three drives in a row, average users are going to be able to make the same kind of mistake.
Sorry for having an opinion, and I'm sorry you feel the need to point out what a horrible person I am for having the audacity to warn users that there is more to moving to SSDs than just plugging them in.
Thanks for clarifying this Jodi. It seems your sales engineer wasn't the only person who got overzealous in this exercise.
Perhaps you might want to consider using NDAs when you provide testing environments to stop incidents like this in future.
By way of disclosure, I'm one of the team members at Fusion-io.
Let me start by thanking Brent for his interest in Fusion-io. I think
the readers of this blog may want to know that Brent was logged
into a test machine in Fusion's very own labs. Unfortunately, an overzealous sales engineer was so excited to showcase our performance that he put more ioDrives in the server than the power supply was rated to
Let me explain, Fusion's ioDrives are designed to go offline in the
event of a power cut rather than losing the in-flight data. In
Brent's case, the ioDrives were detecting signs of what they thought might be an abrupt power cut, and so the cards made sure that all the in-
flight data got saved ahead of the potential power loss before safely
Thanks again for the feedback. Brent, we understand you may have more
questions, so please feel free to reach out to me and I'll make sure
we have the right people on it. That said, we're surprised that this
is getting so much attention as the cards were working exactly how
they were designed to work.
Just to add my 2 pence. We've just bought 8 of these drives off the back of testing and speaking to a current high end user of these drives. We've seen no such problems and we've heard back from our contact at Fusion confirming what Greg mentioned about memory allocation as recommended when you purchase the drives. 2GB memory for every 80GB of FusionIO
Pixy, degradation in I/O performance on FusionIO devices is known to occur if memory is not correctly configured. Unlike HDDs, FusionIO devices require a significant allocation of RAM for normal operation & if both SQL Server & the FusionIO device are competing for RAM, they end up using the page file heavily, with serious performance degradation. Did you check this out?
We generally leave at least 4Gb RAM for the smaller devices (80Gb / 160Gb devices) and more for the larger devices. It's also something that you need to keep monitoring, but this is generally part of wider memory monitoring requirements for busy mission critical servers.
Intel SSDs aren't in the same performance universe as FusionIO yet.
There are a lot of machines that have never been tested with devices as fast as Fusion-io drives… and it's them that's breaking, not the drives. Even some of the high-end OEM machines can't handle the speed.
Oh, and we're currently alos using large numbers of the Intel X25-E SSDs (up to 16 per array). While not as fast as the ioDrive, they deliver more consistent performance.
We have three ioDrives in production, and I *have* run into an issue that sounds similar. While doing a reload of a 60GB MySQL database, performance slowed to a crawl. The drive didn't fail, but IOPS dropped to about 5% of what we had been achieving, and didn't recover. We swapped the card for a spare and restored from backup.
It wasn't until some months later that we discovered the reason behind this. The ioDrive keeps a pool of spare blocks free that it can write to rapidly without having to do an erase cycle first. If you hit it with random I/Os fast enough and for long enough, the pool will run out and performance will nosedive.
You can reformat the card to reserve more blocks for this pool, and we've done that with good results, but it's not something that's exactly obvious.
On Linux the drives didn't fail out, though performance was terrible. Could be different on Windows servers.
Obviously the value of such a testimonial is quite subjective. You clearly hold it in little regard while I'd like to hope that some of my readers don't ignore it – since only listening to the good stuff could lead to similar failures in their own environments… possibly only once they've gone live, because they didn't find enough value in proving that the stuff worked.
Fair enough, but I don't think Brent's lab test was very valuable, even as a negative testimonial as there are too many unanswered questions.
It's not at all clear that three devices actually failed, as it could easily have been the same software driver failing each time, with no problems in the hardware (who knows if the drivers were re-installed between tests?). I also asked Brent about which specific devices were used (as FusionIO have a few products) but haven't received an answer.
It would be good to know more about what actually happened before reading too much into his experience.
We have conducted our own testing into FusionIO devices, by actually obtaining devices & installing them into our own lab. We haven't had any issues with any of the devices used in our lab (we've had 4 so far).
Sure, you are right, I haven't touched and felt a FusionIO device. I haven't touched or felt my own SAN either, aside from looking at it in the data center. Does that mean I should trust everything that emc tells us? That because they do so many things right, they couldn't have possibly overlooked anything that would cause a less than optimal symptom in our particular environment?
As for the tests, according to Brent, it was three different devices that failed successively, but I don't have any more details, as I wasn't the one running the tests or invited into FusionIO's lab environment in the first place. I was just repeating what he's already said about the situation. You are more than welcome to disagree with me if you like, but I think it is more than warranted to mention this test rather than sweep it under the rug and pretend it didn't happen. If anything it will ensure that FusionIO gets to the bottom of the problem, just like we call Microsoft to the carpet on every little snafu they push out. I don't think it's fair to expect everyone to blindly trust good testimonials but to bury their head in the sand like an ostrich as soon as there is one bad one.
I would definitely agree with your Ford example, but neither you or Brent seem to have actually "owned" or used FusionIO in any real way & one failed lab test is no reason to "raise the alarm".
As for FusionIO acknowledgements – did they actually say that three devices failed in a row, or just that the same bad driver was used in all three tests? If the problem was simply that someone installed a wrong or out-of-date driver, it wouldn't matter how many devices were tested..
Greg, nobody said to forget about FusionIO or even SSDs in general. Again, just pointing out that FusionIO has *acknowledged* the problem that Brent discovered in their own lab.
I suppose you are suggesting that what happened to Brent -three times- couldn't possibly happen to anyone else. Maybe there is some I/O characteristic of his test that all of your million tx/sec customers haven't hit yet?
Let's switch the argument to tires. A couple of tears ago Goodyear had a real problem with their tires failing on certain Ford SUVs. The failure rates were relatively low but obviously in those few cases the consequences were obviously disastrous. Now let's say you owned a Ford at that time, and had a friend who was considering buying one, but hadn't heard about the failures. Do you think it would be fair of you to keep that information to yourself, and let your friend buy the Ford unaware, just because you haven't (yet) experienced the problem?
I just want to clarify that I wasn't referring to our own environment (as this would be a limited perspective). As you're aware I run a remoteDBA business since 2002 & we've had a growing number of customers running FusionIO on extremely high volume (millions of Tx /min) production systems since 2008.
Despite the huge transaction volumes that have been processed on these systems, we've not seen a single live failure on any of them other than one faulty device that was returned to the vendor prior to production release.
It should also be pointed out that these customers have saved HUGE amounts of money by using SSD b/c achieving the same throughput on HDDs would have been enormously expensive in comparison.
Whilst I agree that all devices need to be tested before released, this is as true of any hardware & there's no reason to single FusionIO out just because somebody runs a single lab experient that didn't work out
Thanks Pete, two points, one is that FusionIO has certainly been around for some time and that they certainly do have the ability to support Windows and SQL Server (that's precisely what Bremt was testing, and I had a conversation with a PM about that very topic over a year ago); the other is that my intention was purely to make sure people are aware that they should perform adequate tests before making the general assumption that any specific SSD *is* ready for prime time. I'll repeat what I told Greg, that I am not scare-mongering or trying to get people to stay away from SSDs… just wants to point out that there are cases where they have failed and that, like with all hardware changes, testing is required and that you shouldn't just trust any configuration that happens to say SSD on it.
I think your point about due diligence is a good one, but I’m not sure that the premise is fair. Making a general assumption (SSD’s may not be ready for primetime) based on a single test of one particular brand is a bit of a stretch I think. Fusion IO is a fairly new player in the SSD market and last I knew the only offering they had was specific to SUN and Oracle. There are other players in the market with more mature products that I’ve stress tested with SQLIO and never once had an issue. The key differentiator of any SSD product is really the low level IO logic. It requires quite a bit of intellectual capital to produce and therefore product maturity (in an immature market) does make a difference.
PS Brent was testing in the vendors' own environment. We don't know yet if it's a "simple software bug" and it may be something that will require a fix from Microsoft, not from the vendor. So I don't think this is a case where we should all just shut up, pretend it didn't happen, and continue telling everyone the sun is shining and that SSDs are perfect and ready for prime time.
Greg, I hope you read my whole post. I was certainly not telling anyone to run away from SSDs; just to make sure that they perform realistic tests first.
On the flip side, just because they work for you in your environment, this does not mean everybody should rush out and replace their SANs without doing due diligence.
It's a two-way street… if you expect people to believe in the good reviews of SSDs, you can't spit on any bad ones.
Me too – rolling to prod with some SSD's next weekend…
This is sad scare-mongering, based on half-baked tests.
Even if he was testing in an environment completely provisioned by the vendor, a simple software bug doesn't mean the whole technology isn't ready.
If we ran away from Microsoft software every time a driver problem got in the way, no-body would be using Windows or SQL Server.
FusionIO drives have been used on a wide-spread basis for the past couple of years & in my experience (real world vs Brent's Lab tests) they work very well & I've not seen a single failure in a production environment yet.
I hope so too! Thanks for the update.
I'm hearing back that it's a driver problem, and they're testing a new build of the drivers as we speak. Fingers crossed.