Experienced a drive failure in one of our servers last week however I'm rather disappointed in the resiliency side of things. Wonder if there's something I'm doing wrong - only thing I can think of is that I should try some different "drivers" for ESXi.
Single Intel server with an Adaptec 6805 8 port SAS/SATA3 RAID controller with 2 drive arrays - a 2 drive RAID1 and a 4 drive RAID10.
The Raid1 is home to the first domain controller, the Raid10 is home to the first file server. The drive that failed was in the Raid10. This shouldn't be an issue, it's a single failure and would just be degraded until the drive was replaced. Whipped the drive out so I know which one it is, drives on order for replacement on Wednesday however the VM guest only lasted a day before it just died a death. The datastore is "live" albeit extremely slow, and the guest is inaccessible - not even pinging.
This will be resolved on Wednesday however I will have this nagging doubt in my mind should another drive fail. I don't want to interrupt my holiday time off again with things that should take care of themselves or at least tick over until we can resolve them! Any ideas on how I can resolve this permanently?
(For reference, the failing drives are Seagate. Never, ever again.)
Last edited by synaesthesia; 24th August 2014 at 12:22 PM.
Seagate has gone down hill and are now rubbish, I have never found an Adaptec RAID I am happy with, HP rebrands and LSI rebrands in IBM seem to do an all right job but have never had good luck with adaptec. Perhaps a different and beefier RAID controller with more memory, RAID 10 should not slow down that much with a single dropped drive. Had RAID 5 sets fail one drive and keep going at almost full speed with the HP Smart Array stuff.
You should always have a drive available to pop in a RAID in case of drive failure. Or you could move everything to a RAID 6 configuration, where you could survive TWO drive failure. I personally don't see much of a point in doing RAID10. I'd rather do RAID5 with a hot spare or a RAID6 configuration.
Yeah they're SATA drives - a server built on rather a tight budget. And yes, write caching with a failed array apparently.
A little research shows this could be an ESX issue, there's problems relating to this sort of thing since 5.1 - will need to do some more digging, but not until start of term!
Secondarily, @ericdano we don't need lectures that are not relevant to the problem in hand thanks! We should have a spare, but personally I'm glad we didn't. A spare would have meant putting in another Seafail drive. We do keep spares for all arrays, just luck of the draw we didn't for this one. I would not use a RAID6 on a VM host, the write speed drop is far too harsh.
Not on SATA drives I'm not! I would prefer to keep the DC as physically separate from the rest of it as possible hence the separate array. As said re the RAID10 setup, we have a very limited budget and the setup is built around it, so bearing in mind the lower end controller and drive combination we need to keep performance up as much as possible. Not fussed about write performance on a DC but the file server concerns me greatly - that's home drives and shared areas. Maybe in a few years when those shared areas are done away with, but not yet!
It has been discussed here a number of times but it is probably worth repeating. RAID5 should never be provisioned with a hot spare. A failure of a drive on a RAID5 should trigger copy of the data and only then should you attempt to put in a new drive and rebuild the array. RAID5 is really best avoided in any kind of production environment.
(Apologies for the slight hijack, we now return you to your normal viewing).
Funnily enough, I might not really haveany other option but to accept the performance hit. It won't take up the slack of the failed drive on the hotspare >:| Currently tryign to force it online to recover it.
To be honest, once your virtualised, keeping resources physically separate is kind of defeating the point. I would suggest it is much better to have one pool of more reliable storage complete with hotspare, rather than multiple pools with less resiliency and no hot spares.
Either way - it is a bit worrying that the guest has become inaccessible, that shouldn't happen (hasn't happened when I've had degraded arrays). I could understand it may happen during the rebuild process as most low end raid cards can't cope with rebuilding an array whilst serving data off it, but it shouldn't happen during the initially degraded stage. it may be highlighting a further problem with other drive(s).
Hate to be the one to ask this - but do you have backups?
Yeah, backups are good. I'm half way through recovering from the array as a "live" backup recovery, if that fails I'm not too worried (other than losing my holiday to resolve this!) as I can recover from those. Waiting on Parcelfarce to deliver the replacement drive (plus extra spare) that they should have delivered yesterday (GRR!) then it looks like I'll be flattening the arrays and starting from scratch. Performance copying from the drives whilst it's "rebuilding" is oddly fine, how or why it impacted on the running of the server does indeed concern me, especially on the other array (which is what convinced me that it's probably best to do as you suggested as we then gain absolutely nothing keeping them separate).
It would be so nice to even believe we'd be listened to if we said we needed a SAN - a couple of years ago yes, but I really don't believe *anyone* in schools should be doing that currently, not with the way things are going. Hence I'm not worrying too much, get this back up and running and it'll tide us over nicely until we basically have machines in school that are only a physical gateway into servers hosted elsewhere!