Faulty Server HDD's
Wasn't sure where to put this, so hopefully it's in the right place.
We've had an ongoing issue with our Server room. 4 years ago we put in a viglen system (before my time) and for the first year or so everything ran fine. After this initial "honeymoon period" a couple of the servers started to cause faulty harddrives. In total over th enext couple of years we went through 25-30 server hard drives in one server alone.
12 months ago, we replaced Viglen with a vanilla windows environment and for 12 months everything has been running fine. However, as with Viglen, after this honeymoon period ended, we have had 1 server start to experience faulty harddrives and we've just had Hitatchi replace both of our SANS as they too have developed faulty drives.
This, to me at least, is too much of a coincidence and has to be being caused by something else other than faulty hard ware.
Does anyone know of a company who would come out and do a "Server Room Evaluation" or something along those lines? Essentially we are looking for someone to come out to us and see if there is something we've missed which might be causing HDD's to fail on a regular basis (low level static maybe? I dont know).
The room already has its own air conditioning, and it's own power source but we still have the same problems time and time again.
Any advice anyone can provide would be gratefully received.
Are all of the servers on a UPS?
What version of RAID are you using?
All servers are on a UPS.....
Versions of Raid we're using are Raid 0 and Raid 5.
There's been no error with the RAID though, it's the drives themselves that have developed the fault.
How do you know they are faulty, what symptoms or errors are there? What is the humidity level like?
We haven't checked humidity levels as yet, this is something I'm going to look at thought shortly.
With regards how we know they are faulty, same way most would I guess. Servers start to beep, showing a faulty drive. Sans display warning light, when you go into the console it tells us that the drive has developed a fault. Of the two new SANs that Hittchi have sent us, one has developed an issue after just 2 weeks!
As an addition to this, our FROG box has also chucked out a few hard drives over the 2 years we've had it in.
I used to get similar problems, admittedly it normally happened when a member of staff turn the aircon off in the server "cupboard", but the heat issues relating to the HD failures only affected the hitachi and seagate drives, that said the seagate drives i found the be the least reliable.
I have now changed all the server drives to Western Digital Caviar Black drives and havent had any fall over since, That was well over a year ago, with no errors, no raid issues, no strange noises, nothing...
Assuming you can rule out power (do you have sufficient UPS's) and heat (do you have decent air con) isues you may be looking at something far more strange such as humidity (this should be solved by air con) or even worse some kind of magetic interference??
Also, what make and model of hard drives do you use???
The sheer qty of hard drive faliures is definatly cause for concern. I can count the number of server HDD faliures we've had here over the last 8 years on one hand *touches wood*, and I have normal hands by the way not freakish hands.
Try SSD drives?
Not had one fail yet.
With regard to brands It depends on the batch I guess, we have had so many WD drives die but the Seagates have kept on going for years.
As to the server room I would check the humidity, the temperature stability - if it fluctuates a large amount regularly it will kill the drives quicker. I'd also look at the vibration from the surroundings and from eqipment transfered through the mounting. Lots of mechanical vibration will also end in toasted drives.
We do have the room air conditioned, and as far as I can tell, it is on 24 hours a day keeping the temperature at a constant temperature of 20 degrees C (that's what it is currently).
I am certainly going to check humidity in there, but I'm not really sure how I can check for magnetic interference (which at the moment is where our thoughts are as to what is causing the problem).
With regards makes of hard drives, I don't actually know. All our servers are under warranty with their respective manufacturers. Hitatchi SANs I would guess use Hitatchi drives. Our Backup server has been using Seagate Barracuda drives by the looks of it. The frog box, I've no idea, as FROG know when the drive goes before we do and they just turn up and replace it.
The air con should automatically dehumidify the air so unless you've got something seriously wrong with the building fabric, humidity shouldn't be an issue for you.
As for magetic interference - I have no idea how you would check for that, I would assume it would be a specialised job, and therefore very expensive to find out.
Hmm the less I say about Hitatchi the better. Hitatchi = IBM. When I was a wee boy we used to refer to IBM Deskstar hard drives as IBM Deathstars because it seemed as if every one ever made failed within a year. I had 2 in RAID 0 at that time. Big mistake. Spent more time reinstalling windows than actually doing anything productive :/
I would have thought you would need something fairly big and fairly nearby to cause problems with magnetic interference. Is there any large transformers/substations nearby? Or any other large industrial-type stuff? Lift machinery, large electric motors? TV/Radio transmitters? The disk itself will be inside a metal box, inside anther metal box, so it would take a lot for interference to get through. If it were to affect the electronics on the drive, I would have thought you would be getting other failures/random crashes as well. Does anything else go wrong in that room?
Seeing how there are problems with several servers, it would point to something in the room rather than specific hardware.
How well does the air circulate around the room? Are there any heat hot-spots near the servers/HDs? Does the air flow in the room work with or against the servers own cooling mechanism?
Some other thoughts:
* Does anyone else have access to the room (cleaners/caretakers etc)?
* How are the servers situated within the room (in a rack, on the floor, on tables?)
* Is there any problem with dust building up in the room/inside the servers?
* Is the room carpeted? What is the floor made out of?
* What is the power supply like in that room? Do you get any spikes/brownouts?
* Have you tried wiping the drives and reusing them? Or is it a permanent failure? Can you access the data on another computer/server?
* How many UPS's and servers do you have? what capacity are the UPSs?
* Does the room have any problems with damp? Does it feel excessively dry or humid?
* Do the failures happen at regular intervals or after any other events?
Can you describe the room a bit more, maybe show some pictures? Perhaps the area around the room as well.
Thanks again to everyone who's responded. We've spoken to a couple of companies now who offer a service whereby they'll come out and do some analysis of our server room. Unfortunately the school is baulking at the price (you genuinely can't win, they don't want down time, they don't want to spend any money to work out what's wrong!).
Just to follow on from Chris_Cook's response, pleasae see attached some images of our server room.
With regards the questions asked in Chris' response....
- Nobody else has access to the server room. We have a locked door, with a metal roller shutter over the front which is also locked at night. As far as we know, we are the only people with keys.
- Servers (as can be seen in the photos) are rack mounted. However before the rack was in, the servers were on a desk and we had exactly the same problems.
- Don't think there is a problem with dust. When we removed one of the faulty sans I checked the back of it and there was no dust build up at all. The room itself is fine, it's pretty clean as server rooms go.
- The room isn't carpeted, the floor is a vinyl type covering.
- The room has it's own dedicated power supply running into it, when we were a viglen school this is something they said might be causing the problem so we had the dedicated power source put in. As far as I know we don't suffer from spikes/brownouts, but we need to do some research on this.
- We have tried wiping the drives and reusing them, they always show as faulty and won't work. It is a permanent failure and we havebeen unable to access the data on another computer/server.
- All our servers run through the UPS that we have. We had a full network rebuild last year and the company who did the majority of the work was responsible for the UPS and therefore it should be ore than sufficient for our needs.
- No problems with damp in the room (although it does have a plastic sewage pipe running through the corner which burst over the summer!).
- The failures aren't very unpredictable, there's no events that occur that could trigger it (that we know of).
Those servers look remarkably like my old servers, one which is retired now and one still going as a print server (intel chassis ?). We had disk problems for a while indicating faulty disks turned out to be a circuit board behind the HD housing forgot what it is called but its some sort of drive/error reporting circuit board. Could be that but that would not explain why it was happening on the previous server, unless they both had the same fault. Did you test all those faulty disks on another machine ?
25-30 server hard drives, thats alot of drives to go through!