I'm in desperate need of some ideas on how to fix a dying server. Story so far...
We have 3 x HP ProLient DL380 G4 Rackmount servers
One of the servers is randomly restarting every 20 minutes or so with an unknown hardware fault. We've caught it once on a blue screen saying just that "Hardware fault" (no other info) before ASR kicked in a rebooted the server.
When It started happening I used an Ultimate Boot CD to run a CPU stress test and MemTest86+. Both came back fine, so I started looking at hard drives. The controller reports everything AOK and no probs with any of the drives.
Server under warrenty so I've been in touch with HP and after a week of getting no where they agreed to replace the entire motherboard (the RAID controller is integrated).
A week later and it's started doing it again, rebooting every 20 minutes. HP error logs show the event - unkown hardware fault caused ASR to reboot unexpectedly.
I have brand new RAM which I'm putting in today as the servers where due for a RAM upgrade anyway.
So I can be pretty sure the problem is not RAM (if it happens again) or motherboard/RAID controller related.
Does anyone know of a really good CPU stress test utility that test the CPU only (no activity on RAM/HDD)? Preferably one that can test 4-cores to destruction?
How to work out which hard drive of 5 in an array is faulty when none of the usual diagnostic tools or warning lights are showing any suspected fault even occurs?
And maybe (3) anywhere else I can try looking to that I haven't considered yet?
Random re-boots have been attributed to power supplies in the past. Do you have redundant PSU's in the server? if so try running on one and see if it still does it, then if not try with the other one out.
As long as you ahve one connected you can disconnect the other one (you'll get an amber warning light but that's about all)
It's not the mobo or raid controller as they were replaced by HP
It's not the RAM as I've just put in 4 new sticks and got the same result.
Off to do a 'Prime95' CPU stress test. If it's not that then it's either a randomly faulty NIC or on of the HDD's.
edit: also it's not over heating - I've got half the school fighting me for access to the server room it soo cold in there (and so very hot everywhere else!). I find it hard to blame the UPS as 3 other servers are connected into that beast. Surely if that was causing the prob other servers would have been effected?
Last edited by tmcd35; 29th June 2009 at 11:27 AM.
Try and find a copy of virtualPC online, this is a full hardware test program - boot from the CD and run a full system test - A colleague of mine uses it to test refurbished pcs and it often finds faults that dont always show up otherwise.
What temp are the processors running at and do they have thermal cutout settings via the bios?
Obviously as you say it seems when the server is fully loaded at log in or log off time that the problem occurs if the processors are heating up quickly without proper air flow around them then this could be the case.
@Bossman, thanks for the suggestion. I really don't think the problem is heat/processor related. Mainly because it should have rebooted during the prime95 tests. The load was at 100% across all 8 cores/threads (1xdual core with HT) for around half hour total - no issues.
Of course neither the HDD's or the NICS were being accessed during the Prime'95 tests.
My gut instinct (as it as all along) still says a faulty HDD - I just can't work out how to determine which of the 5 buggers is causing the problem.
Also have 4 NICs total, It could be one of the three add-in cards.
@AdamGent - looked up Virtual PC Check/PC Check - $300? Good suggestion but not something I can get any time soon. Anyone know of any good FOSS that'll do the same/similar job?
@Plexar/@MGSTech - Currently running on one PSU - so far so good. Next mass logon is around 1:10pm. I'll switch PSU's at around 1:40pm ready for the last logon of the day (2:10ish). We'll see if the problem re-occurs during either periods? If it does I'll look at recreating the problem tomorrow.
If I turn ASR off then the server just freezes up but no reboot. I end up having to manually reboot the server. Only once have I seen a blue screen for an error which I think may have been thrown up by ASR before it restarted.
The Windows event logs are not showing anything obvious at around the same time as the reboots
I has a similar problem, and it did turn out to be one of the HDDs. I used an HDD Regenerator program which took hours, but the server seems OK now. (Though I'd be happier if SMT would pony up for new disks or preferably a whole new server).
@LeMarchand - how did you work out which HDD was at fault? I *could* spend £1000 on 5 new HDD's and swap them out one at a time. Letting the array rebuild to the hot spare each time before pulling the next. But that could take days and I'm worried about what happens if the faulty drive gives during the rebuild process. I'd rather find a way of determining which drive is at fault and just replace that.