Please help with dying server
I'm in desperate need of some ideas on how to fix a dying server. Story so far...
We have 3 x HP ProLient DL380 G4 Rackmount servers
One of the servers is randomly restarting every 20 minutes or so with an unknown hardware fault. We've caught it once on a blue screen saying just that "Hardware fault" (no other info) before ASR kicked in a rebooted the server.
When It started happening I used an Ultimate Boot CD to run a CPU stress test and MemTest86+. Both came back fine, so I started looking at hard drives. The controller reports everything AOK and no probs with any of the drives.
Server under warrenty so I've been in touch with HP and after a week of getting no where they agreed to replace the entire motherboard (the RAID controller is integrated).
A week later and it's started doing it again, rebooting every 20 minutes. HP error logs show the event - unkown hardware fault caused ASR to reboot unexpectedly.
I have brand new RAM which I'm putting in today as the servers where due for a RAM upgrade anyway.
So I can be pretty sure the problem is not RAM (if it happens again) or motherboard/RAID controller related.
- Does anyone know of a really good CPU stress test utility that test the CPU only (no activity on RAM/HDD)? Preferably one that can test 4-cores to destruction?
- How to work out which hard drive of 5 in an array is faulty when none of the usual diagnostic tools or warning lights are showing any suspected fault even occurs?
And maybe (3) anywhere else I can try looking to that I haven't considered yet?
Thanks in advance