Have been experiencing some strange issues with one of our Servers a HP Proliant ML350 running 2008, it randomly switches off and starts up again. This isn't a clean shutdown and happens mainly at night, in fact its only happened twice or there about in the last 6 months during the day, but has happened 19 times during the night in the same period.
Checking event viewer / system logs I'm left with "The previous system shutdown at 05:06:51 on 24/10/2011 was unexpected." looking at all other logs there is non that are even close to the shutdown times.
After this, last Thursday during the night the server switched off but this time was stuck on a blank screen with "internal health problem" and "external (power supply) health problem" I was unaware of the internal health LEDs so didn't check but switching the UPS off and back on and the server booted fine. This is the first time its done this so as you can understand - quite worried! Checked temperature both with software and a temp probe. It seems to be fine, nothing over 40 degrees. All fans are running. Everything is working OK even when the server is under load.
There is no scheduled tasks, but as I said before the times are completely random.
I was thinking it could be a UPS problem but there is another server attached to this which doesn't reboot, and its just had a new battery.
Also i have had a few problems recent with HP servers and drivers - so worth updating all the drivers and if they are the latest version maybe roll back to the previous release and then see if it carries on.
The reboots are really intermittent, like some have been a week apart. So I'm trying to do it slowly to try find out the problem. Still nothing showing in logs. Did a windows update on Friday and so far so good, but it could just be waiting!
Dave - it does come with ILO, wasn't aware it could do that / never got round to setting it up so I will have a look at that tomorrow! Thanks
Ben - I haven't no, the first time it appeared to be a hardware issue was the other week and I haven't been able to power down the server since then. I did upgrade the RAM last Christmas, and ran a memtest on the new and old ram and it was all fine. So I'm hoping that it hasn't only lasted 6months! I'll make sure that these are my next things to check though, hopefully ILO will indicate if there is problems. Cheers
Damn - Last night another restart. Literally the server sits there. I haven't made any changes to it. I've had remote desktop open to it, and literally checked its still their every 15mins like a mad man.
I've checked ILO - I'm assuming you mean the ILO2 log on the System Status page?
Informational iLO 2 11/17/2011 22:51 11/17/2011 22:51 1 Server power restored.
Informational iLO 2 11/17/2011 22:51 11/17/2011 22:51 1 Server power removed.
So this refers to the reboot Yesterday. The only log before this is:
Informational iLO 2 11/11/2011 19:01 11/11/2011 19:01 1 Server power restored.
Caution iLO 2 11/11/2011 19:01 11/11/2011 19:01 1 Server reset.
Which is when I restarted the server for Windows Updates.
In the IML the last entry is on the 6th:
Caution POST Message 11/06/2011 05:56 11/06/2011 05:56 1 POST Error: 1778-Drive Array Resuming Automatic Data Recovery Process
Which coincides with another crash.
Everything in System information is OK.
I've upgraded the ILO firmware to the latest.
So does this mean I don't have hardware issues? Or could I still but their not registering?
Not sure how to check Plexer. According to ILO its OK, surely if it was on its way out, when the server is under load it would cut out?
However the hardware lights (that have only happened once) did indicate internal problem and external problem. Apparently an external problem is the PSU.
It is connected to a UPS - which has just got a new battery, according the software the UPS is fine. Running the tests it can keep the servers powered up. Connected to the same UPS is another server - which isn't rebooting so I scrapped the idea of it being the UPS??
Install latest driver
Install the latest firmware for all components (NICs, BIOS, Power managment controler, RAID controller firmware) you can use the firmware update CD from HP or do it manually in Windows with the HP downloads.
Use the Insite diagnostics from the latest smartstart CD to run a memory test that allows for ECC RAM and other avalible tests.
Ramp the CPU up to 100% and leave it there for a few hours (folding@home SMP is a good one for this) to check for CPU overheat/point overheating (areas of CPU not near the temp sensor overheating before the temp sensor registers it)
Swap the PSU to the other PSU bay, swap the PSU with another one from another identical server.