Unhappy SIMS Server
by, 6th October 2010 at 03:22 PM (786 Views)
We've had fun with the SIMS Server over the past couple of weeks. It's been perfectly happy until a Hard Drive failure occured a couple of weeks ago. Not a problem, RAID5 in play so got in touch with HP and a new Hard Drive was dispatched next day, installed and all was well.
Unfortunately not, somehow a couple of directories had become corrupted which we can only assumed happened when this hard drive failed. Still, managed to recover them with no issues but couldn't delete these folders with the system complaining that chkdsk was required but of course this has to be done offline. So, arrange a day where we stay a bit late to take the server down and do the chkdsk operation.
A couple of days later before our planned downtime, we notice that the new hard drive in the RAID was now flashing orange indicating a immenient failure! So, a newly supplied hard drive is about to fail! Well, it happens, new equipment does sometimes fail or DOA. Phone up HP again and this time provide various diagnostic reports from the Server which results in another hard drive being dispatched. The HP Agent though does request that we update the firmware on the raid controller, other hard drivers, etc to ensure that this 2nd failure has not been caused by out of date firmware/drivers, etc. Points us to a downloadable DVD ISO which contains all the driver and firmware updates for our server, a HP ML350 G5. Nice; I like it.
So, our planned day arrives, we stay late and staff go home. Our 2nd New hard drive has already been installed and confirmed that it has finished rebuilding. We run chkdsk and this clears the errors on the drive so we can finally clear things up. We run the HP Update DVD and this goes off and updates all the firmware during a reboot (we hold our breath), and all seems well. We go home happy.
Monday morning on my way into work, I recieve an e-mail from the Insight Management Service on the SIMS Server that a ASR Reboot Completed at 07:20am - WTF? Get in, look at the Event Logs - no errors, just the event showing an expected reboot had occurred. No DUMP file present either, so no indication of a Blue Screen. Head scratching now occurs. Give it the benefit of the doubt, other work gets in the way - problem occurs again at 01:10 the next morning. Again, no errors in event log leading up to the time it rebooted.
Some digging around I find this:
HP ProLiant ML350 G5 Storage Server*-* Advisory: (Revision) Integrated Lights-Out 2 (iLO 2) And iLO 2 Management Controller Driver - FIRMWARE/ DRIVER UPGRADE REQUIRED: ProLiant Server May Unexpectedly Reboot And Display Event ID 57 Error Messages - c
It would appear that certain combination of Drivers and Firmware are causing the iLO Management Module to randomly reboot the server for no reason! Using the HP Update DVD that I used installed a later version of the iLO firmware but I didn't update the drivers thus causing this issue.
Updated both Firmware to 2.01 and the latest Drivers and so far so good; fingers crossed no reboots yet but will wait until at least a week without a reboot before I'm happy this has fixed the solution.
Keeps us on our toes I suppose!
Total Trackbacks 0