by, 6th May 2011 at 03:30 PM (5024 Views)
Had one of those days today that we all dread as technicians/network managers. Walk into the office this morning and pounced upon by various members of staff shouting "No one can log on!"
Into the Server room and all the servers apart from one is switched on and running. The main DC which also houses the CSE Database is dead as a dodo.
Initial indication would be a power failure, but no, all others UPS' are running and not reporting an issue. Power up the UPS and powered back up the server to start checking logs, etc to see what the hell has happened. Looking at the logs, the UPS PowerChute log is not showing a power outage, but showing a "Maximum Internal Temperature Exceeded" followed by initiating a shutdown. Thing is, the shutdown procedure runs a script that proceeds to power down the nine virtual servers running on two seperate XenServer hosts and then shuts down the Hosts themselves. Thing is, these two hosts were still running, but all hosts except one on each was shutdown. User Home drives, Print Server and other vital services are running on these Xens.
Ok, need to get the show back on the road, so I reboot the Xen Servers and get the main DC up and running, but still no home drives. Would appear that our main Xen Server has had a right hissy fit and just refused to boot, just a blank screen on the console after all the initial loading and not responding to anything except a ping. After 10-15mins, I take the brave step of forcing a power down of this Xen and reboot and thankfully all appears to boot and everything is running again.
I'm investigating the issue with the UPS at the mo, have a call open with APC to find out what went on but I've moved the main server onto another UPS for now as I don't trust it.
As for the XenServer, bit concerned, but I might be partly to blame, as the script wasn't quite right. The script uses XenCentre command line to gracefully shutdown each VM one by one, then disable the XenServer and then power down the XenServer itself. A few months ago, I rebuilt one of the VM's and renamed it, but the script still had the old VM name! So it would appear that it suceeded in shutting down all but one of the VM's and then refused to shut the host down itself as it had a VM still running. Power went off as the UPS said enough was enough and this would appear upset the XenServer enough to cause it to hang on reboot. Thankfully it manage to sort itself out, but I was at this point reaching for the XenServer Configuration Backups I burnt to DVD a while back.
The other XenServer also refused to switch off as it had a VM still running because this was an additional server I added to it but forgot to include in the shutdown script! Double
Lesson learnt, scripts amended but it could have been a lot worse. Just hope I can sort out the UPS now. Everything back up and running by about 9:15am, so didn't disrupt the day too much.
Total Trackbacks 0