Cheers for that Synack.
So far I've updated all the firmware's. Updated the drivers.
I'm going to run diagnostics like the insite ones or memtest ASAP. But as no one goes home I'm finding it difficult!
I only have 1 PSU for this server but I will try it in the redundant bay. See if I can purchase a redundant one next week.
As for the CPU I will try it but as I said previously the server is pretty heavily loaded at the moment and even at peek times it doesn't reboot / get too hot. The times it has rebooted is times in which I'm assuming the server is not doing anything. During the night, I don't have any scheduled tasks so I really don't think much will be going on.
Currently letting it make another backup to a removable drive so I am sure I've got everything!
Don't trust memtest on a server it is not designed to handle ECC ram so although it may pass it may be throwin lots of ECC errors behind the scenes that only very rarely cause a glitch despite considerable damage.
Cheers Synack, So on the insight dvd their should be memory diagnostics? Ill try that ASAP
You need to disable ASR using the HP tools or within the BIOS!
ASR is a nightmare if you dont manage it properly, I have seen where a NIC looses its internet connection for a while due to excessive traffic the ASR (HP's Automated Server Recovery) decides that the best course of action is to perform a reset!
Dont get me wrong, this isnt a fix it's just to stop the server from randomly restarting whilst you find out why it does it!
My first experience of this was actually caused by a Backup exec job on another server on the same switch. The BE job would run each night for around 2 hours during which time the Slam Dunking Server would decide that because there was too much latency on that NIC it would do an ASR.
Another was due to a known issue with HP Power Supplies and APC UPS's the APC UPS would go into Brown out or Black out due to an over or under voltage state, the APC kicks out a nasty Sawtooth or Square Wave rather than a nice smooth Sine Wave the HP Power Supplies complain bitterly about this and the good old ASR kicks in and reboots the server.
The OS doesnt have a clue whats happening and all you will find in the windows logs is the restart was unscheduled!
Google HP ASR Reboot - there is your homework for the evening....
Cheers for that m25man I will have a look at that ASAP on the server.
Surely HP could make it so that the server shutdown gracefully, or at least attempt to then force a reset. sounds like a good feature just implemented in a strange way.
Hopefully this will be the reason.
Looking at some posts though it implies that a log was created about ASR in ILO logs? Is this not always the case?
The ASR is kind of like an advanced watchdog timer. If the OS is not responding to that it is almost certainly crashed hence no need to try a shutdown. Do you have all the HP services installed on it to report stuff correctly back to the ASR system. You are right though, it should log it in the IML.
the only time ive seen symptoms like that before was from a memory fault. We put some new memory in a server and it ran quite happily for about a month. For some reason after this, the server would shutdown and restart itself randomly - usually at night. It turns out the memory had a fault which, according to Crucial, was probably caused by an ESD resulting from proper ESD precautions not being taken when the memory was installed! This can take any amount of time to start causing problems from straight away to a few years down the road!
OK so an update...
Using the bios utility MEMBIST - the status of the four sticks of RAM is OK. I then ran the HP insight diagnostics, the initial summary page displays all as fine, however running a custom test on Total memory, which tests 8 things, all of which pass apart from - ECC Test
In the Error Log under the ECC Test description is 'Correctable ECC Events logging limit reached in SEL log Device. Ran on CPU 0' in the recommended repair 'Please refer IPMI Sensor Event Log for ECC events' Error code 021279.
I read that this error can be because the RAM isn't seated properly so I removed them and reset them. I re ran the test with still a fail, and tried running the test with just the original RAM in and then the newer upgrade RAM, all the times failing the ECC test.
I disabled ASR by entering bios then going to server availability followed by ASR status, I exited bios and booted the server, however it rebooted about 6 times, I couldn't even login before it reset. So I've had to enable it again.
I tried the PSU in the redundant bay but that didn't make any difference.
So is this all being caused by the RAM? the 2 original HP sticks as well as the new crucial / corsair ( can't remember which) both of which appear to have ECC problems? I assumed this type of problem could only occur from power shorts, but the server is going through an apc ups which apparently regulates voltage?
As for ESD I don't see how I could of, I always make sure I ground my self, and I don't touch contacts etc, and the HP sticks of RAM had not been touched and they failed the ECC test.
Still no error lights or any other indication of a problem.
You may need to try and clear the ECC log if there is an option for it. If there are too many ECC errors the system decides that the RAM is not trstworthy and reboots, sometimes disabling the faulty stick. As to the RAM itself, some RAM is just junk from the day it leaves the factory, you may have been unluck in the brand or the batch that you purchased.
Do you reckon if I clear the ECC log that the results of the test will be different? I really don't see how I could be so unlucky that both the original RAM and the upgraded RAM from crucial/corsair could also be faulty.
Their is a matching pair of HP RAM - factory installed, which haven't been moved till yesterday when I reseated them. As well as a matching pair from crucial/corsair that I added just before last Christmas.
How would I know which stick is faulty? I have read that on other websites about ECC problems, but their is no indication - from what I could see to which stick was faulty. The error lights on the main board for the RAM are all clear, and there is no indication within insight to which is causing the problem.
ECC log should tell you if you can access it. Yes clearing the log could make the results different as it could be failing based on historical errors. If you don't look at it beforehand though you loose all the debugging goodness of the log.
Apparently to clear it:
"To solve the Problem with the full SEL Log we did the following actions:
1.) Remove the CMOS Battery on the Systemboard off the Server
2.) Set the System Maintenance Dipp Switch Number 6 to "On"
3.) Start the Server and let it Run for 3 minutes
4.) Set the System Maintenance Dipp Switch Number 6 to off
5.) Leave the Server without Power for 30 seconds
6.) Put the CMOS Battery back to the Server on the System board.
This Action clears everything, inclusing the SEL Log .
Attention .... the ILO Board is cleared too with this process"
This was the response on the HP forums for clearing the ECC log. Hopefully this is the correct procedure?
Never done it before myself but it sounds convincing. Had simmilar issues but got around them by replacing the RAM with some of a better pedigree than we tried initially.
Originally Posted by beany1
If the ECC tests fail after clearing this log, do you think my next step would be to replace all the RAM?
Yank the new ram, do the reset then retest. It is probably the new RAM at which point you can warrenty it or replace it with something else.
Originally Posted by beany1