Hardware Thread, Please help with dying server in Technical; There is an outside chance that one of the cores has a problem with a given set of instructions that ...
29th June 2009, 04:34 PM #31
There is an outside chance that one of the cores has a problem with a given set of instructions that neither of the CPU stress testing programs used. And it just so happens that the light from my torch bounced of Mars at the correct angle to make those instructions run through that core at that time and cause the problems.
I just don't think it's very likely - personally. I'd hate to be wrong as getting old of replacement processors now (three years old Dual Core, Hyper Threaded Xeons) would probably be difficult and expensive
29th June 2009, 04:34 PM #32
Just need the van with the spare parts in
Originally Posted by tmcd35
Umm... I guess you could remove one CPU, check the manual as you'll need to ensure you keep one in the 1st slot. Make sure it's coated in thermal paste. Personally I'll be phoning HP and moaning my arse off cause there engineer didn't fix it.
Might be worth disabling\remove the NIC if you think it could be that, also fireware the BIOS and RAID controller (careful you don't lose any data!!)
29th June 2009, 04:45 PM #33
Updating all the various Firmwares was one of the things HP got me to do before they agreed to send an engineer out. It didn't fix the problem on the old board and although yes It's likely hp installed a new board with older firmware that which I used to update the last board, I don't think mobo was ever the problem so a firmware upgrade won't solve this.
Still if the problem keeps reoccurring desperation will inevitably lead me to try this again!
To be fair to HP - they sent a bloke out (after a week of over the phone/internet diagnostics) to replace a working Mobo with another working mobo!
Originally Posted by matt40k
STATUS UPDATE: After switching round PSU's everything appears to be fine on all three servers. But then, of course, non of the three servers have come under any king of load at all over the past hour.
So, the question is - whats going to happen tomorrow?
Either - it's a PSU prob thats now fixed (potentially waiting to kill another server), or One (or both) of the other servers are going to reboot under load, or (most likely) this server will reboot itself at 9am tomorrow morning.
Place your bets now...
Last edited by tmcd35; 29th June 2009 at 04:48 PM.
29th June 2009, 04:47 PM #34
I'm putting my money on it rebooting tomorrow morning, and eventually being traced to a memory fault.
29th June 2009, 04:49 PM #35
If it's a memory fault I've got plenty of stick to swap it out with!
It's so, so, so very unlikely to be a memory fault that I'm more than happy to offer up very very good odds and be pleased to take your money when proofed to be something else
29th June 2009, 04:54 PM #36
After all the posts etc of going back n forth I would either go with a hard drive or the NIC ( one of ) being under a load when everyone tries to login.
What make / model of NICS are they ? Also am guessing they are on gigabit ?
29th June 2009, 04:59 PM #37
I've been avoiding the NIC's for a reason
These server were built and installed by the last guy and I'm not overly happy with some of his set up choices - personal opinion and all.
There are two on board NICs and three PCI NICs. I believe (but have yet to check) they are all HP. I know 100% they are all Gigabit. I know each of the three servers are set up the same way. I know each server has 5 IP's and theres some NIC teaming going on.
I think I'd sooner rule out the HDDs before getting my hands dirty and working out which NIC is which IP and what services rely on which IPs and how exactly the teaming is configured.
29th June 2009, 05:01 PM #38
I think it'll re-occur, it could be one of the following.
- PSU issue
- Mobo could have been fitted incorrectly
- RAM could be incompatible
- RAID controller issue
- BIOS issue
- NIC issue
I would try disabling all the extra stuff in the BIOS, such as HT etc. Check the UPS for current load, try with no extras and the NIC disconnected.
30th June 2009, 09:29 AM #39
- Swapped PSU's with other servers - problem server still rebooting
- installed brand new RAM - problem server still rebooting
- Prime95 8 thread stress test - no restarts during test
- Prime95 tests RAM and all CPU cores - server still reboots, but not during tests
- Brand new Mobo installed - server still reboots
- Onboard/Integrated SCSI RAID controllet, replaced with mobo - server still reboots
- Limited info in error logs show same non-descript error code despite above tests/changes
As you can see I really am left with just NIC's and HDD's to test. While I'm not going to totally discount any other possibility -
- new mobo with same fault as last
- incorrectly fitted cpus
- problem cpu core
- cpu overheating
- incompatible ram
- firmware/driver issues
The test done so far, and the state of the machine when this first started happening, suggests that these are all extremely remote and unlikely. Oh hum, another day of digging...
30th June 2009, 09:35 AM #40
Here is a novel idea, hp RAID sets are portable between smartarray adapters. You could simple shut down the server and a good one then swap all of the hard drives between them. This would rule out the drives and the OS from the list of causes. If possible do it with two machines that are not DCs as I am not 100% on how the machine SID change would affect them.
You need to move all of the disks at once while the servers are off then when you boot them they will just read the raid config off the transposed drives. Be sure to put them in in the right order though.
Edit: Oh and also upgrade all the firmware if you have not already done so.
Last edited by SYNACK; 30th June 2009 at 09:39 AM.
30th June 2009, 09:53 AM #41
If the server stays up long enough for me to work out the NIC config I think I'm going to start by pulling the three additional NICs and re-introducing them one at a time.
Sleeping on it overnight I think I agree with the consensus here. The next most likely place is one of the NICs. Thinking about it in all honesty a randomly dodgy drive is the least likely of causes.
I like the idea @Synack, but it does mean downing another server - even temporarily - to do it. Also all three servers are DC's. And I'd have to do the firmware updates on all servers first. Don't want firmware missmatch causing probs if I go down this route.
It'd be a quicker test than pulling the drives one at a time - but potentially riskier as doing any firmware updates on a PC rebooting as often as this one now is is not exactly a wise move.
30th June 2009, 10:44 AM #42
Does the RAID controller have a memory module attached to it, if so was this changed with the motherboard?
Thanks to K.C.Leblanc from:
30th June 2009, 10:50 AM #43
Ooo, very good question. TBH I don't rightly know
There is a 256mb SoDIMM on the motherboard. When I first saw it I thought it may have something to do with onboard graphics (although why a server may need 256mb dedicated graphics ram is beyond me). Thinking about it, it's more likely this is the RAID RAM.
I would have thought its the same RAM from the previous mobo. I'm pretty sure HP only replaced the actual mobo itself.
I'm currently sitting here waiting for the server to reboot (or not). Next period starts in about half hour. I've taken out all three additional NIC's - which on investigation appear to be totally redundant.
Since taking the NICs out I've not had a reboot - but then server hardly been under any load. So I'm sitting here playing the waiting game...
30th June 2009, 11:37 AM #44
Could this be of help http://www.bishopbarrington.net/other/helpfultool.exe
let me know as I think you can put stress tests on various hardware elements it may help you in your search.
30th June 2009, 11:48 AM #45
I would thank you Bossman but thankfully our LEA's virus checker stopped your evil plan to infect my already poorly server with a bad case of swine flu ...
By SteveT in forum General Chat
Last Post: 16th September 2008, 10:37 AM
By cuke2u in forum MIS Systems
Last Post: 18th July 2008, 03:34 PM
By mrforgetful in forum Windows
Last Post: 17th June 2007, 02:51 PM
By starscream in forum General Chat
Last Post: 12th June 2007, 06:15 PM
By dezt in forum Wireless Networks
Last Post: 6th November 2006, 09:31 PM
Users Browsing this Thread
There are currently 1 users browsing this thread. (0 members and 1 guests)