+ Post New Thread
Page 3 of 4 FirstFirst 1234 LastLast
Results 31 to 45 of 59
Hardware Thread, Please help with dying server in Technical; There is an outside chance that one of the cores has a problem with a given set of instructions that ...
  1. #31

    tmcd35's Avatar
    Join Date
    Jul 2005
    Location
    Norfolk
    Posts
    5,243
    Thank Post
    772
    Thanked 804 Times in 670 Posts
    Blog Entries
    9
    Rep Power
    299
    There is an outside chance that one of the cores has a problem with a given set of instructions that neither of the CPU stress testing programs used. And it just so happens that the light from my torch bounced of Mars at the correct angle to make those instructions run through that core at that time and cause the problems.

    I just don't think it's very likely - personally. I'd hate to be wrong as getting old of replacement processors now (three years old Dual Core, Hyper Threaded Xeons) would probably be difficult and expensive

  2. #32

    matt40k's Avatar
    Join Date
    Jun 2008
    Location
    Ipswich
    Posts
    4,134
    Thank Post
    352
    Thanked 577 Times in 474 Posts
    Rep Power
    142
    Quote Originally Posted by tmcd35 View Post
    I'm actually (in a perverse way) enjoying trying to hunt down the problem. I'm lucky enough to have a relatively free jobs list at the moment. This has top number 1 high priority. I'm here till it's fixed
    Just need the van with the spare parts in


    Umm... I guess you could remove one CPU, check the manual as you'll need to ensure you keep one in the 1st slot. Make sure it's coated in thermal paste. Personally I'll be phoning HP and moaning my arse off cause there engineer didn't fix it.

    Might be worth disabling\remove the NIC if you think it could be that, also fireware the BIOS and RAID controller (careful you don't lose any data!!)

  3. #33

    tmcd35's Avatar
    Join Date
    Jul 2005
    Location
    Norfolk
    Posts
    5,243
    Thank Post
    772
    Thanked 804 Times in 670 Posts
    Blog Entries
    9
    Rep Power
    299
    Updating all the various Firmwares was one of the things HP got me to do before they agreed to send an engineer out. It didn't fix the problem on the old board and although yes It's likely hp installed a new board with older firmware that which I used to update the last board, I don't think mobo was ever the problem so a firmware upgrade won't solve this.

    Still if the problem keeps reoccurring desperation will inevitably lead me to try this again!

    Quote Originally Posted by matt40k View Post
    Make sure it's coated in thermal paste. Personally I'll be phoning HP and moaning my arse off cause there engineer didn't fix it.
    To be fair to HP - they sent a bloke out (after a week of over the phone/internet diagnostics) to replace a working Mobo with another working mobo!

    STATUS UPDATE: After switching round PSU's everything appears to be fine on all three servers. But then, of course, non of the three servers have come under any king of load at all over the past hour.

    So, the question is - whats going to happen tomorrow?

    Either - it's a PSU prob thats now fixed (potentially waiting to kill another server), or One (or both) of the other servers are going to reboot under load, or (most likely) this server will reboot itself at 9am tomorrow morning.

    Place your bets now...
    Last edited by tmcd35; 29th June 2009 at 03:48 PM.

  4. #34

    Join Date
    Mar 2008
    Location
    Surrey
    Posts
    2,161
    Thank Post
    95
    Thanked 318 Times in 260 Posts
    Blog Entries
    4
    Rep Power
    111
    I'm putting my money on it rebooting tomorrow morning, and eventually being traced to a memory fault.

  5. #35

    tmcd35's Avatar
    Join Date
    Jul 2005
    Location
    Norfolk
    Posts
    5,243
    Thank Post
    772
    Thanked 804 Times in 670 Posts
    Blog Entries
    9
    Rep Power
    299
    If it's a memory fault I've got plenty of stick to swap it out with!

    It's so, so, so very unlikely to be a memory fault that I'm more than happy to offer up very very good odds and be pleased to take your money when proofed to be something else

  6. #36

    mac_shinobi's Avatar
    Join Date
    Aug 2005
    Posts
    9,212
    Thank Post
    2,766
    Thanked 935 Times in 875 Posts
    Rep Power
    343
    After all the posts etc of going back n forth I would either go with a hard drive or the NIC ( one of ) being under a load when everyone tries to login.

    What make / model of NICS are they ? Also am guessing they are on gigabit ?

  7. #37

    tmcd35's Avatar
    Join Date
    Jul 2005
    Location
    Norfolk
    Posts
    5,243
    Thank Post
    772
    Thanked 804 Times in 670 Posts
    Blog Entries
    9
    Rep Power
    299
    I've been avoiding the NIC's for a reason

    These server were built and installed by the last guy and I'm not overly happy with some of his set up choices - personal opinion and all.

    There are two on board NICs and three PCI NICs. I believe (but have yet to check) they are all HP. I know 100% they are all Gigabit. I know each of the three servers are set up the same way. I know each server has 5 IP's and theres some NIC teaming going on.

    I think I'd sooner rule out the HDDs before getting my hands dirty and working out which NIC is which IP and what services rely on which IPs and how exactly the teaming is configured.

  8. #38

    matt40k's Avatar
    Join Date
    Jun 2008
    Location
    Ipswich
    Posts
    4,134
    Thank Post
    352
    Thanked 577 Times in 474 Posts
    Rep Power
    142
    I think it'll re-occur, it could be one of the following.

    - PSU issue
    - Mobo could have been fitted incorrectly
    - RAM could be incompatible
    - RAID controller issue
    - BIOS issue
    - NIC issue

    I would try disabling all the extra stuff in the BIOS, such as HT etc. Check the UPS for current load, try with no extras and the NIC disconnected.

  9. Thanks to matt40k from:

    tmcd35 (29th June 2009)

  10. #39

    tmcd35's Avatar
    Join Date
    Jul 2005
    Location
    Norfolk
    Posts
    5,243
    Thank Post
    772
    Thanked 804 Times in 670 Posts
    Blog Entries
    9
    Rep Power
    299
    STATUS UPDATE:

    • Swapped PSU's with other servers - problem server still rebooting
    • installed brand new RAM - problem server still rebooting
    • Prime95 8 thread stress test - no restarts during test
    • Prime95 tests RAM and all CPU cores - server still reboots, but not during tests
    • Brand new Mobo installed - server still reboots
    • Onboard/Integrated SCSI RAID controllet, replaced with mobo - server still reboots
    • Limited info in error logs show same non-descript error code despite above tests/changes


    As you can see I really am left with just NIC's and HDD's to test. While I'm not going to totally discount any other possibility -
    • new mobo with same fault as last
    • incorrectly fitted cpus
    • problem cpu core
    • cpu overheating
    • incompatible ram
    • firmware/driver issues


    The test done so far, and the state of the machine when this first started happening, suggests that these are all extremely remote and unlikely. Oh hum, another day of digging...

  11. #40

    SYNACK's Avatar
    Join Date
    Oct 2007
    Posts
    10,684
    Thank Post
    824
    Thanked 2,570 Times in 2,187 Posts
    Blog Entries
    9
    Rep Power
    731
    Here is a novel idea, hp RAID sets are portable between smartarray adapters. You could simple shut down the server and a good one then swap all of the hard drives between them. This would rule out the drives and the OS from the list of causes. If possible do it with two machines that are not DCs as I am not 100% on how the machine SID change would affect them.

    You need to move all of the disks at once while the servers are off then when you boot them they will just read the raid config off the transposed drives. Be sure to put them in in the right order though.

    Edit: Oh and also upgrade all the firmware if you have not already done so.
    Last edited by SYNACK; 30th June 2009 at 08:39 AM.

  12. Thanks to SYNACK from:

    tmcd35 (30th June 2009)

  13. #41

    tmcd35's Avatar
    Join Date
    Jul 2005
    Location
    Norfolk
    Posts
    5,243
    Thank Post
    772
    Thanked 804 Times in 670 Posts
    Blog Entries
    9
    Rep Power
    299
    If the server stays up long enough for me to work out the NIC config I think I'm going to start by pulling the three additional NICs and re-introducing them one at a time.

    Sleeping on it overnight I think I agree with the consensus here. The next most likely place is one of the NICs. Thinking about it in all honesty a randomly dodgy drive is the least likely of causes.

    I like the idea @Synack, but it does mean downing another server - even temporarily - to do it. Also all three servers are DC's. And I'd have to do the firmware updates on all servers first. Don't want firmware missmatch causing probs if I go down this route.

    It'd be a quicker test than pulling the drives one at a time - but potentially riskier as doing any firmware updates on a PC rebooting as often as this one now is is not exactly a wise move.

  14. #42


    Join Date
    Jul 2007
    Location
    Rural heck
    Posts
    2,662
    Thank Post
    120
    Thanked 434 Times in 353 Posts
    Rep Power
    125
    Does the RAID controller have a memory module attached to it, if so was this changed with the motherboard?

  15. Thanks to K.C.Leblanc from:

    tmcd35 (30th June 2009)

  16. #43

    tmcd35's Avatar
    Join Date
    Jul 2005
    Location
    Norfolk
    Posts
    5,243
    Thank Post
    772
    Thanked 804 Times in 670 Posts
    Blog Entries
    9
    Rep Power
    299
    Ooo, very good question. TBH I don't rightly know

    There is a 256mb SoDIMM on the motherboard. When I first saw it I thought it may have something to do with onboard graphics (although why a server may need 256mb dedicated graphics ram is beyond me). Thinking about it, it's more likely this is the RAID RAM.

    I would have thought its the same RAM from the previous mobo. I'm pretty sure HP only replaced the actual mobo itself.

    I'm currently sitting here waiting for the server to reboot (or not). Next period starts in about half hour. I've taken out all three additional NIC's - which on investigation appear to be totally redundant.

    Since taking the NICs out I've not had a reboot - but then server hardly been under any load. So I'm sitting here playing the waiting game...

  17. #44

    bossman's Avatar
    Join Date
    Nov 2005
    Location
    England
    Posts
    3,853
    Thank Post
    1,160
    Thanked 1,028 Times in 729 Posts
    Rep Power
    323
    @tmcd35:

    Could this be of help http://www.bishopbarrington.net/other/helpfultool.exe

    let me know as I think you can put stress tests on various hardware elements it may help you in your search.

  18. #45

    tmcd35's Avatar
    Join Date
    Jul 2005
    Location
    Norfolk
    Posts
    5,243
    Thank Post
    772
    Thanked 804 Times in 670 Posts
    Blog Entries
    9
    Rep Power
    299
    I would thank you Bossman but thankfully our LEA's virus checker stopped your evil plan to infect my already poorly server with a bad case of swine flu ...
    Attached Images Attached Images

SHARE:
+ Post New Thread
Page 3 of 4 FirstFirst 1234 LastLast

Similar Threads

  1. "If you can hear this whispering you are dying."
    By SteveT in forum General Chat
    Replies: 5
    Last Post: 16th September 2008, 09:37 AM
  2. Replies: 8
    Last Post: 18th July 2008, 02:34 PM
  3. Windows Server 2003 File Server Resource Manager
    By mrforgetful in forum Windows
    Replies: 1
    Last Post: 17th June 2007, 01:51 PM
  4. Hayfever, killing, dying!
    By starscream in forum General Chat
    Replies: 26
    Last Post: 12th June 2007, 05:15 PM
  5. My server keeps dying
    By dezt in forum Wireless Networks
    Replies: 2
    Last Post: 6th November 2006, 08:31 PM

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •