+ Post New Thread
Page 1 of 4 1234 LastLast
Results 1 to 15 of 59
Hardware Thread, Please help with dying server in Technical; Hi guys, I'm in desperate need of some ideas on how to fix a dying server. Story so far... We ...
  1. #1

    tmcd35's Avatar
    Join Date
    Jul 2005
    Location
    Norfolk
    Posts
    5,713
    Thank Post
    858
    Thanked 904 Times in 749 Posts
    Blog Entries
    9
    Rep Power
    330

    Please help with dying server

    Hi guys,

    I'm in desperate need of some ideas on how to fix a dying server. Story so far...

    We have 3 x HP ProLient DL380 G4 Rackmount servers

    One of the servers is randomly restarting every 20 minutes or so with an unknown hardware fault. We've caught it once on a blue screen saying just that "Hardware fault" (no other info) before ASR kicked in a rebooted the server.

    When It started happening I used an Ultimate Boot CD to run a CPU stress test and MemTest86+. Both came back fine, so I started looking at hard drives. The controller reports everything AOK and no probs with any of the drives.

    Server under warrenty so I've been in touch with HP and after a week of getting no where they agreed to replace the entire motherboard (the RAID controller is integrated).

    A week later and it's started doing it again, rebooting every 20 minutes. HP error logs show the event - unkown hardware fault caused ASR to reboot unexpectedly.

    I have brand new RAM which I'm putting in today as the servers where due for a RAM upgrade anyway.

    So I can be pretty sure the problem is not RAM (if it happens again) or motherboard/RAID controller related.

    Two questions:

    1. Does anyone know of a really good CPU stress test utility that test the CPU only (no activity on RAM/HDD)? Preferably one that can test 4-cores to destruction?
    2. How to work out which hard drive of 5 in an array is faulty when none of the usual diagnostic tools or warning lights are showing any suspected fault even occurs?


    And maybe (3) anywhere else I can try looking to that I haven't considered yet?

    Thanks in advance

    Terry.

  2. #2
    MGSTech's Avatar
    Join Date
    Jul 2007
    Posts
    362
    Thank Post
    13
    Thanked 95 Times in 54 Posts
    Rep Power
    39
    Random re-boots have been attributed to power supplies in the past. Do you have redundant PSU's in the server? if so try running on one and see if it still does it, then if not try with the other one out.

    As long as you ahve one connected you can disconnect the other one (you'll get an amber warning light but that's about all)

    Steve

  3. Thanks to MGSTech from:

    tmcd35 (29th June 2009)

  4. #3
    ricki's Avatar
    Join Date
    Jul 2005
    Location
    uk
    Posts
    1,475
    Thank Post
    20
    Thanked 164 Times in 157 Posts
    Rep Power
    52
    Hi

    Have the server got an ups on it. I have had a problem with old old ups and also if they are over heating.

    Richard

  5. #4
    ricki's Avatar
    Join Date
    Jul 2005
    Location
    uk
    Posts
    1,475
    Thank Post
    20
    Thanked 164 Times in 157 Posts
    Rep Power
    52

  6. Thanks to ricki from:

    tmcd35 (29th June 2009)

  7. #5

    tmcd35's Avatar
    Join Date
    Jul 2005
    Location
    Norfolk
    Posts
    5,713
    Thank Post
    858
    Thanked 904 Times in 749 Posts
    Blog Entries
    9
    Rep Power
    330
    Update:

    It's not the mobo or raid controller as they were replaced by HP

    and

    It's not the RAM as I've just put in 4 new sticks and got the same result.

    Off to do a 'Prime95' CPU stress test. If it's not that then it's either a randomly faulty NIC or on of the HDD's.

    edit: also it's not over heating - I've got half the school fighting me for access to the server room it soo cold in there (and so very hot everywhere else!). I find it hard to blame the UPS as 3 other servers are connected into that beast. Surely if that was causing the prob other servers would have been effected?
    Last edited by tmcd35; 29th June 2009 at 11:27 AM.

  8. #6

    Join Date
    Jun 2009
    Location
    Brighton
    Posts
    6
    Thank Post
    0
    Thanked 1 Time in 1 Post
    Rep Power
    0
    Try and find a copy of virtualPC online, this is a full hardware test program - boot from the CD and run a full system test - A colleague of mine uses it to test refurbished pcs and it often finds faults that dont always show up otherwise.

  9. Thanks to AdamGent from:

    tmcd35 (29th June 2009)

  10. #7

    plexer's Avatar
    Join Date
    Dec 2005
    Location
    Norfolk
    Posts
    13,232
    Thank Post
    667
    Thanked 1,638 Times in 1,463 Posts
    Rep Power
    423
    Have you ruled out the servers own psu's though as previously mentioned?

    Ben

  11. #8

    tmcd35's Avatar
    Join Date
    Jul 2005
    Location
    Norfolk
    Posts
    5,713
    Thank Post
    858
    Thanked 904 Times in 749 Posts
    Blog Entries
    9
    Rep Power
    330
    actually, no I haven't! I'll get on to those tests straight away. Didn't think of the PSU

    I've run 'Prime95' for around 10 min with 8 thread CPU stress only followed by 20+ min with 8 thread CPU and RAM stress - nearly 2Gb or Ram was committed. Server stayed solid through both tests.

    I'm more convinced the problem is not mobo/raid/ram/cpu related.

    I'll try the PSU check next.

    I still find it suspicious that it mostly does it at the start of lessons - when everyone is logging on.

    I need a good NIC/HDD stress tester.

    I'll try virtualPC after the PSU tests.

    Cheers guys. Please keep the ideas coming...

  12. #9

    bossman's Avatar
    Join Date
    Nov 2005
    Location
    England
    Posts
    3,942
    Thank Post
    1,199
    Thanked 1,069 Times in 760 Posts
    Rep Power
    330
    @tmcd35:

    What temp are the processors running at and do they have thermal cutout settings via the bios?
    Obviously as you say it seems when the server is fully loaded at log in or log off time that the problem occurs if the processors are heating up quickly without proper air flow around them then this could be the case.

  13. #10

    tmcd35's Avatar
    Join Date
    Jul 2005
    Location
    Norfolk
    Posts
    5,713
    Thank Post
    858
    Thanked 904 Times in 749 Posts
    Blog Entries
    9
    Rep Power
    330
    @Bossman, thanks for the suggestion. I really don't think the problem is heat/processor related. Mainly because it should have rebooted during the prime95 tests. The load was at 100% across all 8 cores/threads (1xdual core with HT) for around half hour total - no issues.

    Of course neither the HDD's or the NICS were being accessed during the Prime'95 tests.

    My gut instinct (as it as all along) still says a faulty HDD - I just can't work out how to determine which of the 5 buggers is causing the problem.

    Also have 4 NICs total, It could be one of the three add-in cards.

    @AdamGent - looked up Virtual PC Check/PC Check - $300? Good suggestion but not something I can get any time soon. Anyone know of any good FOSS that'll do the same/similar job?

    @Plexar/@MGSTech - Currently running on one PSU - so far so good. Next mass logon is around 1:10pm. I'll switch PSU's at around 1:40pm ready for the last logon of the day (2:10ish). We'll see if the problem re-occurs during either periods? If it does I'll look at recreating the problem tomorrow.

  14. #11
    DMcCoy's Avatar
    Join Date
    Oct 2005
    Location
    Isle of Wight
    Posts
    3,464
    Thank Post
    10
    Thanked 496 Times in 436 Posts
    Rep Power
    113
    What happens if you turn asr off? I've always found it to be more trouble than any potential use it could be.

  15. #12

    tmcd35's Avatar
    Join Date
    Jul 2005
    Location
    Norfolk
    Posts
    5,713
    Thank Post
    858
    Thanked 904 Times in 749 Posts
    Blog Entries
    9
    Rep Power
    330
    If I turn ASR off then the server just freezes up but no reboot. I end up having to manually reboot the server. Only once have I seen a blue screen for an error which I think may have been thrown up by ASR before it restarted.

    The Windows event logs are not showing anything obvious at around the same time as the reboots

  16. #13

    LeMarchand's Avatar
    Join Date
    Jan 2008
    Location
    The deepest pits of hell
    Posts
    2,197
    Thank Post
    303
    Thanked 339 Times in 241 Posts
    Rep Power
    143
    I has a similar problem, and it did turn out to be one of the HDDs. I used an HDD Regenerator program which took hours, but the server seems OK now. (Though I'd be happier if SMT would pony up for new disks or preferably a whole new server).

  17. #14

    tmcd35's Avatar
    Join Date
    Jul 2005
    Location
    Norfolk
    Posts
    5,713
    Thank Post
    858
    Thanked 904 Times in 749 Posts
    Blog Entries
    9
    Rep Power
    330
    @LeMarchand - how did you work out which HDD was at fault? I *could* spend £1000 on 5 new HDD's and swap them out one at a time. Letting the array rebuild to the hot spare each time before pulling the next. But that could take days and I'm worried about what happens if the faulty drive gives during the rebuild process. I'd rather find a way of determining which drive is at fault and just replace that.

  18. #15
    swgeek's Avatar
    Join Date
    Jun 2009
    Location
    cornwall
    Posts
    6
    Thank Post
    0
    Thanked 1 Time in 1 Post
    Rep Power
    0
    Afternoon all..
    If it is raid any way can you not just remove drives on a one by one basis and if the server stops restarting you will know which drive it is! If it is a drive?
    Cheers Mark

  19. Thanks to swgeek from:

    tmcd35 (29th June 2009)

SHARE:
+ Post New Thread
Page 1 of 4 1234 LastLast

Similar Threads

  1. "If you can hear this whispering you are dying."
    By SteveT in forum General Chat
    Replies: 5
    Last Post: 16th September 2008, 09:37 AM
  2. Replies: 8
    Last Post: 18th July 2008, 02:34 PM
  3. Windows Server 2003 File Server Resource Manager
    By mrforgetful in forum Windows
    Replies: 1
    Last Post: 17th June 2007, 01:51 PM
  4. Hayfever, killing, dying!
    By starscream in forum General Chat
    Replies: 26
    Last Post: 12th June 2007, 05:15 PM
  5. My server keeps dying
    By dezt in forum Wireless Networks
    Replies: 2
    Last Post: 6th November 2006, 08:31 PM

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •