+ Post New Thread
Page 1 of 2 12 LastLast
Results 1 to 15 of 19
Windows Thread, H1N1 for Servers - A 2003 Nightmare! in Technical; This is a nightmare that I want you all to share... One of the schools that we assist had 4 ...
  1. #1

    m25man's Avatar
    Join Date
    Oct 2005
    Location
    Romford, Essex
    Posts
    1,631
    Thank Post
    49
    Thanked 462 Times in 337 Posts
    Rep Power
    140

    H1N1 for Servers - A 2003 Nightmare!

    This is a nightmare that I want you all to share...

    One of the schools that we assist had 4 new DL380's last summer.
    They have been running sweetly for about 9 months.
    One, a dual quad Sims Server had been sitting virtually idle for the last 9 months whilst the DM played with the various upgrades etc ready for a full roll out at easter.

    They installed BackupExec on it and restarted, it hung on restart and displayed "Attempt" in the top left corner of the screen.
    I won't bore you with the technicals HP spent hours with us and remoted in from India using iLO2 and determined that there was nothing wrong with the hardware.
    The word "Attempt" is actually just the hardware hung at "Attempting to Boot from ...."
    The OS cant recognise the boot sector correctly as this has been trashed.
    So we decided as the database was safe on the second raid volume we would just reinstall the OS and refresh the SQL install - Good practise for the DM anyway.

    All good server recovered and they continued testing SIMs and SQL on it.

    Last Weds the school called again, this time another DL380 is showing the same error! This time it's a DC.
    2 New HP DL380's turned to toast in 3 weeks!
    Wait...
    When we arrive on site (within 30 mins I might add) we are now told another server is stuck on a boot screen with just a flashing cursor!
    Another DC has gone down...
    But wait.... this is an Intel 2400 SATA Server!

    So suddenly the Hardware suspicions are no longer valid.

    So the DR plan kicks into play.... What's a DR Plan???? say's the NM

    Where are the backups, on a single 800GB DLT which takes about 9Hrs to run!
    This connected to the old Intel 2400 thats now dead!
    Fortunately I have a spare HP PCIx SCSI card in the car, so we can connect the tape unit directly to the target DC great!

    An authorative DS restore looks like the only way out of this mess, so after about 2 hours cataloging the tape (As the Veritas DB and catalogs are on the other dead server).
    Low and behold, there are no files to restore in the system state! Inconsistent data error.

    S**T, now we are in trouble....

    Whilst battling with the tapes and recovery we are building the OS's again on the other failed server hello we said, where have all the desktop shortcuts gone on the other HP? The one we were hoping to transfer the remaining DC roles from...
    The services are all running but now the 3rd DC is about to die!
    WTF!!!
    In 48 Hrs all 8 Servers have failed, all with identical symptoms!

    The first you notice is the shortcuts, when these go you have just lost all of the file association section of HK_Classes_Root
    To accompany this everything under HKLM_Hardware disappears together along with a huge chunk of HKLM\System\CCS\System

    On the next boot your server and if you are unlucky your AD is History.

    I am writing this as so far neither HP, Microsoft or Sophos have a clue whats going on.

    The site has lost everything apart from it's data, all of that remains untouched.
    This is a virus like activity that in 20 years of Networking I have never seen.

    The entire domain has been rebuilt from the ground up some servers time!

    We made it through the weekend, and thing were looking good until about 11:00 am today when the newly built DC died with exactly the same symptoms.
    We yanked the plug on the core switch and isolated all of the servers from each other.

    So where do we go from here! I have lost the plot entirely.

    This like an episode of "HOUSE".
    Sophos has not detected anything disasterous some lightweight email viruses in the users home folder thats all.
    So what's is the common denominator.....

    The OS Server 2003 R2 SP2
    Sophos

    These are the ONLY two components common to all 8 failures
    What on earth has the ability to blow identical holes in 8 system registries in the same places on 8 different servers?

    I have no concrete evidence yet, but after such a traumatic experience the school techies have been scanning work stations with the SAV standalone Linux tools and have cleaned up a whole rake of viruses including some conficker strains.

    Sophos has the following disinfection options:
    "Do Nothing"
    "Delete"
    "Move to quarantine"
    "Move to a central location"

    Well, it appears that the correct option for your servers is to "Do Nothing".

    When pushed, as to why we should not use the "Delete" option I have been told that "I don't want to do that".
    "It's not recommended"
    Why I ask?

    Because I believe that a false positive might result in the AV application trying to delete critical systems files or registry keys thats why!
    I can't explain any other possible reasons this is H1N1 for servers!

    I have had to make changes to the AV policies so that a system area scan is a light tickle and the user areas are agressively purged.

    I know that Sophos watch this board so I am being very careful what I say, but I am now most concerned that if you are a Sophos user and you have the "wrong" but "logically obvious settings" you may be in danger of following in my footsteps.

    I will update this thread as we make headway (or not) as the case may be.
    But any input however bizzare regarding the identical registry damage across 8 servers over 2 builds I would be keen to hear of them.

    In the mean time I think a review of your AV agressiveness on your system areas such as \Windows\NDTS, \Windows\system32\config is in order

    I have seen similar reports of Server Boot issues after updates recently but nothing on this scale.

    I have not been beaten by M$ in 20 years, but now I am worried.....

  2. Thanks to m25man from:

    Netman (29th April 2009)

  3. #2
    DMcCoy's Avatar
    Join Date
    Oct 2005
    Location
    Isle of Wight
    Posts
    3,462
    Thank Post
    10
    Thanked 496 Times in 436 Posts
    Rep Power
    113
    I guess different server types rules out faulty raid controllers in the HPs (same age etc). Whats the humidity like? Not too dry I hope (Static). It could be a rootkit trying to write to/damage the bios but I don't think thats very likely. Are the servers being built then patched offline from the main network? Something else to try is to use a different administrator password on the servers in case somethings replicating around by simply running with enough priviledges to install on everything else.

    Everything on a ups and power spike free too? Any applications common to all servers?

    Sounds like a tough one!

    Edit: I see sophos is a common factor, any chance of trialing another AV for a few days to rule it out?
    Last edited by DMcCoy; 27th April 2009 at 10:06 PM.

  4. #3

    m25man's Avatar
    Join Date
    Oct 2005
    Location
    Romford, Essex
    Posts
    1,631
    Thank Post
    49
    Thanked 462 Times in 337 Posts
    Rep Power
    140
    Quote Originally Posted by DMcCoy View Post
    I guess different server types rules out faulty raid controllers?
    Certainly does.

    Quote Originally Posted by DMcCoy View Post
    It could be a rootkit trying to write to/damage the bios but I don't think thats very likely.
    Something is triggering a reaction that removes registry keys with the precision of a brain surgeon.

    Quote Originally Posted by DMcCoy View Post
    Something else to try is to use a different administrator password on the servers in case somethings replicating around by simply running with enough priviledges to install on everything else.
    This has been changed on the new domain build, previously we found 10 members of the SLT were in the Server Operators Group, (WTF).

    We also quarantined 146GB of illegally downloaded copyright material, DVD Rips, Music, Warez Key generators all sorts!
    Oh and I forgot to say we also found another 39GB of stuff in the Kids Home Files!!!!

    Yes, the 146GB was Staff mostly the SMT!!!

    Obviously the NM was too busy watching these videos instead of managing the network.

    The place is the pits..... if there is one place on the planet that I would be happy to see BSF it's here!

  5. #4

    SYNACK's Avatar
    Join Date
    Oct 2007
    Posts
    11,223
    Thank Post
    874
    Thanked 2,717 Times in 2,302 Posts
    Blog Entries
    11
    Rep Power
    780
    It does sound like a rootkit or virus issue. My thoughts would be to try and get a hold of the data in a raw format off the drives. do you have any other machines that have simmilar raid controllers that you could swap the affected disks over to to access them, the hp raid volumes are portable if you move all of the disks (removing the good ones for safe keeping of course). If you could get access to the disks you could have a look at the raw data and possibly the event logs depending on how damaged the partitions are.

    Also if you got an image of an offending system a computer forensics/security place may be able to shed some light. The AV company may also be interested if it is a new virus and as yet undetectable to get a sample. Givent the fact that it does appear to be spreading you may need to get these companies involved to turn the tide without a complete system wide wipe of the network and documents (unpractical).

    I'm assuming that the iLO did not shed any light as these are like a black box for servers including halt event logging. The fact that it is halting so eairly in the proceedings at that point could indicate that it is the low level chunk of the drive that it is having difficulty with which would indicate the boot sector. This shoudl be protected by the AV software but a determined rootkit will easily bypass that. There may be master sector protection options in the BIOS but I am uncertain. Also do the systems have the latest BIOS and RAID contoller firmware as this may help the system cope a little better and give you a more sencible error.

    It may be that the virus/rootkit is badly written and designed to effect much more simple computers and cannot action its intended behaviour on the more complicated system.

    Another way to look for the culprit/stabelize things in the meantime could be to use virtual servers configured the same way as the originals and then see if they get affected to. If so these may be easier to disassemble and find both the culprit and the infection vector.

  6. #5
    Azhibberd's Avatar
    Join Date
    May 2008
    Location
    Newbury,Berkshire
    Posts
    169
    Thank Post
    20
    Thanked 21 Times in 20 Posts
    Rep Power
    16
    Wow... Rough time there so it would seem, Also might be worth turning off the remote registry service maybe?

  7. #6

    m25man's Avatar
    Join Date
    Oct 2005
    Location
    Romford, Essex
    Posts
    1,631
    Thank Post
    49
    Thanked 462 Times in 337 Posts
    Rep Power
    140
    I will replace one of the HP's arrays with a 2k8 HyperV build and get the 2003 DC's migrated across if only to eliminate the rebuilding times.
    We have run full deep rootkit scans under the guidance of Sophos to no avail.
    The servers are scheduled to loose the registries between 10.30 and 11.00 tomorrow.
    Time is the problem here. Exam deadlines and EOY Finance needed.
    We have not even thought about getting the exchange system online yet!
    The EDB's are fine as I have these already mounted on our recovery server and exported all the mailboxes into 248 PST Files.

  8. #7

    SYNACK's Avatar
    Join Date
    Oct 2007
    Posts
    11,223
    Thank Post
    874
    Thanked 2,717 Times in 2,302 Posts
    Blog Entries
    11
    Rep Power
    780
    If you know about when the issues are are going to happen you could possible use filemon on the box during that time to see the activity so long as it keeps running ( FileMon for Windows ) there may be some way to redirect the output to a remote server so that you could pick up the PID of the offender and the activity leading up to it.

  9. Thanks to SYNACK from:

    stariq (28th April 2009)

  10. #8
    zag
    zag is offline
    zag's Avatar
    Join Date
    Mar 2007
    Posts
    3,808
    Thank Post
    906
    Thanked 420 Times in 353 Posts
    Blog Entries
    12
    Rep Power
    87
    Might not fix your current problems but there are a couple of things that could be done differently:

    - Never buy the same server hardware, I build my networks over time getting each server stable before fully rolling it out.
    - Never install backupexec on a server that performs a vital role. I always have a workstation with backupexec installed in a different location to the main server room.
    - I don't use raid on any server other than the file server, this may just be my experience but raid has caused more problems than it has fixed, especially from power cuts and corruption.
    - Backing up to tape is unreliable and slow. A 1tb hard disk costs 50 quid these days and it far more reliable for recovery. I keep 1 hard disk always online, and another to take home.

    But yeh it sounds like a virus, hardware issue or deliberate damage to me.
    Last edited by zag; 28th April 2009 at 09:04 AM.

  11. #9
    Richie1972's Avatar
    Join Date
    Apr 2006
    Location
    Blackburn
    Posts
    239
    Thank Post
    2
    Thanked 6 Times in 6 Posts
    Rep Power
    19
    RAID (proper raid, not the pseudo RAID of cheaper motherboards) should be an essential to all your major servers.
    Split roles and ensure plenty of redundancy (DFS, load balancing, etc).
    Try and get virtualisation up and running - much faster to get a backup... back up again.
    Get rid of tape backups and have disk backups in a remote building (unless you only have the one building).
    Have a server(s) dedicated just to backups

  12. #10

    localzuk's Avatar
    Join Date
    Dec 2006
    Location
    Minehead
    Posts
    17,807
    Thank Post
    517
    Thanked 2,469 Times in 1,913 Posts
    Blog Entries
    24
    Rep Power
    835
    With all the junk the techies have found, could it be that someone knows an administrator password and is installing rootkits or similar?

  13. #11

    mattx's Avatar
    Join Date
    Jan 2007
    Posts
    9,240
    Thank Post
    1,058
    Thanked 1,068 Times in 625 Posts
    Rep Power
    740
    raid has caused more problems than it has fixed, especially from power cuts and corruption.
    Errrrrr, heard of RAID Write Cache ? Obviously not. Never mind eh...

  14. #12

    m25man's Avatar
    Join Date
    Oct 2005
    Location
    Romford, Essex
    Posts
    1,631
    Thank Post
    49
    Thanked 462 Times in 337 Posts
    Rep Power
    140
    OK Guys,

    Survived for 24Hrs with out another loss "phew"...

    Currently working with Sophos on the cause as it would appear that the only software installed after the last rebuild was Sophos.

    The "Sledgehammer and nut" syndrome looks most likely.

    An ide update for the Sality Virus may be responsible for actions being taken against a couple of executables that are the Sim's .net setup package.

    This is looking more likely that it is an over reaction to a false positive caused by this update!.

    What has come out of several hours on the telephone to Sophos in the UK and US is that the option to "Delete" a Virus especially on a server is most definately NOT an option.
    If you use this product the correct option of course is "Do Nothing"......
    but of course as Techies you would all know that wouldn't you!

    We suspect that during an AV scan, a false positive may trigger the product to react violently attempting to remove registry keys that are not there and generally blowing holes in your OS with a blunderbus!

    It's simple Sophos, if it is not safe to use the delete option on a server then simply disable it with an OS check!

    As far as I am concerned we are not out of the woods yet only after several weeks without incident will I be happy.

  15. #13

    mattx's Avatar
    Join Date
    Jan 2007
    Posts
    9,240
    Thank Post
    1,058
    Thanked 1,068 Times in 625 Posts
    Rep Power
    740
    This is looking more likely that it is an over reaction to a false positive caused by this update!.
    >>Cough Cough<< Never mind Sophos - I'll leave it at that. Glad you got it sorted - a VERY steep learning curve.

  16. #14
    gshaw's Avatar
    Join Date
    Sep 2007
    Location
    Essex
    Posts
    2,662
    Thank Post
    166
    Thanked 220 Times in 203 Posts
    Rep Power
    67
    Just changed the Sophos cleanup policy to a new one for Servers set to "Do Nothing" after having read this! I wasn't keen on the wording of the whole "Do Nothing" section as it was... whatever way you look at it the solution doesn't seem ideal... leave the fiile infected or trash the server... ouch!

    Looks like a stressful few days
    Last edited by gshaw; 29th April 2009 at 04:18 PM.

  17. #15

    Join Date
    Feb 2008
    Location
    Wiltshire
    Posts
    885
    Thank Post
    277
    Thanked 139 Times in 112 Posts
    Blog Entries
    27
    Rep Power
    42
    I too have now set mine to 'Do Nothing'.

    It will just be a case of checking the Enterprise Console to see if anything has been flagged and deal with it then for now.

    Pete

SHARE:
+ Post New Thread
Page 1 of 2 12 LastLast

Similar Threads

  1. 2008 servers freezing. 2003 ok though!
    By joe90bass in forum Windows Server 2008
    Replies: 7
    Last Post: 13th November 2008, 01:54 PM
  2. AOL nightmare
    By suesmate in forum How do you do....it?
    Replies: 3
    Last Post: 10th July 2007, 10:44 AM
  3. New server nightmare
    By dezt in forum Wireless Networks
    Replies: 10
    Last Post: 16th November 2006, 07:21 PM
  4. Nightmare
    By mrforgetful in forum ICT KS3 SATS Tests
    Replies: 6
    Last Post: 16th May 2006, 02:27 PM
  5. Nightmare Scenario
    By kingswood in forum Wireless Networks
    Replies: 4
    Last Post: 15th September 2005, 07:27 AM

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •