H1N1 for Servers - A 2003 Nightmare!
This is a nightmare that I want you all to share...
One of the schools that we assist had 4 new DL380's last summer.
They have been running sweetly for about 9 months.
One, a dual quad Sims Server had been sitting virtually idle for the last 9 months whilst the DM played with the various upgrades etc ready for a full roll out at easter.
They installed BackupExec on it and restarted, it hung on restart and displayed "Attempt" in the top left corner of the screen.
I won't bore you with the technicals HP spent hours with us and remoted in from India using iLO2 and determined that there was nothing wrong with the hardware.
The word "Attempt" is actually just the hardware hung at "Attempting to Boot from ...."
The OS cant recognise the boot sector correctly as this has been trashed.
So we decided as the database was safe on the second raid volume we would just reinstall the OS and refresh the SQL install - Good practise for the DM anyway.
All good server recovered and they continued testing SIMs and SQL on it.
Last Weds the school called again, this time another DL380 is showing the same error! This time it's a DC.
2 New HP DL380's turned to toast in 3 weeks!
When we arrive on site (within 30 mins I might add) we are now told another server is stuck on a boot screen with just a flashing cursor!
Another DC has gone down...
But wait.... this is an Intel 2400 SATA Server!
So suddenly the Hardware suspicions are no longer valid.
So the DR plan kicks into play.... What's a DR Plan???? say's the NM
Where are the backups, on a single 800GB DLT which takes about 9Hrs to run!
This connected to the old Intel 2400 thats now dead!
Fortunately I have a spare HP PCIx SCSI card in the car, so we can connect the tape unit directly to the target DC great!
An authorative DS restore looks like the only way out of this mess, so after about 2 hours cataloging the tape (As the Veritas DB and catalogs are on the other dead server).
Low and behold, there are no files to restore in the system state! Inconsistent data error.
S**T, now we are in trouble....
Whilst battling with the tapes and recovery we are building the OS's again on the other failed server hello we said, where have all the desktop shortcuts gone on the other HP? The one we were hoping to transfer the remaining DC roles from...
The services are all running but now the 3rd DC is about to die!
In 48 Hrs all 8 Servers have failed, all with identical symptoms!
The first you notice is the shortcuts, when these go you have just lost all of the file association section of HK_Classes_Root
To accompany this everything under HKLM_Hardware disappears together along with a huge chunk of HKLM\System\CCS\System
On the next boot your server and if you are unlucky your AD is History.
I am writing this as so far neither HP, Microsoft or Sophos have a clue whats going on.
The site has lost everything apart from it's data, all of that remains untouched.
This is a virus like activity that in 20 years of Networking I have never seen.
The entire domain has been rebuilt from the ground up some servers time!
We made it through the weekend, and thing were looking good until about 11:00 am today when the newly built DC died with exactly the same symptoms.
We yanked the plug on the core switch and isolated all of the servers from each other.
So where do we go from here! I have lost the plot entirely.
This like an episode of "HOUSE".
Sophos has not detected anything disasterous some lightweight email viruses in the users home folder thats all.
So what's is the common denominator.....
The OS Server 2003 R2 SP2
These are the ONLY two components common to all 8 failures
What on earth has the ability to blow identical holes in 8 system registries in the same places on 8 different servers?
I have no concrete evidence yet, but after such a traumatic experience the school techies have been scanning work stations with the SAV standalone Linux tools and have cleaned up a whole rake of viruses including some conficker strains.
Sophos has the following disinfection options:
"Move to quarantine"
"Move to a central location"
Well, it appears that the correct option for your servers is to "Do Nothing".
When pushed, as to why we should not use the "Delete" option I have been told that "I don't want to do that".
"It's not recommended"
Why I ask?
Because I believe that a false positive might result in the AV application trying to delete critical systems files or registry keys thats why!
I can't explain any other possible reasons this is H1N1 for servers!
I have had to make changes to the AV policies so that a system area scan is a light tickle and the user areas are agressively purged.
I know that Sophos watch this board so I am being very careful what I say, but I am now most concerned that if you are a Sophos user and you have the "wrong" but "logically obvious settings" you may be in danger of following in my footsteps.
I will update this thread as we make headway (or not) as the case may be.
But any input however bizzare regarding the identical registry damage across 8 servers over 2 builds I would be keen to hear of them.
In the mean time I think a review of your AV agressiveness on your system areas such as \Windows\NDTS, \Windows\system32\config is in order
I have seen similar reports of Server Boot issues after updates recently but nothing on this scale.
I have not been beaten by M$ in 20 years, but now I am worried.....