Netman (29th April 2009)
This is a nightmare that I want you all to share...
One of the schools that we assist had 4 new DL380's last summer.
They have been running sweetly for about 9 months.
One, a dual quad Sims Server had been sitting virtually idle for the last 9 months whilst the DM played with the various upgrades etc ready for a full roll out at easter.
They installed BackupExec on it and restarted, it hung on restart and displayed "Attempt" in the top left corner of the screen.
I won't bore you with the technicals HP spent hours with us and remoted in from India using iLO2 and determined that there was nothing wrong with the hardware.
The word "Attempt" is actually just the hardware hung at "Attempting to Boot from ...."
The OS cant recognise the boot sector correctly as this has been trashed.
So we decided as the database was safe on the second raid volume we would just reinstall the OS and refresh the SQL install - Good practise for the DM anyway.
All good server recovered and they continued testing SIMs and SQL on it.
Last Weds the school called again, this time another DL380 is showing the same error! This time it's a DC.
2 New HP DL380's turned to toast in 3 weeks!
When we arrive on site (within 30 mins I might add) we are now told another server is stuck on a boot screen with just a flashing cursor!
Another DC has gone down...
But wait.... this is an Intel 2400 SATA Server!
So suddenly the Hardware suspicions are no longer valid.
So the DR plan kicks into play.... What's a DR Plan???? say's the NM
Where are the backups, on a single 800GB DLT which takes about 9Hrs to run!
This connected to the old Intel 2400 thats now dead!
Fortunately I have a spare HP PCIx SCSI card in the car, so we can connect the tape unit directly to the target DC great!
An authorative DS restore looks like the only way out of this mess, so after about 2 hours cataloging the tape (As the Veritas DB and catalogs are on the other dead server).
Low and behold, there are no files to restore in the system state! Inconsistent data error.
S**T, now we are in trouble....
Whilst battling with the tapes and recovery we are building the OS's again on the other failed server hello we said, where have all the desktop shortcuts gone on the other HP? The one we were hoping to transfer the remaining DC roles from...
The services are all running but now the 3rd DC is about to die!
In 48 Hrs all 8 Servers have failed, all with identical symptoms!
The first you notice is the shortcuts, when these go you have just lost all of the file association section of HK_Classes_Root
To accompany this everything under HKLM_Hardware disappears together along with a huge chunk of HKLM\System\CCS\System
On the next boot your server and if you are unlucky your AD is History.
I am writing this as so far neither HP, Microsoft or Sophos have a clue whats going on.
The site has lost everything apart from it's data, all of that remains untouched.
This is a virus like activity that in 20 years of Networking I have never seen.
The entire domain has been rebuilt from the ground up some servers time!
We made it through the weekend, and thing were looking good until about 11:00 am today when the newly built DC died with exactly the same symptoms.
We yanked the plug on the core switch and isolated all of the servers from each other.
So where do we go from here! I have lost the plot entirely.
This like an episode of "HOUSE".
Sophos has not detected anything disasterous some lightweight email viruses in the users home folder thats all.
So what's is the common denominator.....
The OS Server 2003 R2 SP2
These are the ONLY two components common to all 8 failures
What on earth has the ability to blow identical holes in 8 system registries in the same places on 8 different servers?
I have no concrete evidence yet, but after such a traumatic experience the school techies have been scanning work stations with the SAV standalone Linux tools and have cleaned up a whole rake of viruses including some conficker strains.
Sophos has the following disinfection options:
"Move to quarantine"
"Move to a central location"
Well, it appears that the correct option for your servers is to "Do Nothing".
When pushed, as to why we should not use the "Delete" option I have been told that "I don't want to do that".
"It's not recommended"
Why I ask?
Because I believe that a false positive might result in the AV application trying to delete critical systems files or registry keys thats why!
I can't explain any other possible reasons this is H1N1 for servers!
I have had to make changes to the AV policies so that a system area scan is a light tickle and the user areas are agressively purged.
I know that Sophos watch this board so I am being very careful what I say, but I am now most concerned that if you are a Sophos user and you have the "wrong" but "logically obvious settings" you may be in danger of following in my footsteps.
I will update this thread as we make headway (or not) as the case may be.
But any input however bizzare regarding the identical registry damage across 8 servers over 2 builds I would be keen to hear of them.
In the mean time I think a review of your AV agressiveness on your system areas such as \Windows\NDTS, \Windows\system32\config is in order
I have seen similar reports of Server Boot issues after updates recently but nothing on this scale.
I have not been beaten by M$ in 20 years, but now I am worried.....
Netman (29th April 2009)
I guess different server types rules out faulty raid controllers in the HPs (same age etc). Whats the humidity like? Not too dry I hope (Static). It could be a rootkit trying to write to/damage the bios but I don't think thats very likely. Are the servers being built then patched offline from the main network? Something else to try is to use a different administrator password on the servers in case somethings replicating around by simply running with enough priviledges to install on everything else.
Everything on a ups and power spike free too? Any applications common to all servers?
Sounds like a tough one!
Edit: I see sophos is a common factor, any chance of trialing another AV for a few days to rule it out?
Last edited by DMcCoy; 27th April 2009 at 11:06 PM.
We also quarantined 146GB of illegally downloaded copyright material, DVD Rips, Music, Warez Key generators all sorts!
Oh and I forgot to say we also found another 39GB of stuff in the Kids Home Files!!!!
Yes, the 146GB was Staff mostly the SMT!!!
Obviously the NM was too busy watching these videos instead of managing the network.
The place is the pits..... if there is one place on the planet that I would be happy to see BSF it's here!
It does sound like a rootkit or virus issue. My thoughts would be to try and get a hold of the data in a raw format off the drives. do you have any other machines that have simmilar raid controllers that you could swap the affected disks over to to access them, the hp raid volumes are portable if you move all of the disks (removing the good ones for safe keeping of course). If you could get access to the disks you could have a look at the raw data and possibly the event logs depending on how damaged the partitions are.
Also if you got an image of an offending system a computer forensics/security place may be able to shed some light. The AV company may also be interested if it is a new virus and as yet undetectable to get a sample. Givent the fact that it does appear to be spreading you may need to get these companies involved to turn the tide without a complete system wide wipe of the network and documents (unpractical).
I'm assuming that the iLO did not shed any light as these are like a black box for servers including halt event logging. The fact that it is halting so eairly in the proceedings at that point could indicate that it is the low level chunk of the drive that it is having difficulty with which would indicate the boot sector. This shoudl be protected by the AV software but a determined rootkit will easily bypass that. There may be master sector protection options in the BIOS but I am uncertain. Also do the systems have the latest BIOS and RAID contoller firmware as this may help the system cope a little better and give you a more sencible error.
It may be that the virus/rootkit is badly written and designed to effect much more simple computers and cannot action its intended behaviour on the more complicated system.
Another way to look for the culprit/stabelize things in the meantime could be to use virtual servers configured the same way as the originals and then see if they get affected to. If so these may be easier to disassemble and find both the culprit and the infection vector.
Wow... Rough time there so it would seem, Also might be worth turning off the remote registry service maybe?
I will replace one of the HP's arrays with a 2k8 HyperV build and get the 2003 DC's migrated across if only to eliminate the rebuilding times.
We have run full deep rootkit scans under the guidance of Sophos to no avail.
The servers are scheduled to loose the registries between 10.30 and 11.00 tomorrow.
Time is the problem here. Exam deadlines and EOY Finance needed.
We have not even thought about getting the exchange system online yet!
The EDB's are fine as I have these already mounted on our recovery server and exported all the mailboxes into 248 PST Files.
If you know about when the issues are are going to happen you could possible use filemon on the box during that time to see the activity so long as it keeps running ( FileMon for Windows ) there may be some way to redirect the output to a remote server so that you could pick up the PID of the offender and the activity leading up to it.
stariq (28th April 2009)
Might not fix your current problems but there are a couple of things that could be done differently:
- Never buy the same server hardware, I build my networks over time getting each server stable before fully rolling it out.
- Never install backupexec on a server that performs a vital role. I always have a workstation with backupexec installed in a different location to the main server room.
- I don't use raid on any server other than the file server, this may just be my experience but raid has caused more problems than it has fixed, especially from power cuts and corruption.
- Backing up to tape is unreliable and slow. A 1tb hard disk costs 50 quid these days and it far more reliable for recovery. I keep 1 hard disk always online, and another to take home.
But yeh it sounds like a virus, hardware issue or deliberate damage to me.
Last edited by zag; 28th April 2009 at 10:04 AM.
RAID (proper raid, not the pseudo RAID of cheaper motherboards) should be an essential to all your major servers.
Split roles and ensure plenty of redundancy (DFS, load balancing, etc).
Try and get virtualisation up and running - much faster to get a backup... back up again.
Get rid of tape backups and have disk backups in a remote building (unless you only have the one building).
Have a server(s) dedicated just to backups
With all the junk the techies have found, could it be that someone knows an administrator password and is installing rootkits or similar?
Errrrrr, heard of RAID Write Cache ? Obviously not. Never mind eh...raid has caused more problems than it has fixed, especially from power cuts and corruption.
Survived for 24Hrs with out another loss "phew"...
Currently working with Sophos on the cause as it would appear that the only software installed after the last rebuild was Sophos.
The "Sledgehammer and nut" syndrome looks most likely.
An ide update for the Sality Virus may be responsible for actions being taken against a couple of executables that are the Sim's .net setup package.
This is looking more likely that it is an over reaction to a false positive caused by this update!.
What has come out of several hours on the telephone to Sophos in the UK and US is that the option to "Delete" a Virus especially on a server is most definately NOT an option.
If you use this product the correct option of course is "Do Nothing"......
but of course as Techies you would all know that wouldn't you!
We suspect that during an AV scan, a false positive may trigger the product to react violently attempting to remove registry keys that are not there and generally blowing holes in your OS with a blunderbus!
It's simple Sophos, if it is not safe to use the delete option on a server then simply disable it with an OS check!
As far as I am concerned we are not out of the woods yet only after several weeks without incident will I be happy.
>>Cough Cough<< Never mind Sophos - I'll leave it at that. Glad you got it sorted - a VERY steep learning curve.This is looking more likely that it is an over reaction to a false positive caused by this update!.
Just changed the Sophos cleanup policy to a new one for Servers set to "Do Nothing" after having read this! I wasn't keen on the wording of the whole "Do Nothing" section as it was... whatever way you look at it the solution doesn't seem ideal... leave the fiile infected or trash the server... ouch!
Looks like a stressful few days
Last edited by gshaw; 29th April 2009 at 05:18 PM.
I too have now set mine to 'Do Nothing'.
It will just be a case of checking the Enterprise Console to see if anything has been flagged and deal with it then for now.
There are currently 1 users browsing this thread. (0 members and 1 guests)