There are three questions you need to ask your deputy head.
1. How much data can the school afford to lose? I.e all of it, some of it, none of it.
2. How long can each system be down during an emergency? 2 weeks? 5 days? 1 day?
3. Based on the above answers - will a suitable budget be provided, not just initially but yearly?
E.g. 2 weeks to up time will require little cost really, just a backup drive, tapes, and server warranty, or the ability to order one or more new servers should the old ones fail. So say roughly £10k.
1 Day uptime will require Virtualisation, Storage Area Networks, multiple seperate server rooms, etc etc. Now you can easily be talking £100k+ for an average Secondary or anything up to £500k for a very large ICT heavy school/college.
So you can either say to your deputy - give me your requirements and I'll give you a cost. Or you can say give me your cost and I'll give you your requirements. Both will work.
Years ago when I worked in the mainframe industry, we used to carry out a regular exercise called 'Component Failure - Impact Analysis' or CFIA for clients.
This exercise pulled together key skills from all areas of the centre and we went through each 'component' (hardware & software) and reviewed the type of failure they may suffer, the impact, and the recovery/repair times. You also ranked the failures in terms of how likely they were to happen.
This enabled us to identify those things that needed additional redundancy, those things that could be reconfigured in an emergency, and those things that were not cost effective to plan for.
So for example, you may decide to switch to RAID storage to provide redundancy from disk failure as it is likely and relatively cheap, but you may decide not to seriously consider the loss of the server room & the building that houses it because it is unlikely & very expensive!
You need to consider the wider failure scenarios too, including environmental damage due to power cuts, flood, building destruction, criminal damage,theft etc. You end up with a list of things that can be done to mitigate a disaster, those things that need spending money on, and those things not worth worrying about. These then need to be costed and presented to management who will hopefully agree to fund the high risk ones & accept you cannot (at least on a school budget) plan for every eventuality.
If you are paranoid, you may want to consider things like 'are you under the flightpath of a major international airport or have you a major petro-chemical plant next door' ........
No, DRBD doesn't act as a bottleneck, it will simply get a little behind when very busy and catch up when it gets a chance. This does mean you want a decent UPS on both machines so they have a chance to shut down properly if possible.
Originally Posted by localzuk
Don't know if anyone else has mentioned this too but you can get servers with dual power supplies, which after harddrives are probably the most common component to fail.
If you're mirroring instances of Windows in real-time you'll be needing Windows licenses on both machines, even if you're only running one at a time. You'd probably want to get Windows Enterprise or Datacentre edition.
I'm sure you could spend £100,000+ on doing this all very well, but this is all perfectly manageable for way less. I doubt we've spent more than £20,000 on all our servers, switches and wiring in the two years I've been here.
Our servers went tits up a few weeks ago - system down for 2 and a half days (multiple problems ocuring at the same time)
Originally Posted by ninjabeaver
the question of how long until it's up again is a big question - depend on the problem - I could in theory have a new set of servers built from scratch within a day but it all depends on backup and stuff.
if you want to see what setup we have here now with the servers now virtualised and having been through the process of recovering from a disarster resently?
I have been toying with the idea of a Large NAS drive and this imaging software.
Online data backup solution for Windows servers
Has the ability to recover an entire server to a new one in a few hours even with different hardware, also could even transfer to a virtual server. CAn do full and Incremental backups of the images.
I know a technician in a school down the road who swears by it. Might be worth a look into.
There are may things to consider with a mirrored setup and it really comes down to how much you are willing to spend as to how quick and easy the recovery is.
There are many options, primarily though you would be after a large shared storage array (SAN) with constant replication to a SAN in your backup location http://www.cisco.com/web/partners/pr...utio_Brief.pdf
On top of this you want a virtualisation environment that can provide automatic fail over. Again, how gracefully this occours is up to your budget. You can go for Windows 2008 Hyper-V with failover clustering. This will boot up a backup copy of the VM at your remote site when the primary fails but users may loose some data and you must wait while it boots.
The next option is a fully speced out VMware solution which can offer live spares which run the VMs synced up at each site and users can fail over without even noticing as the system is effectivly duplicated on the secondary server including memory state.
The next thing that you must look at is the network infrastructure that supports this and work to avoid any single point of failure by running switches in parallel with multiple links to avoid any one bit of equipments failure from bringing the system down.
As you can really spend as much money as you have on adding redundency it is important to have clear goals of what you want to achive before you start then pick the best solution for that level of disaster proffing.
Here's hp's marketing departments vision of disaster proofing:
[ame="http://www.youtube.com/watch?v=WFp-V_WRHxQ"]YouTube - HP Disaster Proof Data Center[/ame]
the best role to work to when designing a system is: Why buy 1 when you can get 2 for double the price?
doubling up on everything is fair enough as long as you have the infrastructure to work with that. I would love to have a few locations with servers in linked up with multi-10Gbe networking but (firstly we dont have that sort of money) the fibre is a low quality gigabit max stuff
acronis also do something where you put a client on the server, make an image to a remote (other building) file server and then leave the client resident.
it will update the image on the fly so if you server dies all you need to do is replace the box and pull the image back down.
they also include a tool for restoring to dissimilar hardware.
Online data backup solution for Windows servers
You backup SAN can be virtualized lefthand and datacore have solutions. The big vendors are not best suited for schools.
Different subject really, my school had a power cut during the night, I came in and found part of the network down.........All servers were fine as they were on teh UPS, but my core switches went down.........need to get them on the UPS.
Make sure you get a nice powerful UPS installed.
I think it's simply a case of buying two lots of servers, putting them in two different locations and getting as fast a network connection between them as possible. There's no need to spend any money on software, that's all available for free, just spend your cash on hardware.
Originally Posted by SYNACK
Yea but high speed links and hardware especially when duplicated for redundancy do end up costing a fair amount, especially if you are looking at fast and reliable SAN storage. Also I have not seen some of the features avalible on the free systems like hot spares with less than a second failover and the automated hardware provisioning stuff. For the moment at least some of the features are only avalible with software that you have to pay for.
Originally Posted by dhicks
Simple solution - just Ghost the servers (may take a while to restore) but couple this with a regular backup hey presto - cheap!
May take a little longer than some people to get back up and running, but there is such a thing as overkill!
I used to work for the government and we had nothing like what some are suggesting!
I don't know if offering the UK government with its world renowned IT miss spending, failed projects and data losses as an upstanding example is the best idea :)
Seriously tho any disaster can be planned for but really there is a limit to what is realistic within a school budget and the realities of school life.
I agree entirely that a VM system with SAN and offsite backup my personal choice has been to use DAS drives for that is the way to go to allow operations to continue after significant hardware failure.
If its a disc error it should be minutes to recover if users even notice a disc has failed at all.
Server Failure it should be a matter of hours as VMs can be shifted and run on remaining functional hardware. Redundant PSUs in servers I consider a minimum spec requirement which avoids most cases this could happen.
Beyond this I think good offsite backs to allow a restore a day or 2 later on alternate hardware is totally acceptable. If your server room is vandalised or burns to the ground a couple of days is acceptable. I have tested a plan in house to use the desktop machines in 1 computer room to run the server VMs which would allow 90% of our network to function even if the server room was destroyed like this. It wouldn't be fast and there would be some missing functionality like the extended storage system we have for media pupils editing video but admin and basic academic use could continue a day later.
If you go beyond this scenario to flood or major fire consuming multiple computer suites etc 1 day recovery is not even logical. The school will not be open the day after an event like that maybe not for weeks afterward and there will be time to order new equipment and restore from backup.
You should be able to protect against disc and Power failure very cheaply and I would consider this a minimum spec on new server purchases. Redundant power supplies, UPS and RAID 1 for system drive RAID 5 for user space. Upon failure of 1 server the network should function fully. A loss of performance is acceptable until new hardware is purchased. Beyond that a few days to a fortnight is unavoidable and should not cause that much disruption relative to whatever event the school has fallen victim too to cause it.
There's been no mention of power redundancy so far. You might want to consider the fact that if the lights aren't on using any of the above solutions, nothings going to work.
To get power redundancy, failover and backup you need to do a few things.
- Have dual UPS' in each of your racks.
- Have servers with dual power supplies, with one power supply plugged into each UPS.
- Have your switches on UPS.
- Have a large diesel generator on site.
So the theory is, if one of your UPS' dies in one of your racks, you can still power the servers. Yes, the switches are on single UPS' (Larger modular switches do come with dual power supplies). Losing a switch to a shouldn't be an issue. You're network should route round the failure (see above on how to do that). All the UPS' onsite need to be tuned and load tested so they have sufficient capacity to keep things going until your diesel generator kicks in (this should be automatic on power failure).
Be aware this is massively expensive and is usually the lowest level of any redundancy plans. So unless your planning to run a stock exchange or start a war with some other country and need to co-ordinate your military assets I wouldn't bother.