still getting 503 errors here at 8 oclock at night what gives?
Several schools have reported issues with access to their websites through our reverse proxy service, which you will no doubt be aware of. We of course offer our apologies for the service not being optimum at this crucial time for schools but we have found several contributing factors.
As you know, earlier in the year we had an issue with RP. At the time, we had the server go down and was unavailable. Earlier this year, we made the service resilient in terms of hardware i.e. we added another box, so that if the server went down, the secondary box would serve the traffic.
The issues we are experiencing at the moment have been partly down to traffic volumes. We have had a much larger amount of traffic trying to get to school websites as you would expect in the inclement weather. This has caused the service to run slowly as it has had to serve many requests at the same time. Added to this has been the steady increase of sites now using the RP service, particularly OWA traffic. We have found an issue with the Microsoft’s IIS service using HTTP 1.1 and the amount of keep alive traffic that it generates. It is a broken implementation. This basically clogs up the RP when it’s trying to serve normal traffic, especially when devices are polling for mail, contrary to popular belief, ActiveSync doesn't actually PUSH it uses keep alive. In this instance, the RP service has actually prevented school IIS servers failing.
We have seen a large amount of requests from the RP servers get no response from school servers. This means that the RP continues to try to connect to the server until the request times out. This adds to the amount of sockets, processing and traffic within the box which adds to the clog. It is also confounded by DNS lookups for each request.
To reduce the impact of all of these issues we are taking the following measures:
1. Instead of having a resilient pair of RP servers, to get us through this difficult time, we are moving all SSL based reverse proxies to one of the servers and keeping the non SSL on the other server. This will reduce the amount of traffic going to each server and will reduce the amount of clogging.
2. We are going to force all servers to use HTTP/1.0 This will eradicate the issue with the faulty MS implementation of HTTP/1.1 keep alive traffic
3. We are implementing a wider scope of DNS caching actually on the RP servers themselves. This will reduce the amount of traffic going between the RP servers and the DNS servers and also reduce the amount of latency waiting for those requests.
4. As a temporary measure, we have also increased the maximum amount of connections on the RP servers. While this does leave us more at risk of a Denial of Service attack, we are taking the calculated risk while we’re experiencing higher than normal volumes of legitimate traffic.
We are very confident that these changes will alleviate the issues that we have been experiencing, but there is little we can do if a back end server will not respond to our requests. We ask that you monitor your school sites as we will be over the next 24 hours and let us now immediately by e-mailing your LA service desk if you experience any difficulties.
Head of Support Services
still getting 503 errors here at 8 oclock at night what gives?
Last edited by round2it; 1st December 2010 at 07:57 PM.
Our RP services have been up and down for the last +48 hours now. it's reassuring that something's being done about it, but is it making enough of a difference as all I can see is that it's now "hit and miss" instead of "miss and miss"? A few more questions:
Is there a way we can prioritise some services over others, for example our VLE and website need to be accessible more critically than our MIS.
Would we still have these issues if we hosted our website elsewhere? Obviously, it depends on the host's service level but with regards to DNS still coming through YHGfL. We can only do this with our public website as all of our other services (VLE, email and MIS) are all LDAP integrated.
I'm looking around at various different school's services/websites. It appears to me that it's the Wakefield ones that are down. Calderdale and Hull RPs are all up as far as I can tell. Is this right?
I'm aware of a data centre move at some point in the near future (calderdale?), will things improve with the new data centre or is the architecture going to be the same.
We've now finished all the config changes we have been making tonight. We hope you will see a vast improvement.
Please make sure your LAs are aware if you are experiencing difficulties. If it isn't reported, we can't do anything about it in individual cases. If you want certain sites prioritised over others, please log that with us and we will endeavour to do as much as we can. Obviously, from your posts and location details in Edugeek, I have no idea which schools you represent and which websites you're talking about specifically.
We do understand the gravitas this has at this particular time with the weather being like it is and we're working day and as you can see night as well to try to sort the problems out.
We are currently undertaking a review of all of our services as we prepare for the new grid to make sure they are optimal for moving forward. RP is no exception and we plan to ensure that it is meeting the needs moving forward from April.
The RP service is there principally to protect schools and the grid from outside attack by not exposing each school web server to the wider internet. Be assured we're doing what we can to take the pain away that we have all been suffering over the past days.
Now, where's my bed? I have to be up early to monitor websites.
Thanks Andrew, very appreciative that you've worked around the clock getting this sorted for us all. Fingers crossed, this morning I've seen no 503 pages.
The result on my phone battery is immense, not 1 phone call!
You can always tell when Wakefield RP has failed... just check https://eportal.horbury.wakefield.sc...rtal/index.jsp and (c) Ossett School 2010 - Welcome To The Ossett School Website
Have to say, the improvement today vs yesterday and the day before is tenfold for RP'd sites thus far, so well done on sorting the configs.
I've been watching all the Wakefield YHGfL RP'd sites fail for 3 years on the trot now whenever it's been a "snow day". It's why we've flat out refused to use it for our webserver other than ePortal services. Our PlusNet 2Mb upstream ADSL backup connection has proved far more reliable thus far - it may get a tad sluggish from 7am to 9am on proper snowdays when we have several hundred hits to the site simultaneously - but it stays up and continues to serve either way. Am watching the Wakey sites with interest to see the impact of the new config changes, as I would LOVE to get all our webservices running over the YHGfL connection, especially as we're now on 100Mb, but whilever the 4Mb ADSL keeps our site up on high-traffic days and I see sites such as Ossett's being down, I won't be persuing it.
If failover DNS could be provided so that if RP fails it rediverts to any other available connection, that would be the best solution all round that way there's a degree of redundancy - unfortunately LEA helpdesk doesn't provide a 24hr support service... whereas the web IS a 24hr service, and many Web Admins within schools find they end up having to be on call 24hrs when it comes to the school's site being accessible or not. For situations such as this there needs to be a direct contact for YHGfL, bypassing LEA. Add in to the equation the current conditions... where schools are closed, as well as many local council offices, for health and safety reasons. Potentially you could find yourself with an unmanned LEA helpdesk, which means any chance of a speedy response or action from YHGfL is slim to minimal without direct lines of communication. The support infrastructure for YHGfL has always been the weakest point in the whole system. Whether that's a local issue or whether that's an issue that all YHGfL users suffer, I don't know. I'm speaking for myself here.
Are there procedures accessible to us for such circumstances?
Stevee - with regards external hosting - if you can find an ISP that will offer a webserver with 2 connections, 1 to public web, 1 secure VPN, you can then latch it onto your internal network that way and keep LDAP integration without needing to expose a barrel of ports and services to the net. Could also schedule a cronjob or similar to retrieve and update LDAP data into MySQL every 15mins if you can't find someone accomodating on the VPN front, or do it over an SSH tunnel. There are options out there...
Last edited by Marci; 2nd December 2010 at 11:42 AM.
the revese proxy for our school website had dissapeared overnight.
a quick call to the lea got it sorted but
we had to close the school and was unable to tell parents via the website (not good)
There are currently 1 users browsing this thread. (0 members and 1 guests)