Major Incident Report Internet Access Issues on 9thth April 2010
Version Date Changes / Comments
1.0 13/4/10 First version issued to LGfL Management Summary
Synetrix apologise for the inconvenience to LGfL members for the Internet access issues on 9th April 2010 and the high impact this had on users within London.
The incident was caused by a combination of issues, starting with what we believe to be a large scale Denial of Service attack. These issues are described later in this report.
Issues that affect service are always disappointing. Whilst we pride ourselves on the quality of our services, we accept that problems will occasionally occur and we strive to learn from them and continually improve our performance; hence the purpose of this report is to:
1. Explain the issue and the cause
2. Review the management of the issue.
3. Review lessons learned and describe any corrective actions that have been or will be put in place.
Synetrix welcomes your feedback, as this provides the opportunity to further improve and provide the highest possible standard of service to all of our customers and users. Issue summary and resolution
There were two distinct parts to the issue:
Outage part 1 – firewall only: 10:30 – 13:30
The impact of this issue was limited to inbound access to on-site services from the Internet and outbound Internet access from sites not using the Netsweeper filtering system.
The cause of this was high firewall CPU due to unidentified external traffic hitting the firewall. The amount of traffic, and its impact on the firewall CPU meant that the firewall throughput was reduced to the point of it being effectively “down”. It is believed that this traffic was either a large scale Denial of Service attack, or the effects of a virus outbreak.
Outage part 2 – total Internet outage: 13:30 – 14:50
All internet users were impacted by this issue, with both inbound and outbound traffic affected.
This was caused by a combination of events: -
- While investigating the firewall issue and trying to isolate the traffic, the link on the Earls Court Juniper MX960 connecting to the untrusted interface of the firewall was disabled and later re-enabled
- Due to an issue with the Juniper MX960, the change to the configuration to re-enable the interface was not correctly synchronised between the master and backup routing engines. This out-of-sync state caused the interface to show as activated in the configuration but remained operationally down
- The BGP routing configuration on the MX960 was configured with the untrusted interface of the firewall as the next hop. Therefore, when the firewall became unavailable due to the interface being down, advertisement of the routes to external providers ceased, causing the total outage.
Resolution – 14:50
The Juniper MX-960 interface was disabled and re-enabled, and the BGP routing process restarted. This resolved the issue with the BGP routes and traffic started flowing.
At this point the attack on the firewall had ceased and there were no further issues.
Brief timeline of events
10:30 – Synetrix monitoring systems picked up problems with the firewall. Investigations showed high CPU usage on the firewall.
10:40 – A small number of calls were received into the Service Desk regarding Internet access from sites not using URL filtering (sites using Netsweeper were OK).
11:00 – Investigations discovered that the high CPU was related to external traffic, but this proved difficult to isolate. Various troubleshooting measures started, culminating in temporarily isolating the firewall by disabling the interface on the Juniper MX960 router at approximately 13:25
13:30 – Complete loss of internet access, including sites using Netsweeper
13:50 – Initial investigations pointed to external peering with upstream providers. The theory was that upstream providers were blocking our routes
14:00 – Discussions were started with all 3 upstream Internet providers
14:20 - 14:30 – Discussions with upstream providers concluded that our routes were not being correctly advertised to them. Investigations refocused to Juniper MX960 routing configuration.
14:50 – Following resetting an interface on the Juniper MX960 and restarting the BGP routing peering, service was resumed Issue management and communication
Throughout the issue, a lead ticket was raised and updated with progress, and Talk2Synetrix was updated.
Unfortunately, due to the reliance on internet access to send SMS alerts and access to the email filtering platform to send email updates, both of which were affected by the issue, these updates did not go out to subscribed customers in a timely manner. Lessons learned and corrective actions
Juniper MX 960 configuration issue
Reviewing the Juniper knowledgebase, there is an issue that under certain circumstances, where a change is being made before the previous change has completed synchronisation to the backup routing engine, the router can become out of sync.
Now this issue is known and understood, it has been communicated to the engineers that work on these routers and will be added to the Synetrix knowledge base for future reference.
The BGP routing configuration has been changed so that the non-availability of the firewall interface will no longer affect the advertisement of Internet routes for traffic flowing through Netsweeper, or traffic for customers not using the core firewalls at all.
It is clear that the automated SMS and email alerts failed to perform their function, due to their reliance on parts of the network that were affected by the failure. We have started a review process to ascertain where these dependencies lie and how they can be engineered out of the communication solution. This is likely to take several weeks and in the meantime this awareness means that manual SMS updates will now be sent to customers who have registered for them on Talk2Synetrix, should such an event recur.
The core firewalls are to be upgraded within the next 2 months, adding significant extra capacity. This will enable them to withstand a higher level of virus or DOS traffic before service is affected.