This post is about two severity 1 incidents caused by problems with our third-party suppliers, and how they were resolved.
The incidents affected the availability of GOV.UK.
We’ve blogged in the past about what happens when things go wrong on GOV.UK, and also how we classify and prioritise incidents.
Incident 1: Fastly outage
What happened
From around 2:55pm until 3:20pm on Wednesday 28 June, GOV.UK was slow or intermittently unavailable for some users.
What users saw
Upon visiting the site, some users were served the following error message: “Error 503 Maximum threads for service reached”.
How we responded
Our technical support team tracked the issue down to a global outage at our content delivery network (CDN) provider, who were aware of the issue. It was quickly resolved by their support staff.
What we’re doing to prevent this from happening again
We’re currently evaluating how we can switch to an alternative CDN provider at short notice for situations such as this.
We’ve also considered the feasibility of using multiple providers, but this is complicated slightly by our current reliance on vendor specific features for functionality, such as A/B testing.
Incident 2: Carrenza outage
What happened
From 8:18am to 8:55am on Thursday 3rd August, GOV.UK content was served from static mirrors.
What users saw
Most static content was available to front-end users as normal. Dynamic front-end features of GOV.UK (such as search, finders, and some assets) were unavailable. Government publishers were unable to access the publishing applications.
How we responded
Our technical support team tracked the issue down to a problem at our hosting provider. There had been some planned maintenance on the supplier’s external firewalls.
During this maintenance, those devices lost connectivity to the internal firewalls that protect GOV.UK’s systems. The supplier’s existing monitoring did not alert them to this issue. We contacted the supplier, who restarted the affected devices, restoring connectivity.
What we’re doing to prevent this from happening again
The supplier has updated their maintenance process to ensure that future maintenance will not be affected by the same issue. A support case was raised with the vendor of the affected appliance to discuss the issue encountered during the maintenance.
Graham Pengelly is a developer on GOV.UK. Tim Blair is GOV.UK's lead technical architect.
1 comment
Comment by Tanbin Hasan posted on
wow, I impressed very much for this post.