https://insidegovuk.blog.gov.uk/2016/05/12/severity-1-incident-errors-across-publishing-applications/

Severity 1 incident: errors across publishing applications

This post is about a recent incident on the GOV.UK website. For more information on incidents on GOV.UK, you can check the status of the GOV.UK site or read more about what happens when things go wrong on GOV.UK. You can also take a look at how we classify and prioritise incidents on GOV.UK

What publishers saw

From around 12:30pm on Monday 9 May, publishers using GOV.UK’s publishing applications experienced intermittent errors when trying to publish content. There were 3 bursts of errors around 2 hours apart. The GOV.UK technical support team saw a spike in timeout errors on GOV.UK. These quickly resolved themselves but they did still cause some disruption to publishing. As the load on the publishing applications dropped off, so did the bursts of errors and we believed the situation to be stable.

The next morning, as the publishing load picked up again, we saw even higher spikes in errors and we declared a severity 1 incident. Whitehall Publisher was particularly badly affected and didn’t recover after the spate of errors. We had to shed the load from the publisher by turning off access entirely. The result of this was that all of our publishing tools were unaccessible between 10am and 11am and again from 1pm to 1:45pm. Whitehall Publisher was inaccessible for most of the morning, up until 1:45pm.

What users browsing GOV.UK saw

Most people using www.gov.uk saw content as usual. We served content from our static mirrors and most wouldn’t have noticed a problem. However, up to 8% of requests returned an error between 10am and 11am, and 1pm to 1:45pm. People who accessed atom feeds or 'latest' feeds saw content that was out of date. This led to confusion and, in some cases, email alerts being sent out unnecessarily.

What caused the problem

The root cause of this problem was performance problems with our hosting provider’s network storage. To mitigate the impact, our provider limited how many disk operations could take place per second. This rate was too low for GOV.UK, which resulted in extremely high disk wait times and errors across our publishing applications.

Preventing reoccurrence 

To resolve the disk performance issues, our hosting provider will be upgrading their network storage.

We’re already rebuilding our publishing platform to make it more resilient to these failures.