https://insidegovuk.blog.gov.uk/2017/04/21/incident-roundup-part-2/

Incident roundup - part 2

We’ve posted before about what happens when things go wrong on GOV.UK, and how we classify and prioritise incidents on GOV.UK.

This post is a roundup of 5 incidents that GOV.UK encountered between August and October 2016. It follows on from yesterday’s post about 3 incidents that happened between May and June 2016.

18 August 2016 - users unable to apply for licences on GOV.UK

What users saw

For just over 9 hours on 18 August, users were unable to apply for licences on GOV.UK. We had alerts in place which notified us of the problem almost immediately, but it still took time to resolve. This was a severity 2 incident.

Cause of the problem

The hosting supplier for the licensing application (which enables users to apply for licences on GOV.UK) failed to reapply required configuration after planned maintenance. This prevented the application from working properly and users experienced errors as a result.

Steps taken to prevent this from happening again

The hosting supplier has taken steps at their end to avoid a reoccurrence of this kind of problem.

In future, we may implement functionality to inform users if the licensing application is not working correctly, rather than serving a generic error message.

18 August 2016 - dynamic content and publishing applications unavailable

What users saw

Between 11am and 5:30pm, some of GOV.UK’s servers were unavailable for 5 separate periods of about 3 to 4 minutes. During those periods, dynamically generated content (such as calculators) couldn’t be viewed. However, static content could still be viewed. This was a severity 2 incident.

All backend and publishing applications were unavailable during those periods, although this didn’t appear to have a significant impact on publishers.

Cause of the problem

Load testing by other customers of our cloud host led to a denial of service. Our host’s routers were unable to handle the amount of traffic (more than 1 million packets a second).

Steps taken to prevent this from happening again

Our cloud host put measures in place to prevent this kind of issue from reoccurring. We’re also planning to add additional monitoring to our publishing tools.

13 September 2016 - unable to edit or save content in Mainstream Publisher

What users saw

From about 2pm to 3:30pm, GDS content designers were unable to edit or save content in Mainstream Publisher - the application they use to publish content to GOV.UK. Users of the GOV.UK website were not directly affected. This was a severity 3 incident.

Cause of the problem

A browse page had been redirected and given an empty title. This caused unexpected data to be sent to Mainstream Publisher and the content tagger application, and GDS content designers saw server errors.

Steps taken to prevent this from happening again

A fix was implemented to filter out any items with empty titles, solving the underlying problem. During the incident review meeting we restated the design principle that applications which accept data should validate the data on input.

13 September 2016 - delay in publication of travel advice

What users saw

An update was made to travel advice about Peru, but it wasn’t published on GOV.UK and subscribers did not receive a notification email. In the end, publication was delayed by about 1 hour 25 minutes. This was a severity 2 incident.

Cause of the problem

We released an update to our publishing API earlier in the day. The Travel Advice Publisher application should have also been updated, because of a dependency with the API, but this was overlooked. This caused the application and API to get out of sync, so updates to travel advice content were no longer being published as expected.

Steps taken to prevent this from happening again

Improvements were made to our end-to-end testing process. The GOV.UK Operations Manual was also updated to provide additional information about travel advice email alerts and how to re-trigger them when required.

10 October 2016 - Trade Tariff tool unavailable

What users saw

Between 2pm and 5:35pm on 10 October, links from GOV.UK to the Trade Tariff tool didn’t work. (The problem reoccurred 2 days later, and the day after that.) During these periods, some users may have seen an error page when trying to view certain search results and other dynamically generated content. This was a severity 2 incident.

Cause of the problem

Several months previously, we changed the URL settings for the Trade Tariff tool in our router database when we moved hosting of the tool to GOV.UK PaaS. Just before the incident occurred, these settings were accidentally overwritten by out of date information with an incorrect URL. This meant that traffic for the Trade Tariff tool was sent to a URL that no longer existed and users saw server errors.

Steps taken to prevent this from happening again

We updated the URL for the Trade Tariff tool in the content store. We also removed redundant calls from the old content repository application, Panopticon. (Panopticon is being replaced as part of work to consolidate the GOV.UK publishing platform.)

To help us respond to incidents more effectively, we also expanded the GOV.UK Operations Manual with specific information about diagnosing and resolving issues related to the Trade Tariff tool.

Paul is a delivery manager on GOV.UK. You can follow him on Twitter.

Leave a comment