https://insidegovuk.blog.gov.uk/2018/04/26/incident-roundup-part-3/

Incident roundup - part 3

This post is a roundup of 5 incidents that GOV.UK encountered between August 2017 and February 2018.

It follows on from previous posts about:

We’ve posted before about what happens when things go wrong on GOV.UK and how we classify and prioritise incidents on GOV.UK.

24 April 2017 - incident with the Trade Tariff app

You can read about this incident at the Platform as a Service (PaaS) statuspage channel.

14 August 2017 - issue with users seeing empty pages

What users saw

Blank pages were being served to users on several URLs.

Cause of the problem

The site was being crawled externally. One of the requests that the crawler was making was causing an error to our frontend application.

This was happening in the code known as Slimmer that adds the GOV.UK styling to each page. A bug in that code caused it to fail in such a way that it returned a blank page to the crawler, rather than an error message.

Because the page didn’t have an error code, it was cached by our content delivery network, Fastly. This meant that users who requested those pages less than 30 minutes after the crawler, and hit the same node at Fastly, saw a blank page.

Steps taken to prevent this from happening again

We removed the code that would generate the blank page. We also ran some automated tests against the other applications to see if they had the same bug. We added notes on how to fix the issue to our internal guidance.

31 October 2017 - 500 errors on GOV.UK

What users saw

Pages on GOV.UK that required search were not working from 4:20am to 10:57am. Searches would have contained very few results and some pages such as finders (the Competition and Markets Authority case finder, for example) were showing an error.

Cause of the problem

We made a code change to an automated task that ran overnight and copied all documents from one search index to another. But because of a fault in the code change, it didn’t copy all of the documents.

Steps taken to prevent this from happening again

We investigated the user experience during this incident and tried to work out whether we could use an error message that’d better describe what had gone wrong.

We now perform diffs after any reindexing and don’t switch to the new index if there are any unexpected errors or differences between the old and new indices. We also added a critical alert (red in Icinga) to report a significant growth or reduction in the size of the search index.

We’d like to investigate automatically running the full reindexing job periodically on staging and integration to ensure it’s kept in a working state.

14 December 2017 - Licensing DNS issues

What users saw

From 10:20am to approximately 11:35am, the GOV.UK licensing tool wasn't working. Users were unable to download or upload licence applications and may have experienced timeouts.

Cause of the problem

We updated our machines to use provider-specific DNS servers instead of Google’s public DNS.

The licensing service in production runs in an environment hosted by a different provider than the main GOV.UK stack. Machines in this environment couldn’t access the new DNS servers, so the application started failing.

 Steps taken to prevent this from happening again

We reverted changes to DNS so that we rely on Google’s public DNS again. We investigated whether there was an error on staging before we deployed to production, to review if we needed to increase the test time on staging to 30 mins before deploying to production. We also investigated why our smoke tests didn’t fail for this.

In future, we will look into adding an Icinga check that verifies the machine can use DNS.

 26 January 2018 - Clamdscan reporting false positives

What users saw

Publishers trying to upload attachments to their editions saw reports of viruses being present on their uploaded files. They were unable to publish these editions because of this. This affected our publishing applications so only publishers saw the issue.

Cause of the problem

The service is set to update its virus definitions daily. One of the virus definitions caused a bug in ClamAV (our virus scanning software) which had the end result of false positives being reported.

Steps taken to prevent this from happening again

We’ve added an item to our backlog to setup Icinga alerting if we get increased positive virus reports. We’re also considering other virus scan tools and have improved our developer documentation to help spread knowledge about ClamAV and improve response times if we get another occurrence.

2 February 2018 - Foreign Travel Advice feed errors

What users saw

Users of the Foreign Travel Advice atom feeds may have received updates slightly later than normal. Travel advice pages and email alerts were unaffected.

Cause of the problem

GOV.UK server resources couldn’t sufficiently handle the increased requests from The National Archives.

Steps taken to prevent this from happening again

We’re planning to have more resource for the frontend group of machines, to make atom feeds more resilient and to make it easier for developers on support to block external organisations who crawl GOV.UK too aggressively.

We’ve also measured Foreign Travel Advice atom feed errors for the duration of the incident and compared them to the number of successful requests. As well as this, we included a set of contacts for the Foreign and Commonwealth Office for travel advice related incidents in our incident management guidance.

 

Hong is a delivery manager on GOV.UK.

Leave a comment

We only ask for your email address so we know you're a real person