https://insidegovuk.blog.gov.uk/2018/11/02/incident-roundup-march-to-may-2018/

Incident roundup - March to May 2018

GOV.UK Incident Report

We’ve posted before about what happens when things go wrong on GOV.UK, and how we classify and prioritise incidents.

Every incident teaches us something new about our technology or the way we communicate with each other. It also gives us the opportunity to improve our incident management process so we can minimise disruption to users in the future.

In May 2016 we committed to blogging about every severity 1 or severity 2 incident, as well as severity 3 incidents if they’re particularly interesting.

This post is a roundup of 5 incidents that GOV.UK encountered between March and May 2018.

28 March 2018 - content attachments not found

What happened

Between 10am and 3pm on 28 March 2018 a number of attachments
and thumbnail images were unavailable, resulting in a 404 Not Found response.

In total we served 927,000 Not Found responses during that time. We
believe that around 3,000 assets were affected, many of these being
thumbnail images for downloading attachments to content on GOV.UK.

What the users saw

We believe some of the affected files were thumbnail images
for attachment downloads - these images would not be displayed. We think
that in most cases the underlying file would still have been available.

How we responded

We began to receive support requests about Not Found errors at around
1:45pm. An incident was declared and investigations began at 2:20pm.

We discovered that attachments (such as PDFs and images) that contained multiple full stops in their file name were not being served correctly by one of our applications. Thumbnails of attachments are particularly likely to have multiple full stops in their file names (such as name_of_file.pdf.png), which is why they were affected.

A previous version of the asset manager application (the application that manages things like images and PDFs that are attached to GOV.UK content) was deployed - this resolved the problem and the incident was closed at 2:55pm.

Steps taken to prevent this from happening again

We fixed the code that allowed the incident to occur, and added some extra tests to ensure it does not happen again.

2 May 2018 - email signups temporarily unavailable

What users saw

On 2 May between 4:30pm and 5:30pm, users attempting to use GOV.UK’s email sign up and subscription management interfaces were shown an error.

Cause of the problem

Buttons (such as the 'Next' button) rendered on GOV.UK use a custom component that we developed. The button component used to be stored in an application called Static. As part of our ongoing improvements we’re making to standardise front end components across GOV.UK, we moved our button component into a gem.

This required referencing the gem in all applications that needed the button component. We missed the application that serves our email subscription management interface, so when it tried to render a button, the component no longer existed.

How we responded

We rolled back our release that removed the button component from Static.

Steps taken to prevent this from happening again

We have reduced the cache time to live (TTL) to surface issues more quickly on our Staging environment. Although we released to our Staging environment first, we did not notice the issue because the caches on Staging had not yet cleared before we released to Production.

We have tightened our deploy process so developers now post screenshots of our log files proving that the components are no longer being requested (for example https://github.com/alphagov/static/pull/1441).

18 May 2018 - email alert API outage

What happened

On Friday 18 May at around 9pm the GOV.UK support team received an alert that the email monitoring for foreign travel advice had failed.

After determining that a courtesy copy email (a copy of an email update sent to the GOV.UK support team to indicate the update has worked) had been sent out, they mistakenly concluded that the monitoring system was not configured correctly and that the emails were sending as expected.

On Monday 21 May, while investigating the root cause of the problem, the support team received 2 tickets about email alerts not being sent. We discovered that a batch processing utility was inadvertently processing a batch of 100,000 emails, exceeding its capacity of around 65,000. Typically, it would process 1,000 per batch.

What users saw

Users who subscribe to immediate GOV.UK emails had a substantial delay in receiving updates. Notifications about any content published on GOV.UK between from 3pm on Friday 18 May and 2:40pm on Monday 21 May were eventually sent between 2:40pm and 6pm on 21 May.

Any departments that published content during the outage will have been affected - 319 documents were updated during this time.

How we responded

On the evening the problem occurred, recent changes to the monitoring system meant that GOV.UK’s on-call team was misinformed about the situation, concluding that email updates were being sent correctly. In addition, error warnings were not coming through to Sentry (an error tracking platform).

On Monday, after further investigation the team was able to identify the cause and fix the issue. This involved ensuring the email updates system processed emails in fixed size batches - so instead of trying to process a batch of 100,000 emails, it would process batches of 1,000 emails at a time.

Once the fix was deployed we closely monitored the system to confirm emails were sending correctly. All delayed emails were sent by 5:45pm.

What we're doing to prevent this from happening again

The incident has highlighted that our monitoring and alerts need further improvements.

We’ll evaluate the use of courtesy copies as confirmation that emails updates have been sent to subscribers, which proved unreliable in this case. We are also adding additional health check monitoring to ensure we are notified if immediate email generation fails.

18 May 2018 - delay in publishing content

What happened

From around 12:30pm until 3pm on Friday 18 May, publishers using one of GOV.UK's publishing applications experienced significant delays when trying to publish content.

This was a severity 1 incident, since statistics, which must be published at a specific time, were among the affected publications.

The root cause of this problem was our queueing service (Redis) running out of memory. This caused problems throughout GOV.UK, including the publishing queue in one application backing up and delaying some publications by up to 3 hours.

Even after the memory issue was solved, because we had not anticipated processing such a large number of publication tasks at once, other inefficiencies led to a delay in recovery.

What users saw

Publishers using the affected publishing application saw publications succeed, but not go live on the website.

Most people using GOV.UK saw content as usual. However, new pages were not accessible.

How we responded

We informed publishers that there was a significant delay in publication, so that they could contact us if they needed to publish anything urgently.

Once we determined the root cause, we added a new database index to make the queue faster to process, and cleared some unnecessary data from Redis.

Steps taken to prevent this from happening again

We're changing how we handle alerts, to make Redis running out of memory a more critical problem which alerts on-call engineers.

We're also increasing the amount of memory our Redis machines get.

Finally, we're investigating how we use Redis, to see if everything that we store in there is needed.

21 May 2018 - content temporarily unavailable

What happened

From 6pm on 21 May a number of GOV.UK users were shown error pages rather than the content they requested. This was because the underlying data store for GOV.UK content - a MongoDB database - was unavailable.

From 4:30pm disk space usage had escalated rapidly on the cluster of machines storing the database. When the primary machine ran out of space, the datastore became unavailable. The system then failed to promote a secondary database in place of the unavailable primary one.

What users saw

During this incident 3% of the pages served on GOV.UK returned an error response and informed users that the site was experiencing technical issues.

The publication of new content was delayed by up to 2 hours.

How we responded

Because GOV.UK’s infrastructure is configured to use cached versions of pages, users saw only a 3% error rate despite 50% of requests actually failing.

Our engineering response was to increase the disk space of the affected machines. The error rates then reduced and GOV.UK stopped serving the error pages.

What we're doing to prevent this from happening again

This incident highlighted a number of areas for improvement in our monitoring and configuration.

The first warning that the MongoDB machines were low on disk space was only shortly before the incident occurred. In order to give an earlier indication of a problem in future, we’ve adjusted these alerts to trigger on lower thresholds. We’ve also increased disk space on all of our MongoDB machines to be safely within these thresholds.

We also updated our configuration so that losing the primary database in our cluster would only hinder write access and not read access. Thus if a similar event was to occur, publishing would still be delayed, but the site should not serve errors to users.

Finally, our infrastructure team will prioritise upgrading MongoDB - using an older version likely impacted our ability to respond to the problem.

Subscribe to updates from this blog.

Leave a comment

We only ask for your email address so we know you're a real person