https://insidegovuk.blog.gov.uk/2019/04/03/incident-roundup-august-2018/

Incident roundup - August 2018

GOV.UK Incident Report

We’ve posted before about what happens when things go wrong on GOV.UK, and how we classify and prioritise incidents.

Every incident teaches us something new about our technology or the way we communicate with each other. It also gives us the opportunity to improve our incident management process so we can minimise disruption to users in the future.

In May 2016 we committed to blogging about every severity 1 or severity 2 incident, as well as severity 3 incidents if they’re particularly interesting.

This post is a roundup of 4 incidents that GOV.UK encountered during August 2018.

2 August 2018 - data.gov.uk sync failure

What happened

Datasets that were added or edited on data.gov.uk by departmental publishers between 25 July and 2 August 2018 were not always presented to users immediately, sometimes taking several hours or days to update. This was a severity 2 incident.

What users saw

For datasets that were modified, new links were not showing and the ‘last updated’ date remained static. New datasets did not become available through search and thus could not be retrieved by users.

While users would be able to access the data through the publisher's own website, they may not have been aware this was possible due to the incorrect ‘last updated’ dates shown on data.gov.uk, and a lack of links to the new dataset. This did not affect availability of data through the data.gov.uk API.

Cause of the problem

The Find and Publish parts of data.gov.uk act as 2 separate applications with a synchronisation process copying data from Publish to Find. At the time of this incident, a large number of sync jobs had been queued but had failed.

New jobs for the same datasets were queued up behind the failed jobs every 10 minutes, which resulted in the queue length increasing rapidly. This meant that new jobs were not processed for some time and the data flow into Find was slowed down.

How we responded

After determining the queue had stalled, we manually terminated all running jobs and deleted all pending jobs from the queue. Following this, a new sync process automatically started (as per design) and all new/updated datasets were available on Find.

Steps taken to prevent this happening again

We have increased our monitoring of the queue lengths and added alerting to ensure a stalled queue can be resolved quickly by our second line support team.

Additionally, we have modified the sync code so that failed jobs do not remain in the queue for re-processing - instead a new job is created on the next sync run. This prevents the queue rapidly increasing in the event of a problem occuring during the synchronisation.

6 August 2018 - duplicate emails sent to some subscribers

Between 6 and 7 August 2018, some email subscribers received multiple emails whenever something changed on GOV.UK.

What happened

On Monday 6 August, we made a change to Email Alert API, the application that compiles and sends email alerts to subscribers every time content changes on GOV.UK. This changed the way that Email Alert API connected to its database in order to make it more efficient, using an external application called PgBouncer.

We configured PgBouncer to be as efficient as possible. However, this introduced an issue with Email Alert API, which relies on database locking to make sure that emails are only sent once. A task runs every few seconds that looks for unsent emails and sends them.

If this task is running more than once at the same time, the database lock will prevent it from picking up the same emails. In this case, the database lock was being removed too soon which meant a second task running at the same time could pick up and send an email while the first task was also doing the same thing.

What users saw

Some people who were subscribed to receive emails every time something changes on GOV.UK would have received more than one email for the same change.

How we responded

Since PgBouncer is a new tool for GOV.UK, we had to spend some time understanding how it worked in relation to Email Alert API and where the issue was. Once we had understood this, we disabled PgBouncer for Email Alert API to fix the immediate issue.

We then investigated deeper into PgBouncer to understand the problem and how it could be fixed permanently. We made a small change which makes it slightly less efficient but does not cause problems with database locks.

Once we were sure that the immediate problem had been fixed, and that there were no problems with our other apps, we moved Email Alert API to use PgBouncer again and monitored it to make sure it was working normally.

What we’re doing to prevent this from happening again

We’ve documented the problem for the benefit of developers who work on our apps, and we are also keeping a list of apps that use database locking. This will allow us to understand which apps to check if we see similar problems in the future.

14 August 2018 - data.gov.uk dataset withdrawal

What happened

On 2 occasions, on the mornings of 14 August and 20 August 2018, it became apparent that a large number of datasets were not visible or searchable from data.gov.uk Find. They were still available in the Publish tool and through the data.gov.uk API. This was a severity 2 incident.

What users saw

Users would not have seen these datasets when browsing the site or when using the search function. Those with a URL directly to a non-visible dataset would have been redirected to the Publish interface. Approximately 13,000 datasets were not visible during the first incident and 4,000 during the second incident.

Cause of the problem

The datasets that were not visible had been marked as draft (non-visible) status. The sync process from Publish to Find makes an assumption that any datasets not available through the Publish API have been withdrawn, and are subsequently unpublished in Find by being marked as draft.

We believe that these datasets were inadvertently marked as draft in Find due to an error or invalid response by the Publish API endpoint during the sync process. The synchronisation code also incorrectly made the assumption that a withdrawn dataset could never be published again, therefore preventing these datasets from being published correctly on the next sync run.

How we responded

In both cases, after identifying the datasets which had been unpublished, we manually marked them as published. As part of the recovery, we noted that the re-indexing of datasets into Elasticsearch (the search engine used to power the frontend of Find) was slow: estimates suggested it would take around 18 hours to index all 46,545 datasets.

We took this as an opportunity to generally improve the indexing process, by adding indices into the origin PostgreSQL database (which acts as a temporary data store prior to indexing) where needed, reducing indexing time to around one hour.

Steps taken to prevent this happening again

We have implemented a number of checkpoints within the sync process to prevent a recurrence of this incident.

These include: validation that the number of results returned after pagination matches the number of results expected, aborting if an unusually large number of datasets are being unpublished in a single sync run and ensuring that a withdrawn dataset can always be marked as published again.

In any of these cases, the reason for aborting will be reported to Sentry (software used internally to monitor and track server-side errors), allowing for further investigation.

15 August 2018 - delays in published pages appearing

For around 3 hours on 15 August 2018, publishers in GDS and other government departments experienced delays in published pages appearing on GOV.UK.

What happened

On Tuesday 14 August, we deployed a number of small updates and fixes for Content Performance Manager, an app used by publishers in some government departments. One of the updates was for Ruby on Rails, the code framework used to build the app.

After the deployment, the number of threads used by the app started to increase slowly but steadily. Each thread represents a discrete set of work done by the app, and the number of threads running at any time for this app is normally 6.

Since we do not usually monitor the number of threads an app is creating, we did not notice that they were increasing overnight and into the next day.

By 11:50am on Wednesday 15 August, the limit for the maximum number of threads across all apps had been reached on some of our servers. This meant other apps on the same servers could not create new threads to carry out work such as publishing documents and pages.

What users saw

Publishers who tried to publish or update documents or pages on GOV.UK did not see their changes appear on GOV.UK until the problem was fixed.

How we responded

Since we wanted to fix the immediate problem as soon as possible, we started by restarting each of the affected servers one at a time, allowing the apps to restart and process all the documents and pages that had been published but were not appearing on GOV.UK.

We then informed the team responsible for Content Performance Manager of the problem, and they deployed a temporary fix that reversed the upgrades until they could work out the original problem.

Once they had tracked down the problem to the Ruby on Rails upgrade, they deployed the rest of upgrades again, leaving out the problematic one pending more investigation.

What we’re doing to prevent this from happening again

We have added a new alert to our alerting app that will let developers on our 2nd line team know if a single app creates more than 100 threads. This will allow us to know about the problem earlier and fix it before it spreads to other apps.

Subscribe to updates from this blog.

Share this page