We’ve posted before about what happens when things go wrong on GOV.UK, and how we classify and prioritise incidents.
Every incident teaches us something new about our technology or the way we communicate with each other. It also gives us the opportunity to improve our incident management process so we can minimise disruption to users in the future.
In May 2016 we committed to blogging about every severity 1 or severity 2 incident, as well as severity 3 incidents if they’re particularly interesting.
This post is a roundup of 4 incidents that GOV.UK encountered between June and July 2018.
18 June 2018 - connectivity issue with licences
What happened
From 4:49pm to 7.34pm on 18 June 2018, users were unable to apply for or download licences on GOV.UK (such as for a temporary events notice), and the administration tools for licences were unavailable to government staff.
What users saw
Users who tried to access licence pages saw a generic 500 error page telling them that something had gone wrong and to try again later.
How we responded
Despite looking into the problem as soon as the errors started, we struggled to identify the cause quickly as we could not see any code changes or problems that would cause the outage.
We eventually raised a support ticket with our cloud provider who hosts the licensing infrastructure. They responded after about half an hour identifying the cause of the outage as an upgrade they had implemented.
The change had altered the way health checks (which check that applications are working) were performed. The health check was then unable to connect to our licensing application. This caused our load balancers to stop routing traffic to the application because they thought it was down.
The cloud provider assisted our team in altering the health check’s configuration to fix this and within 10 minutes our licensing pages were working normally again.
Steps taken to prevent this from happening again
We added further notes to our documentation to help staff to debug similar incidents in future, and looked at how we could get more warning of upgrades and alterations by our cloud provider in future. We are also looked into improving our monitoring of the licensing system.
25 June 2018 - licence payment error
What users saw
On 25 to 29 June, users from local authorities were seeing errors when submitting a payment for licenses through the licensing system. The payments were processed as expected, but some users would have been incorrectly told there had been a problem.
Cause of the problem
The issue was caused by 2 factors. The first related to the way we replay visits to the site on our staging and integration environments.
We replay all visits to GOV.UK on these environments so that they accurately reflect the live site. URLs are automatically changed to their staging or integration versions during the replays. However, this process was not applied to the licensing application, so all its traffic was being replayed on the live site, creating duplicate payment tracking entries.
The second cause of the problem involved how we tracked payments. When a licence type requires a payment, the user sees a page telling them this. While the error was occurring, visits to that page were replayed on our staging/integration environments, creating a second payment tracking record in the database.
The URL for the page was being called via an HTTP GET request. Calls to a page via a GET request should not change data, even if repeated - they should only get information back to the user. This is known as an ‘idempotency’. In this case the GET request was not idempotent, as the code for creating the payment tracking record did not handle subsequent requests correctly.
This meant that when user interactions with the licensing application (including payments) were replayed on staging/integration, users saw errors indicating that payments had been made twice.
When a payment was completed, only the first tracking record was marked as paid, while the duplicate record was not. We took steps to reassure users that only one payment had been taken per transaction.
Steps taken to prevent this from happening again
We’ve changed our licensing application so it only creates a payment tracking record if one does not already exist. We also fixed our scripts so that replayed traffic is not sent to the live version of the application.
5 July 2018 - attachment publication error
What the users saw
On 5 July from 4:30pm until 2:00pm on 6 July, users attempting to visit a newly published attachment did not see it, and were instead redirected to the draft host, which requires logging in.
Cause of the problem
A code change which was meant to optimise uploading attachments caused newly published attachments to become unavailable to users. When the second line team was informed of this issue, the affected code was reverted and the affected attachments were manually updated so they would be accessible.
This revealed a second, more obscure, bug where some of the affected attachments had not been properly uploaded at all. Most of the affected attachments could be repaired by updating their draft state, but for some, we needed to process the entire upload a second time. This second problem was not readily apparent, and required some investigation to identify a solution.
Steps taken to prevent this from happening again
We decided to add an additional product review step into our workflow so that we can be sure new features do not break existing functionality.
We also planned to add an extra, automated test to our end-to-end tests so that a similar problem will be caught before deploying code.
The second bug was recorded a month earlier as an issue on Github, rather than being captured on our Trello board. This meant it was overlooked. We are exploring better ways of handling Github issues to ensure they are fixed.
21 July 2018 - issues with some publishing applications
What happened
From around 8:25am on 21 July 2018, publishers saw 500 error messages when trying to log in and publish new content using some of our publishing applications. This included those trying to publish new travel advice content using Travel Advice Publisher.
There was no impact on users visiting GOV.UK.
This was a Severity 1 incident as it affected publishing of content that could be time critical.
Cause of the problem
The root cause of the problem was a code deployment the previous day that unexpectedly led to high logging rates on all of our backend machines. This caused the log files to rapidly increase in size until all machines ran out of disk space, resulting in errors on the publishing applications hosted on these machines.
How we responded
As soon as we established that the errors were due to lack of disk space, we were able to manually delete the large log files from each machine in order to free up space. This restored access to the publishing applications, and we set about identifying the root cause.
The large log files we deleted belonged to one specific application, so we alerted the relevant development team, who were able to investigate the problem with more context. Once they identified the code change which had caused the problem, they were able to revert their changes and put a fix in place to stop it happening again.
Steps taken to prevent this from happening again
Although the root cause of this incident was a code bug, we’re still able to make improvements to help prevent similar problems happening again. We have made some immediate changes to improve our logging and we are investigating further possible changes to our log rotation system and storage of logs in the long term.
We also decided to expand our alerting system so that errors relating to disk space are flagged up to our on-call team if they become critical.