This post is about a severity 2 incident affecting the GOV.UK website. We routinely publish incident reports because we believe we should be open about our mistakes and share our learning. We’ve posted before about what happens when things go wrong on GOV.UK, and how we classify and prioritise incidents.
What happened
Between 4:00am on Saturday 21 January and 2:30pm on Monday 23 January, users were unable to download application forms for certain types of licences from GOV.UK - for example, when applying for a temporary events notice. Users were also unable to upload completed forms during that period.
What users saw
Users were presented with a ‘We're sorry, but something went wrong’ error message.
The usual workflow is for a user to download a relevant form after being directed to it from the GOV.UK license finder tool. They then enter information into the form before uploading it back to GOV.UK for subsequent processing. However, this functionality didn’t work during the outage.
How we responded
We weren’t alerted of the problem over the weekend through our out-of-hours support process, but early on Monday morning we saw errors in our logs that led us to investigate.
License application forms on GOV.UK are hosted by a third party provider. We raised the issue with them immediately.
By 2:30pm, the supplier resolved the issue and explained what had happened. They had replaced a SSL certificate but had forgotten to refresh it for the application server hosting the document processing application, causing it to stop working. Once they redeployed the certificate and restarted the application, things started working as normal.
Steps taken to prevent this from happening again
The supplier added extra steps to their application deployment process to prevent a recurrence when replacing certificates in future.
At GOV.UK we clarified the escalation path to raise any issues with our supplier and we updated our operations playbook accordingly. We also discussed adding proactive monitoring to check the status of the supplier’s SSL certificate, but on reflection decided that wasn’t something we wanted to do.
Paul Heron is a delivery manager at GOV.UK. You can follow Paul on Twitter.
2 comments
Comment by Jamie Smith posted on
Have you considered simply monitoring outgoing links aren't broken?
Comment by Paul Heron posted on
Hi Jamie,
Thank you for your question. We've implemented proactive link checking across GOV.UK (https://github.com/alphagov/link-checker-api), although the licence application PDFs aren't covered by this solution. That said, we're looking at adding specific monitoring for the third party provider that hosts and processes these application PDFs.