This post is about a severity 2 incident with some of our Smart Answers and how it was resolved. We believe in being open about our mistakes and sharing our learning. This is why we have these incident write ups on the blog.
We’ve blogged in the past about what happens when things go wrong on GOV.UK, and also how we classify and prioritise incidents.
What happened
On Monday 15 May we updated Whitehall publisher - one of our apps that allows editors across Government to publish content to GOV.UK.
The update removed some code that was providing information to other applications - this was all ‘worldwide’ information, like the addresses of embassies. We mistakenly thought the information wasn't being used anywhere and was safe to remove.
This had an effect on Smart Answers, an app that provides a framework for creating question-and-answer pages. We use these pages to help users understand how specific Government policies may affect them - for example, Check your state pension age.
What users saw
The following day at 11:30am it was reported to us that there was a problem with some of the smart answers, such as on the smart answer for getting married abroad in Bulgaria. It was showing a ‘service unavailable’ error.
The page was trying to get addresses for embassies around the world. It was not able to query the API from Whitehall publisher that provided this information.
We estimate a total of 319 pages were affected, possibly more. However, the number of users affected was relatively low. At most, 0.275 errors were served from the Smart Answers app per minute.
A total of 103 hits encountered the error in the period between midday on 15 May and 2pm on 16 May.
How we responded
We reinstated the worldwide services provided by Whitehall publisher, and Smart Answers was able to serve those pages again.
What we're doing to prevent this from happening again
We realised that our high-level ‘smoke’ tests (our most basic tests) failed us as they weren't looking at the affected parts of the Smart Answers app. We’d been testing pages earlier in the series of questions, but not the pages that included services from the Whitehall app. We now have tests that point to those areas.
We have also lowered our alerting thresholds. The number of errors were well below the previous setting. Lowering the thresholds for alerting will mean that similar incidents in the future will be detected.
Finally we tried to gracefully handle when some of the services outside Smart Answers are unavailable. For example, in this case we could use the Countries register instead.
But we also spotted that we rely on a service called Imminence. This provides structured geographical data, such as postcodes. We added a more informative error message for problems with fetching information about postcodes. Now users won't see a mostly blank screen, but a more helpful error message.
Kelvin is a developer on GOV.UK. You can follow him on Twitter.