In February, GOV.UK had 2 consecutive days of errors affecting applications that utilise our publishing platform. This was a severity 2 incident.
This post explains the cause of the issue, the effects and the steps we are taking to resolve it. We’ve blogged in the past about what happens when things go wrong on GOV.UK, and also how we classify and prioritise incidents.
Background
For the past 2 years we’ve been rebuilding the publishing platform of GOV.UK. One of the important distinctions of this changed architecture is that we have a publishing API which publishing applications communicate with and a content store that applications which render content (frontend) applications communicate with. Frontend applications do not communicate with the publishing API, and publishing applications do not communicate with the content store.
One of our largest publishing applications, Whitehall, is also a frontend application, which therefore crosses the boundaries of these responsibilities. We are currently working to migrate this to be separate publishing and frontend applications, but while this process is underway it has been using the publishing API to render some types of frontend content.
What happened
On the morning of the 28 February 2017 we were alerted by our monitoring system that there was a high number of pages being served with error codes. The errors were due to the publishing API responding slowly, as it was overwhelmed by a large spike in requests, and the responses were timing out. We restarted the publishing API application which improved the speed of the responses and resolved the problem.
We investigated the root cause of the error and how we could mitigate a similar problem occurring. We deployed an optimisation of the publishing API to improve what had responded slowly under heavy traffic.
The following morning we received a support ticket at 7:10am indicating there was a problem publishing content using Whitehall publisher. The problem had started at 5.00am and had a very similar pattern to the previous day. It was determined that our optimisation was not helping sufficiently.
On this occasion restarting the application was not sufficient to resume normal operations. The technical team instead reduced load on the publishing API by switching Whitehall frontend traffic to serve requests from a cache, while we did a further optimisation was deployed to the publishing API. When Whitehall took over from the cache the requests to the publishing API no longer timed out and operations resumed. As this had caused major functionality problems with publishing, this incident was categorised as severity 2.
On 1 March we considered a variety of options to ensure the error did not occur on the subsequent day and to have people actively available if it did. We monitored the logs from before 5.00am on 2 March and there was not a subsequent occurrence of the incident.
What users saw
On GOV.UK
Approximately 5,000 requests to GOV.UK resulted in errors occurring, although most users were unlikely to have seen a problem because requests for HTML pages were served by the cache. Some users who were using feed readers to access GOV.UK might have seen an error.
Government publishing
Users who were trying to publish content during this incident suffered a variety of problems, ranging from being unable to access Specialist Publisher to not seeing expected changes on the GOV.UK site while publishing with Whitehall. This was particularly pertinent as there were a number of time critical publishings to be made on the 1 March.
How this occurred
One of the peculiarities of this incident was that it started at approximately the same time - 5.00am - 2 days in a row with a pattern of a spike in requests. This is one of the quietest times in the day for GOV.UK so this seemed quite unusual.
Further investigation revealed that GOV.UK was not actually seeing any different pattern in requests - instead the spike was only to the publishing API. Our logs revealed that every request to affected Whitehall frontend pages was generating multiple requests to the publishing API.
The Whitehall frontend application is supposed to query the publishing API once for a single request, then cache the results for 5 minutes. But because the API was taking too long to respond, the results weren’t cached, causing more requests to be sent to the publishing API. This caused a snowball effect with each request to Whitehall resulting in hundreds or thousands of requests being sent to the publishing API per minute.
A scheduled job was completing at 5.00am which was causing sufficient load on the publishing API to result in the initial slow requests leading to the snowball effect.
What we’re doing to prevent this from happening again
The initial step taken of optimising the cache look up on the publishing API has been sufficient to prevent this error from occurring. However we have used this as an opportunity to re-architect Whitehall frontend to no longer use the publishing API. This distinction will mean that problems on GOV.UK can no longer break the publishing API and similarly that problems with the publishing API can no longer break GOV.UK.
We have learnt from this experience that we have opportunities to improve our monitoring of the publishing platform. We discovered these problems from the effects they were causing to GOV.UK, where the effects were mostly unnoticeable to GOV.UK users, while we weren’t notified automatically about the severe problems publishers were having.
Finally we have moved our publishing API to an increased capacity hosting architecture.
Kevin Dew is a tech lead at GOV.UK.