https://insidegovuk.blog.gov.uk/2016/01/28/incident-report-504-gateway-timeouts-in-publishing-applications/

Incident report: 504 Gateway timeouts in publishing applications

This post is about a recent incident relating to the GOV.UK website. For more information on incidents on GOV.UK, you can check the status of the GOV.UK site or read more about what happens when things go wrong on GOV.UK.

What users saw

Between 1 and 10 December, content designers in GDS and in departments experienced intermittent ‘504 Gateway timeout’ errors when using some publishing applications. Users of the GOV.UK website were not directly affected.

Cause of the problem

As part of work to improve our disaster recovery (DR) capability, we added a 4th node to a database cluster (MongoDB) used by several of our publishing applications. This node was accessed through a virtual private network (VPN).

Publishing applications are configured to only use the 3 MongoDB nodes which are not accessed via the VPN, but due to an auto-discovery feature in MongoDB drivers, applications could find and make requests against all 4 MongoDB nodes.

At the same time the VPN had an unrelated problem which resulted in the connection occasionally dropping. When the connection dropped, the new MongoDB node couldn’t be accessed and publishing applications timed out while trying to write to it. This resulted in ‘504 Gateway timeout’ errors in the publishing applications.

Graph showing slow VPN response times
Graph showing VPN availability with response times for contentapi - an application which uses MongoDB.

How we are preventing this from happening again

We've made the new node a hidden member of the cluster so it's no longer discoverable by applications.

Other improvements we're going to make

We've improved the stability of the VPN.

We're improving our alerts on individual backend applications which have a high number of 5XX errors.