This post is about a recent incident relating to the GOV.UK website. For more information on incidents on GOV.UK, you can check the status of the GOV.UK site or read more about what happens when things go wrong on GOV.UK.
What users saw
Between 1 and 10 December, content designers in GDS and in departments experienced intermittent ‘504 Gateway timeout’ errors when using some publishing applications. Users of the GOV.UK website were not directly affected.
Cause of the problem
As part of work to improve our disaster recovery (DR) capability, we added a 4th node to a database cluster (MongoDB) used by several of our publishing applications. This node was accessed through a virtual private network (VPN).
Publishing applications are configured to only use the 3 MongoDB nodes which are not accessed via the VPN, but due to an auto-discovery feature in MongoDB drivers, applications could find and make requests against all 4 MongoDB nodes.
At the same time the VPN had an unrelated problem which resulted in the connection occasionally dropping. When the connection dropped, the new MongoDB node couldn’t be accessed and publishing applications timed out while trying to write to it. This resulted in ‘504 Gateway timeout’ errors in the publishing applications.
How we are preventing this from happening again
We've made the new node a hidden member of the cluster so it's no longer discoverable by applications.
Other improvements we're going to make
We've improved the stability of the VPN.
We're improving our alerts on individual backend applications which have a high number of 5XX errors.