On 5 January 2022, some of the servers that host GOV.UK experienced problems connecting to each other. This meant that a small number of users (around 0.5%) saw error pages when visiting GOV.UK. The errors were more common on pages with dynamic content, such as features that allow users to enter a postcode in order to find location-based information.
Servers are computer hardware or software that provide a service, for instance they can produce web pages or hold data. GOV.UK’s servers are ‘virtual’, rather than physical computers sitting in a room somewhere. This means we can add and remove servers easily depending on demand and some maintenance can be handled by our cloud provider.
What users saw
Affected visitors to GOV.UK saw an error page with a “Sorry, we’re experiencing technical difficulties” heading.
How we dealt with the problem
Our technical support team received alerts that there was a higher than usual error rate. We declared it an incident and began to follow our established processes, adding an update to GOV.UK’s status page. The priority then was to reduce the impact on users - knowing that the underlying cause could be determined and managed later.
We discovered that the servers that present the user-facing GOV.UK website were having difficulty connecting to some of the servers that store the content. There are several of these servers in a group, to spread the load. All of the servers hold copies of the content, and our system is configured to try multiple times across the group. This built-in resilience is why only a very small percentage of users saw an error.
We confirmed that there weren’t any known availability issues with our cloud provider, and so determined that the fault must be at the server level.
After identifying the faulty servers, we removed them from the group and replaced them with new ones. We then continued to monitor the situation carefully and saw the errors drop to zero, and so the incident was marked as resolved after one hour.
As is usual after any incident like this, we held an ‘Incident Review’ meeting. This included a detailed look at what happened, a discussion of possible root causes, follow-up actions and any possible improvements to the incident process itself.
We also asked our cloud provider for their thoughts on the cause of the errors.
We concluded the incident was caused by a bug deep in the servers’ operating system, affecting how they store information about which other servers they should connect to.
We stop and start servers regularly to cope with changes in demand and to fix issues. Each time a new one starts up, it gets its own address on the network. This is called the ‘IP address’, such as 10.13.4.67 in the example shown. Old addresses should be forgotten.
The bug means that sometimes servers are storing out of date ‘addresses’ that no longer work. It’s a bit like if you don’t update your address book and keep sending birthday cards to your friend even though they moved years ago.
We’ve improved our technical documentation on removing and replacing faulty servers, and how to remove out-of-date server addresses manually. This will make it even easier and quicker to deal with a similar problem in the future. We’re also planning to update the operating system on the servers to a version that doesn’t have the same bug. This is part of our wider work to improve the platform that GOV.UK runs on.
We'll shortly be publishing a more detailed look at what went wrong and how we fixed it, on Technology in Government.