GOV.UK is an essential part of living, studying and working in the UK. For its millions of users, GOV.UK appears as a single website, where people can move seamlessly from one page or service to another.
However, behind the scenes, GOV.UK is made up of many applications that work together to produce the public-facing website. Between these applications and the computer hardware that runs them, there is system software that acts as a go-between. We sometimes call this system software the "platform" on which GOV.UK runs.
With a system software end-of-life deadline approaching, we took the opportunity to evaluate what platform was best for GOV.UK and its needs.
This was GOV.UK's largest software infrastructure project since its launch. This blog post describes how we decided what to do, how we did it and ultimately how we made GOV.UK more secure, cheaper to run and easier to scale. By doing this, we directly contributed to GDS’s mission “to make digital government simpler, clearer and faster for everyone”.
Why we decided to modernise
The software that runs the GOV.UK website makes use of other software "underneath" it, such as an operating system (OS) and some configuration management software to manage and keep track of changes. From 2014 to 2023, our OS was Ubuntu, and we managed our configuration with Puppet.
By 2021, this infrastructure software was approaching the end of its supportable lifespan. Though we still have commercial support in place, running an old operating system meant we were spending ever more engineering effort working around compatibility issues whenever we needed to update other software that GOV.UK depends on, such as Ruby-on-Rails.
We considered whether to upgrade to recent versions of Ubuntu and Puppet (running on virtual machines), or to take the opportunity to modernise by running GOV.UK in containers.
For GOV.UK, the main advantages of containers are scalability and lower maintenance. We were using complex automation to deploy many different applications onto the same set of virtual machines, which made it difficult to add more machines ("scaling up") in response to increased website traffic. We were also spending a lot of engineering time updating the software on these virtual machines.
Moving to containers means that when we need to respond to surges of traffic, we can add capacity easily. During our work on COVID-19, we saw how important it was to be able to withstand these traffic spikes.
Upgrading the existing infrastructure would have solved our short-term problem, but would have represented a much smaller return on investment and taken at least as much effort as containerisation.
How the team worked
The team of engineers had varying degrees of experience working on projects like this, and had to get up to speed with a new technology to progress with the project. To establish healthy ways of working we, as team leads;
- assigned workstream leads that would be responsible for prioritising their workstream backlog and having kick-off sessions with the entire team to help spread knowledge - this allowed people to ask questions about the approach and reduce any assumptions
- implemented a pairing by default way of working to make sure engineers could learn from each other
- protected the team from any time-intensive activities that weren’t directly related to the delivery of the work
- adapted our regular updates as needed, by holding them less frequently and making better use of "asynchronous" communication such as Slack, which meant there was more flexibility in managing workloads
Testing ahead of go-live
By giving engineers responsibility for managing and prioritising their own stream of work and protecting them from other distractions, it allowed them the time and space to try things out. It also helped build their confidence to work autonomously.
One of our engineers suggested a bold idea which would prove a catalyst for the project's success: what if we were to perform a trial-run on the real, live website? At first this sounds like an unnecessary risk: why not wait until the whole system is fully working in a test environment?
By limiting the scope of our trial-run to just those components of GOV.UK that serve web pages to the public (as distinct from the other parts that help people update the content of the site), we were able to:
- focus on a subset of the problem and a nearer-term goal providing an important milestone ahead of go-live
- build confidence amongst our management stakeholders
- discover some minor issues with the website that may have otherwise interfered with our eventual launch
During our trial runs, we solved problems as a team by mobbing on them. Mob programming as a team further helped build trust and confidence for individuals. By working through minor issues together as a team in a safe environment, it increased people’s ability to experiment with different solutions to solve problems.
Once the team was fully confident everything was in place, we selected a week to switch over when there weren't any big planned government announcements. We reviewed our run book and roll back plans to ensure they were accurate and up to date. We made a list of useful contacts such as technical support and escalation contacts. We informed publishers and other stakeholders, allowing them time to ask questions.
On the day, we mobbed on the go live activities and kept a log of status updates. We kept our stakeholders updated regularly.
The launch went smoothly and the public-facing website continued working without any visible change, as we were hoping.
We’ve seen a positive impact to the organisation and our users, for example:
- we can now scale GOV.UK more quickly in response to traffic surges
- the new platform is cheaper to run
- it’s much quicker and easier to roll out security patches
- it's easier for us to make changes to applications and configuration
- we've eliminated a lot of repetitive maintenance work for GOV.UK's developers, so that they have more time to spend on valuable features and enhancements
- running modern software means we can recruit engineers from a larger pool of talent
In the GOV.UK Platform Engineering team, we're working hard to get rid of a few behind-the-scenes odds and ends that are still running on the old Ubuntu/Puppet infrastructure so that we can finally switch it off. These are things like:
- process automation that helps developers perform occasional maintenance tasks
- the daily processes that automatically test that we can successfully restore from backups
Once we've done that, we'll realise the rest of the value and savings by switching off the old infrastructure and no longer having to maintain it.
Now that GOV.UK is hosted on a modern, container-based infrastructure, there are a lot of exciting improvements that we can make to further increase developer productivity, reduce running costs and add even more resilience. Many of these things would have been difficult or expensive to achieve with the previous system. For example, we could:
- enable developers to run a complete and accurate replica of GOV.UK "locally" on their workstations so that they can find and fix bugs more quickly and easily
- improve GOV.UK's suite of automated tests so that developers receive near-instant feedback if, for example, their change would break an important end-to-end user journey such as publishing a change to an article and seeing the change appear on the website
- further enhance GOV.UK's resilience by hosting in more geographic regions or cloud providers