The Platform Health team maintains and operates the GOV.UK platform, made up of about 47 applications in 2 environments, serving around 3 million users every day. The team’s job is to know what’s healthy, what’s not, and how to fix things.
The team makes changes to applications, fixes bugs, tweaks machines, monitors the performance of everything and – for all intents and purposes – acts a bit like a hospital for the government’s home on the web.
Adam started in Platform Health at the start of the year, as an Associate Product Manager, but GDS colleagues Steve and Martin were the original Product Managers of the team – managing the operational process of running multiple applications throughout the platform’s entire life-cycle. This meant making priority calls on when to do proactive work to improve the health of the platform versus solving bugs that arise and are dealt with reactively.
Why we needed a framework
Not long after the team formed, Platform Health had a huge list of bugs that was building up. On the one hand we needed to make the publishing pipeline more efficient and get a handle on our error alerting, but we also needed to ensure that the platform continued working for users. It really wasn’t clear which we should prioritise to fix first - we just trusted our instincts.
That’s never a great idea, as you don’t want to waste time doing something you needn’t. Since the list of bugs was growing, our strategy for paying those down was to fix 1 to 3 bugs per sprint. This was alongside all our other work on the infrastructure, iterating products and making security better. But by committing to 1 to 3 bugs per sprint, we reckoned we’d clear through the list within a few weeks.
So we wanted to fix up to 3 bugs per week – but which one should we fix first? Like any good product manager, we needed a framework.
How we designed the framework
In 2018, Steve and Martin sat down to work out some factors for scoring bugs to determine how quickly they should be fixed. There were a few different triggers that would cause us to work on a bug sooner:
- User impact: if the bug had implications for a sensitive part of the platform - for example, if it was providing incorrect rates for holiday pay that could negatively impact a user’s life - we would want to fix the bug sooner, as its effect was severe
- Platform operation: similarly, if the product was critical to the functioning of the platform, we’d consider it a severe bug that needed to be fixed sooner
- Reach: and if a large number of people would be affected by the bug - the 'reach' - we’d certainly have to jump on it quickly
Urgency and severity are good ways to measure priority. However, known workarounds can give some leeway.
Improving the health of the platform
Thinking about our team’s objective of improving the health of the platform, Steve and Martin wanted to consider when fixing the bug would advance our goals or pay down tech debt. This would help us address the platform’s need to continue functioning and not degrade in quality.
If it was a recurring problem, we should aim to fix it properly. ‘Cost of delay’ gives a figure for how much time and effort it costs to not solve the bug. For example, if it adds an extra 10 minutes on to someone’s normal processes, that time builds up over a month and can slow a team down.
And if something could be solved quickly, it’s worth considering acting on it sooner.
Putting our ideas into practice
Now that we had all the factors to calculate a priority score, we needed a function to calculate by. The product managers took inspiration from Intercom’s RICE prioritisation framework to help us create the formula. They weighed up the factors and came out with the following calculations:
(Reach x Severity) x Workarounds = Problem score
If the bug affects many users and is also severe, it gets a high problem score. But if there’s a workaround - that reduces the severity. We decided to set the reach from 1 to 3, the severity from 1 to 3, and if workarounds existed we multiplied by 0.5. This gave us a problem score.
Problem score + cost of delay + long-term value = Priority score
Having an objectively-rated problem was great but we still needed to know how soon to work on the bug. We figured that adding the cost of delay (measured in 0–9 hours) and the long-term value factor (0–2, where 0 means there’s no long-term value - defined as being beneficial to our systems for at least 12 to 18 months) to the problem score would give us a priority score. Any bug with a high priority score should be worked on sooner.
We ran the calculations on our backlog of unprioritised bugs and immediately saw a security flaw zoom to the top, while a unique problem with a specific PDF attachment went straight to the bottom of our prioritisation ranking.
The framework was a success because, with loads of stuff going on, it helped us cut through the noise and gain clarity.
How we iterated our framework
We knew, however, that this wasn’t the end of the journey. Bugs were piling up, and brought about new problems. Firstly, colleagues writing up the bug cards were unclear what the scores represented – one person could score the reach as 3, another as 1. There was also the issue that the scoring range was quite narrow, which meant that one small change in the score would move a bug from last place… to first. This didn’t make any sense to us.
But the main issue we had with the scoring range was that, as the bugs piled up, we would find multiple cards having the exact same score, and therefore prioritised position. And when we say multiple, we mean about 10 or 11 bugs – all, apparently, at the same priority level. Steve and Martin had left the team, but we knew this wasn’t sustainable, so we set about trying to make changes.
Fine-tuning the scoring system
As Platform Health’s new product managers, the first thing that Adam and fellow product manager Jonathan did was make the scoring system, and what it represented, much clearer. Instead of choosing a random number from 1–3, colleagues chose from a set ‘level’ which matched the situation – for example, ‘Affects a minimal set of users’ or ‘Affects most or all users’.
This helped make it easier for more people to agree on the same score. We may disagree that that was the reach or severity – but not because we disagreed on what the numbers represented.
Secondly, we bumped up the scoring range for reach and severity from 1–3 to 1–5. This greatly increased the number of possible scores, which limited the number of bugs that could have the same score. We still get bugs with the same score, but 2 or 3 – rather than 10 or 11.
How it changed our team (and how you can use it)
For Platform Health, the bug prioritisation scoring had a huge impact on our effectiveness and our ability to work on the right piece of work at the right time. Instead of making finger-in-the-air decisions on what was the most important bug, we had a tried and tested formula to point us in the right direction.
Point to note – we don’t follow the scoring blindly. After placing bugs in priority order, we’re empowered to move them around if we feel certain bugs are just more important to work on than others depending on our current product strategy.
If you want to introduce a similar prioritisation system to your team, whether focused on bugs or not:
- don’t be afraid to start small (with a few factors, or fields) and adapt if you need more differentiation (for example you’re getting a lot of bugs with the same score)
- use this template scoring system, but feel free to make a copy and customise to your specific needs
- get in touch
Most importantly – be bold!
You can follow Steve on Twitter.