Over the last year, we’ve been improving how we manage technical support on GOV.UK. Towards the end of 2015, we blogged about how we respond when things go wrong on GOV.UK and not long after that we published our very first public incident report. Since then we’ve made some more improvements to how we manage incidents so that we and a clear and documented way of categorising incidents.
GOV.UK is an important part of national infrastructure. Citizens rely on it for information, and government publishers use it every day to publish content. We take any problem on GOV.UK very seriously, and so we’ve made sure that we have a framework to refer to when we make decisions about how to handle incidents.
One of the hardest things to do when responding to an incident, is working out just how serious the incident is. We need to triage and categorise incidents for very good reasons.
Firstly, we need to know how to respond - should we all stop what we’re doing and focus on fixing the problem, should we be working all hours to restore service, how do we know what ‘restore service’ even means?
We need to be able to prioritise our responses to incidents. Sometimes, several things go wrong at the same time, and our incident responders need to know which problem is the highest priority to fix.
Finally, we need to be able to manage expectations with our users, whether they’re citizens looking for information, or publishers looking to make changes.
We’ve changed our process slightly and now the most important thing about responding to an incident is triaging the incident. We do that by asking ourselves 3 questions when we’re responding to an issue:
- What’s the urgency and why?
- What’s the impact to our users and systems?
- What’s the extent of the issue and how many systems and users are affected?
The answers to those questions allow us to use a scale to categorise the incident and work out the appropriate response.
Once we’ve asked and answered those questions, we categorise the incident and respond to it in the most appropriate way. Engineers working on an incident will know whether they’re expected to continue working on the incident until it’s stable, regardless of what time of day it is.
Severity 1 - this is our highest and most serious incident. This is generally when there’s an outage of one or more critical applications, like Whitehall or Publisher and there’s no workaround possible. This is the highest priority in GOV.UK. Work will continue out of hours until the incident can be downgraded to the next level, severity 2.
Severity 2 - this is the next highest level. We declare a severity 2 incident when major functionality is broken, such as site search. This will be dealt with during working hours above other non-emergency work.
Severity 3 - this is the lowest level of incident, but it still takes priority over regular work. Generally, we’ll declare a severity 3 if minor functionality or internal tools aren’t working as intended. We’ll work on this as a priority within working hours, but if a severity 1 or 2 develops, we’ll stop dealing with the severity 3 in order to respond to the higher priority issue.
Regardless of the incident severity, we will always do two things after an incident:
- Hold an incident review meeting (previously known as ‘post mortem’) to agree how to address the root cause of the issue
- Produce an internal incident report and list of actions
We will also publish a blog post about them in the Incident Report category on this blog for severity 1 and 2 incidents, and for severity 3 incidents if they’re particularly interesting.