At GOV.UK, we’re constantly iterating our processes to give our users the best experience. Over the last 6 months we’ve been focussing on what we do when things go wrong.
What is an incident?
An incident is an event that affects how users experience GOV.UK. It may affect people using the site, or editors using our publishing tools, or both. It could result in downtime for parts of GOV.UK or poor performance and errors.
What happens when there’s an incident?
The first responders for an incident are usually the people who are working on our Technical Support team for the week. They’re there specifically to provide support and to deal with problems on the site.
When something has gone wrong, the Technical Support team will assign roles to themselves. An incident lead coordinates activities and is the person that everyone else needs to check with before getting involved. We also assign someone to take care of incident communications. That person sends out emails internally, updates the Basecamp community and updates our status page so that people are aware that we have a problem and are fixing it.
Our primary concerns during an incident are stabilising the site so that it can be used and keeping people informed through Basecamp and the status page.
What happens after the incident?
Once the incident has been resolved and the site is stable, we work on analysing the root cause of the incident. Sometimes it’s not easy to see what the actual cause of the incident was because there has been a chain of events.
The root cause is the initial problem that caused the incident, or caused the chain of events that resulted in the incident. It’s important that we identify exactly what caused the incident so that we can stop it from happening again.
When we have a suspected root cause or some evidence that helps us suggest a root cause, we get together as a group to hold a post mortem. This is usually open to anyone in GDS and we don’t assign blame, which is really important because it encourages people to contribute and to learn from what’s gone wrong.
During the meeting the team reviews the incident timeline and suggests what they could have done better. We also discuss and agree the root cause and what the fix should be.
How do we follow up?
We make sure that actions from the post mortem meeting are assigned to team backlogs or individuals for completion, and those actions are regularly followed up. We then finish writing up the incident report, making sure that it’s easily accessible from our wiki and that the actions are clear.
Every incident we have on GOV.UK teaches us something new about how we can improve what we do, whether it’s technology or communication. We’re constantly iterating our incident management processes so that we can manage problems on the site as smoothly as possible and concentrate on restoring service.