https://insidegovuk.blog.gov.uk/2018/01/19/travel-advice-publisher-incident-roundup/

Travel advice publisher incident roundup

This post is about 2 incidents that affected our travel advice pages, and how they were resolved.

We’ve posted before about what happens when things go wrong on GOV.UK, and how we classify and prioritise incidents on GOV.UK.

21 March 2017 - Travel advice for USA not published

What happened

At around 4:40pm on Tuesday, 21 March we were alerted by our monitoring system that the email to tell users about an update to travel advice for the USA hadn’t been sent. The change to the content that should have triggered the alert also didn’t appear on GOV.UK.

Travel advice is published by the Foreign and Commonwealth Office (FCO). Users who subscribe to alerts should get an email when an update is published. This was a severity 1 incident.

What users saw

As the email alert wasn’t sent and the content change didn’t appear on GOV.UK, users were unaware of the update. We received a helpdesk ticket from the publisher at FCO who made the change asking us to investigate after they realised it wasn’t appearing.

How we responded

We published the content change and sent the email alert out manually as soon as we received the alert from our monitoring system. The travel advice publisher application had been updated earlier that day and we suspected this caused the issue. The update upgraded the Rails framework and this changed the behaviour of autoloading, causing the problem. The issue wasn’t apparent when we tested it in our integration and staging environments which is how it became a problem on the live site. To solve it, we redeployed the application with a fix for the issue.

We liaised with FCO and monitored further updates to make sure they were successful before closing the incident.

What we’re doing to prevent this from happening again

We added information about this issue to our upgrade guide and shared it with the product teams.

27 July 2017 - Travel advice publisher down

What happened

On the morning of 27 July, publishers using GOV.UK’s travel advice publishing application were unable to publish content.

This issue was caused by a bug in a code deployment in the publishing pipeline of the application, because some tests that normally run automatically had failed to run.

This was a severity 1 incident.

What users saw

Publishers using the travel advice publishing application were able to draft new content, but saw error messages when trying to publish that content on GOV.UK.

There was no impact on users visiting GOV.UK.

How we responded

When we found out that publishers were unable to publish content, we immediately rolled back the code deployment that had happened earlier that morning. This fixed the problem, so we set about investigating why it happened.

We discovered that our code deployment included a big update to another bit of code which had changed a lot since we last used it. During development, our manual testing had missed that if you publish a travel advice page with a new attachment, you get an error.

At the same time, we also noticed that our end-to-end tests throughout the publishing pipeline were not configured properly so didn’t report the problem to us before we deployed the code.

What we’re doing to prevent this from happening again

Although the root cause is a code bug, we would have caught the problem earlier if our end-to-end tests were working correctly. So we’re looking into how we can ensure the tests are always running and alert us to problems before we deploy code.

At the same time, we’re going to make sure that more of our apps use the end-to-end tests, along with localised tests, so incidents like this don’t happen again.

Graham and Thomas are developers on GOV.UK.