https://insidegovuk.blog.gov.uk/2017/02/22/incident-report-broken-tabs-on-gov-uk/

Incident report: broken tabs on GOV.UK

This post outlines a recent production issue on GOV.UK and how it was resolved. We’ve blogged in the past about what happens when things go wrong on GOV.UK, and also how we classify and prioritise incidents.

What happened

On the afternoon of 5 January 2017, we upgraded to a newer version of jQuery, which briefly caused some GOV.UK pages to fail to display content.

jQuery is a Javascript library that makes for a simplified way of writing Javascript code. Using a code library in a project allows access to pre-built functionality, meaning that you don’t need to code it yourself.

When a library is updated, the maintainers of that library will assess the potential damage of their changes. If the change is very minor, it is considered ‘non-breaking’ and therefore unlikely to break anything on sites which include it. Although this update in jQuery was documented as non-breaking, it caused a breaking change in the jQuery Tabs plugin that we use on calendars and transaction start pages.

What users saw

Any pages with the jQuery Tabs plugin did not display content between 3:13pm and 5:23pm. Content still displayed for users that did not have Javascript enabled.

The most affected pages were:

/bank-holidays
/apply-renew-passport
/apply-online-to-replace-a-driving-licence
/apply-first-provisional-driving-licence
/student-finance-register-login
/change-address-driving-licence
/view-driving-licence
/renew-driving-licence-at-70
/renew-driving-licence

The resulting Javascript errors were recorded in Google Analytics, which told us that /bank-holidays (a very highly trafficked page on GOV.UK) hit 59,089 errors during the course of the incident. The second most affected page was /apply-renew-passport which had 18,255 errors.

The /bank-holidays page, for instance, failed to show the upcoming bank holidays for the different parts of the UK.

Broken bank holidays page

Users should have been presented with the dates of upcoming bank holidays:

Working bank holidays page

How we responded

We rolled back the jQuery upgrade in the Static application to the previous jQuery version. Static is where global templates (headers, footers and components that are common to a lot of pages) are defined for GOV.UK pages.

We make use of caching on GOV.UK - once a user has visited a page, content will be stored in memory to prevent the browser from having to re-request it from the server. The benefit of this is that the page should load faster.

After deploying the rollback of Static, we purged the most affected pages from our caches so that users would see the fixed version straight away.

What we’re doing to prevent this from happening again

We’re implementing better alerting for Javascript errors so we notice problems with page rendering before they affect users. We will begin by investigating the volume of Javascript errors that we are encountering. We will store these in Graphite (a graphing tool that allows us to draw graphs of various metrics that we put into it) in order to determine a threshold beyond which alerts should be triggered.

We are investigating how to integrate visual regression testing into the deployment process for the Static application. Visual regression testing will generate screenshots of webpages on different environments (in this case staging and production), with the differences between them highlighted. This provides a mostly automated way of telling if something doesn’t look quite as expected.

We’re also making sure Smokey, an application which runs automated tests that describe high-level user journeys, can catch and handle Javascript errors.

Rosa is a developer on GOV.UK. You can follow her on Twitter.

Leave a comment