This post is about a recent incident on the GOV.UK website. For more information on incidents on GOV.UK, you can check the status of the GOV.UK site or read more about what happens when things go wrong on GOV.UK.
What users saw
From 17 December to 21 December 2015, up to 24 worldwide organisation pages were inaccessible. Users visiting the page would be endlessly redirected between 2 URLs.
Cause of the problem
The GOV.UK development team has been rebuilding the systems that store and publish content on the website. As part of this work, two unrelated changes were made.
The first was cleaning up some duplicate data in our publishing database. Because of the legacy way that corporate information pages (which power the worldwide organisation pages) were modelled, the page itself and the "about" content were stored separately in the database, but shared the same URL path. To tidy this up and allow the data to be migrated to a new system, we gave the "about" content an imaginary URL, "<organisation>/about", and set up a redirect which meant that if anyone ever visited that path they would be redirected back to the actual organisation page.
The second was to make it easier to change the path (URL) of a piece of content. Previously, when editors wanted to change the URL of some content, it meant asking our second-line support team to manually change the path and set up redirects from the old path to the new one, so that existing links didn't fail. In our new system, we added the functionality to automatically set up a redirect if it detected that the content had changed path when it was edited.
By themselves, these changes had no effect. However, one of the effects of the duplicate path issue was that some edit notifications were continually failing and retrying in the publishing queue, Sidekiq.
When the issue was fixed, the notifications processed successfully. However, since they still referred to the old path, the publishing system saw them as a change. This triggered a redirect from the 'old' fixed path to the 'new' duplicate one. Since there was already an existing redirect from the duplicate path to the fixed one, a loop now existed between the old and new paths.
How we are preventing this from happening again
We should have kept a closer watch on the failing jobs in the Sidekiq queue, and cleared them out before implementing the fix for the duplicate path issue. Since there was no actual content in those jobs, clearing them would have had no negative impact for users.
We will be adding monitoring to show how long jobs have been stuck in the retrying state in the queue, so that we get a more visible signal that things are not being processed properly.
In addition, we are looking at adding anomaly detection on GOV.UK. This means that if we suddenly start serving many more redirects than usual, we will be notified sooner so that we can investigate and fix the problem quickly.