This post is about a recent incident on the GOV.UK website. For more information on incidents on GOV.UK, you can check the status of the GOV.UK site or read more about what happens when things go wrong on GOV.UK.
What users saw
From early on 3 November to 11am on 4 November, the /announcements page on GOV.UK didn’t show any new government announcements. Users visiting the page saw that the last announcement was on 2 November, despite new announcements having been published.
Cause of the problem
Every night we run an automated task to improve our search results based on the most popular pages on GOV.UK. That task requires us to rebuild our search index and so, while that task is running, we prevent further changes from being made to the index by locking it. If any changes need to be applied to the index while it’s locked, we queue them up and process them after the task has finished. When the task completes, it unlocks the indexes and if there are any items in the queue, we’ll process them then.
At 4:06am on 3 November, the nightly task ran but it didn’t complete, and the search indexes remained locked. When new announcements were published, they should have been added into the search index. However, as the index was still locked, they stayed in the queue waiting for the index to be unlocked.
GOV.UK didn’t have monitoring on whether the index was locked or on whether the task had completed and so we didn’t know that there was a problem until we received reports from our content community.
How we are preventing this from happening again
Usually we try to establish exactly why a problem occurred and how we can stop it from happening again. We haven’t been able to tell exactly what’s causing the task to fail, so we’ve added some extra logging so that we get more information the next time it happens.
In the meantime, we’ve made sure that we can catch this early and we have now set up some monitoring to check if the task has completed within the time we expect it to finish, and alert us if it hasn’t. This means that if the tasks is taking longer to complete than it should, we’ll know about it and be able to fix it.