Service Disruption
Incident Report for blubolt
Postmortem

Summary

An automated security update to an operating-system level package caused a significant portion of our infrastructure to fail, and prevented the self-healing systems we have in place from restoring service. Our Elasticsearch cluster was the primary customer-facing service affected.

The update was rolled back to restore service to website visitors shortly after midnight, with all the other internal services affected being restored over the subsequent few hours.

Root cause

On the 14th of May, an update to json-c was released to fix CVE-2020-12762 and published to the security update repositories for the version of Ubuntu we are using at 20:03. Starting at 20:40 our infrastructure checked for updates, found the security update and proceeded to install it.

We are using upstart as our init system on the affected servers, which has a dependency on libjson-c2, provided by json-c. As part of the update process, services that rely on the updated package were restarted, including the init system. Linux relies heavily on the init system and so it isn't possible to restart it without crashing the system. As a result, our servers tried to restart upstart and crashed with a kernel panic during the upgrade, taking them offline.

As these servers are set to be automatically replaced if they fail, the ones that had gone offline were terminated and replacements created. Unfortunately those replacements also tried to install the update to libjson-c2 and crashed in the same way, preventing them from self-healing.

From our initial viewpoint we could only see that instances were crashing, and our hosting provider was brought in to assist. As we gathered data and got closer to the root cause, a number of potential fixes were attempted including replacing the Amazon Machine Image that appeared to be central to the issue. These weren’t able to restore service, but did give us valuable extra insight into the issue.

At 22:50, it was discovered that the json-c updates were the underlying cause of the outage, and we proceeded to downgrade the security update that had been applied. This took approximately an hour, and servers began successfully completing the startup process just before midnight.

There were a few additional manual steps required to restore the data into Elasticsearch, and this was completed by 00:18, restoring service to website visitors.

This affected a large number of servers on our infrastructure, including Elasticsearch, FTP, and numerous internal tools. Restoration of all of those took additional time, steadily coming back to full strength by 03:30.

We have reported the issue to the package maintainers, and the update has since been removed from the upstream repositories.

Posted May 15, 2020 - 16:32 BST

Resolved
The fix implemented has been successful at restoring service and will prevent the issue reoccurring.

A small number of non customer facing services were also affected, including FTP, and these have now also been restored.
Posted May 15, 2020 - 03:38 BST
Monitoring
A fix has been implemented which has restored the website service and we are currently monitoring the results.

We are currently investigating some non-customer facing services and any additional updates will be posted to the status page
Posted May 15, 2020 - 00:33 BST
Update
We are continuing to work on and test a potential fix for this issue and there will be an additional update in the next 30 minutes
Posted May 14, 2020 - 23:53 BST
Update
We are continuing to work on a fix for this issue and there will be an additional update in the next 30 minutes
Posted May 14, 2020 - 23:22 BST
Update
We are continuing to work on a fix for this issue and there will be an additional update in the next 30 minutes
Posted May 14, 2020 - 22:53 BST
Identified
The issue has been identified and our systems team is currently working on resolving this.
Posted May 14, 2020 - 21:56 BST
Update
We are continuing to investigate this issue.
Posted May 14, 2020 - 21:33 BST
Update
We are still investigating this issue and we will send an additional update in the next 30 minutes.
Posted May 14, 2020 - 21:29 BST
Update
We have temporarily blocked deploys while we continue to investigate this service disruption

We will send an additional update in the next 15 minutes.
Posted May 14, 2020 - 21:08 BST
Investigating
We are currently experiencing a service disruption.

Our team is working to identify the root cause and implement a solution. Website and Admin users may be affected.

We will send an additional update in the next 15 minutes.
Posted May 14, 2020 - 20:57 BST
This incident affected: Admin (Admin application) and Websites (Websites, Checkout & Cart).