An automated security update to an operating-system level package caused a significant portion of our infrastructure to fail, and prevented the self-healing systems we have in place from restoring service. Our Elasticsearch cluster was the primary customer-facing service affected.
The update was rolled back to restore service to website visitors shortly after midnight, with all the other internal services affected being restored over the subsequent few hours.
On the 14th of May, an update to json-c was released to fix CVE-2020-12762 and published to the security update repositories for the version of Ubuntu we are using at 20:03. Starting at 20:40 our infrastructure checked for updates, found the security update and proceeded to install it.
We are using upstart as our init system on the affected servers, which has a dependency on libjson-c2, provided by json-c. As part of the update process, services that rely on the updated package were restarted, including the init system. Linux relies heavily on the init system and so it isn't possible to restart it without crashing the system. As a result, our servers tried to restart upstart and crashed with a kernel panic during the upgrade, taking them offline.
As these servers are set to be automatically replaced if they fail, the ones that had gone offline were terminated and replacements created. Unfortunately those replacements also tried to install the update to libjson-c2 and crashed in the same way, preventing them from self-healing.
From our initial viewpoint we could only see that instances were crashing, and our hosting provider was brought in to assist. As we gathered data and got closer to the root cause, a number of potential fixes were attempted including replacing the Amazon Machine Image that appeared to be central to the issue. These weren’t able to restore service, but did give us valuable extra insight into the issue.
At 22:50, it was discovered that the json-c updates were the underlying cause of the outage, and we proceeded to downgrade the security update that had been applied. This took approximately an hour, and servers began successfully completing the startup process just before midnight.
There were a few additional manual steps required to restore the data into Elasticsearch, and this was completed by 00:18, restoring service to website visitors.
This affected a large number of servers on our infrastructure, including Elasticsearch, FTP, and numerous internal tools. Restoration of all of those took additional time, steadily coming back to full strength by 03:30.
We have reported the issue to the package maintainers, and the update has since been removed from the upstream repositories.