Service Disruption
Incident Report for blubolt
Postmortem

Summary

On Sunday 29th September 2019, we experienced an issue with a server in our database cluster which resulted in a severe degradation in database query performance. This had a cascading effect across our infrastructure, leading to websites being unavailable for a period of time.

Root cause and resolution

At approximately 13:00, a table on one of our read-only database servers became locked with an InnoDB table metadata lock. There was no corresponding database thread owning the lock which caused the table to become deadlocked. Over the next 25 minutes, requests to update that table failed, entered a wait state, and remained in that state indefinitely.

At 13:26, the number of web requests waiting for a response from the database reached a critical point where all available servers were fully saturated with requests. Our internal monitoring alerted us to this and our incident response process was triggered.

Multiple steps were taken to mitigate the impact of the issue, including adding additional capacity to the infrastructure, but each time we did this the issues followed again within minutes.

Once we had determined that the root cause was linked to an issue with a server in the database cluster, the server was rebooted and the dangling connections on the web servers were terminated. A few minutes after this at 14:01 the web servers started processing requests again restoring service to the majority of clients, with full service for all clients restored at 14:09.

During this incident, we were also experiencing a suspected DDoS with a significant increase in traffic from a large number of IP addresses in Hong Kong. Our data at the time implied this might have been the root cause of the issue and the requests were blocked. While this did end up being unrelated, the additional traffic added an extra level of complexity to the incident.

Long term mitigation

Following on from discussions about this with our hosting provider, a number of database configuration changes have been implemented, providing us better insight into the underlying processes running on the database when similar events occur. This has also allowed us to alert the on-call team earlier, to give more opportunities to resolve these issues before they have an impact on website visitors.

While we don’t want issues like these to occur, we know that communication with our clients during them is critical. As such, we are continuing to improve our internal tooling to reduce the time it takes to get updates onto our status page.

Posted Oct 04, 2019 - 16:19 BST

Resolved
Site availability has remained stable and our monitoring confirms normal performance levels. Our systems team will be continuing to investigate the full root cause of this issue but there is no indication at present that there will be any further impact on availability.
Posted Sep 29, 2019 - 14:49 BST
Monitoring
Site availability has remained stable since the last update with customer orders being taken as normal.
Posted Sep 29, 2019 - 14:32 BST
Identified
Our systems team have taken measures to mitigate the impact of the ongoing issue which has restored availability for most sites. We are continuing to investigate the full nature of the cause and will update you again shortly.
Posted Sep 29, 2019 - 14:13 BST
Update
Our systems team is continuing to investigate this ongoing issue with the utmost urgency. We will update you again shortly.
Posted Sep 29, 2019 - 14:04 BST
Investigating
Please accept our apologies for any downtime that you may be currently experiencing.

We are aware of the issue and are working to implement a solution. It is our utmost priority to restore service.

We will update you again shortly
Posted Sep 29, 2019 - 13:42 BST
This incident affected: Websites (Websites, Checkout & Cart, Third party systems).