On Sunday 29th September 2019, we experienced an issue with a server in our database cluster which resulted in a severe degradation in database query performance. This had a cascading effect across our infrastructure, leading to websites being unavailable for a period of time.
At approximately 13:00, a table on one of our read-only database servers became locked with an InnoDB table metadata lock. There was no corresponding database thread owning the lock which caused the table to become deadlocked. Over the next 25 minutes, requests to update that table failed, entered a wait state, and remained in that state indefinitely.
At 13:26, the number of web requests waiting for a response from the database reached a critical point where all available servers were fully saturated with requests. Our internal monitoring alerted us to this and our incident response process was triggered.
Multiple steps were taken to mitigate the impact of the issue, including adding additional capacity to the infrastructure, but each time we did this the issues followed again within minutes.
Once we had determined that the root cause was linked to an issue with a server in the database cluster, the server was rebooted and the dangling connections on the web servers were terminated. A few minutes after this at 14:01 the web servers started processing requests again restoring service to the majority of clients, with full service for all clients restored at 14:09.
During this incident, we were also experiencing a suspected DDoS with a significant increase in traffic from a large number of IP addresses in Hong Kong. Our data at the time implied this might have been the root cause of the issue and the requests were blocked. While this did end up being unrelated, the additional traffic added an extra level of complexity to the incident.
Following on from discussions about this with our hosting provider, a number of database configuration changes have been implemented, providing us better insight into the underlying processes running on the database when similar events occur. This has also allowed us to alert the on-call team earlier, to give more opportunities to resolve these issues before they have an impact on website visitors.
While we don’t want issues like these to occur, we know that communication with our clients during them is critical. As such, we are continuing to improve our internal tooling to reduce the time it takes to get updates onto our status page.