Between 09:00 and 09:14 on Friday 23rd November we experienced issues connecting to our primary database server resulting in approximately 5% of page requests failing. This was caused by an issue with the underlying database which has now been resolved.
Our primary database is set up to allow orders of magnitude more database connections than we would expect to use on a daily basis, allowing us significant leeway to handle traffic spikes such as those we’d expect on Black Friday.
However a bug in the specific version of the database software we use caused the number of available connection slots to decrease over time, constraining our ability to handle traffic spikes. As this is a bug rather than intended behaviour, there was no visibility of the reduced connection pool.
From 08:00 on Friday the traffic across the platform started significantly increasing causing the number of open connections to the database to climb. We started getting error reports from our internal monitoring that a small number of connections had failed at 08:45 but had self-recovered, and more reports of errors from 09:00 when database connection traffic peaked.
As per our incident response process, we attempted to mitigate the impact of this by blocking deploys to prevent the caches being cleared. If the caches were cleared, there would have been a brief period of rebuilding the data which would have required opening more database connections than were available at the time. Deploys were unblocked at 09:23 once we were confident that deploys would not cause adverse effects on the platform.
The number of connections started falling at 09:10 and the errors stopped at 09:14. By 09:35, database connections had returned to levels of a more typical day.
Throughout the incident we were investigating the cause with our hosting provider, who manages the database software and hardware. They were able to identify the cause and applied a temporary fix later in the day. Unfortunately, they are not able to share the underlying cause of the issue.
By this point database traffic was sufficiently below the limits that we had not seen further connection failures, and the database continued to operate as expected for the remainder of the day.
We are planning a number of further improvements to the database, including moving to a newer version that is not affected by this bug to ensure we do not see this issue occur again. There is also a longer-term plan to investigate migrating to alternative database software.
We are also reviewing our incident response process to ensure communication of key events such as deploys being blocked is available to all affected parties in a timely manner.