Intermittent connection issues
Incident Report for blubolt
Postmortem

Summary

Between 09:00 and 09:14 on Friday 23rd November we experienced issues connecting to our primary database server resulting in approximately 5% of page requests failing. This was caused by an issue with the underlying database which has now been resolved.

Root cause and resolution

Our primary database is set up to allow orders of magnitude more database connections than we would expect to use on a daily basis, allowing us significant leeway to handle traffic spikes such as those we’d expect on Black Friday.

However a bug in the specific version of the database software we use caused the number of available connection slots to decrease over time, constraining our ability to handle traffic spikes.  As this is a bug rather than intended behaviour, there was no visibility of the reduced connection pool.

From 08:00 on Friday the traffic across the platform started significantly increasing causing the number of open connections to the database to climb.  We started getting error reports from our internal monitoring that a small number of connections had failed at 08:45 but had self-recovered, and more reports of errors from 09:00 when database connection traffic peaked.

As per our incident response process, we attempted to mitigate the impact of this by blocking deploys to prevent the caches being cleared.  If the caches were cleared, there would have been a brief period of rebuilding the data which would have required opening more database connections than were available at the time.  Deploys were unblocked at 09:23 once we were confident that deploys would not cause adverse effects on the platform.

The number of connections started falling at 09:10 and the errors stopped at 09:14.  By 09:35, database connections had returned to levels of a more typical day.

Throughout the incident we were investigating the cause with our hosting provider, who manages the database software and hardware.  They were able to identify the cause and applied a temporary fix later in the day. Unfortunately, they are not able to share the underlying cause of the issue.

By this point database traffic was sufficiently below the limits that we had not seen further connection failures, and the database continued to operate as expected for the remainder of the day.

Long-term mitigation

We are planning a number of further improvements to the database, including moving to a newer version that is not affected by this bug to ensure we do not see this issue occur again. There is also a longer-term plan to investigate migrating to alternative database software.

We are also reviewing our incident response process to ensure communication of key events such as deploys being blocked is available to all affected parties in a timely manner.

Posted 8 months ago. Nov 29, 2018 - 12:00 GMT

Resolved
After monitoring this issue throughout the day, we are fully satisfied that this issue is now resolved as we have not seen the same issue since our database provider applied a patch this morning.
Posted 8 months ago. Nov 23, 2018 - 16:02 GMT
Monitoring
We have had confirmation from our upstream provider that a fix has been put in place. We will now be monitoring sire requests to ensure the issue is fully resolved.
Posted 8 months ago. Nov 23, 2018 - 10:45 GMT
Identified
We have identified an issue this morning which is resulting in page requests timing out. We have identified this to be an issue with the upstream database provider and we are working with them to resolve the issues. In the meantime, we will be blocking deploys intermittently to mitigate issues during demanding periods.
Posted 8 months ago. Nov 23, 2018 - 09:54 GMT
This incident affected: Websites (Websites).