Database connectivity issues
Incident Report for blubolt
Postmortem

Summary

Between 7:11 am and 7:29 am on Tuesday 18th June we experienced a high volume of connections to our database cluster resulting in a single read replica refusing additional connections. This caused a small number of requests to receive an error response from the platform.

Root cause and resolution

At 7:11 am we experienced a sudden and significant increase in the number of connections to the live database cluster. As a result of how our hosting provider distributes the traffic between the database nodes, a single node received the majority of the spike in connections causing it to become unresponsive and reject new connections.

Our monitoring system informed us about the high number of connections to the read replica at 7:14 am and an additional replica was added to the cluster and was in circulation at 7:26 am. This provided additional stable servers that helped keep the number of 500 errors to a minimum.

The replica that was refusing connections was then restarted to bring it back to a normal working state by 7:28 am. During this period, the vast majority of requests were served successfully from our caching layer, with a small number receiving a fatal error.

At 2:16 pm on Wednesday 19th June the additional servers were removed from the database cluster. This resulted in a very short period of error responses due to an issue with the scaling down policy resulting in web servers not connecting to the other available servers automatically. The errors were resolved by 2:17 pm and requests were handled normally.

Long term mitigation

We are going to be implementing a more efficient autoscaling policy to detect and mitigate this specific issue in the future. We will also be improving our communication around this specific scenario to better inform our clients faster of known issues.

Posted Jun 21, 2019 - 14:53 BST

Resolved
This incident has been resolved.
Posted Jun 18, 2019 - 11:07 BST
Update
We are continuing to monitor for any further issues.
Posted Jun 18, 2019 - 07:39 BST
Monitoring
Connectivity issues between the bluCommerce application tier and the database services were detected at 7:17AM this morning. Our systems team adjusted the configuration of the database read replicas at 07:20AM with this change coming fully into affect a few minutes later. Our monitoring is now showing database connectivity has returned to normal with all requests now being served in a timely manner.
Posted Jun 18, 2019 - 07:38 BST
This incident affected: Admin (Admin application, Order and backend task processing, Scheduled tasks) and Websites (Websites, Checkout & Cart).