Between 7:11 am and 7:29 am on Tuesday 18th June we experienced a high volume of connections to our database cluster resulting in a single read replica refusing additional connections. This caused a small number of requests to receive an error response from the platform.
At 7:11 am we experienced a sudden and significant increase in the number of connections to the live database cluster. As a result of how our hosting provider distributes the traffic between the database nodes, a single node received the majority of the spike in connections causing it to become unresponsive and reject new connections.
Our monitoring system informed us about the high number of connections to the read replica at 7:14 am and an additional replica was added to the cluster and was in circulation at 7:26 am. This provided additional stable servers that helped keep the number of 500 errors to a minimum.
The replica that was refusing connections was then restarted to bring it back to a normal working state by 7:28 am. During this period, the vast majority of requests were served successfully from our caching layer, with a small number receiving a fatal error.
At 2:16 pm on Wednesday 19th June the additional servers were removed from the database cluster. This resulted in a very short period of error responses due to an issue with the scaling down policy resulting in web servers not connecting to the other available servers automatically. The errors were resolved by 2:17 pm and requests were handled normally.
We are going to be implementing a more efficient autoscaling policy to detect and mitigate this specific issue in the future. We will also be improving our communication around this specific scenario to better inform our clients faster of known issues.