Sites unavailable
Incident Report for blubolt
Postmortem

Summary

Between 10:14 and 10:50 am on Monday 11th March we experienced issues with the configuration on our live web worker cluster. View cached pages (65% of served traffic across the platform) were still available, however pages with dynamic content ran into problems when being generated.

Root cause and resolution

As part of our weekly maintenance, we had identified an update to a core software package that required a restart of several services. To minimise disruption to website operations, this was deployed using our standard rolling replacement process.

During the rollout of the new servers, we were notified by our monitoring system that there was an issue with the configuration that caused sites to display an error message when processing a request.

Upon investigation during the incident, we identified that the configuration deployed to the servers was not up to date with the one stored within our configuration management software.

During the incident, we employed various mitigation techniques, including blocking deploys across the platform to ensure that we served view cached pages. We resolved the incident by manually installing the configuration file across all live servers and restarted the affected service. This brought the servers back to normal working operation.

Long term mitigation

We are going to be looking into additional steps and improvements that we can take during the testing and rollout of service level configuration. We will also be improving our communication around this specific scenario to better inform our clients faster of known issues.

Posted 4 months ago. Mar 13, 2019 - 14:44 GMT

Resolved
We have monitored our platform and we can confirm that our actions undertaken this morning have resolved the issues with site availability.

We will issue an update describing the cause and resolution of this issue in detail on the status page going forward.
Posted 4 months ago. Mar 11, 2019 - 17:15 GMT
Update
The platform remains stable and no issues with site availability have been reported in our logs.

The systems team is continuing to monitor this.
Posted 4 months ago. Mar 11, 2019 - 11:42 GMT
Update
The platform is stable and there haven't been any additional errors or issues with site availability. We are continuing to monitor the platform - and we will update you again shortly.
Posted 4 months ago. Mar 11, 2019 - 11:10 GMT
Update
We are currently still in the same situation, however we can see that there have been no errors in the last 5 minutes and the platform is stable.

We will update you in the next 5 minutes.
Posted 4 months ago. Mar 11, 2019 - 10:58 GMT
Monitoring
We have now resolved the immediate issues and we are working on ensuring that this remains stable. We will continue to monitor this for the next 5 minutes and update you shortly.

We will also unblock deploys going forward.
Posted 4 months ago. Mar 11, 2019 - 10:51 GMT
Update
We are currently still seeing issues with site availability. Our systems team is currently working to resolve these issues as quickly as possible with full resources.

We will update you shortly with a status update
Posted 4 months ago. Mar 11, 2019 - 10:40 GMT
Identified
We are currently recovering from a systems issue. We will update you with further information as soon as possible,
Posted 4 months ago. Mar 11, 2019 - 10:27 GMT
This incident affected: Admin (Admin application, Order and backend task processing, Asset generation, Scheduled tasks) and Websites (Websites, Checkout & Cart, Third party systems).