Between 10:14 and 10:50 am on Monday 11th March we experienced issues with the configuration on our live web worker cluster. View cached pages (65% of served traffic across the platform) were still available, however pages with dynamic content ran into problems when being generated.
As part of our weekly maintenance, we had identified an update to a core software package that required a restart of several services. To minimise disruption to website operations, this was deployed using our standard rolling replacement process.
During the rollout of the new servers, we were notified by our monitoring system that there was an issue with the configuration that caused sites to display an error message when processing a request.
Upon investigation during the incident, we identified that the configuration deployed to the servers was not up to date with the one stored within our configuration management software.
During the incident, we employed various mitigation techniques, including blocking deploys across the platform to ensure that we served view cached pages. We resolved the incident by manually installing the configuration file across all live servers and restarted the affected service. This brought the servers back to normal working operation.
We are going to be looking into additional steps and improvements that we can take during the testing and rollout of service level configuration. We will also be improving our communication around this specific scenario to better inform our clients faster of known issues.