Website search index availability
Incident Report for blubolt
Postmortem

Summary

Our search system, Elasticsearch, has recently been experiencing a significant increase in data update traffic. This created more memory use and churn during product updates, to the point where several individual servers were spending nearly all their available time managing memory rather than processing the updates.

This occurred between 17:23 and 17:25, causing a number of requests to fail until the Elasticsearch cluster self-healed. Our incident team worked on this issue immediately but the impact was already mitigated through our use of caching which meant that 90% of requests were still processed successfully. Further actions were then carried out to prevent any future outages.

Root cause

We have recently released significant upgrades to our product Visual Merchandising system, which allows for products to be re-ordered on product listing pages through a drag and drop interface, or through an import from a spreadsheet.

Under the hood, this adds extra data to each product that we want to sort for each category they’re sorted in and stores that in Elasticsearch.

Uptake for the updated Visual Merchandising system has been phenomenal, but the impact of this has been that Elasticsearch has begun to struggle with the increased size of the updates, particularly when significant numbers of changes are applied at once such as during a Visual Merchandising import.

We have started seeing significant increases in Java heap usage in Elasticsearch while products are being updated, and a corresponding increase in the amount of garbage collection Java has needed to do to clean up old data in the heap.

This increase in garbage collection has had knock-on effects to the amount of time Elasticsearch needs to dedicate to garbage collection, reducing the resources available for normal query processing.

Recovery and long term mitigation

We have taken a number of steps to resolve this issue, including adjusting the size of our Elasticsearch cluster and adjusting settings to spread the load across it better. We have also changed how Elasticsearch document indexing works during bulk updates to reduce the amount of processing needed during the updates, and we are continuing to look into improving the performance of Visual Merchandising imports, with a number of optimisations expected in the new year.

Posted Dec 20, 2019 - 16:45 GMT

Resolved
We have completed maintenance on the affected search indexing system components, and everything is operating as normal.
Posted Dec 17, 2019 - 20:20 GMT
Identified
We have identified the cause of the issue and are performing maintenance on the affected search indexing system components.

No customer-facing errors have been reported by our monitoring solution for the past 30 minutes and website availability is stable at this time.
Posted Dec 17, 2019 - 18:19 GMT
Investigating
We are currently investigating alerts issued by our monitoring solution relating to our internal website search index system availability.

Website and administration application performance and availability is currently being impacted.

We will post updates as we have more information.
Posted Dec 17, 2019 - 17:41 GMT
This incident affected: Admin (Admin application, Order and backend task processing, Scheduled tasks) and Websites (Websites, Checkout & Cart).