Our search system, Elasticsearch, has recently been experiencing a significant increase in data update traffic. This created more memory use and churn during product updates, to the point where several individual servers were spending nearly all their available time managing memory rather than processing the updates.
This occurred between 17:23 and 17:25, causing a number of requests to fail until the Elasticsearch cluster self-healed. Our incident team worked on this issue immediately but the impact was already mitigated through our use of caching which meant that 90% of requests were still processed successfully. Further actions were then carried out to prevent any future outages.
We have recently released significant upgrades to our product Visual Merchandising system, which allows for products to be re-ordered on product listing pages through a drag and drop interface, or through an import from a spreadsheet.
Under the hood, this adds extra data to each product that we want to sort for each category they’re sorted in and stores that in Elasticsearch.
Uptake for the updated Visual Merchandising system has been phenomenal, but the impact of this has been that Elasticsearch has begun to struggle with the increased size of the updates, particularly when significant numbers of changes are applied at once such as during a Visual Merchandising import.
We have started seeing significant increases in Java heap usage in Elasticsearch while products are being updated, and a corresponding increase in the amount of garbage collection Java has needed to do to clean up old data in the heap.
This increase in garbage collection has had knock-on effects to the amount of time Elasticsearch needs to dedicate to garbage collection, reducing the resources available for normal query processing.
We have taken a number of steps to resolve this issue, including adjusting the size of our Elasticsearch cluster and adjusting settings to spread the load across it better. We have also changed how Elasticsearch document indexing works during bulk updates to reduce the amount of processing needed during the updates, and we are continuing to look into improving the performance of Visual Merchandising imports, with a number of optimisations expected in the new year.