Delayed processing of documents

Incident Report for Tradeshift

Postmortem

Summary

Users were experiencing slow or unavailable service on the Tradeshift platform throughout the day of March 5th. The events that triggered the incident were related to regular maintenance our storage team was doing to upgrade servers on our document storage system.

There was an underlying hardware failure on AWS at the beginning of the day which resulted in an additional server being automatically removed from the cluster during the planned maintenance. This resulted in higher network usage as the removal of a server from the cluster generates rebalancing traffic while the cluster resynchronizes. Concurrently, another server was scheduled to be automatically removed. This is normal and our systems should have been able to handle this. For reasons still under investigation, read latency jumped dramatically during this process.

Because of this slowdown 3 observable things happened for our customers - Slow UI response, API and UI intermittent availability, and long delays before documents were available. Additionally, the slowdown caused a chain of events that exacerbated the situation.

Processing queues for indexing new documents to make them searchable grew due to the increased latency in the document storage cluster. This resulted in long delays before documents were visible on the receiver side.

Our load balancers then began to run out of memory because they had to cache more of the above concurrent slow requests. This resulted in the UI and APIs being intermittently available.

Timeline (All times in UTC)

March 5th 2019

08:00 - Higher latency was observed on the storage system, Investigation started, Indexing queues started growing (#1)

12:00 - Latency increased, UI becomes intermittently available (#2) and slow for certain parts (#3)

14:00 - Load balancers were resized to add capacity reducing the visibility on #2 and #3

18:00 - All performance problems are solved, #2 and #3 is no longer observable, but #1 remains to be a problem especially for documents dispatched in that period.

March 6th 2019

Delays in processing continued through March 6th while our indexing subsystem caught up with the backlog.

Throughout the day, teams continued tuning access to the backup storage systems. Our customers have been seeing slightly delayed updates in the UI.

By 7pm all backlogged processing was completed and all operations are back to normal.

Major avenues of repair

3 teams worked together to recover from the incident, each on different aspects:

Storage team - Investigate and fix the root cause for the slow responses from the underlying storage system.

DevOps team - Mitigate the effects of the slow storage system by adding capacity to our load balancer and service pools.

Site Reliability Engineering team - Facilitate a failover to our backup storage system. As part of our disaster recovery capabilities, we have a complete backup storage system in place. This team made a code change to use the backup system in place of the primary to improve the user experience during the incident.

The 3 teams coordinated the above changes in production so that after each change we could monitor the effect before moving onto the next.

Remediations

This is the first large-scale cascading failure we have seen on the platform where slowness in one subsystem caused upstream systems to be partly or completely unavailable. Below are improvements we are implementing in response:

Response time

Ability to rapidly add capacity. Load balancers and services took longer than necessary to add capacity on demand. We will improve and optimize the automation required to enable rapid scaling.

System failover latency improvements. We had to test and release a small code change to enable failover to the backup system as it was not fully configurable in real time. We will be adding this capability in the immediate future

Impact on user experience did not trigger alerts in our monitoring systems after the main storage issue was resolved. The degraded service responsiveness was due to the backlog of work in the indexing subsystem and the lack of alerting on this queue masked the severity of the incident. Relevant metrics have been added to the monitoring and alerting systems.

System resilience

Client behavior for the document storage system was not optimal - failover was not fast enough which then caused additional slowness. Failover optimizations have already been put in place to improve this.

There has been an ongoing effort to evolve the design of the storage layer architecture to reduce complexity and the likelihood of operational failures. The incident has accelerated the implementation of these efforts as the teams are determined to ensure that we will not experience this failure mode again.

Service level objectives

Because of the cascading effect, we will set more strict service level objectives to the lover level storage systems and the teams running them.

Conclusion

First, we apologize for the disturbance this caused for our customers. We know that Tradeshift operates in many mission critical flows and that service availability and responsiveness are of paramount importance to our users. We strive to provide an always available, highly responsive user experience and did not live up to those expectations.

We know that actions speak louder than words and as such are redoubling our efforts to evaluate possible failures and to build preventative measures in place. We’re also actively building a more formal program to better educate and train our engineers on the types of failures that can occur in complex distributed systems in order to build quality and reliability into the earliest stages of building the product.

We thank you for your patience during this incident and again apologize deeply for the interruption and inconvenience we placed on our customers.

Posted Mar 07, 2019 - 14:17 PST

Resolved

We’re back! We’re sorry for the disruption to your day

Posted Mar 06, 2019 - 11:05 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 06, 2019 - 09:53 PST

Identified

The issue has been identified and a fix is being implemented.

Posted Mar 06, 2019 - 07:34 PST

Update

We are continuing to investigate this issue.

Posted Mar 06, 2019 - 06:06 PST

Investigating

We are investigating delays on document processing.

We apologize for the inconvenience.

Posted Mar 06, 2019 - 06:06 PST

This incident affected: Tradeshift Go (getgo.tradeshift.com), WebUI ( go.tradeshift.com ), API ( api.tradeshift.com ), and Integration ( FTP/FTPS/SFTP ) ( si.tradeshift.com ).