Degraded performance
Incident Report for Tradeshift
Postmortem

Issue Summary

Starting at 23:00 UTC on April 22, 2014, most of the functionality on the platform became unresponsive. This meant that communication between the platform components was not working as expected. This affected all the users, blocking any ability to log in on the platform, dispatch documents as well as make API calls.

Root Cause

The root cause for the downtime is a network failure in our hosting provider's datacenter. This triggered all the unwanted behaviour, from having the queue cluster in a bad state, as well as communication problems between the different components.

Resolution and recovery

The queue clustered recovered by stopping the master in order to allow the back-up server to take its place.

The affected application servers (those connecting directly to the queue) were restarted to ensure proper connectivity to the queue cluster.

Also, other instances affected by the network issue were restarted, but no significant improvement was detected until the network was in a working state again.

Tradeshift is committed to continually improving our technology and operational processes to prevent outages. We appreciate your patience and we apologize for the impact it had on you and your organization. We thank you for your continued support.

Posted Apr 23, 2014 - 04:14 PDT

Resolved
We have been investigating performance problems with the platform, affecting some user's ability to log in and dispatch documents, as well as API access. We have successfully solved the problem, therefore all systems are back into a working state.
Posted Apr 22, 2014 - 18:11 PDT