Starting at 23:00 UTC on April 22, 2014, most of the functionality on the platform became unresponsive. This meant that communication between the platform components was not working as expected. This affected all the users, blocking any ability to log in on the platform, dispatch documents as well as make API calls.
The root cause for the downtime is a network failure in our hosting provider's datacenter. This triggered all the unwanted behaviour, from having the queue cluster in a bad state, as well as communication problems between the different components.
The queue clustered recovered by stopping the master in order to allow the back-up server to take its place.
The affected application servers (those connecting directly to the queue) were restarted to ensure proper connectivity to the queue cluster.
Also, other instances affected by the network issue were restarted, but no significant improvement was detected until the network was in a working state again.
Tradeshift is committed to continually improving our technology and operational processes to prevent outages. We appreciate your patience and we apologize for the impact it had on you and your organization. We thank you for your continued support.