MultiTracks.com

If you use MultiTracks.com, ChartBuilder, or Playback you might have noticed our API had some critical performance issues this past Sunday beginning around 7:30 am CST We understand that you trust our products to always work, especially on a Sunday morning, without disruption. This is our standard, and we apologize that we did not meet this standard last Sunday. The stability of each of our products is and has been our top priority and we know that we let many of you down this weekend. Please be assured that our team takes this very seriously, and we are passionately committed to being a platform and a partner you can trust.

To that end, we want to disclose what went wrong and provide confidence that we are addressing the issue. Simply put, we had a bug in our system that caused a cascade of issues that were not all related, but the result is that they caused a significant load on our servers that resulted in downtime on Sunday morning. We have identified the issue and put measures in place to ensure that this issue does not happen in the future.

As a result of what happened, here are the changes we are implementing to prevent them from happening in the future:

We have already begun significant infrastructure investments to scale accordingly should an issue like this happen in the future. On Sunday there were several problems which on their own could have been addressed without affecting customers but together they cascaded into a severe degradation in our service. We are adding additional servers and making improvements to our monitoring as a result.
We are implementing further automated testing for elevated load scenarios. This includes mirroring closer to production datasets and peak traffic scenarios.
We are adding better offline support to our apps to handle transient network errors. Many of you reported being logged out unexpectedly in ChartBuilder yesterday. That should not be the case and this will be addressed. You should be able to freely use both Playback and ChartBuilder without any loss of connection to the network for content you had previously downloaded.

If you are interested in the technical reasons why we had issues, please read on.

What exactly went wrong?

We utilize SQL Server to service requests for many parts of our applications from authentication details to downloading MultiTracks metadata into Playback. Last week we deployed an update to a database function that updates content in your library. At MultiTracks we have several layers of review before code gets deployed into production. We employ an internal developer code review that checks for performance, security, and functional acceptance. From there, we employ a dedicated QA team that tests each issue for acceptance as well. The code mentioned passed both of those without issue. However, we did not realize that under strain of significant server load that the code could cause cascading SQL performance issues.

The Incident

Due to the spike in activity from Sunday morning, at 7:30 AM our SQL Server reached peak capacity. This caused a significant increase in the load to our servers. For example, this meant requests that typically complete in milliseconds were now taking 10s of seconds or longer to complete. This increased load caused some requests to start failing. As a result, our apps began sending additional traffic to satisfy the data they needed to complete actions customers were taking in their apps. Our apps employ an automated retry for network requests to mitigate transient network errors typically seen on network requests. In this scenario, a single user action could, in some cases, result in several retries occurring behind the scenes.

As requests rapidly increased in quantity, they cascaded among themselves, and the result is that traffic grew exponentially beyond what our servers could handle. At that point, our web cluster exhausted the number of ports it could connect with, and we stopped sending requests to our caching tier. Once the caching tier was exhausted, we fell back to SQL Server again which exacerbated an already demanding situation.

Though our team began working diligently to solve the issue, in the confusion of diagnosing multiple issues simultaneously, it took us until 11:30 AM CST before everything had been resolved across our platform.

Key Takeaways

We are dramatically increasing the scale of our SQL Server infrastructure to cope with significantly more load than is typical of our infrastructure today. This will provide a significant buffer if this issue were to ever occur again, though we are also taking additional steps to prevent that possibility.
We are adding additional capacity and scale to our webfarm.
We discovered a bug with ChartBuilder that caused customers to be logged out inadvertently, which should not be the case and will be addressed.
We are developing additional resiliency with our apps to cope with unexpected network responses or network performance degradation.
We are adding additional alerts to ensure the highest performance standards are met.

We are adding additional testing that will ensure performance at extremely high peak load times.

Conclusion

It is important to us that we keep our products running smoothly, and that you know that you can trust them to work for you every Sunday. We feel confident that these steps will mitigate the risk of any future incident of this nature.

Phillip Edwards
MultiTracks.com | Founder

Up Mix

Blog

A few words about Sunday's downtime.