11/29/2021 | News release | Distributed by Public on 11/29/2021 09:04
We all know that the move to cloud infrastructure and cloud-hosted services has been increasing rapidly for a long time-and the pandemic has only accelerated that growth. Datto's SaaS Protection service protects SaaS offerings such as Microsoft 365 and Google Workspace (formerly G Suite). Oftentimes with rapid growth comes significant change and occasionally instability as well. Even for companies like Microsoft and Google who operate at hyper-scale, this accelerated growth can create issues.
Datto specializes in data protection, which means quality, reliability, and resilience are at the forefront of our engineering priorities. The past year has tested all of us in these key areas. We're an agile organization that is continually adapting and making incremental improvements. In this blog post, I'll highlight some of the recent challenges we've encountered, the actions we've taken, and our plans for the future to provide best-in-class service to protect SaaS workloads.
While we support both Microsoft 365 and Google Workspace, I'm going to focus on M365 specifically because it represents an increasing majority of our customer base. Although, many of these same topics similarly apply to Google Workspace.
The Microsoft APIs are going through a transition from service-specific (e.g., Exchange Online, SharePoint, OneDrive) legacy APIs to the Graph API. We've been rapidly migrating to Graph and all of our new development is done in Graph. It's been this way for a while now. We are aware of all service-specific API retirement dates and are staying well ahead of each and every one of those. We're accelerating the transition on an as-needed basis since some service-specific legacy APIs have hard end of life (EOL) dates. There are two API situations that I'd like to highlight:
Like many cloud service providers, Microsoft implements throttling on API calls in order to ensure a high quality of service. During peak times, Microsoft will prioritize certain API calls over others. For example, a user request (e.g., fetching a message via an end-user client) will be prioritized over a third-party application request, like one from Datto SaaS Protection. When our API calls are throttled, we receive a specific error code, typically "429 Too Many Requests". An advantage that we have as both the developer and the operator of the service, as opposed to an Independent Software Vendor (ISV) that licenses their software to a 3rd party provider, is that we have access to a mountain of telemetry data. We've made significant investments in analyzing this data with the express purpose of making our service more performant and reliable. A tangible example of this is a change that we made earlier this year to strategically schedule backups at periods of the day when we see the least amount of throttling errors. This has measurably increased our overall backup success rate.
Many cloud services that are going through hyper-growth similar to what M365 has been experiencing, especially over the past 18 months, must make changes to keep pace with demand. That involves not only adding new features, but also adding infrastructure to support the demand. There have been two events in the past year that have had a significant impact on our service.
The first such event happened on March 15th when there was a global authentication outage. This was a very public event and impacted end users and service providers alike. For us, it caused a major influx of Datto SaaS Protection support requests which more than doubled our expected support volume for the month of March. Attending to all of the support requests took time and created a significant backlog. We've since made process and staffing changes to be able to handle such an event, should it happen again in the future.
At virtually the same time the authentication issue occurred, we also saw errors on one of our peering links in the US region. Microsoft offers a peering service that provides for lower latency and higher reliability for traffic to/from Microsoft services such as M365 and Azure. We invest in these links when we reach a certain level of scale in a given region. The use of peering links is mutually beneficial to Datto and our partners. In the US region, we have multiple peering links to our data centers, only one of which was exhibiting errors which made it harder to diagnose the exact problem. The biggest challenge identifying the root cause of the peering link errors was that we were simply getting Microsoft API errors returned to us which looked just like the errors we received during the authentication outage. Due to the timing, these problems blended together to create a perfect storm and prompted a surge in support tickets.
The last Microsoft event of note happened on successive days in May. Our monitoring and alerting systems quickly identified a problem because our KPIs began to rapidly drop. After investigation, we identified the root cause as a TLS negotiation failure. Microsoft was applying rolling updates to their accepted TLS versions and ciphers. We were already using TLS v1.2. However, Microsoft hand-selected a specific set of ciphers within TLS v1.2 that they would only accept. Because this change was not well communicated, we opened a production down ticket with Microsoft. I can only assume other vendors did the same, because very soon thereafter Microsoft halted the rolling updates and reverted the changes (a very rare thing indeed). Unfortunately, Microsoft began rolling these updates again just 24 hours later. We were in the process of testing the cipher changes, but we had to quickly pivot to deploying the TLS changes to our fleet. This provided an opportunity to partner with Microsoft on an appropriate communication scheme to properly warn us of these changes in the future. Believe it or not, even our premium support contacts were unaware of this maintenance.
Given how important Microsoft is to us and our customers, we are making significant investments in that relationship. There are a few things to highlight in particular:
We have just scratched the surface of some of the stories and technical challenges that we have encountered. Check back for future blogs focusing on different aspects of Datto SaaS Protection.