On September 23, 2020 at 1:45 pm UTC UiPath Automation Cloud was unavailable for 40 minutes. During this time, all user requests received errors. Though the problem has been completely resolved, we wanted to give our users some additional understanding of what happened and how UiPath handles these sorts of incidents.
UiPath Automation Cloud makes use of Microsoft Azure’s API Management. This is used as the gateway for all traffic to http://cloud.uipath.com and takes care of determining which server the traffic should be sent to. This includes the UiPath Portal, a Community Orchestrator, or one of the Enterprise Orchestrators which are deployed in Europe, United States (U.S.), or Japan.
Azure offers a service level agreement (SLA) of 99.95% when deployed to a single region and 99.99% when deployed across two or more regions. As of the time of the outage we were only deployed to a single region, creating a single point of failure.
Azure upgraded our API Management service to a new version. As part of that upgrade they took two 15-minute windows of downtime in quick succession. This was the root cause of our outage.
Automation Cloud has three different types of monitoring:
- • Synthetic tests
- • Metrics provided by Azure
- • Metrics plus logs that are emitted by our own servers
In this case, the synthetic tests caught the problem first, creating an alert less than one minute after the maintenance window started. These tests run continuously and simulate common user activity to ensure that UiPath is available from the public internet. The monitoring based on Azure metrics also caught the issue a few minutes later due to the large increase in http errors that were being returned.
At any given time, UiPath has a team of engineers who are on-call (DRIs) to respond to any issues in our cloud products. We have a global Site Reliability Engineering (SRE) team located in both Bellevue, Washington, U.S. and Bangalore, India. This allows us to have a follow-the-sun SRE on call so that we can have a fast response time without needing to wake somebody up. We also have a rotation within each development team so that there is always someone available to provide deeper expertise on their product. Finally, we have a global support organization which has engineers all around the world to help troubleshoot our customers’ issues and answer their questions.
In this case, it was a major Severity 0 issue that affected all the services that make up Automation Cloud. As a result, both the SRE on call as well as the DRIs from the relevant developer teams were paged. Within four minutes of the start of the incident we were collaborating on Slack and within eight minutes we had created a Zoom bridge for better communication.
We quickly updated http://status.uipath.com as well as notified our support organization so they could answer any calls they received. Shortly after, we recognized that the outage was due to the Azure’s maintenance and that a manual restore into a new API Management resource would take longer than simply waiting for the maintenance window to end. We sent out updates with this information and then monitored to ensure that the maintenance ended on time and that our service was completely healthy after the window was over.
- • Immediate – Deploy our API Management to a second region so we have an SLA of 99.99% rather than 99.95%.
- • Immediate – File a ticket with Microsoft to understand if a second region would have protected us from this maintenance and to ask about any other recommended best practices.
- • Next Sprint – Build a tool to consume Microsoft notifications about planned maintenance. Currently this is being done ad-hoc and obviously we missed the notification for this window. The goal would be to learn about it early enough that we can either do something to avoid the outage or at least to be able to announce it to our own customers in advance.
- • Next Sprint – Complete a review of all our critical components and double check that each is already deployed to be as resilient to outages as possible.
- We recognize the significant impact this outage had on customers that rely on UiPath Automation Cloud and sincerely apologize for it.
Kevin Schmidt is the Director of Site Reliability at UiPath.