On April 1, 2021 from 21:20 UTC to 22:40 users saw HTTP 500 errors while using Automation Cloud™, UiPath Automation Hub, and all other UiPath cloud services. We served 235,000 errors which represents about 4% of our total traffic during this time. These errors were served to both Enterprise and Community customers. Users in the Americas were more significantly impacted (seeing as high as 10% errors) than users in other regions.
Starting at 21:20 UTC, Microsoft Azure DNS had a major outage. See here for details and look at April 1, 2021.
UiPath cloud services make extensive use of Azure, so when Azure DNS was unavailable our traffic was not able to be routed into Azure nor routed within Azure.
This problem was quickly detected by both synthetic testing and log-based alerts. It was also reported by a number of internal and external users.
Our on-call engineers immediately responded to the alerts and began to investigate. Once the Azure outage was identified as the root cause they realized that a full mitigation would not be possible until Azure had fixed its problem. However, we also noticed that the problem was significantly worse in the United States (U.S.) region and so we decided to route all traffic through our European servers. This was successful in reducing the error rate while we waited for a Microsoft fix.
Even after our partial mitigation was in place, we continued to monitor the situation until Azure declared the root cause fixed. At that time we updated status.uipath.com to mark the incident as complete.
UiPath is primarily based on Azure, so that limits our ability to respond to these sorts of major infrastructure-level outages. That said, we are investigating to see if there are any options to reduce our exposure.
Kevin Schmidt is the Director of Site Reliability at UiPath.