On October 5, 2020 at 12 :17 pm UTC, UiPath Automation Cloud began serving a very high rate of errors. Users in the United States (U.S.) and Japan primarily saw these errors but over time the problem spread to Europe as well. The incident lasted for three hours 48 minutes and at peak we were serving 33% errors.
The information used by the APIM is heavily cached. Any time there is a cache miss the APIM will make a call to the routing service. This call is authenticated using a token which is stored in the cache. In the case of a cache miss, the token is generated from a secret in Azure Key Vault.
Immediately before the outage, Azure placed our eastern U.S. APIM into maintenance. We believe this led the cache to be briefly unavailable. We have reached out to Microsoft to confirm this.
While the cache was unavailable, every request served by the APIM had to make a call to the Key Vault to fetch the secret and then a call to our backend server to get the routing information. This quickly exceeded Key Vault’s throttling limits so all calls to Key Vault returned an HTTP 429.
Key Vault’s throttling is built such that if it has received more than 3,000 requests for a secret in the last 10 seconds then it will return a 429 for every request. When the cache became available again it was empty, which means APIM was still making calls to Key Vault. Unfortunately, Key Vault continued to reject all these requests because it was getting too much traffic. As a result, the problem did not auto-heal even after the cache was available again.
Eventually, the token in northern Europe cache expired and the northern Europe APIM’s calls to Key Vault were also blocked. Fortunately, the rest of the routing information in the European cache was never evicted. So, many European customers continued to be served without error. But any European customer whose data wasn’t in the cache would have received an error.
This problem was quickly detected by both synthetic testing and log-based alerts. It was also reported by a number of internal and external users.
Our first attempt was to rollback most recent change: the APIM which was deployed to east U.S. because of the previous outage. This was insufficient because the token had expired in the European region and the APIM there was unable to fetch a new version from the Key Vault due to throttling.
Our second attempt was to manually add the token to the cache, which would bypass the Key Vault. We were unable to recreate the token correctly.
Our third attempt was to add the secret to a second Key Vault and switch the APIM to use it instead. This was successful and a new token was written into the cache at which point everything became healthy again.
Unfortunately, these steps took much longer than we would have liked. This was due to the complex nature of the outage which took us some time to fully understand. In retrospect, we realize that we probably could have stopped the outage sooner by blocking all inbound traffic at a higher level for approximately one minute. This would have allowed the Key Vault’s rate limit to reset and we would expect that the next request would have successfully retrieved the secret.
- • Redeploy our multi-region APIM
- • Remove our runtime APIM dependency on Key Vault
- • Add more APIM and Key Vault specific alerts so we can more quickly root cause similar problems
- • We are updating our incident response process to ensure that we post more frequent updates to https://status.uipath.com with more details
Back-to-back outages in APIM have been very painful. We are doing everything we can to add a resiliency mechanism to this critical layer as quickly as possible.
Kevin Schmidt is the Director of Site Reliability at UiPath.