27 April 2021

Postmortem - Automation Cloud Outage on April 21, 2021

27 April 2021

Postmortem - Automation Cloud Outage on April 21, 2021

Customer impact

On April 21, 2021 from 1:26 am UTC until 2:56 am UTC, customers accessing Automation Cloud™, UiPath Automation Hub, and all other UiPath cloud services saw a very high rate of errors. At peak, our authentication service was returning up to 33% errors. However, since some scenarios require multiple calls to the authentication service, the impact to customers was even higher. Robot authentications were also failing at this time.

 

Root cause

We run multiple copies of each of our services. Due to a rare condition, one instance of one of our services stopped being able to communicate with its database. All the other copies continued to work correctly. Typically in this circumstance, we would expect the broken instance to automatically restart while the other copies of the service handle all the requests. Unfortunately, the restart did not happen as expected and therefore the broken instance remained broken. The service in question handles all authentication requests for both humans and robots. The problem was an exhausted SQL connection pool.

 

Detection

We had multiple alerts across the stack for failures being observed across multiple UiPath products within 10 minutes of failures.

 

Response

While engineers across multiple services engaged quickly to start investigating the issue, diagnosing the root issue took longer than we normally expect. This delay was due to needing to investigate errors at multiple layers of our architecture to determine that the root cause was in the authentication service. Once this was confirmed, we quickly identified the pods on which SQL connection failures were happening and mitigated the issue by restarting the pods to re-establish the SQL connections.

 

Follow up

We understand how impactful this outage was and apologize for the inconvenience caused. We are continuously taking steps to improve the UiPath cloud platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Auto-heal to handle SQL connection pool exhaustion at a per-pod level.
  • Prevent resource exhaustion by early signals towards reaching unhealthy thresholds.
  • Improve targeted detection at the source of failures.

 

Kevin Schmidt is the Director of Site Reliability at UiPath.


by Kevin Schmidt

TOPICS: Automation Cloud

Show sidebar