On April 7, 2022, from 17:46 UTC until 18:29 UTC, some customers of UiPath Automation Cloud™ experienced intermittent access issues on multiple products, as well as failed Robot authentication, or connections.
In late March, we observed several instances in which one of our infrastructure microservices was incorrectly restarted due to memory pressure. This problem was not user-visible; however, the internal decision was made to push a fix on April 4 to prevent further disruption. The fix was intended to guarantee that the critical service would always receive enough memory.
Unfortunately, a bug in the configuration created a problem when the product was under high load. This bug was not visible in our test or dogfooding environments and did not occur in production until April 7. At that time, all copies of the microservice ran out of memory and were not able to be automatically restarted. This critical component caused everything that depended on it to begin to fail.
We had multiple alerts across the stack for failures being observed across multiple UiPath products within 10 minutes of failures.
Engineers responded to the alert immediately and quickly identified it as a high-severity issue. There was a need to investigate errors at multiple layers of our architecture, and upon investigation, we determined that the issue was at the cluster level. As a result, we performed a manual failover to the secondary region to mitigate the issue. After the mitigation was in place, we continued to investigate until the full root cause was understandable and were able to revert the misconfiguration so that the issue could not recur.
We understand how impactful this outage was and are deeply apologetic. We are continuously taking steps to improve the Automation Cloud™ and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
- 1. Adding additional regions for the Automation Cloud™.
- 2. Better monitoring and alerting to identify this and similar issues before they become user-visible.
- 3. Improved debugging tools and techniques to reduce the time to mitigation.
- 4. Enhanced testing to be able to catch these sorts of load-dependent issues in our test environment.
- 5. Rolling out the correct fix to the original issue with increased testing and scrutiny.
Kevin Schmidt is the Senior Director, Site Reliability Engineering at UiPath