If you’re building AI agents, whether coded or low-code, you know guesswork doesn't scale. You need a reliable, repeatable way to measure performance, catch regressions, and improve quality. That’s where evaluation driven development comes in.
When you’re designing your agent in evaluation mode, you can move from “this sometimes works” to “this works consistently.” Evaluation-driven development gives you clarity, control, and confidence, without slowing you down. Here’s what that means, how it works under the hood, and why it matters.
At the core of our evaluation framework are evaluators: modular, reusable scoring units that assess how well your agent performs on a task based on defined inputs and expected outputs. You decide what “good” looks like. Evaluators do the grading. Evaluators can be deterministic or LLM-based:
Use deterministic evaluators when the output is specific and predictable.
Like booleans (false or true), exact strings, alphanumeric values, or arrays of primitives for exact matches.
Or like JSON similarity checks, which compare structure and content between JSON objects.
2. Use LLM evaluators when the output is fluid or open-ended.
These use an LLM to judge how close the agent’s response is to an expected, acceptable answer.
Not sure which to use? Check out our best practices guide or ask the community.
Evals are important…Without structured evals, development becomes guesswork. Mastering AI evals means building smarter, faster, and more reliable agents. —Andrew Ng, founder of DeepLearning.AI
You can’t evaluate what you can’t observe. That’s why agent observability is built into our systems. It tracks the steps the agent takes: how it plans, which tools it uses, what outputs it generates, so you can trace where things go right or wrong.
This is how you move from “the agent didn’t work” to “the tool it picked was wrong” or “the summarization step lost key details.”
Here’s an example:
Say you’ve built a research agent. It finds sources, collects content, summarizes results, and iterates if needed. You can:
Evaluate the final output
Evaluate each decision along the way (like whether it picked the right tool at each step)
Measure how often it loops, repeats steps, or escalates unnecessarily
You can start by running evaluations with simulated data and mocked tools—it’s fast, cheap, and good for iterating. As you gain confidence, you can switch to real data and get more granular. You can evaluate the full trajectory, individual tool outputs, or both—whatever gives you the best signal.
Evaluations for low-code agents are now available in the cloud. Build your first agent and evaluations in the cloud. Subscribe to our release notifications to learn first when simulations or evaluations for coded agents are available to the general public here.
One way to build evaluations is from actual test runs in design time. Take a few runs as an example, curate them, and start tracking how future changes affect performance. But building every eval like this is time-consuming, and running every eval with live data and real tool calls can get expensive, especially if you’re testing a lot of variations. That’s where simulations come in. And it’s a good way to consider starting with these first.
We’re adding simulation capabilities that let you run your agent against synthetic data, and mocked tools, saving time and cost. You’ll be able to test edge cases, rare errors, or long workflows without triggering actual actions or using real inputs. Simulations will be especially useful when building larger evaluation datasets, when testing at scale, and at the beginning of setting up your agent before you run evaluations with real data.
Manage your evaluation in design and run time Create a variety of trajectory scenarios to test out your agent without calling tools Because you can more rapidly test these scenarios, it allows you to accelerate creating more diverse evaluation sets. Show the import/export Add from runs Structured evaluations make the difference between agents that "sort of" work and agents you can trust. They bring observability, repeatability, and speed to the development process. And they’re foundational if you’re building anything agentic at scale. Whether you're just getting started or already scaling agent-based apps, our goal is to make this easier, cheaper, and more robust, without slowing you down.
Try it now in Automation Cloud. Join the community and share what you're learning. Let’s build smarter agents together.
Product Manager, UiPath
Sign up today and we'll email you the newest articles every week.
Thank you for subscribing! Each week, we'll send the best automation blog posts straight to your inbox.