Introduction
This is the first part in a series of blogs that aims to share practical knowledge about evaluating ReAct agents, with a focus on how to measure their performance and compare different experimental setups. Our focus in this article is on defining the concepts that revolve around agentic systems for computer use, i.e. systems where language models interact with software through APIs or user interfaces to accomplish tasks.
Agentic systems types
These systems are often augmented with retrieval, tools, and memory, enabling them to perceive environmental feedback, execute actions, and iterate based on results. Within this scope, the following architectural patterns can be distinguished, ordered by their level of autonomy:
LLM workflow
- Language models and tools orchestrated through predefined code paths.
- Execution follows a predetermined sequence with fixed transition logic (incl. branching).
- Agent developer imposes constraints rather than allowing autonomous
path selection.
- Examples: Standard RAG system - prompt is always augmented with examples for each query, LLM has a single shot to provide an answer
- LLM with a context tool providing information to another LLM - the first LLM decides if context is to be augmented with examples, but has a single shot to formulate the retrieval query, while the final LLM takes the context tool output and formulates in a single shot the final answer.

Figure 1: LLM workflow example
Agent
- The language model dynamically directs its own processes and tool usage.
- Operates in a ReAct-style loop, where reasoning and action steps are tightly interleaved
- The agent reflects on the current state, inspects tool outputs or feedback, and then decides the next action to take.
- The agent developer sets out capabilities, but the agent has autonomy for
path selection.
- Example: Agent with a context tool - agent decides if its context is to be augmented with examples, but it can refine retrieval queries if it deems the results incomplete.

Figure 2: Agent example
Multi-agent workflow
- Multiple agents with predetermined interaction patterns (predefined outer workflow).
- Each agent retains local reasoning capabilities (non-deterministic inner loops).
- Agent developer specifies which agents communicate and when handoffs
occur.
- Example: Several domain-specific agents receive a query and they independently analyze it in their own loop according to their expertise. A final agent aggregates all agent outputs and makes decisions as to what to show the user.

Figure 3: Multi-agent workflow example
Decentralized multi-agent system
- Multiple agents whose interactions emerge dynamically at runtime (non- deterministic)
- Agents coordinate through negotiation or shared memory rather than fixed handoffs
- Designer establishes capabilities and protocols but cannot predict exact
sequences
- Example: Several domain-specific agents receive a query and begin analyzing it, but retain the ability to communicate with each other and negotiate which aspects each will handle, share intermediate findings, and dynamically decide which one needs to contribute based on the emerging understanding of the problem.

Figure 4: Decentralized multi-agent system example
A closer look at ReAct agents
Agent configuration
Agents are LLM-powered systems that alternate between reasoning and acting, perceiving their environment through tool outputs and taking actions to achieve predefined objectives. To construct a robust framework for evaluating agents, it is essential to first understand their main components. Let us, therefore, dissect the fundamental building blocks that form the core of an agent's configuration.

Figure 5: General Agents components
- System Prompt (General Instructions): Lays out the rules and goals that govern the agent
- User Prompt (Contextual Instructions): Analogous to a user’s query;
- for conversational agents, the user prompt changes form at every turn
- for unattended agents, since there are no multiple turns, it is convenient to represent this as a prompt template, featuring placeholders that get replaced at runtime with task-specific input
- Models: The LLM(s) that power the autonomous decision making of the agent
- Tools: A set of utilities the agent has at its disposal to interact with the
environment and achieve its goal. Each tool is associated with a description,
which makes the agent aware of what the tool can accomplish and an input
schema, which provides context on how the tool can be invoked. Based on
the interface they provide, tools can be of the following types:
- General tools enable direct interaction with external systems and APIs
- Context tools provide access to information and knowledge sources
- Escalation tools create pathways to human decision making and oversight
- Input Schema: Defines the structure of the input provided to the agent; this component mainly applies to unattended agents, where input fields align with the prompt’s placeholder variables; the input schema of an agent enables us to use other agents as tools
- Output Schema: Defines the structure of the agent’s output; while this can be valuable for ensuring consistent responses, it is not always necessary – conversational agents, for instance, typically return natural language answers without fixed schemas
Evaluation setup
Before deploying an agent to production, developers must rigorously evaluate its performance and runtime consistency. To facilitate this assessment of an agent’s reliability, we need to consider two additional building blocks:
- Dataset: A collection of datapoints primarily used to evaluate the
performance of the agent, but which can also facilitate agent training; a
datapoint is defined by multiple properties:
- agent input – the specific input on which the agent needs to execute
- for unattended agents with user prompt templates, the input can be conceptualized as a dictionary with keys matching the known input schema
- for conversational agents, the input is inherently multi-turn, each turn representing a different user query in natural language
- trajectory annotations – the evaluation criteria against which we compare the agent’s actual final output and the trajectory it followed
- agent input – the specific input on which the agent needs to execute
- Evaluators: A set of functions used to measure the performance of the
agent; an evaluator:
- returns a score describing how the agent performed on a specific input
- targets a particular dimension of performance (e.g. tool order, tool count, tool argument validity)
- operates on either the agent’s final output (e.g. substring checks) or the trajectory of actions it followed to reach that result (e.g. tool order)
- can be either deterministic (e.g. equality checks) or non-deterministic (e.g. LLM judge - evaluators that use LLMs to compute the score)
- has a weight associated with it, specifying how much the evaluator contributes to the agent’s overall score (we will revisit this in the following section)
An assignment relation specifies how datapoints are distributed across evaluators, making explicit both sides of the mapping: for each datapoint, which evaluators are responsible for assessing it, and for each evaluator, which subset of datapoints they should review. This structure allows one to encode patterns such as overlapping coverage (multiple evaluators per datapoint) and specialization (certain evaluators only seeing particular categories).
Concluding remarks
This initial blog post establishes the foundational vocabulary for understanding the landscape of agentic systems and designing ReAct agents. We categorized these agentic systems based on their level of autonomy, ranging from the more rigid, predefined workflows to the more dynamic and decentralized multi-agent systems. Crucially, we unpacked the components of a classic ReAct agent and we introduced the key components for rigorously evaluating them: datasets and evaluators.
In the following two blog posts, we will explore how to apply evaluators to measure agent performance and how to establish confidence in our measurements when making design choices for experimental setups.