The Definitive Guide to AI Agent Evaluation (2026)
Six dimensions, four trajectory scores, and the CI gate structure that separates production-ready agents from demo-ready ones
Most agent evaluation setups answer the wrong question.
They ask: did the agent produce the right output? What they need to ask is: did the agent take the right path to get there?
The difference matters because a 95% per-step success rate across eight steps yields approximately 66% end-to-end task completion. The math is unforgiving. And the only way to see where those failures are happening is to evaluate the trajectory, not just the response.
This guide covers the full evaluation framework: the six dimensions that need separate scoring, the four-part trajectory score, the CI gate structure that catches regressions before users do, and the production observability loop that keeps your eval set from going stale.
Why “An Agent Is Not a Model”
Language model evaluation is straightforward in principle. You have an input and an output. You measure output quality.
Agent evaluation requires something different. An agent is a sequence of decisions: which tool to call, what arguments to pass, what to do with the result, how to recover from failures, whether the overall plan is coherent, and whether the task was completed. Each step in that sequence can fail independently. Failures compound. A wrong argument at step two propagates silently through steps three through eight.
Evaluating only the final output misses every intermediate failure. And intermediate failures are where most production incidents originate.
The unit of evaluation for an agent is the trajectory: the complete ordered sequence of system prompt, reasoning steps, tool calls with arguments and return values, and intermediate state at each step.
The trace is the truth.
The Six Evaluation Dimensions
Agent failures cluster into six independent categories. Strong performance on one does not compensate for failure on another. Aggregate scoring across dimensions hides the failure modes that cause incidents.
1. Tool Selection
What it measures: Does the agent call the right tool? Does it correctly refrain from calling any tool when none is needed?
Failure modes:
Wrong tool selected from available set — often happens when tool descriptions are semantically similar and the agent pattern-matches on surface-level keywords rather than intended function
Tool call made when the task required no tool — agents trained heavily on tool-use examples over-index on calling tools even when the answer is available from context
Tool invoked that does not exist in the schema — fabricated endpoint calls happen most often when agents are prompted with examples from a different deployment or when the system prompt is ambiguous about what tools are available
Partial tool name matches — the agent calls
get_userwhen the schema definesget_user_profile, causing a silent failure that may not surface until result utilization time
The irrelevance bucket is where most teams fail to test. Agents that always reach for a tool perform poorly on tasks that require synthesis or reasoning from context alone. An eval set without irrelevant cases gives a systematically optimistic picture of tool selection accuracy.
Eval target: Eval set includes cases requiring single tool calls, cases requiring multiple sequential tool calls, and cases requiring no tool call. Scores reported separately per category so over-calling and under-calling failures don’t cancel each other out.
2. Argument Extraction
What it measures: When the agent calls the right tool, does it pass correct arguments?
Failure modes:
Type mismatches — passing a string where an integer is expected, or an object where a flat string ID is required. These often pass schema validation if the validator is loose and only surface at runtime
Missing required fields — the agent omits arguments it doesn’t believe are necessary, even when they are marked required in the schema. Common when the agent has seen the tool succeed in simpler contexts
Semantically wrong values — correct format, wrong content. The agent passes a valid date string, but it’s yesterday’s date instead of today’s because it defaulted to a prior turn’s context
Edge-case failures — timezone omissions on datetime fields, truncated IDs when a user message is long, currency code assumptions on financial tools, empty string passed instead of null
Argument hallucination — the agent invents a value for an argument it couldn’t extract from context rather than requesting clarification or leaving the field empty
Argument extraction failures are particularly damaging because they are silent. The tool call is made. The schema validation may pass. The value is wrong. Downstream steps execute against bad data for the remainder of the trajectory.
Eval target: Eval cases cover edge cases for every argument type in the schema, including optional fields that agents frequently mishandle. Score is separated from tool selection so argument errors don’t disappear into a combined metric.
3. Result Utilization
What it measures: Does the agent use the actual payload returned by the tool, or does it substitute its own knowledge?
Failure modes:
Numerical transcription errors — the tool returns
{"balance": 10432.87}and the agent says “your balance is approximately $10,400.” Close enough to sound right, wrong enough to matter in financial or medical contextsModel knowledge substituted for tool return — the agent “knows” the capital of France, so when a geography tool returns a result the agent doesn’t trust, it answers from training data instead. The tool result is ignored
Drift from tool output during multi-turn conversations — the agent correctly uses a tool result in turn three, but by turn seven it has reverted to describing the resource as it existed in its training data rather than as the tool reported it
Selective extraction — the agent uses part of the tool payload and silently drops fields that would have changed the response, either because the payload was large or because the dropped fields contradicted an earlier assumption
This failure mode is the most consequential and the hardest to catch without trace inspection. An agent that ignores a 404 and synthesizes an answer from training data looks correct on output-only evaluation. It is confidently wrong.
Eval target: Evaluator checks that every factual claim in the agent response can be attributed to a specific field in the tool return payload for that trace. Responses grounded in model knowledge rather than tool data fail the grounding check regardless of whether the content happens to be correct.
4. Error Recovery
What it measures: When a tool call fails (timeout, API error, malformed return), what does the agent do?
Failure modes:
Hallucinating success — the agent receives a 500 error and proceeds as if the call succeeded, fabricating a plausible-looking result. This is the most dangerous failure mode because the agent’s subsequent responses look coherent
Silent failure — the agent stops making progress but doesn’t surface the failure to the user or calling system. The task times out with no explanation
Retry storm — the agent retries the same failing call immediately and repeatedly, without backoff or a retry limit. In production this amplifies the original API failure into a latency and cost incident
Context loss on recovery — the agent retries with correct backoff but loses the context from prior turns, effectively restarting the task rather than resuming it
Incorrect escalation target — the agent escalates on a recoverable error (a transient timeout) rather than attempting the retry it should have tried first
Recovery patterns to eval:
Retry with exponential backoff
Fallback to alternate tool
Clarification request to user
Graceful escalation with full context preserved
Error recovery is almost never tested in development environments because mocks return clean responses. It is one of the top three failure modes we see across production agent deployments.
Eval target: Eval set includes API timeouts, 4xx errors, 5xx errors, and malformed return payloads for every tool in the schema. Recovery behavior — not just whether the agent eventually completes the task — is scored explicitly.
5. Plan Coherence
What it measures: Is the multi-step trajectory coherent? Does it reach the goal efficiently without loops, dead-ends, or unnecessary depth?
Failure modes:
Loops — the agent revisits a step already completed, either because it lost track of prior state or because it is trying to verify a result it already has. Loops burn tokens and latency, and frequently end in task failure when the context window fills
Dead-ends — the agent reaches a state where no path forward is available and halts, without escalating or surfacing the blockage. The task silently fails
Excessive depth — the agent takes twelve steps where four would suffice, breaking a simple lookup into a chain of unnecessary sub-calls. Each additional step adds error probability
Premature termination — the agent declares the task complete before it is, usually because an intermediate step returned a success status even though the actual goal was not achieved
Plan drift — the agent starts with a coherent plan and drifts from it mid-trajectory, responding to intermediate results by improvising rather than adapting the original plan deliberately
Long, flat trajectories are where compound-error pain concentrates. An agent that solves a problem in fifteen steps when five would have worked is not “fine.” It is burning tokens, latency, and error probability on unnecessary steps.
Eval target: Trajectory length is tracked alongside correctness. Anomalously long trajectories for a given task class are flagged for review, not just cases where the agent failed outright.
6. Task Completion
What it measures: Did the agent accomplish what the user asked for, end-to-end, across the full conversation?
Task completion is the dimension almost every team measures. It is necessary. It is not sufficient.
A task-completion score without the five dimensions above tells you that the agent got to the right answer. It does not tell you whether it got there by the right path, whether it will get there reliably at scale, or where it will fail when conditions shift.
Eval target: Task completion is one score in a six-dimension report, not the only score. A pass on task completion with a fail on tool selection or error recovery is not a pass.
The 4-D Trajectory Score
The six dimensions above establish where failures occur. The 4-D Trajectory Score provides a unified per-trace quality assessment across four orthogonal axes.
Factual Grounding
Measures whether agent outputs are anchored to what tools actually returned versus what the model assumed. The key signal: does the agent’s response change when the tool return changes? If it doesn’t, the agent isn’t using the tool.
Metric: factual_grounding_score — proportion of factual claims in the response that can be attributed to tool payloads in the trace.
Privacy and Safety
Measures whether the agent stayed within defined scope. For customer-facing agents: did it expose PII it shouldn’t have accessed? Did it resist jailbreak attempts? Did it stay within the authorization boundary of the calling user?
Metric: safety_compliance_score — binary pass/fail on a checklist of scope violations, aggregated per trace.
Instruction Adherence
Measures whether the agent followed the system prompt and task specification throughout the trajectory. Agents that comply in early turns and drift in later turns have an instruction adherence problem that task completion scoring won’t surface.
Metric: instruction_adherence_score — proportion of trajectory steps where behavior is consistent with the system prompt constraints.
Optimal Plan Execution
Measures efficiency of the trajectory relative to the minimum viable path for the task class. An agent that completes the task in more steps than necessary is not performing well on this dimension, even if it completes it.
Metric: plan_efficiency_score — ratio of minimum viable steps to actual steps taken, normalized per task class.
CI Gate Structure
The most common mistake in agent CI setup is using a single aggregate threshold. If overall score is above 0.80, merge. The problem: an agent that scores 0.95 on task completion, 0.90 on instruction adherence, and 0.50 on error recovery will pass an aggregate gate at 0.80. It will fail in production when it hits its first API timeout.
Per-dimension thresholds are the gate, not aggregate scores.
Each dimension blocks independently. A strong run on five dimensions does not carry a weak run on the sixth. Here is what that gate structure looks like in fi run config:
# config.yaml for `fi run`
assertions:
- "tool_selection_f1.score >= 0.95 for at_least 95% of cases"
- "argument_validation.score >= 0.90 for at_least 90% of cases"
- "argument_semantics.score >= 0.85 for at_least 85% of cases"
- "result_groundedness.score >= 0.90 for at_least 90% of cases"
- "recovery_score.score >= 0.80 for at_least 85% of cases"
- "task_completion.score >= 0.85 for at_least 90% of cases"
Thresholds are calibrated to your baseline pass rate at the point you wire up CI. The key design decision is for at_least N% of cases rather than a per-case hard gate. This handles the reality that eval sets include edge cases that no production deployment is expected to pass 100% of the time, while still catching dimensional regressions before they ship.
Public benchmarks as floors. BFCL (Berkeley Function Calling Leaderboard), τ-bench, and ToolBench establish meaningful baseline comparisons for tool use capability. They are useful for model selection. They are not production gates. Your CI set needs to reflect your specific tool schemas, your actual error distributions, and your users’ actual task distributions.
Instrumenting for Trajectory Evaluation
Trajectory evaluation requires trace data. Trace data requires instrumentation. This is not optional.
The minimum span capture for agent trajectory evaluation:
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
# Initialize tracing for agent spans
tracer_provider = register(
project_type=ProjectType.AGENT,
project_name="your-agent-project",
)
# Trace a tool call
with tracer_provider.get_tracer("agent").start_as_current_span("tool_call") as span:
span.set_attribute("tool.name", tool_name)
span.set_attribute("tool.arguments", json.dumps(arguments))
result = call_tool(tool_name, arguments)
span.set_attribute("tool.result", json.dumps(result))
span.set_attribute("tool.success", result.get("status") == "ok")
The 14 span types in traceAI cover: LLM calls, tool calls, retrieval, embedding, reranking, agent steps, workflow nodes, and synthesis. For trajectory evaluation specifically, tool call spans and agent step spans are mandatory. LLM call spans are required for factual grounding scoring.
Without full span capture, the 4-D Trajectory Score can only be partially computed. plan_efficiency_score requires a complete step count. factual_grounding_score requires tool return payloads. Partial traces produce partial scores.
Production Observability: The Eval Feedback Loop
Offline eval sets go stale. They were built from the distribution you anticipated at launch. Production exposes the distribution you didn’t anticipate.
The production observability loop keeps your eval set current:
Step 1: Error Feed clustering. Failing production traces are clustered by failure mode. LLM-based investigation generates a taxonomy: argument extraction failures cluster separately from error recovery failures, which cluster separately from result utilization failures. Each cluster surfaces the representative traces that define it.
Step 2: Taxonomy review. Engineering reviews the cluster taxonomy weekly. New failure modes that weren’t in the offline eval set are identified. Representative traces are promoted to the offline set.
Step 3: Evaluator calibration. If a new failure mode persists across multiple weeks of production traces, the corresponding dimension threshold in CI is recalibrated against the new distribution.
Step 4: Test set growth. The offline eval set grows continuously from production failures. Six months after launch, it reflects the actual distribution of user behavior, not the anticipated distribution.
Teams with this loop in place ship significantly fewer hotfixes than teams without it. The mechanism is simple: they catch failure modes before users find them, because users are continuously teaching the eval set what to look for.
The Stack
The evaluation framework described here is implemented in two open-source libraries and one hosted platform.
ai-evaluation SDK (Apache 2.0): 70+ evaluation templates covering the six dimensions above. Templates include pre-built evaluators for tool selection accuracy, argument schema validation, factual grounding, instruction adherence, and task completion. Usable standalone or integrated with the platform.
traceAI (Apache 2.0): Instrumentation library with 14 span types across Python, TypeScript, and Go. Captures the complete trace structure required for trajectory evaluation. Available for direct integration into any agent framework.
FutureAGI Platform: Self-improving evaluators backed by classifier scoring, Error Feed clustering for production trace analysis, and the weekly feedback loop that keeps eval sets current. The analysis in this guide comes from the production trace data the platform surfaces.
If you'd rather not build the Error Feed clustering, dimension evaluators, and CI gate config from scratch, they're already set up at app.futureagi.com. Connect your agent traces and the framework above runs against your own data.


