Static Eval Sets Can't Catch Six Kinds of Drift

What shadow, canary, and percentage ramps each catch and why skipping one means that stage runs in production anyway, without gate criteria or a rollback trigger.

Jun 01, 2026

A passing eval is not evidence a rollout is safe. Most teams evaluate an agent candidate the way they'd evaluate a code change - run it against a test suite, check the outputs, ship if they look right. That works for deterministic systems. Agents aren't deterministic systems, and the gap shows up the same way every time: the CI suite is green, the deploy goes out, production fails, and the postmortem blames the rubric.

The rubric usually isn’t wrong. It’s stale - the same way a unit test goes stale when the function it covers gets refactored. Both pass. Neither protects you. The underlying problem is structural: your eval set is a snapshot, production is a river. Every release ages the eval set the day it lands, and without a pipeline that promotes new failure modes back into the offline set, the offline-pass / prod-fail gap is mathematical, not accidental.

The right unit of evaluation is the production trace, not the curated test case. Static offline evals are a necessary regression gate, never a sufficient ship gate, because they measure against a world that stopped existing the day they froze.

The six kinds of drift

There are six, and they all share one shape: the offline eval sees one thing, production does another.

Dataset drift - every curated case passes while real users arrive with intents the set never had.
Tool-API drift - the mocked tool returns the same payload while the vendor quietly changed its schema, error codes, or rate limits.
Prompt drift - the rubric is frozen in git for v3 while the prompt has moved to v17. Retrieval-corpus drift - the index was frozen at eval-build time, but it has since doubled and the chunker was bumped, so the same query surfaces new chunks.
User-distribution drift - the inputs were hand-authored, and real traffic looks nothing like them.
Agent-step compounding - every step succeeds 95% of the time, but eight of them multiply to 66% end-to-end.

Each runs on a different timescale. Dataset and user-distribution drift creep in over weeks. Tool-API and prompt drift land overnight. Retrieval-corpus drift is silent until a re-index. Agent-step compounding is structural and was never going to be caught by single-turn rubrics. None of these is a “more evals” problem. They’re an architecture problem.

Dataset drift

The eval set was written at launch. Users found intents the test authors never anticipated. The eval still passes because the dataset never moved.

Tell. Offline scores flat for months, production complaints diversifying, the team can’t reproduce most reported failures on the test set.
Fix. Sample failing traces weekly. Bucket by user segment, intent, and judge score. Promote the hardest 5–10% into the eval set with version tags. Every promoted trace is a regression future PRs can’t break.

Tool-API drift

The tool call was mocked in CI. The real endpoint changed schema, error shape, or rate-limit headers. The agent retries, the loop times out, and it fabricates a reasonable-sounding answer. CI is green because the mock still returns the old payload.

Tell. Tool-call latency climbs, retries climb, the per-response rubric still passes, cost-per-success creeps wrong.
Fix. Score tool-call success as its own rubric on live spans. A function-calling evaluator grades argument shape and call sequence, and a failing tool call shows up in the trace tree next to the failing response, scored. The mocked CI test catches your regression; the span-attached score catches the vendor’s.

Prompt drift

v17 of the prompt shipped Friday. The rubric was written for v3 in February and still grades the criteria v3 cared about. The agent is being evaluated for the wrong thing.

Tell. A senior engineer reads ten traces, disagrees with the judge on six, and can’t articulate why. The judge is grading by the old contract.
Fix. Version the rubric in the same PR as the prompt it scores. Treat it like a contract test: when the prompt’s intent moves, the rubric moves with it, and the next CI run regrades the dataset under the new contract. Track judge-vs-human agreement on a small calibration set; when it drops, the rubric is overdue.

Retrieval-corpus drift

The retriever evaluated in March indexed 12,000 documents at chunk size 800. By May the index has 38,000, the chunker reranked on a re-embed, and the same query lands on different top-k chunks. The generator grounds in whatever it’s handed. Groundedness still scores 0.94. The answer is grounded in the wrong material.

Tell. Generation rubrics hold. Users say the bot is “less helpful than last quarter.” Trace inspection shows the top-1 chunk shifted for a class of queries.
Fix. Split the eval suite by layer. Retrieval rubrics (context relevance, chunk attribution, chunk utilization) catch index drift before generation rubrics absorb it. A drop in context relevance with stable groundedness means the retriever moved; a drop in groundedness with stable context relevance means the generator did. One bisect instead of three days.

User-distribution drift

The eval set was hand-authored or sampled from launch-month traffic. Six months later, real users arrive with slang, multi-language code-switching, longer prompts, screenshots, and follow-up chains the dataset never had. A judge calibrated against the curated set reads 15 points lower on live traffic.

Tell. A spot-check of production traces scored by hand disagrees with the judge by 15+ points. Engineers stop trusting the rubric and start reading traces directly.
Fix. Calibrate the judge against production samples, not the dataset. Each rubric ships with a small human-labelled calibration set drawn from production, and judge-vs-human drift becomes its own tracked metric.

Agent-step compounding

Every per-step rubric scores 95%. The agent makes eight tool calls per session. 0.95 to the eighth is 0.66. Two-thirds of sessions end up structurally wrong even when every individual step looks right. The rubric never multiplied.

Tell. Per-turn metrics high; conversation completeness, outcome rate, and CSAT low. Tickets read “the bot kept asking me the same question” or “it said yes then said no.”
Fix. Score the trace as a unit. Add conversation completeness, role adherence, knowledge retention, and turn relevancy on the conversation, plus optimal plan execution on the span tree. Multi-turn metrics are noisier per dollar than per-turn ones and correlate with user experience an order of magnitude better.

Why static offline evals can’t catch any of these

The shared property of all six: they happen after the eval set was frozen. A static dataset can’t encode a hypothesis it doesn’t have yet. The CI gate is a regression test on a hypothesis you wrote in the past; the drift is a hypothesis production hasn’t surfaced cleanly enough to label.

This is not a “your dataset is too small” problem. A 10,000-example offline set from March still doesn’t contain May’s tool-schema change, June’s prompt revision, July’s index re-embed, or August’s users phrasing questions a new way. Scale doesn’t fix the snapshot. Only sampling production does, and sampling production means the eval surface lives where the agent lives.

The reframe: the trace is the eval case. The curated dataset is the regression seed; live spans are the working set. Failures cluster, the rubric scores them as they happen, the named clusters become the next batch of dataset entries, and the loop closes. Offline pass is necessary. Trace-attached pass is sufficient.

The four-dimensional trace score

Per-turn faithfulness on the final response isn’t enough resolution to diagnose a drifting agent. The score written back on every failing trace is four-dimensional, each axis scored 1–5:

Factual grounding. Did the agent stay anchored in the retrieved or supplied context, or confabulate? Catches retrieval-corpus and dataset drift at the response level.
Privacy and safety. Did the agent leak PII, cross a tenant boundary, or comply with a jailbreak it should have refused? Catches tool-API drift on permissions and prompt drift on the refusal head.
Instruction adherence. Did the agent follow the system prompt and refuse what should have been refused? Catches prompt drift directly — when v17 says one thing and the agent does another, this is the axis that drops.
Optimal plan execution. Did the agent pick the right tool, in the right order, without redundant calls, retries, or unreachable branches? Catches agent-step compounding and tool-API drift on the call graph.

Four axes, four kinds of regression, one composite. When the composite drops on a trace, the axes tell you which drift mode bit you. The same axes run in CI on the offline set and on live spans, so the diagnostic vocabulary is identical in both places.

Turning the score into a working loop

A score on a trace is just a number. What makes it useful is the machinery that acts on it. Here’s how that machinery is wired, concretely enough that you could rebuild the shape yourself.

Every failing trace gets written into ClickHouse alongside its span embeddings. HDBSCAN then runs soft-clustering to collapse those traces into named issues, with the threshold set at prob >= 0.4 — low enough that genuine outliers aren’t discarded as noise. Each cluster is handed to a judge agent (Claude Sonnet 4.5 on Bedrock), which works the cluster for up to 30 turns using eight span-level tools: read_span, get_children, get_spans_by_type, search_spans, submit_finding, submit_scores, and submit_summary. Any span longer than 3,000 characters gets condensed first by a lighter Claude Haiku “chauffeur.” Because roughly 90% of prompts hit the cache, the whole thing stays cheap enough to run.

A taxonomy label, drawn from 5 categories and 30 subtypes.
The four-dimensional trace score.
An immediate_fix line naming the single change worth shipping now, editing a rubric, patching a prompt, guarding a tool call, or tightening a retrieval filter.

Closing the loop is a workflow, not a feature

Cluster the failures into named issues. Nobody can usefully triage a flat list of 800.
Score - the judge attaches the 4-D score, the taxonomy label, and the immediate_fix.
Promote - on-call signs off on a cluster, picks 3 to 10 traces that represent it, and checks them into the offline set, tagged by route and labelled by rubric.
Re-gate - on the next CI run those new cases are graded by the exact rubric production used, so the next PR touching that path can’t quietly undo the fix.
Optimize - a prompt search runs over the now-larger set, and any candidate has to clear the rubric in CI before it ships.

Run this weekly on anything active, more often around shaky launches. A set that hasn’t moved in a quarter has almost certainly fallen out of step with production; on fast-moving agents the gap can surface inside two or three weeks. Pull your samples from failing traces, low-scoring examples, and a spread across segments, then label, version, and commit them.

What teams that close the gap actually ship

There are six things, and most teams manage two or three:

One rubric, two places. The same code-defined rubric runs against a versioned dataset on PRs and against live spans on canary - same judge, same prompt, both sides.
Scores that live on the span. The 4-D scores are written as OpenTelemetry span attributes, so the score and the trace it describes sit together.
Conversation- and outcome-level metrics. Completeness and role adherence, plus the things the business cares about: resolved, filed, booked.
Retrieval and tool calls graded on their own. Each gets a separate score, independent of the final answer.
Clustering that explains itself. Failures group automatically, each cluster carries its immediate_fix, and each is a candidate to fold into the dataset.
A loop that genuinely closes. New clusters land in the offline set as regression tests on the next CI run.

Most posts like this end on a clean checkmark. This one shouldn’t. The piece that doesn’t exist yet is a direct connector from the trace stream into the optimizer -continuous tuning on live spans, with no dataset round-trip in between. It’s on the roadmap, not in production. For now, continuous optimization means running the loop weekly through the promote step by hand. Pretending the direct path already ships is exactly the kind of claim that evaporates the moment an engineer opens the code, and the trust it costs isn’t worth it.

Three tradeoffs worth naming before you commit to any of this:

It adds operational surface. Span-attached scoring, auto-clustering, and a promote workflow are far more moving parts than running pytest evals/. What you get back is a regression suite that compounds. If that’s too much up front, start with tracing plus offline eval and switch the loop on later.
Self-improving evaluators need watching too. A rubric that retunes itself against live traffic can wander somewhere you didn’t intend. Keep a small human-labelled hold-out and trip an alarm when the judge and the humans diverge past your inter-rater baseline.
Scoring traces costs more than scoring cases. A 4-D rubric on a 30-second trace is pricier and noisier than per-turn scoring on a 200-token example. Sample where the failure signal is, not uniformly.

The one thing to take away

Keep both evals; they have different jobs. The offline suite is your regression gate on a PR; production is your drift signal on a deploy. The mistake is letting an offline pass stand in for a ship decision when it's only a precondition for one. Run the same rubric on both sides, and watch the distance between the offline average and the production average as a quality metric in its own right. The day that gap starts to widen is the day production tells you which drift reached you first.

The loop that does this - clustering, span-attached scores, promote-back is open source (ai-evaluation, traceAI) and hosted at: Link.

Future AGI

Discussion about this post

Ready for more?