Evals as First-Class Runs

Most eval frameworks live next to the application they evaluate. They have their own runner, their own storage, their own result format, and — when something goes wrong — their own debugging workflow. The eval that failed in CI yesterday is a different sort of artifact from the production run that succeeded this morning, even though they executed mostly the same code.

We treat evals as runs. Same journal, same replay, same ctx.step boundaries, same Parquet archive, same DuckDB query layer. An eval harness is just a workflow that wraps the workflow under test.

Here is one, as plain as we can write it:

from agnt5 import workflow, WorkflowContext, function, FunctionContext

@function
async def grade_answer(ctx: FunctionContext, expected: str, actual: str) -> dict:
    # An LLM judge, scoring a single answer.
    verdict = await ctx.step("judge",
        lambda: ctx.llm.chat(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": (
                    f"Expected: {expected}\nActual: {actual}\n"
                    "Score from 0 to 1 and explain in one sentence."
                ),
            }],
        ),
    )
    return parse_score(verdict.content)

@workflow
async def eval_summarization(ctx: WorkflowContext, dataset_id: str) -> dict:
    cases = await ctx.step("load_dataset", lambda: load_dataset(dataset_id))

    scores = []
    for case in cases:
        # Replay-safe: same case id → same child run id.
        result = await ctx.invoke(
            "summarize_workflow",
            input={"document": case["document"]},
            idempotency_key=case["id"],
        )
        grade = await ctx.step(f"grade_{case['id']}",
                               lambda: grade_answer(case["expected"], result))
        scores.append({"id": case["id"], "score": grade["score"]})

    return {
        "dataset": dataset_id,
        "n": len(scores),
        "mean_score": sum(s["score"] for s in scores) / len(scores),
        "per_case": scores,
    }

This is an eval. It is also a run.

What “first-class” buys you

The eval workflow is durable. If the LLM judge times out halfway through case 47, the harness resumes on retry and picks up from case 47 — not case 1. Previously graded cases are memoized. The expensive parts do not re-execute.

The eval is replayable. Six weeks from now, when someone asks “how did v1.2 of the summarization prompt score on this dataset,” you can replay the eval run with the new prompt, diff the per-case scores, and see exactly which cases regressed. The run IDs, step names, and inputs are all durable; the outputs change because the prompt changed; the delta is the eval result.

The eval is queryable. Because the eval run’s journal lands in the same Parquet files as every other completed run, a DuckDB query can pull “all eval runs for dataset internal_support_v3 in the last quarter, by mean_score” in a few hundred milliseconds. You do not maintain a separate eval results database.

The eval is observable. The same OTel spans, the same SSE timeline, the same Studio run view. A failed eval is just a run whose final output does not match expectations — the mechanics of debugging it are identical to debugging a production workflow.

Invoking the workflow under test

The ctx.invoke call is the piece that makes this work. It dispatches another workflow as a child run, records the child’s result in the parent’s journal, and returns the child’s output. On replay of the parent, ctx.invoke with the same idempotency key returns the cached child output instead of dispatching again.

That idempotency key is how the eval harness guarantees that a retry of case 47 does not re-execute the summarization workflow — it pulls the existing child run’s result from the journal. The case ID becomes the stable identifier that ties the eval run to the workflow run it is evaluating.

The child run is a real run. It has its own journal, its own step-level cost attribution, its own trace. When the eval reports “case 47 scored 0.82,” you can click into case 47 and see every step of the summarization workflow that produced that answer. The grader’s reasoning, the summarizer’s tool calls, the prompt that was sent — all durable, all linkable.

Evals in continuous integration

Because the eval is a workflow, CI can trigger it the same way it would trigger any other. The CI job is agnt5 run eval_summarization --dataset internal_support_v3. The run’s exit status is the workflow’s success or failure. The run’s output contains the mean score, the per-case breakdown, and a link back to the run in Studio.

A regression gate in CI is now an assertion against the run’s output:

agnt5 run eval_summarization \
  --input '{"dataset_id": "internal_support_v3"}' \
  --wait \
  --assert-jsonpath '$.mean_score > 0.85'

If the gate fails, the CI job fails. If it fails because case 47 specifically regressed, the engineer looking at the failure clicks through to case 47 in Studio and sees the full summarization run next to the grader’s verdict. No eval-specific tooling required.

The tradeoff

Treating evals as runs means eval runs show up in your run counts, your run costs, and your run metrics. A team that runs a five-thousand-case eval every commit will see a corresponding bump in their Parquet archive and their LLM spend line. Evals cost money. The platform makes that explicit.

The alternative — hiding eval runs in a separate system — hides the cost too. We would rather the cost be visible and manageable than invisible and creeping. If eval spend dominates production spend, the team can see that directly in a DuckDB query and adjust.

Why this matters

Evals and production runs solve the same problem from two sides: how do we know the system is doing the right thing? If the two live in different worlds, you end up maintaining parallel understandings — “it passes eval in CI but behaves differently in prod” becomes a common phrase. Putting them in the same world, with the same durability and observability guarantees, removes that class of mismatch.

Every eval is a run. Every run is auditable. The symmetry is what makes the feedback loop trustworthy.