The improvement loop

The trace → eval → edit → deploy cycle — the loop that turns durable execution into a product, not only a runtime feature.

| View as Markdown

The improvement loop is the cycle every production agent system needs: every run produces a trace, traces feed evals, evals expose regressions, edits ship as new deployments. Every other concept in AGNT5 exists to make this loop fast.

┌─────────┐       ┌────────┐       ┌─────────┐       ┌──────────┐
│   run   │ ────► │ trace  │ ────► │  eval   │ ────► │   edit   │
└─────────┘       └────────┘       └─────────┘       └──────────┘
     ▲                                                    │
     │                                                    ▼
     │           ┌──────────┐       ┌─────────────┐       │
     └───────────│   run    │ ◄──── │ deployment  │ ◄─────┘
                 └──────────┘       └─────────────┘

The trace is the system of record. Without traces, every other step in the loop is impossible — you cannot eval what you cannot inspect, and you cannot tell whether an edit improved or regressed behavior.

The mental model

Treat AGNT5 as a loop accelerator, not only a workflow runtime. The runtime captures every step’s input and output to the journal. The trace UI reads from the journal. Eval frameworks read from the trace. Edits land as new deployments. New runs produce new traces, which feed the next round of evals. The faster you can complete one rotation, the faster your agent system improves.

Each stage has a clear input and output:

Run produces a trace. Inputs are the run’s arguments; outputs are every step’s input/output, error, timing, and (for LLM steps) prompts/responses/token counts.
Trace is browsed, exported, or piped into an eval. Inputs are the trace IDs you select; outputs are the trace data structures with everything the runtime captured.
Eval scores traces against rubrics, references, or LLM judges. Inputs are a dataset of traces; outputs are scores and per-row diffs.
Edit changes a workflow, prompt, model, or tool. Inputs are eval signals; outputs are a new deployment.
Deploy ships the edit. Inputs are the new code; outputs are a new deployment artifact and (when the environment pointer advances) routing of new runs to it.

The loop is the product. Durable execution is a means; the trace-as-system-of-record is the bridge that lets evals replay old runs against new code; deployments-as-immutable-versions are what make A/B comparison meaningful.

Why it works this way

Agent systems are not deterministic enough to ship-and-forget. The same prompt produces different outputs across model versions; the same workflow produces different tool calls across runs; the same eval rubric scores differently as the dataset drifts. The only sustainable strategy is to measure continuously and edit deliberately — and to do that, you need every run to be inspectable, every edit to be comparable, and every comparison to be auditable.

AGNT5 picks one mechanism (event sourcing) that gives you all three at once. The journal makes runs inspectable (it’s the trace). The journal makes edits comparable (replay an old run against new prompts; the inputs are still on disk). The journal makes comparisons auditable (you can show exactly which calls fired, in what order, with what arguments).

A separate sidecar logging path could give you traces. A separate eval database could give you comparisons. A separate audit log could give you accountability. Picking one mechanism that gives you all three is the simplification — and it is what makes the loop fast.

Edge cases and gotchas

Replaying old traces against new prompts requires deterministic workflow code. If the workflow body’s call sequence depended on a clock or RNG, replay drifts and the comparison is meaningless. Determinism (see Determinism) is a precondition for the eval half of the loop.
Eval datasets drift unless versioned. The set of traces you eval against today may include traces you would not pick tomorrow. Snapshot the dataset (trace IDs + timestamps) before each eval run; otherwise comparing scores across time is comparing different denominators.
Comparison across deployments needs stable trace IDs. The runtime generates trace IDs that are stable per run; reusing the same ID across replays is what lets eval frameworks pair “before edit” and “after edit” results.
The loop is per-component, not per-system. A team improving one workflow’s prompt should not be blocked on a system-wide eval pipeline. Treat each workflow’s loop as independent and run them on their own cadences.
Skipping traces breaks the loop. It is tempting to log only “interesting” runs. Every run produces a trace anyway in AGNT5; the cost of saving them all is what makes the loop sustainable. Filtering happens at eval time, not capture time.

Durable execution — the runtime mechanism that makes traces possible.
Event sourcing and replay — the journal that powers every stage of the loop.
Versioning and deployment model — how edits ship as new deployments.
Determinism — why workflows have rules — the constraint that lets you replay old traces.

┌─────────┐ ┌────────┐ ┌─────────┐ ┌──────────┐ │ run │ ────► │ trace │ ────► │ eval │ ────► │ edit │ └─────────┘ └────────┘ └─────────┘ └──────────┘ ▲ │ │ ▼ │ ┌──────────┐ ┌─────────────┐ │ └───────────│ run │ ◄──── │ deployment │ ◄─────┘ └──────────┘ └─────────────┘

The mental model

Each stage has a clear input and output:

Run produces a trace. Inputs are the run’s arguments; outputs are every step’s input/output, error, timing, and (for LLM steps) prompts/responses/token counts.

Trace is browsed, exported, or piped into an eval. Inputs are the trace IDs you select; outputs are the trace data structures with everything the runtime captured.

Eval scores traces against rubrics, references, or LLM judges. Inputs are a dataset of traces; outputs are scores and per-row diffs.

Edit changes a workflow, prompt, model, or tool. Inputs are eval signals; outputs are a new deployment.

Deploy ships the edit. Inputs are the new code; outputs are a new deployment artifact and (when the environment pointer advances) routing of new runs to it.

Why it works this way

Edge cases and gotchas

Replaying old traces against new prompts requires deterministic workflow code. If the workflow body’s call sequence depended on a clock or RNG, replay drifts and the comparison is meaningless. Determinism (see Determinism) is a precondition for the eval half of the loop.

Eval datasets drift unless versioned. The set of traces you eval against today may include traces you would not pick tomorrow. Snapshot the dataset (trace IDs + timestamps) before each eval run; otherwise comparing scores across time is comparing different denominators.

Comparison across deployments needs stable trace IDs. The runtime generates trace IDs that are stable per run; reusing the same ID across replays is what lets eval frameworks pair “before edit” and “after edit” results.

The loop is per-component, not per-system. A team improving one workflow’s prompt should not be blocked on a system-wide eval pipeline. Treat each workflow’s loop as independent and run them on their own cadences.

Skipping traces breaks the loop. It is tempting to log only “interesting” runs. Every run produces a trace anyway in AGNT5; the cost of saving them all is what makes the loop sustainable. Filtering happens at eval time, not capture time.

Related concepts

Durable execution — the runtime mechanism that makes traces possible.

Event sourcing and replay — the journal that powers every stage of the loop.

Versioning and deployment model — how edits ship as new deployments.

Determinism — why workflows have rules — the constraint that lets you replay old traces.

The improvement loop

The mental model

Why it works this way

Edge cases and gotchas

Related concepts

On this page

The improvement loop

The mental model

Why it works this way

Edge cases and gotchas

Related concepts