Event sourcing and replay

How AGNT5 records every step's input and output to a journal, and how replay reads the journal to skip work that already ran.

| View as Markdown

AGNT5 records every step’s input, output, error, and timing to an append-only journal. Recovery — and idempotent re-execution — works by replaying that journal: at each step the runtime asks whether a record already exists, returns it if so, runs the step for real if not.

Run start
  ├─ ctx.step(fetch_article, url)   ──► journal: { step: 1, in: url, out: <html> }
  ├─ ctx.step(summarize, html)      ──► journal: { step: 2, in: html, out: <text> }
  └─ return <text>

Crash + restart, same run

Run resume
  ├─ ctx.step(fetch_article, url)   ──► journal HIT  → returns <html>, no fetch
  ├─ ctx.step(summarize, html)      ──► journal HIT  → returns <text>, no LLM call
  └─ return <text>

The crashed worker did not lose state. The journal is the state. The new worker walks the same recipe and reads each step’s outcome from disk.

The mental model

Think of the journal as a logbook the runtime keeps next to your workflow. Every time the workflow body crosses a ctx.step(...) call, the runtime opens the logbook to that page and asks: have I already written down what happened here? If yes, replay returns the recorded value and moves on. If no, the runtime executes the step, records what happened, and only then returns control to the workflow.

The unit recorded is the step, not the line. Code between two ctx.step(...) calls — branches, variable assignments, deterministic helpers — re-executes on every replay; that’s why the workflow body must be deterministic. Anything inside a step is opaque to replay; the runtime sees only the input it received and the output it returned.

The journal is append-only. Steps record success and failure outcomes; a failed step that retried until it succeeded leaves a trail of attempts plus the final success. The journal is also the source of every other observability artifact AGNT5 produces — traces, eval datasets, debug snapshots all read from it.

Why it works this way

Event sourcing is the cheapest mechanism that gives you exactly-the-required-amount of recovery: the runtime can resume a crashed run without your code knowing it crashed, and without re-running side effects you have already paid for. The alternative — persisting full process memory at every step — is orders of magnitude more expensive and fragile across deploys.

It also makes observability free. Because the journal already records every step’s inputs and outputs, the trace UI, eval comparisons, and agnt5 inspect are all readers of the same data structure. No separate logging path is needed to power them. The trade is that your workflow body must stay deterministic so replay reaches the same call sites — see Determinism for the constraint that buys this.

Edge cases and gotchas

Replay reads the journal first. If the journal entry for a step is missing — the run was started fresh, the step is new code, or the journal was trimmed — the runtime executes the step for real. There is no “fail closed” mode for missing entries: missing means run-fresh.
The journal grows unbounded per run. A workflow with thousands of steps produces a long journal. Long-running workflows that loop should periodically checkpoint a summarized state and resume from it rather than relying on millions of journal entries.
Non-deterministic workflow bodies break replay. If ctx.step(...) calls happen in a different order on replay than on the original run, the runtime cannot match journal entries to call sites. The error surfaces as a replay-drift exception. Move the offending non-determinism inside a step so its result is journaled.
Side effects can partially succeed. The journal records the runtime’s view of a step (what it sent in, what came back). It cannot tell you whether an HTTP POST committed at the destination before the network failed. Design side-effecting steps to be idempotent at the external boundary.
Replay is not a debug feature, it is the recovery mechanism. Every restart triggers replay. The cost of replay is paid on the happy path too — the runtime walks the journal even when nothing crashed.
The journal outlives the worker. Workers can come and go; the journal lives in the engine’s storage. A new worker picking up a paused run reads the same journal the original worker was writing to.

Durable execution — the guarantee event sourcing implements.
Determinism — why workflows have rules — the constraint that makes replay tractable.
What the runtime owns vs. your code — where the journal sits in the responsibility split.

Run start ├─ ctx.step(fetch_article, url) ──► journal: { step: 1, in: url, out: <html> } ├─ ctx.step(summarize, html) ──► journal: { step: 2, in: html, out: <text> } └─ return <text> Crash + restart, same run Run resume ├─ ctx.step(fetch_article, url) ──► journal HIT → returns <html>, no fetch ├─ ctx.step(summarize, html) ──► journal HIT → returns <text>, no LLM call └─ return <text>

The mental model

Why it works this way

Edge cases and gotchas

Replay reads the journal first. If the journal entry for a step is missing — the run was started fresh, the step is new code, or the journal was trimmed — the runtime executes the step for real. There is no “fail closed” mode for missing entries: missing means run-fresh.

The journal grows unbounded per run. A workflow with thousands of steps produces a long journal. Long-running workflows that loop should periodically checkpoint a summarized state and resume from it rather than relying on millions of journal entries.

Non-deterministic workflow bodies break replay. If ctx.step(...) calls happen in a different order on replay than on the original run, the runtime cannot match journal entries to call sites. The error surfaces as a replay-drift exception. Move the offending non-determinism inside a step so its result is journaled.

Side effects can partially succeed. The journal records the runtime’s view of a step (what it sent in, what came back). It cannot tell you whether an HTTP POST committed at the destination before the network failed. Design side-effecting steps to be idempotent at the external boundary.

Replay is not a debug feature, it is the recovery mechanism. Every restart triggers replay. The cost of replay is paid on the happy path too — the runtime walks the journal even when nothing crashed.

The journal outlives the worker. Workers can come and go; the journal lives in the engine’s storage. A new worker picking up a paused run reads the same journal the original worker was writing to.

Event sourcing and replay

The mental model

Why it works this way

Edge cases and gotchas

Related concepts

On this page

Event sourcing and replay

The mental model

Why it works this way

Edge cases and gotchas

Related concepts