Durable execution
The runtime guarantee that a workflow's progress survives crashes — completed steps are not re-run.
Durable execution is a runtime guarantee that a workflow’s progress survives process crashes, network failures, and restarts: completed steps are replayed from the journal, not re-run.
from agnt5 import FunctionContext, WorkflowContext, function, workflow
@function
async def charge_card(ctx: FunctionContext, order_id: str) -> str:
# Real side effect: a charge happens at most once per order_id.
return await payments.charge(order_id)
@function
async def send_receipt(ctx: FunctionContext, order_id: str, txn: str) -> None:
await email.send(order_id, txn)
@workflow
async def checkout(ctx: WorkflowContext, order_id: str) -> str:
txn = await ctx.step(charge_card, order_id)
# If the worker dies here, the next attempt skips charge_card
# (its result is in the journal) and runs send_receipt.
await ctx.step(send_receipt, order_id, txn)
return txnIf the worker crashes between charge_card returning and send_receipt starting, the next attempt does not charge the card again. The runtime reads the recorded txn from the journal, advances past charge_card, and runs send_receipt against that value.
The mental model
Think of the workflow body as a recipe and the journal as the cooked-pot history: a record of what has already been prepared. Replay walks the recipe step by step. At each step, the runtime asks one question: do I have a recorded result for this call in this run? If yes, replay returns the recorded value and moves on. If no, the runtime executes the step for real, writes the input and output to the journal, then returns the value.
This means your code stays the shape of ordinary async Python. There is no try/except for transient infrastructure errors at the workflow level, no resumption flags, no manual checkpoint tables. The recovery contract lives in the runtime; you write business logic.
The unit of durability is the step, not the line. Anything that happens between two ctx.step(...) calls is workflow body code — branches, variable assignments, calls to deterministic helpers — and is re-executed on replay. Anything inside a step is a side effect that runs at most once per run, modulo the gotcha below.
Why it works this way
The alternative is to make every line a checkpoint. That has been tried; it produces unreadable code and unbounded journals. The opposite extreme is to checkpoint only at workflow boundaries, which makes any non-trivial multi-step process unrecoverable without manual cleanup. The step boundary is the compromise: explicit enough that you can see where the durability bargain is being made, coarse enough that the journal stays bounded, fine enough that recovery is automatic.
The cost is a constraint on workflow code: the body must be deterministic. Replay must arrive at the same ctx.step(...) calls in the same order, every time. AGNT5 trades this constraint for an automatic recovery model — without it, the system would have no way to tell which journaled result belongs to which call site.
Edge cases and gotchas
- Durability is not idempotency at the side effect. If
charge_cardpartially succeeded — the network call left your process, the bank charged the card, but the response never came back — the runtime cannot tell. On retry it will runcharge_cardagain. Design side-effecting steps to be idempotent at the external boundary (idempotency keys, conditional inserts,INSERT ... ON CONFLICT). - Long-running steps hold a lease. A step that takes hours blocks the run from progressing past it. Set a
step_timeoutand surface partial progress through smaller steps rather than waiting indefinitely inside one call. - The workflow body must stay deterministic. Wall-clock reads, random numbers, network calls, and in-process caches in the workflow body are replay hazards. Move them inside a step, where their result is recorded. See Determinism for the full list.
- Replay reads the journal first. If the journal entry for a step is missing — the run was started fresh, the step is new code, or the journal was trimmed — the runtime executes the step for real. There is no “fail closed” mode for missing entries: missing means run-fresh.
- Durability is per-run, not per-input. Re-invoking the same workflow with the same input creates a new run with a new ID and a new journal. The runtime does not deduplicate on input. If you need at-most-once semantics across submissions, dedupe at the caller.
Related concepts
- Event sourcing and replay — the journal mechanics that make durable execution work.
- Determinism — why workflows have rules — the constraint replay imposes on workflow code.
- What the runtime owns vs. your code — the responsibility split this concept creates.