Get started Workflows

Workflows

Durable orchestrators — async functions whose progress survives crashes through journaled step boundaries.

A workflow is a @workflow-decorated async function whose body orchestrates steps and whose progress survives crashes. Each ctx.step(...) call is the unit of replay.

from agnt5 import WorkflowContext, function, workflow


@function
async def validate_order(ctx, order_id: str, items: list) -> dict:
    return {"valid": len(items) > 0, "item_count": len(items)}


@function
async def charge_card(ctx, order_id: str) -> str:
    return await payments.charge(order_id)


@function
async def create_shipment(ctx, order_id: str, txn: str) -> str:
    return await shipping.create(order_id, txn)


@workflow
async def order_fulfillment(ctx: WorkflowContext, order_id: str, items: list) -> dict:
    validation = await ctx.step(validate_order, order_id, items)
    if not validation["valid"]:
        return {"order_id": order_id, "status": "rejected"}

    txn = await ctx.step(charge_card, order_id)
    tracking = await ctx.step(create_shipment, order_id, txn)
    return {"order_id": order_id, "status": "fulfilled", "txn": txn, "tracking": tracking}

If the worker crashes between charge_card and create_shipment, the next attempt skips validate_order and charge_card (their results are journaled) and runs create_shipment against the recorded txn.

The mental model

A workflow body looks like ordinary async Python: variable assignments, branches, loops, exception handlers. The runtime treats the body as a deterministic recipe and the journal as the cooked-pot history. On every replay, the runtime walks the recipe and asks one question at each ctx.step(...): do I have a recorded result for this call in this run? If yes, replay returns the journaled value and continues. If no, the runtime executes the step, writes the input and output to the journal, then returns.

The unit of durability is the step, not the line. Code between two ctx.step(...) calls — branches, variable assignments, calls to deterministic helpers — re-executes on every replay. Code inside a step is a side effect that runs at most once per run, modulo the durable-execution gotcha about partial side effects.

WorkflowContext is richer than FunctionContext. It carries the workflow’s run identifier, session and user identifiers for memory scoping, an entity for state changes, and the step counter the runtime uses for journaling. The context is your handle on the durability machinery; the body is the recipe.

Why it works this way

Step boundaries are explicit so you can see where the durability bargain is being made. Implicit checkpointing — at every await, every line, every function call — produces unreadable code and unbounded journals. Boundary-only checkpointing makes the journal proportional to your business logic, not your control flow.

The cost is a constraint on workflow code: the body must be deterministic. Replay must arrive at the same ctx.step(...) call sites in the same order, every time. AGNT5 trades this constraint for an automatic recovery model. Without it, the system would have no way to tell which journaled result belongs to which call site.

Edge cases and gotchas

  • The body must be deterministic. Wall-clock reads, random numbers, network calls, and in-process caches in the workflow body are replay hazards. Move them inside a step. See Determinism.
  • Three forms of ctx.step. ctx.step(handler, *args) calls a @function (the recommended form). ctx.step("name", awaitable) checkpoints arbitrary async work. ctx.step("name", lambda: ...) checkpoints a synchronous callable. Pick one form per workflow and stay with it.
  • Long-running steps hold a lease. A step that takes hours blocks the run from progressing past it. Surface progress through smaller steps instead of waiting indefinitely inside one call.
  • Runs are not deduplicated by input. Re-invoking the same workflow with the same input creates a new run with a new ID and a new journal. Dedupe at the caller if you need at-most-once semantics across submissions.
  • ctx.task(...) still works. Older code uses ctx.task for the same shape. New code uses ctx.step everywhere; both currently coexist.
  • In-flight runs stay on their version. When a new deployment ships, runs that started on the previous version keep running on it. New runs use the new version. See Versioning and deployment model.