Get started What the runtime owns vs. your code

What the runtime owns vs. your code

The responsibility boundary — what AGNT5 takes care of for you, and what stays in your code.

The runtime owns scheduling, journaling, retries, replay, and lease management. Your code owns business logic, side-effect implementation, step boundaries, and idempotency at external systems.

Concern AGNT5 runtime owns Your code owns
Scheduling Picking a worker, dispatching a run Registering the workflow / function
Journal Recording every step’s input and output Choosing where step boundaries go (ctx.step)
Retries on failure Retrying steps per the configured policy Marking which exceptions are retryable
Replay on restart Reading journal entries and skipping completed steps Keeping the workflow body deterministic
Lease management Tracking which worker holds which run Returning from steps inside the timeout
Tracing Capturing inputs/outputs/errors of every step Adding domain context (tenant id, user id, span attrs)
Idempotency at the runtime level Replay returns recorded results, not duplicate calls Idempotency at the external system (HTTP, DB)
Worker lifecycle Health checks, reconnection, graceful drain The handler implementation inside the worker

Your job ends at the step boundary; AGNT5’s begins.

The mental model

Picture two columns. The left column is the runtime — a single process (Gateway + Engine + Coordinator) that knows how to schedule work, write to a journal, and dispatch to a worker. The right column is your code — @workflow, @function, @tool, Agent instances. The two columns touch in exactly two places: the runtime calls into your code at a step boundary, and your code calls into the runtime when it invokes ctx.step(...).

Everything that depends on the shape of the run — what the workflow looks like, when steps fire, what side effects they have — is your code. Everything that depends on the run surviving over time — recovery, retry, observability, scaling — is the runtime. The split exists so you can write business logic without weaving infrastructure concerns into every function.

The contract goes both ways. Because the runtime owns retries, your code does not need try/except around every transient network error in a step. Because your code owns step boundaries, the runtime cannot tell on its own which side effects are safe to retry — that is what @function and ctx.step exist to communicate.

Why it works this way

The split mirrors the split between Kubernetes and your container, or between Postgres and your SQL. Infrastructure that is reused across applications goes in the runtime; logic that varies per application stays in user code. This is the only split that scales — bundling retry logic into every workflow makes every workflow a fragile reimplementation of the same retry strategy; pushing retry into the runtime makes it consistent and audit-friendly.

It also keeps the SDK surface small. The Python SDK has roughly five user-facing primitives (@workflow, @function, @tool, Agent, ctx.step). Everything else — the journal, the lease manager, the reconnect logic — is handled below the surface. The reader of your workflow code does not need to understand any of it to follow the business logic.

Edge cases and gotchas

  • Retries are runtime-driven. Do not add try/except around transient errors at the workflow level — the runtime will retry the step. Your code raises; the runtime decides whether to replay.
  • Database connection management lives in your step code. The runtime does not pool connections for you. A step that opens a connection without closing it leaks resources.
  • Observability is provided; enrichment is yours. The runtime captures inputs, outputs, errors, and timings. Adding tenant ids, user ids, or domain-specific tags happens through your code calling into the trace context.
  • The runtime does not enforce idempotency at the external system. Journaling protects against duplicate journaled outcomes; it cannot stop a partially-completed HTTP POST from committing twice across attempts. Use idempotency keys, conditional updates, or safe-by-design operations.
  • Lease timeouts are a runtime concern; staying inside them is yours. A long-running step that overruns its lease loses ownership of the run. Either tune step_timeout upward or break the work into smaller steps that surface progress.
  • Worker code crashes are recoverable; runtime crashes are too. The runtime is durable across its own restarts (journal in storage, leases reissued). Worker code crashes are also recoverable — replay picks up where the journal left off. Both halves survive independently.