What the runtime owns vs. your code

The responsibility boundary — what AGNT5 takes care of for you, and what stays in your code.

The runtime owns scheduling, journaling, retries, replay, and lease management. Your code owns business logic, side-effect implementation, step boundaries, and idempotency at external systems.

Concern	AGNT5 runtime owns	Your code owns
Scheduling	Picking a worker, dispatching a run	Registering the workflow / function
Journal	Recording every step’s input and output	Choosing where step boundaries go (`ctx.step`)
Retries on failure	Retrying steps per the configured policy	Marking which exceptions are retryable
Replay on restart	Reading journal entries and skipping completed steps	Keeping the workflow body deterministic
Lease management	Tracking which worker holds which run	Returning from steps inside the timeout
Tracing	Capturing inputs/outputs/errors of every step	Adding domain context (tenant id, user id, span attrs)
Idempotency at the runtime level	Replay returns recorded results, not duplicate calls	Idempotency at the external system (HTTP, DB)
Worker lifecycle	Health checks, reconnection, graceful drain	The handler implementation inside the worker

Your job ends at the step boundary; AGNT5’s begins.

The mental model

Picture two columns. The left column is the runtime — a single process (Gateway + Engine + Coordinator) that knows how to schedule work, write to a journal, and dispatch to a worker. The right column is your code — @workflow, @function, @tool, Agent instances. The two columns touch in exactly two places: the runtime calls into your code at a step boundary, and your code calls into the runtime when it invokes ctx.step(...).

Everything that depends on the shape of the run — what the workflow looks like, when steps fire, what side effects they have — is your code. Everything that depends on the run surviving over time — recovery, retry, observability, scaling — is the runtime. The split exists so you can write business logic without weaving infrastructure concerns into every function.

The contract goes both ways. Because the runtime owns retries, your code does not need try/except around every transient network error in a step. Because your code owns step boundaries, the runtime cannot tell on its own which side effects are safe to retry — that is what @function and ctx.step exist to communicate.

Why it works this way

The split mirrors the split between Kubernetes and your container, or between Postgres and your SQL. Infrastructure that is reused across applications goes in the runtime; logic that varies per application stays in user code. This is the only split that scales — bundling retry logic into every workflow makes every workflow a fragile reimplementation of the same retry strategy; pushing retry into the runtime makes it consistent and audit-friendly.

It also keeps the SDK surface small. The Python SDK has roughly five user-facing primitives (@workflow, @function, @tool, Agent, ctx.step). Everything else — the journal, the lease manager, the reconnect logic — is handled below the surface. The reader of your workflow code does not need to understand any of it to follow the business logic.

Edge cases and gotchas

Retries are runtime-driven. Do not add try/except around transient errors at the workflow level — the runtime will retry the step. Your code raises; the runtime decides whether to replay.
Database connection management lives in your step code. The runtime does not pool connections for you. A step that opens a connection without closing it leaks resources.
Observability is provided; enrichment is yours. The runtime captures inputs, outputs, errors, and timings. Adding tenant ids, user ids, or domain-specific tags happens through your code calling into the trace context.
The runtime does not enforce idempotency at the external system. Journaling protects against duplicate journaled outcomes; it cannot stop a partially-completed HTTP POST from committing twice across attempts. Use idempotency keys, conditional updates, or safe-by-design operations.
Lease timeouts are a runtime concern; staying inside them is yours. A long-running step that overruns its lease loses ownership of the run. Either tune step_timeout upward or break the work into smaller steps that surface progress.
Worker code crashes are recoverable; runtime crashes are too. The runtime is durable across its own restarts (journal in storage, leases reissued). Worker code crashes are also recoverable — replay picks up where the journal left off. Both halves survive independently.

Durable execution — the guarantee this split enables.
Architecture overview — what the runtime side actually looks like.
Event sourcing and replay — the mechanism the runtime uses to fulfill its half.
Determinism — why workflows have rules — the constraint your half must respect.

Concern

AGNT5 runtime owns

Your code owns

Scheduling

Picking a worker, dispatching a run

Registering the workflow / function

Journal

Recording every step’s input and output

Choosing where step boundaries go (ctx.step)

Retries on failure

Retrying steps per the configured policy

Marking which exceptions are retryable

Replay on restart

Reading journal entries and skipping completed steps

Keeping the workflow body deterministic

Lease management

Tracking which worker holds which run

Returning from steps inside the timeout

Tracing

Capturing inputs/outputs/errors of every step

Adding domain context (tenant id, user id, span attrs)

Idempotency at the runtime level

Replay returns recorded results, not duplicate calls

Idempotency at the external system (HTTP, DB)

Worker lifecycle

Health checks, reconnection, graceful drain

The handler implementation inside the worker

The mental model

Why it works this way

Edge cases and gotchas

Retries are runtime-driven. Do not add try/except around transient errors at the workflow level — the runtime will retry the step. Your code raises; the runtime decides whether to replay.

Database connection management lives in your step code. The runtime does not pool connections for you. A step that opens a connection without closing it leaks resources.

Observability is provided; enrichment is yours. The runtime captures inputs, outputs, errors, and timings. Adding tenant ids, user ids, or domain-specific tags happens through your code calling into the trace context.

The runtime does not enforce idempotency at the external system. Journaling protects against duplicate journaled outcomes; it cannot stop a partially-completed HTTP POST from committing twice across attempts. Use idempotency keys, conditional updates, or safe-by-design operations.

Lease timeouts are a runtime concern; staying inside them is yours. A long-running step that overruns its lease loses ownership of the run. Either tune step_timeout upward or break the work into smaller steps that surface progress.

Worker code crashes are recoverable; runtime crashes are too. The runtime is durable across its own restarts (journal in storage, leases reissued). Worker code crashes are also recoverable — replay picks up where the journal left off. Both halves survive independently.

What the runtime owns vs. your code

The mental model

Why it works this way

Edge cases and gotchas

Related concepts

On this page

What the runtime owns vs. your code

The mental model

Why it works this way

Edge cases and gotchas

Related concepts