Token Attribution: Making LLM Spend Traceable to the Step

Anyone who has operated an AI workload has opened a monthly bill and wondered, with some urgency, which workflow spent the twelve thousand dollars. The vendor dashboard groups by API key, not by business logic. The internal dashboard — if there is one — groups by service, not by run. Getting from the bill to the decision that caused it usually means a half-day of SQL.

AGNT5 attacks this by pushing token and cost metadata into the journal alongside the step data. Every step entry carries tokens_in, tokens_out, cost_usd, and the model string. Those columns exist in every backend — SQLite, Postgres, Redpanda, and the Parquet files that sealed runs get flushed to.

Here is the schema for the SQLite backend, which makes the shape clearest:

CREATE TABLE journal_entries (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id          TEXT NOT NULL,
    sequence_num    INTEGER NOT NULL,
    entry_type      TEXT NOT NULL,
    step_name       TEXT,
    data            BLOB NOT NULL,
    timestamp_ms    INTEGER NOT NULL,
    tokens_in       INTEGER,
    tokens_out      INTEGER,
    cost_usd        REAL,
    model           TEXT,
    UNIQUE(run_id, sequence_num)
);

Four columns of cost metadata, first-class in the schema. Not hidden in a metrics sink, not shipped through OpenTelemetry as a counter, not left to the application to report. The journal entry itself is the source of truth.

How the numbers get in there

The SDK’s LLM integration — wrapping OpenAI, Anthropic, or any provider — captures the provider’s usage response on every call. When the step completes, the SDK attaches prompt_tokens, completion_tokens, and a computed cost_usd to the journal entry it is about to append. The coordinator writes those fields into the dedicated columns, not into the opaque data blob, so they are queryable without parsing the payload.

@function
async def summarize(ctx: FunctionContext, document: str) -> str:
    # ctx.llm.chat(...) wraps the provider and records usage automatically.
    response = await ctx.step("summarize_call",
        lambda: ctx.llm.chat(
            model="claude-sonnet-4",
            messages=[
                {"role": "system", "content": "Summarize concisely."},
                {"role": "user", "content": document},
            ],
        ),
    )
    return response.content

Under the hood, the ctx.llm.chat wrapper inspects the provider’s response, computes cost from a pricing table keyed on model, and passes the numbers to the step completion. The user does not write any of this. They also cannot forget to write any of it — the attribution is a consequence of using the SDK, not an opt-in.

Rolling it up

Because the numbers live in their own columns, aggregations are straightforward. The runs listing endpoint in the query crate uses DuckDB to roll up over archived Parquet:

SELECT
    project_id,
    model,
    COUNT(*)               AS runs,
    SUM(tokens_in)         AS prompt_tokens,
    SUM(tokens_out)        AS completion_tokens,
    SUM(cost_usd)          AS spend_usd
FROM read_parquet('s3://agnt5-engine/engine/runs/tenant=proj_*/**/*.parquet',
                  hive_partitioning = true)
WHERE completed_at_ms >= $1
GROUP BY project_id, model
ORDER BY spend_usd DESC;

That query runs in a few hundred milliseconds against a month of data because DuckDB only reads four columns out of dozens. A dashboard backed by this query answers “where did the money go this month” in the time it takes to scroll to it.

The step-level granularity matters. Rolling up by step name tells you which part of a workflow is expensive. A “research_report” run might spend $0.04 on source gathering, $0.18 on per-document summarization, and $1.80 on the final long-context synthesis. That shape is not visible from a provider dashboard — it is only visible if the step boundary and the cost boundary are the same boundary.

Why not OTel?

OpenTelemetry is already in the platform — the runtime emits spans for every step, and those spans get wired into customer tracing pipelines through the OTLP collector. Token counts could live on those spans as attributes, and in fact they do.

The reason we also put them in the journal is that spans are lossy by design. OTel pipelines sample, drop late-arriving data, and use storage backends tuned for traces, not aggregates. If a billing reconciliation comes back asking why yesterday’s number is $14.23 off, sampling is not an acceptable answer.

The journal is not sampled. Every step’s numbers are in there. The OTel side is for latency and flow debugging; the journal side is for accounting. Both exist because they serve different questions.

Attribution to what

The step-level numbers are the atom. From there, we roll up to:

Run. Sum across a run’s steps. Visible on the run detail page.
Project. Sum across a project’s runs. The default billing boundary.
Tenant. Sum across a tenant’s projects. The account-level boundary.
Model. Sum across calls to a given model string. Useful for comparing GPT-4o cost against Claude Sonnet cost on the same workload.
Step name. Sum across all invocations of summarize_each, everywhere. Useful when a shared subroutine is secretly expensive.

The S3 Parquet layout partitions by tenant and day, which makes tenant-scoped queries fast. Cross-tenant queries — “which tenant is driving the spike this hour” — are slower because they fan out across prefixes, but we accept that because the alternative is leaking one tenant’s data into another’s query plan.

Why this matters

You cannot improve what you cannot measure, and you cannot measure LLM spend per unit of business logic unless the observability boundary matches the execution boundary. AGNT5 steps are both. Every step is a journal entry, every journal entry has cost columns, every cost column is queryable. The trace from “a user clicked refresh on their dashboard” to “we spent thirty cents” is one query away.

This is not the glamorous part of a durable execution platform. It is the part that lets someone justify the platform in a budget meeting.