Sealed Segments and Parquet: How Completed Runs Leave the Hot Path

A durable execution log has a dirty secret: most of what sits in it is historical. The live run at offset 12,483 cares about offsets 12,400 through 12,482. It does not need the half-million entries that completed last week, and neither does anyone else who is still executing. Meanwhile those old entries are in the same RocksDB segment as the hot stuff, competing for block cache and bloat-check time.

AGNT5 solves this with storage tiering. RocksDB holds active runs. S3 holds everything that has reached a terminal state. DuckDB reads the S3 side. The processor is the component that moves runs from one side to the other.

The s3_writer inside the processor crate buffers completed runs and flushes them as Parquet:

pub struct S3WriterConfig {
    pub flush_interval: Duration,   // e.g. 10s
    pub flush_threshold: usize,     // e.g. 1000 runs
    pub prefix: String,             // "engine/runs"
}

// Path layout:
// s3://{bucket}/engine/runs/tenant={tenant_id}/day={YYYY-MM-DD}/runs-{batch_id}.parquet

When a run finishes — Completed, Failed, or Cancelled — the processor’s run state machine drops the serialized Run proto into the writer’s buffer. Every ten seconds, or once the buffer crosses a thousand runs, the writer builds an Arrow RecordBatch, encodes it as Parquet, and writes one object per tenant-day batch.

Why Parquet, and why partitioned this way

Parquet is columnar. A query that asks “what was the p95 latency for failed runs in project X on Tuesday” touches the status, tenant_id, and duration_ms columns — not the fat input and output blobs. Parquet lets the reader skip those columns on disk. For archived runs, where the payloads dwarf the metadata, that is the difference between a sub-second query and one that drags.

The Hive-style partitioning (tenant=.../day=.../) is a DuckDB-friendly layout. A query scoped to one tenant-day reads exactly one set of files. No fan-out, no filter pushdown gymnastics. A query scoped to a month reads thirty. A query scoped to everything reads everything — which is the honest behavior, and usually the one you want behind a pagination UI anyway.

How DuckDB reads it

The engine-query crate embeds DuckDB with its httpfs extension configured to talk to S3 or MinIO. A pool of connections holds warm caches of the Parquet metadata so plan time stays low.

pub struct S3QueryConfig {
    pub region: String,
    pub endpoint: String,        // "s3.amazonaws.com" or "minio:9000"
    pub access_key: String,
    pub secret_key: String,
    pub use_ssl: bool,
    pub bucket: String,
    pub runs_prefix: String,     // "engine/runs"
    pub pool_size: u32,
}

A request like “list the last 50 failed runs for project Y” turns into a DuckDB SQL query against a view defined over read_parquet('s3://.../engine/runs/tenant=Y/**/*.parquet', hive_partitioning=true). The listing is ordered by completed_at_ms DESC, limit 50, and DuckDB’s predicate pushdown skips day partitions outside the window.

This is the part of the system we do not have to invent. DuckDB was built for this shape of workload. Using it as an embedded query engine — not a separate service — means the runtime can answer listing and analytics queries without spinning up a data warehouse.

The hot-to-cold handoff

The interesting operational question is: when is a run safe to drop from RocksDB? The answer involves two gates. First, the S3 writer must have flushed and acknowledged the batch containing the run. Second, a retention pass must have observed that acknowledgment and marked the corresponding RocksDB entries deletable. Only then does the next RocksDB compaction drop them.

We chose the two-gate approach because a lost write to S3 is far cheaper to diagnose than a lost run. If the flush fails, the run stays in RocksDB and the processor retries on the next tick. If the retention pass lags, RocksDB grows slightly. If everything works, the hot tier drains cleanly.

Tradeoffs worth naming

Parquet archival is not instant. A run that just completed might sit in RocksDB for up to ten seconds before it shows up in a DuckDB query. The gateway’s SSE stream and direct run lookups hit the hot path and are unaffected, but an analytics dashboard pulling from the query crate is always a few seconds behind the freshest completion.

Cross-tenant queries are expensive by design. Partitioning by tenant keeps single-tenant reads fast but means an admin query that spans the fleet fans out across many prefixes. We accept that cost because the common case is a user inside one project watching their own runs, and that case is the one we optimized for.

Why it matters

Separating the hot path from the query path is not about saving disk — it is about keeping the hot path predictable. RocksDB stays small, the WAL stays tight, and reopen after a crash is fast because there are fewer live segments. DuckDB over S3 takes the messy, long-tail “show me stuff from last month” queries and runs them against immutable files that were written once and will never need recompaction.

The runtime’s job is to execute. The query layer’s job is to explain what executed. Keeping those separate — with Parquet as the contract between them — is how both stay fast at the same time.