Why RocksDB for the Runtime WAL

AGNT5 Team
4 min read

The tradeoff space for a durable execution log, and why we landed on an embedded LSM over Kafka, Postgres, or a custom file format.

Every durable execution engine is, at heart, a disciplined log. The interesting design choice is not whether to keep a log but which log to keep. We write a lot to ours — every run start, every step input, every step output, every timer, every signal. Miss a write and a replay diverges. Double a write and a replay does something worse.

When we started the AGNT5 runtime, we had to pick a write-ahead log that was (1) durable under crash, (2) fast on the hot append path, and (3) cheap to embed inside a single binary alongside the gateway and the coordinator. RocksDB is an unfashionable answer in 2026, but it is the right one for the shape of our workload.

Here is what the segment crate looks like opening a per-run log:

let mut records_opts = Options::default();
records_opts.set_compaction_style(DBCompactionStyle::Fifo);
records_opts.set_compression_type(DBCompressionType::Lz4);

let db = DB::open_cf_descriptors(
    &db_opts,
    path,
    vec![
        ColumnFamilyDescriptor::new("records", records_opts),
        ColumnFamilyDescriptor::new("meta", meta_opts),
    ],
)?;

let mut write_options = WriteOptions::default();
write_options.set_sync(true); // fsync on every append

Two column families, FIFO compaction on the records CF, Lz4 compression, and set_sync(true) on the write path. Every append becomes a WAL write plus an fsync. The meta CF tracks next_offset and a sealed flag so we can reopen a segment after a crash and keep going from the last durable offset.

What we weighed it against

A Kafka-style log (Redpanda, Kafka). Good fit if you need a pub/sub topology and horizontal fan-out. Wrong fit for us: we already dispatch directly to workers over gRPC. Asking a run to traverse a broker on the hot path adds a network round trip and a second durability domain we would then need to reason about. It also pulls us away from the single-binary story we wanted for self-hosted and local dev.

Postgres as the log. Comfortable, well-understood, ops teams love it. It is also slow for per-entry appends: with synchronous_commit=on you pay 5–20ms per write, and advisory locks start to matter as soon as two tasks in the same run commit concurrently. Our budget for journal append is ~3ms.

A custom append-only file format. Tempting, but the interesting parts of a log store are not the append path — they are segment rotation, compaction, recovery after a torn write, and block cache. RocksDB gives us those for free and exposes tuning knobs we actually use.

What RocksDB buys us

Appends land in the WAL before they hit the memtable, and with sync=true each append is durable against a process crash or a power cut. On reopen, RocksDB replays the WAL into the memtable and our meta CF tells us where the offset counter was. We did not write that recovery code. That matters more than it sounds.

Tuning for our workload looks like this: FIFO compaction on records means we never rewrite old entries, Lz4 on entries keeps the on-disk footprint reasonable for JSON-ish step payloads, and no compression on meta keeps the reopen path tight. The record key is the offset as big-endian bytes, which gives us cheap prefix scans for ReadFrom(run_id, from_offset) during replay.

Latency sits where we want it. A synced append lands around 1–3ms on a commodity NVMe. ReadAll of a hundred-entry run is under 5ms because the records are contiguous and hot. The segment notifier wakes blocked readers with a tokio::sync::Notify, so streaming consumers do not poll.

The tradeoffs we accepted

RocksDB is not distributed. Our per-run segments live on one node. We handle durability across nodes separately via the replication crate, which writes the same logical record to peers before acknowledging the append. That is a deliberate split — the segment crate stays boring and local, the replication crate does the hard consensus work, and we can swap one without touching the other.

RocksDB is also not a query engine. You cannot ask it “show me all failed runs for tenant X yesterday.” For that we seal completed runs, flush them to S3 as Parquet, and point DuckDB at the result. RocksDB is the hot tier. It does not try to be anything else.

Why this matters

A runtime is only as trustworthy as its log. We picked RocksDB because the append path is short enough to reason about, the recovery path is well-tested enough to rely on, and the embedded footprint lets the whole runtime — gateway, engine, coordinator — ship as a single binary. When a run resumes cleanly at offset 47 after a pod restart, the boring choice is the one that paid off.