Coding Agent

Autonomous test-driven development agent with E2B sandbox

Get Code

The coding-agent template is a test-driven agent loop. Given a task description and a test suite, it writes code, executes it inside an E2B sandbox, reads the failures, and iterates until the tests pass or the step budget is exhausted. It uses Groq for fast inference, E2B for isolated execution, and the AGNT5 runtime to make every iteration durable.

What you’ll build

  • An agent workflow that alternates between LLM reasoning and sandboxed code execution
  • An E2B-backed run_tests tool that executes the candidate code against a pytest suite
  • A bounded iteration loop with a step budget and final-answer termination
  • A durable journal of every attempt — every model call, every sandbox run, every test output

Requirements

Install

curl -LsSf https://agnt5.com/cli.sh | bash

Setup

Scaffold the project

agnt5 create coding_agent tdd-agent
cd tdd-agent

Set environment variables

export GROQ_API_KEY=gsk_...
export E2B_API_KEY=e2b_...

Install dependencies

uv sync
pip install -e .

Run the agent

agnt5 dev up
agnt5 invoke coding_agent --input '{
  "task": "Write a function `fizzbuzz(n)` that returns the classic list.",
  "tests": "def test_fizzbuzz():\n    assert fizzbuzz(5) == [1, 2, \"Fizz\", 4, \"Buzz\"]"
}'

How it works

Each iteration of the loop follows the same shape. The LLM — served by Groq for sub-second turn latency — receives the task, the current candidate code, and the last test output. It returns either an updated source file or a final-answer signal. If it returned code, run_tests spins up an E2B sandbox, writes the candidate plus the test file, runs pytest, and returns the exit code and captured stdout/stderr. The workflow feeds that output into the next model turn.

Every LLM call and every sandbox execution is a durable step. If the worker dies while pytest is running, replay reconstructs the loop exactly — prior iterations return their journaled outputs, and only the in-flight step re-executes. The step budget is enforced by the workflow, not the model, so there’s a hard ceiling on cost and latency independent of what the agent decides.

E2B sandboxes cost money per minute and have startup latency. For fast iteration, the template reuses a single sandbox across iterations when possible; see tools/e2b.py.

Key files

  • worker.py — Registers the agent workflow and its tools.
  • agent.py — The iteration loop: model turn, tool dispatch, budget check.
  • tools/e2b.py — The run_tests tool wrapping E2B’s Python SDK.
  • prompts/system.txt — Instructs the model to return code diffs and call run_tests after every change.
  • agnt5.toml — Project config, including the step budget for the workflow.

Customize

Swap Groq for another model. Groq is the default for its latency, but the loop works with any chat model that supports tool use. Change the client in agent.py and set the corresponding API key.

Replace E2B with a local runner. For air-gapped environments, swap tools/e2b.py for a Docker-based runner. Keep the function signature stable — (code: str, tests: str) -> TestResult — and the loop is unchanged.

Tighten the step budget. Reduce the max iterations in agent.py to cap cost for simple tasks. The workflow will return whatever the best passing candidate is, or a failure with the last test output.

Next steps