> For the complete documentation index, see [llms.txt](/llms.txt).
> A full single-fetch corpus is available at [llms-full.txt](/llms-full.txt).
---
title: Batch eval
description: Run a component against multiple inputs in parallel, score every output, and inspect aggregate results using the Python or TypeScript SDK.
last_verified: 2026-06-23
---

**Batch eval** lets you evaluate a component against a list of inputs in a single call. The SDK runs each input in parallel, scores every output, and returns a single result object with per-item scores and aggregate statistics. Use batch eval for local regression testing, CI pipelines that don't need the full [Experiments](/docs/improve/experiments.md) workflow, or quick quality checks during development.

You need a running AGNT5 worker and at least one scorer. Point the client at your worker by setting `AGNT5_GATEWAY_URL` in your environment — `Client()` and `AsyncClient()` pick it up automatically and fall back to `https://gw.agnt5.com` when the variable is not set.

```bash
# local dev
export AGNT5_GATEWAY_URL=http://localhost:34181
```

## Run a single eval

`client.eval()` evaluates one input and returns a scored result:

```python
from agnt5 import Client
from agnt5.eval import Correctness

client = Client()

result = client.eval(
    component="support_agent",
    component_type="agent",
    input_data={"message": "Where is my order #1234?"},
    expected="Your order #1234 is in transit",
    scorers=[Correctness()],
)

print(result.passed)           # True / False
print(result.output)           # Component output
for score in result.scores:
    print(score.scorer, score.score, score.passed, score.explanation)
```

## Run a batch

`client.batch_eval()` accepts a list of inputs and runs them in parallel:

```python
from agnt5 import Client, BatchEvalItem

client = Client()

result = client.batch_eval(
    component="support_agent",
    component_type="agent",
    items=[
        BatchEvalItem(
            input={"message": "Where is my order #1234?"},
            expected="Your order #1234 is in transit",
            item_id="order-status",
        ),
        BatchEvalItem(
            input={"message": "I want to cancel order #5678"},
            expected="Order #5678 has been cancelled",
            item_id="order-cancel",
        ),
    ],
    scorers=["exact_match"],
)

print(f"Pass rate: {result.pass_rate:.0%}")
for item in result.results:
    status = "PASS" if item.passed else "FAIL"
    print(f"{item.item_id}: {status} ({item.duration_ms}ms)")
```

## Input formats

`batch_eval()` accepts items in several forms — mix them freely:

**Plain dicts with a separate `expected` list:**

```python
result = client.batch_eval(
    component="greet",
    items=[{"name": "Alice"}, {"name": "Bob"}],
    expected=["Hello, Alice!", "Hello, Bob!"],
    scorers=["exact_match"],
)
```

**Dicts with `input` and `expected` keys:**

```python
result = client.batch_eval(
    component="add",
    items=[
        {"input": {"a": 1, "b": 2}, "expected": 3},
        {"input": {"a": 3, "b": 4}, "expected": 7, "item_id": "add-2"},
    ],
    scorers=["exact_match"],
)
```

**`BatchEvalItem` objects for full control:**

```python
from agnt5 import BatchEvalItem

result = client.batch_eval(
    component="add",
    items=[
        BatchEvalItem(input={"a": 1, "b": 2}, expected=3, item_id="add-1"),
        BatchEvalItem(input={"a": 3, "b": 4}, expected=7, item_id="add-2"),
    ],
    scorers=["exact_match"],
)
```

## Scorers

Pass scorer names, SDK preset classes, or `LLMJudge` instances. Combine them freely:

```python
from agnt5.eval import Correctness, Helpfulness, LLMJudge

scorers = [
    "json_valid",          # Fast structure check
    "contains",            # Required substring
    Correctness(),         # Managed correctness preset
    Helpfulness(model="openai/gpt-4o"),
    LLMJudge(
        criteria="Is the response under 50 words?",
        model="openai/gpt-4o-mini",
    ),
]
```

All [built-in deterministic scorers](/docs/improve/scorers.md#built-in-deterministic-scorers) work as strings. All [SDK evaluator presets](/docs/improve/scorers.md#sdk-evaluator-presets) work as class instances.

## Concurrency and timeouts

```python
result = client.batch_eval(
    component="slow_agent",
    component_type="agent",
    items=test_items,
    scorers=[Correctness()],
    max_concurrency=5,   # Parallel evaluations (default 10)
    timeout=60.0,        # Per-item timeout in seconds
)
```

For large batches, start with `max_concurrency=3` and `timeout=30.0` during development, then increase for production runs.

## Async client

```python
const client = new Client();

const result = await client.batchEval(
  'support_agent',
  [
    { input: { message: 'Where is my order #1234?' }, expected: 'Your order #1234 is in transit', itemId: 'order-status' },
    { input: { message: 'I want to cancel order #5678' }, expected: 'Order #5678 has been cancelled', itemId: 'order-cancel' },
  ],
  {
    scorers: [new Correctness()],
    componentType: 'agent',
    maxConcurrency: 5,
  },
);

console.log(`Pass rate: ${(result.passRate * 100).toFixed(0)}%`);
for (const item of result.results) {
  const status = item.passed ? 'PASS' : 'FAIL';
  console.log(`${item.itemId}: ${status} (${item.durationMs}ms)`);
}
```

`Client` reads `AGNT5_GATEWAY_URL` from `process.env` automatically. All input formats from the Python section work the same way in TypeScript.


**Python API**: `client.eval(component, input, expected?, scorers?, component_type?)` -> `EvalResponse`; `client.batch_eval(component, items, scorers?, expected?, component_type?, max_concurrency=10, timeout?)` -> `BatchEvalResult`.
**Input normalization**: plain dicts use positional `expected` list; dicts with `input` key use embedded `expected`; `BatchEvalItem(input, expected?, item_id?, index?)` for full control.
**Scorers**: strings ("exact_match"), SDK preset instances (`Correctness()`, `Helpfulness(model=...)`), or `LLMJudge(criteria, model, include_input, temperature)`.
**BatchEvalResult**: `batch_id`, `status` ("completed"/"partial_failure"/"failed"), `results: BatchEvalItemResult[]`, `stats: BatchEvalStats`, `pass_rate`, `passing_items()`, `failing_items()`, `failed_items()`.
**BatchEvalItemResult**: `index`, `run_id`, `output`, `scores: ScorerResultSummary[]`, `passed`, `duration_ms`, `item_id?`, `trace_id?`, `error?`, `get_score(name)`.


## Next steps

* [Scorers](/docs/improve/scorers.md): full reference for built-in scorers, SDK presets, and custom scorer code.
* [Experiments](/docs/improve/experiments.md): run a component against a versioned dataset, compare runs, and gate CI.
* [Datasets](/docs/improve/datasets.md): curate test cases from production runs and publish immutable versions.
* [Agents](/docs/build/agents.md): structure agent tool use so scorers have predictable events to check.
