> For the complete documentation index, see [llms.txt](/llms.txt).
> A full single-fetch corpus is available at [llms-full.txt](/llms-full.txt).
---
title: Scorers
description: Score component outputs with built-in deterministic checks, trace assertions, LLM-as-judge presets, or your own custom scorer code.
last_verified: 2026-06-07
---

A **scorer** decides whether a component's output meets the target behavior. Each scorer returns a score between 0.0 and 1.0, a pass/fail verdict, and an optional explanation. [Experiments](/docs/improve/experiments.md) attach one or more scorers and run them against every [dataset](/docs/improve/datasets.md) item.

---

## Scorer classes

AGNT5 has exactly three scorer classes:

| Class | Examples | Needs deployment? |
|---|---|---|
| **Built-in deterministic** | `exact_match`, `json_schema`, `tool_called` | No — runs as AGNT5-owned logic |
| **Built-in LLM-as-judge** | `llm_judge`, `correctness`, `faithfulness` | No — AGNT5-owned, configurable model and rubric |
| **Custom** | Your `@scorer` functions | Yes — registered and deployed with your worker |

Built-ins work out of the box: select them in Studio or pass them to `agnt5 experiments create --builtin-scorer <name>`. Only custom scorers require your own code.

Scorers also differ by what they evaluate:

- **Output scorers** compare a single item's output against its input and expected output.
- **Trace scorers** assert on execution behavior — which tools were called, how many LLM calls were made, how long the run took. They need trace events, which dataset items carry when [imported from production runs](/docs/improve/datasets.md#from-production-runs).

---

## Built-in deterministic scorers

Output scorers:

| Scorer | Checks |
|---|---|
| `exact_match` | Output equals the expected output exactly |
| `contains` | Output contains a substring |
| `regex_match` | Output matches a regular expression |
| `json_valid` | Output is well-formed JSON |
| `json_schema` | Output validates against a JSON Schema |
| `numeric_range` | Numeric output falls within a range |
| `levenshtein` | Output is similar to expected by edit distance |
| `structured_assertions` | Configured assertions over input, output, and expected JSON |

Trace scorers:

| Scorer | Checks |
|---|---|
| `tool_called` / `tool_not_called` | A named tool was (or was not) called |
| `tool_sequence` / `tool_sequence_in_order` | Tools were called in the configured order |
| `tool_sequence_exact` | The tool trajectory matches exactly |
| `tool_sequence_any_order` | Configured tools all appear, in any order |
| `tool_trajectory` | Tools match a selected trajectory pattern |
| `tool_params_match` | Tool-call arguments match configured parameters |
| `max_tool_calls` / `max_llm_calls` | Total tool or LLM calls stay under a limit |
| `max_tokens` | Total LLM tokens stay under a budget |
| `duration_under` | Session duration stays under a limit |
| `no_errors` | The execution produced no errors |
| `state_equals` | A named state snapshot equals an expected value |

Pass a bare name for default behavior, or a JSON object for configuration:

```bash
agnt5 experiments create \
  --name support-agent-quality \
  --dataset-id <dataset-id> \
  --dataset-version-id <dataset-version-id> \
  --deployment-id <deployment-id> \
  --component-name support_agent \
  --component-type agent \
  --builtin-scorer json_valid \
  --builtin-scorer '{"name":"tool_called","config":{"tool":"search_orders"}}' \
  --builtin-scorer '{"name":"max_llm_calls","config":{"max":5}}'
```

---

## Built-in LLM-as-judge scorers

LLM-as-judge scorers use a language model to grade outputs against criteria. All are AGNT5-owned — no registration or deployment needed — and accept overridable judge settings: model, provider, prompt, and rubric.

| Scorer | What it grades |
|---|---|
| `llm_judge` | Generic judge — you supply the criteria and rubric |
| `correctness` | Output correctly answers the input and matches expected output |
| `faithfulness` | Output stays faithful to configured context fields; penalizes hallucination |

Pass them to an experiment the same way as deterministic scorers:

```bash
agnt5 experiments create \
  --name my-experiment \
  --dataset-id <dataset-id> \
  --dataset-version-id <dataset-version-id> \
  --deployment-id <deployment-id> \
  --component-name my_agent \
  --component-type agent \
  --builtin-scorer correctness \
  --builtin-scorer '{"name":"llm_judge","config":{"criteria":"Is the response concise and actionable?","model":"openai/gpt-4o-mini"}}'
```

> **Note:** Judge scorers call an LLM provider, so the project needs the provider credential configured (for example `OPENAI_API_KEY` as a [project secret](/docs/run/deploying.md#secrets)).

### SDK evaluator presets

The Python and TypeScript SDKs export typed preset classes for common LLM judge patterns. Use these with `client.eval()` and `client.batch_eval()` instead of raw scorer strings:

| Preset class | Grades |
|---|---|
| `Correctness` | Output correctly answers the input and matches expected |
| `Faithfulness` | Output is faithful to provided context; penalizes unsupported claims |
| `Helpfulness` | Output is useful, complete, and actionable for the user's task |
| `Coherence` | Output is logically organized and internally consistent |
| `Conciseness` | Output is concise while preserving all needed information |
| `ResponseRelevance` | Output directly addresses the input without off-topic content |
| `InstructionFollowing` | Output follows all explicit and implied instructions |
| `GoalSuccess` | Session achieved the user's goal (uses journal events and session state) |
| `Refusal` | Refusals are appropriate, well-explained, and offer safe alternatives |
| `Harmfulness` | Output avoids instructions or claims that could enable harm |
| `Stereotyping` | Output avoids stereotypes and biased generalizations |

```python
from agnt5.eval import Correctness, Helpfulness, Faithfulness

scorers = [
    Correctness(),
    Helpfulness(model="openai/gpt-4o"),
    Faithfulness(context_fields=["retrieved_chunks"]),
]
```

All presets accept an optional `model` (default `openai/gpt-4o-mini`), `temperature` (default `0.0`), `threshold` (default `0.7`), and `include_input`. See [Batch eval](/docs/improve/batch-eval.md) for full SDK usage.

---

## Custom scorers

When built-ins can't express your check, write a **custom scorer** — user code that receives the eval context and returns a result. Custom scorers are components: they register with your worker and deploy with your code.

Python:

```python
from agnt5.eval import scorer, EvalContext, ScorerResult

@scorer(name="cites_order_id", description="Reply must cite the order ID from the input")
def cites_order_id(ctx: EvalContext) -> ScorerResult:
    order_id = ctx.input.get("order_id", "")
    cited = order_id in str(ctx.output)
    return ScorerResult(
        score=1.0 if cited else 0.0,
        passed=cited,
        explanation=f"Order ID {order_id} {'found' if cited else 'missing'} in reply",
    )
```

The `EvalContext` carries `input`, `output`, `expected`, `run_id`, `trace_id`, and `events` (trace events for trace-level scorers, declared with `scope="trace"`).

TypeScript:

```typescript
const citesOrderId = scorer("cites_order_id", "Reply must cite the order ID from the input")(
  async (ctx, request) => {
    const orderId = (request.input as { order_id?: string }).order_id ?? "";
    const cited = String(request.output).includes(orderId);
    return new ScorerResult({ score: cited ? 1 : 0, passed: cited });
  },
);
```

Custom scorers register with the worker like any other component — auto-registration picks up decorated scorers, or pass them explicitly via the worker's `scorers` list. After [deploying](/docs/run/deploying.md), the scorer appears in Studio under **Evaluate** -> **Scorers**. Attach it to an experiment by ID:

```bash
agnt5 experiments create ... --scorer-id <scorer-id>
```

---

## Trace assertions in the SDK

For glassbox testing — asserting on execution behavior rather than output content — the SDK provides `TraceAssertion` and `trace_scorer()`. These are used inside custom scorers or directly in [batch eval](/docs/improve/batch-eval.md) code; they do not map to CLI scorer names.

Python:

```python
from agnt5.eval import TraceAssertion, trace_scorer, EvalContext, ScorerResult
from agnt5 import scorer

@scorer(name="efficiency_check", scope="trace")
def efficiency_check(ctx: EvalContext) -> ScorerResult:
    assertions = [
        TraceAssertion.max_tokens(2000),
        TraceAssertion.max_lm_calls(4),
        TraceAssertion.no_errors(),
        TraceAssertion.duration_under(15000),
    ]
    result = trace_scorer(ctx.events, assertions)
    return ScorerResult(
        score=result.score,
        passed=result.passed,
        explanation=result.explanation,
    )
```

TypeScript:

```typescript
const efficiencyCheck = scorer("efficiency_check", "Checks token use, LLM calls, and duration", "trace")(
  async (ctx, request) => {
    const result = traceScorer(request.trace ?? [], [
      TraceAssertion.maxTokens(2000),
      TraceAssertion.maxLmCalls(4),
      TraceAssertion.noErrors(),
      TraceAssertion.durationUnder(15000),
    ]);
    return { score: result.score, passed: result.passed, explanation: result.explanation };
  },
);
```

Available assertions:

| Method | Asserts |
|---|---|
| `max_tokens(n)` / `maxTokens(n)` | Total LLM tokens ≤ n |
| `max_lm_calls(n)` / `maxLmCalls(n)` | LLM call count ≤ n |
| `no_errors()` / `noErrors()` | No error events in the trace |
| `duration_under(ms)` / `durationUnder(ms)` | Execution duration < ms |
| `event_sequence([...])` / `eventSequence([...])` | Named events appear in order |
| `step_memoized(name)` / `stepMemoized(name)` | Step was served from cache |
| `event_count(type, min)` / `eventCount(type, min)` | Event type appeared ≥ min times |

`trace_scorer()` returns a score equal to the proportion of assertions that passed, plus a combined explanation.

---

## Inspect scores

Every scorer execution produces a **score** record with evidence — the inputs the scorer saw and why it decided what it decided.

```bash
# All scores for an experiment run
agnt5 scores list --run-id <run-id>

# Scores for a specific run item
agnt5 scores list --run-id <run-id> --run-item-id <item-id>

# Scores produced by a specific scorer
agnt5 scores list --scorer-id <scorer-id>

# Scores from the last 2 hours for a component
agnt5 scores list --component-name support_agent --since 2h

# Live production scores (for online evals)
agnt5 scores list --root-run-id <root-run-id>
agnt5 scores list --session-id <session-id>

# Show evidence for one score — scorer input, output, and reasoning
agnt5 scores evidence <score-id> --include scorer_input,scorer_output,evidence
```

Available filters on `scores list`:

| Flag | Filters by |
|---|---|
| `--run-id` | Experiment run UUID |
| `--run-item-id` | Specific experiment run item UUID |
| `--scorer-id` | Scorer UUID |
| `--scorer-version-id` | Specific scorer version UUID |
| `--subject-type` | Subject type (e.g. `experiment_run_item`, `span`, `session`) |
| `--subject-id` | Subject UUID |
| `--session-id` | Runtime session ID |
| `--root-run-id` | Root runtime run ID |
| `--component-name` | Component name |
| `--component-type` | Component type |
| `--since` | Scores created at or after (RFC3339 or Unix timestamp) |
| `--until` | Scores created at or before (RFC3339 or Unix timestamp) |

In Studio, open **Evaluate** -> **Experiments**, select a run, and click into any item to see its per-scorer results and evidence.


**Scorer classes**: built-in deterministic (SDK-core owned, no registration), built-in LLM-as-judge (`llm_judge`, `correctness`, `faithfulness`; configurable model/prompt/rubric), custom (user components, require worker registration + deployment).
**Built-in deterministic names**: `exact_match`, `contains`, `regex_match`, `json_valid`, `json_schema`, `numeric_range`, `levenshtein`, `structured_assertions`, `tool_called`, `tool_not_called`, `tool_sequence`, `tool_sequence_in_order`, `tool_sequence_exact`, `tool_sequence_any_order`, `tool_trajectory`, `tool_params_match`, `max_tool_calls`, `max_llm_calls`, `max_tokens`, `duration_under`, `no_errors`, `state_equals`.
**SDK evaluator presets** (wrap `llm_judge` or named built-ins): `Correctness`, `Faithfulness`, `Helpfulness`, `Coherence`, `Conciseness`, `ResponseRelevance`, `InstructionFollowing`, `GoalSuccess`, `Refusal`, `Harmfulness`, `Stereotyping`.
**TraceAssertion** (glassbox, used inside custom scorers with `scope="trace"`): `max_tokens`, `max_lm_calls`, `no_errors`, `duration_under`, `event_sequence`, `step_memoized`, `event_count`; composed with `trace_scorer()`.
**Result shape**: `{score: 0.0–1.0, passed: bool, label?: string, explanation?: string, metadata?: object}`.
**Code primitives**: `@scorer(name, description, scope)` decorator + `EvalContext` -> `ScorerResult` (Python); `scorer(name, description, scope)(handler)` (TypeScript).
**Errors**: built-ins fail with typed scorer errors (`input_error`, `config_error`, `provider_error`, `auth_error`, `artifact_error`, `timeout_error`); only custom scorers can return `scorer_not_found`.


## Next steps

* [Experiments](/docs/improve/experiments.md): attach scorers to an experiment and run them against a dataset version.
* [Batch eval](/docs/improve/batch-eval.md): run scorers from code using `client.eval()` or `client.batch_eval()`.
* [Datasets](/docs/improve/datasets.md): curate the items your scorers grade, including trace events for trace scorers.
* [Agents](/docs/build/agents.md): structure agent tool use so trace assertions have meaningful events to check.