> For the complete documentation index, see [llms.txt](/llms.txt).
> A full single-fetch corpus is available at [llms-full.txt](/llms-full.txt).
---
title: Experiments
description: Run a deployed component against a dataset version, score every item, compare runs, and gate CI on the results.
last_verified: 2026-06-07
---

An **experiment** binds a target (a deployed component or a prompt) to a [dataset](/docs/improve/datasets.md) version and a set of [scorers](/docs/improve/scorers.md). Each **experiment run** executes the target against every dataset item, scores the outputs, and produces a pass/fail summary you can compare across runs or enforce in CI.

---

## Experiment versions

- An experiment's definition (target, scorers, config) is versioned. Starting a run snapshots it as an immutable **experiment version**, so older runs always show the exact configuration that produced them.
- A run pairs one experiment version with one dataset version. Because both sides are immutable, two runs over the same dataset version are directly comparable.
- Each dataset item becomes a **run item** with its own output and per-scorer scores.

---

## List experiments

```bash
# All experiments in the current project
agnt5 experiments list

# Filter by dataset, status, or type
agnt5 experiments list --dataset-id <dataset-id> --status active
```

Available filters: `--dataset-id`, `--status`, `--type`. Use `--page` and `--page-size` for pagination.

---

## Create an experiment

The target component must already be [deployed](/docs/run/deploying.md). Then:

```bash
agnt5 experiments create \
  --name support-agent-quality \
  --dataset-id <dataset-id> \
  --dataset-version-id <dataset-version-id> \
  --target-type component \
  --deployment-id <deployment-id> \
  --component-name support_agent \
  --component-type agent \
  --builtin-scorer json_valid \
  --builtin-scorer '{"name":"tool_called","config":{"tool":"search_orders"}}' \
  --config '{"passed_threshold":1}'
```

- `--builtin-scorer` attaches a [built-in scorer](/docs/improve/scorers.md#built-in-deterministic-scorers) by name, or by JSON object when it needs config. Repeat for each scorer.
- `--scorer-id` attaches a deployed [custom scorer](/docs/improve/scorers.md#custom-scorers) by UUID.
- `--target-type prompt` with `--prompt-id` targets a [Prompt](/docs/build/prompts.md) instead of a component:

```bash
agnt5 experiments create \
  --name support-prompt-comparison \
  --dataset-id <dataset-id> \
  --dataset-version-id <dataset-version-id> \
  --target-type prompt \
  --deployment-id <deployment-id> \
  --prompt-id <prompt-id> \
  --prompt-version-id <prompt-version-id> \
  --builtin-scorer correctness
```

Via Studio: open your project, go to **Evaluate** -> **Experiments**, and create the experiment with the same choices — dataset version, target, and scorers.

---

## Run an experiment

```bash
# Fire and forget
agnt5 experiments run <experiment-id>

# Block until the run finishes; exit non-zero if the gate fails
agnt5 experiments run <experiment-id> --wait --timeout 15m
```

Useful overrides:

| Flag | Purpose |
|---|---|
| `--name` | Label the run (e.g. `pr-1234`) |
| `--deployment-id` | Compare a candidate deployment against the baseline for the same component |
| `--experiment-version-id` | Re-run an older immutable experiment version |
| `--fail-on-gate` | Return non-zero when the CI gate fails (default true with `--wait`) |
| `--config` | Override run-level config JSON (e.g. `'{"passed_threshold":0.9}'`) |

---

## Inspect results

```bash
# Runs for an experiment
agnt5 experiments runs list <experiment-id>

# One run's summary: status, pass rate, per-scorer aggregates
agnt5 experiments runs show <run-id>
# Also available as a standalone command:
agnt5 reports summary <run-id>

# Failed items only
agnt5 reports failures <run-id>

# Scores for a run — filter by scorer, item, component, or time window
agnt5 scores list --run-id <run-id>
agnt5 scores list --run-id <run-id> --scorer-id <scorer-id>
agnt5 scores list --run-id <run-id> --run-item-id <item-id>
agnt5 scores list --since 2h --component-name support_agent

# Hydrate scorer input, output, and evidence for one score
agnt5 scores evidence <score-id> --include scorer_input,scorer_output,evidence
```

In Studio, the experiment run page shows the run timeline, per-item results, and per-scorer scores; click any item to see its output and score evidence side by side.

Cancel a stuck run with `agnt5 experiments runs cancel <experiment-id> <run-id>`.

---

## Compare runs

Compare a candidate against a baseline over the same dataset version:

```bash
agnt5 experiments runs compare <base-run-id> <compare-run-id>
```

The comparison shows aggregate score movement and which items flipped between pass and fail. Studio renders the same comparison under **Evaluate** -> **Experiments** when you select two runs.

---

## Gate CI on eval results

Use the reports commands in a CI job to block a merge or deploy on eval regressions:

```bash
# Start a run for the freshly deployed candidate, wait, and gate
agnt5 experiments run <experiment-id> \
  --deployment-id "$CANDIDATE_DEPLOYMENT_ID" \
  --name "ci-$GIT_SHA" \
  --wait --fail-on-gate

# Or wait on an already-started run
agnt5 reports wait <run-id> --timeout 15m

# Print the gate verdict for a finished run
agnt5 reports ci <run-id>

# Export full artifacts for the build log
agnt5 reports export <run-id> --artifact-format csv --out-file eval-results.csv
```

Exit codes from `--wait`, `reports wait`, and `reports ci`:

| Code | Meaning |
|---|---|
| `0` | Gate passed |
| `2` | CI gate failed (run completed but pass rate below threshold) |
| `3` | Run failed or was cancelled |
| `4` | Wait timed out |

Use `--fail-on-gate=false` to suppress non-zero exits and inspect the result yourself.

The gate verdict comes from the experiment's config (for example `{"passed_threshold": 1}`), so the threshold lives with the experiment, not the pipeline.

---

## Turn failures into a regression dataset

After a run with failures, capture the failing items as a new dataset so fixes stay fixed:

```bash
agnt5 experiments runs regression-dataset <run-id> \
  --name support-agent-regressions \
  --start-run --wait

# Rerun against a specific deployment (e.g. a fix candidate)
agnt5 experiments runs regression-dataset <run-id> \
  --name support-agent-regressions \
  --start-run --deployment-id <candidate-deployment-id> --wait
```

This creates a dataset from the failed items (all of them, or specific ones via repeatable `--run-item-id`), a regression experiment over it, and — with `--start-run` — kicks off the first rerun immediately. Pass `--deployment-id` to run the regression against a specific deployment rather than the experiment's default.

---

## Rescore a completed run

After updating a scorer's logic or thresholds, rescore an existing completed run without re-executing the component. The original outputs stay intact; only the score records are replaced.

```bash
# Rescore all scorers on a run
curl -X POST "https://api.agnt5.com/api/v1/projects/<project-id>/experiments/<experiment-id>/runs/<run-id>/rescore" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "reason": "updated_rubric",
    "replace_latest": true
  }'
```

Optional request fields:

| Field | Purpose |
|---|---|
| `scorer_version_ids` | Restrict rescore to specific scorer version UUIDs (default: all) |
| `run_item_ids` | Restrict rescore to specific item UUIDs (default: all) |
| `reason` | Label for the rescore in audit history (default: `manual_rescore`) |
| `replace_latest` | Replace the current aggregate scores and gate result (default: `true`) |
| `idempotency_key` | Deduplicate concurrent rescore requests |

The response reports how many scoring jobs were enqueued, already queued, processed, completed, and skipped. Run the gate command after to see the updated verdict:

```bash
agnt5 reports ci <run-id>
```

Only **completed** runs can be rescored; running or pending runs return an error.

---

## Annotate run items

Add a human label to any run item to capture reviewer decisions alongside scorer verdicts. Annotations are stored separately from scores and used for meta-evaluation (measuring scorer accuracy against human judgement).

```bash
curl -X POST "https://api.agnt5.com/api/v1/projects/<project-id>/experiments/<experiment-id>/runs/<run-id>/items/<item-id>/annotations" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "human_label",
    "label": "pass",
    "metadata": {"reviewer": "alice", "note": "Correct order ID cited"}
  }'
```

Fields:

| Field | Required | Description |
|---|---|---|
| `label` | Yes | Reviewer verdict (any string, e.g. `pass`, `fail`, `partial`) |
| `name` | No | Annotation name (default: `human_label`) |
| `metadata` | No | JSON object with reviewer notes (max 64 KB) |

List annotations for an item:

```bash
curl "https://api.agnt5.com/api/v1/projects/<project-id>/experiments/<experiment-id>/runs/<run-id>/items/<item-id>/annotations" \
  -H "Authorization: Bearer <token>"
```

In Studio, annotations appear inline next to scorer results in the run item detail view.

## Output formats and project targeting

All eval commands accept two persistent flags:

| Flag | Description |
|---|---|
| `--format table\|json\|jsonl` | Output format. Defaults to `table` in the terminal, `json` when `--output json` is set globally. Use `jsonl` to stream one record per line for piping to `jq`. |
| `--project-id <uuid>` | Target a specific project UUID instead of the current project context. |

```bash
# Machine-readable output for scripting
agnt5 experiments list --format json | jq '.[].id'

# Pipe failures to jq for filtering
agnt5 reports failures <run-id> --format jsonl | jq 'select(.item.passed == false)'

# Target a specific project
agnt5 datasets list --project-id <project-id>
```

---


**Commands**: `agnt5 experiments {list,create,run,runs list|show|cancel|compare|regression-dataset}`, `agnt5 reports {summary,failures,ci,wait,export}`, `agnt5 scores {list,evidence}`
**Create requires**: `--name`, `--dataset-id`, `--dataset-version-id`, target (`--target-type component --deployment-id --component-name --component-type`, or `--target-type prompt --prompt-id`), at least one `--builtin-scorer <name|json>` or `--scorer-id <uuid>`
**CI gating**: `experiments run --wait --fail-on-gate` and `reports ci|wait` exit non-zero on gate failure; threshold set via experiment `--config '{"passed_threshold": <0..1>}'`; exit codes: 2=gate failed, 3=run failed/cancelled, 4=timeout
**Rescore**: `POST /api/v1/projects/{projectId}/experiments/{experimentId}/runs/{runId}/rescore` — only completed runs; optional `scorer_version_ids` and `run_item_ids` to narrow scope; `replace_latest: true` refreshes aggregates.
**Annotations**: `POST /api/v1/projects/{projectId}/experiments/{experimentId}/runs/{runId}/items/{itemId}/annotations` with `label` (required), `name`, `metadata`; used for meta-evaluation against human judgement.
**Model**: experiment versions and dataset versions are immutable; a run = one experiment version × one dataset version; per-item scores queryable via `scores list --run-id`.


## Next steps

* [Datasets](/docs/improve/datasets.md): grow the dataset behind the experiment and publish a new version.
* [Scorers](/docs/improve/scorers.md): tighten what pass/fail means, or write a custom scorer.
* [Quality cases](/docs/improve/quality-cases.md): track regressions and production failures through a structured lifecycle.
* [Deploying](/docs/run/deploying.md): ship the candidate deployment your experiment runs against.
* [Prompts](/docs/build/prompts.md): version the prompt artifacts that prompt-target experiments compare.
