Benchmark Contract Specification (v0.1)

The contract covers task manifests, task interface modules, runner budgets, run artifacts, and CLI behavior.

Scope

Task manifests (task.toml) and registry alignment.
Task interface modules (setup.py, actions.py, validate.py).
Runner budgets and determinism expectations.
Run artifacts and baseline exports.
CLI behavior that reads/writes these artifacts.

Task Contract

Required files: task.toml (preferred) or legacy task.yaml, setup.py, actions.py, validate.py.
Required manifest fields: id, suite, version, description, deterministic, seed_behavior, budgets.steps, budgets.tool_calls, action_surface.source, validator.entrypoint.
Optional fields: setup.entrypoint, action_surface.schema.
Registry entries must match the manifest id, suite, and version.

Determinism Contract

Tasks must produce identical outcomes for fixed seeds.
Any change that alters behavior requires a new task version and updates to SPEC_FREEZE.md.
Regression checks should include deterministic replays and baseline comparisons.

Budget Contract

budgets.steps and budgets.tool_calls are enforced by the runner.
Tasks must be solvable within declared budgets under the reference agents.

Artifact Contract

Run artifacts (.agent_bench/runs/*.json) must include:

run_id, task_ref, seed, success, failure_type, failure_reason
steps_used, tool_calls_used
action_trace (step-by-step entries)

Baseline exports (.agent_bench/baselines/*.json) must include:

Aggregated metrics (success rate, average steps/tool calls)
Metadata describing filters and generation time

Compatibility Rules

Additive fields are allowed.
Breaking changes require a new contract version and a release note.
The CLI must remain backward compatible with published artifacts.

Validation Tooling

terminal

agent-bench tasks validate --path path/to/task
agent-bench tasks validate --registry