Benchmark Contract Specification (v0.1)
The contract covers task manifests, task interface modules, runner budgets, run artifacts, and CLI behavior.
Scope
- Task manifests (task.toml) and registry alignment.
- Task interface modules (setup.py, actions.py, validate.py).
- Runner budgets and determinism expectations.
- Run artifacts and baseline exports.
- CLI behavior that reads/writes these artifacts.
Task Contract
- Required files: task.toml (preferred) or legacy task.yaml, setup.py, actions.py, validate.py.
- Required manifest fields: id, suite, version, description, deterministic, seed_behavior, budgets.steps, budgets.tool_calls, action_surface.source, validator.entrypoint.
- Optional fields: setup.entrypoint, action_surface.schema.
- Registry entries must match the manifest id, suite, and version.
Determinism Contract
- Tasks must produce identical outcomes for fixed seeds.
- Any change that alters behavior requires a new task version and updates to SPEC_FREEZE.md.
- Regression checks should include deterministic replays and baseline comparisons.
Budget Contract
- budgets.steps and budgets.tool_calls are enforced by the runner.
- Tasks must be solvable within declared budgets under the reference agents.
Artifact Contract
Run artifacts (.agent_bench/runs/*.json) must include:
- run_id, task_ref, seed, success, failure_type, failure_reason
- steps_used, tool_calls_used
- action_trace (step-by-step entries)
Baseline exports (.agent_bench/baselines/*.json) must include:
- Aggregated metrics (success rate, average steps/tool calls)
- Metadata describing filters and generation time
Compatibility Rules
- Additive fields are allowed.
- Breaking changes require a new contract version and a release note.
- The CLI must remain backward compatible with published artifacts.
Validation Tooling
terminal
agent-bench tasks validate --path path/to/task agent-bench tasks validate --registry