Benchmark Contract Specification (v0.1)

The contract covers task manifests, task interface modules, runner budgets, run artifacts, and CLI behavior.

Scope

  • Task manifests (task.toml) and registry alignment.
  • Task interface modules (setup.py, actions.py, validate.py).
  • Runner budgets and determinism expectations.
  • Run artifacts and baseline exports.
  • CLI behavior that reads/writes these artifacts.

Task Contract

  • Required files: task.toml (preferred) or legacy task.yaml, setup.py, actions.py, validate.py.
  • Required manifest fields: id, suite, version, description, deterministic, seed_behavior, budgets.steps, budgets.tool_calls, action_surface.source, validator.entrypoint.
  • Optional fields: setup.entrypoint, action_surface.schema.
  • Registry entries must match the manifest id, suite, and version.

Determinism Contract

  • Tasks must produce identical outcomes for fixed seeds.
  • Any change that alters behavior requires a new task version and updates to SPEC_FREEZE.md.
  • Regression checks should include deterministic replays and baseline comparisons.

Budget Contract

  • budgets.steps and budgets.tool_calls are enforced by the runner.
  • Tasks must be solvable within declared budgets under the reference agents.

Artifact Contract

Run artifacts (.agent_bench/runs/*.json) must include:

  • run_id, task_ref, seed, success, failure_type, failure_reason
  • steps_used, tool_calls_used
  • action_trace (step-by-step entries)

Baseline exports (.agent_bench/baselines/*.json) must include:

  • Aggregated metrics (success rate, average steps/tool calls)
  • Metadata describing filters and generation time

Compatibility Rules

  • Additive fields are allowed.
  • Breaking changes require a new contract version and a release note.
  • The CLI must remain backward compatible with published artifacts.

Validation Tooling

terminal
agent-bench tasks validate --path path/to/task
agent-bench tasks validate --registry