Manual Verification Checklist

Use this script before publishing results or tagging a release.

Prerequisites

terminal
python -m venv .venv && .venv/Scripts/activate
pip install -e .[dev]

CLI Flow

terminal
# Run deterministic tasks
agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
agent-bench run --agent agents/rate_limit_agent.py --task rate_limited_api@1 --seed 11
agent-bench run --agent agents/chain_agent.py --task rate_limited_chain@1 --seed 7
agent-bench run --agent agents/ops_triage_agent.py --task log_alert_triage@1 --seed 21

# List recent artifacts
agent-bench runs list --limit 5

# Generate baseline snapshots
agent-bench baseline --agent agents/toy_agent.py --task filesystem_hidden_config@1

# Compare runs
agent-bench baseline --compare .agent_bench/runs/<run_a>.json .agent_bench/runs/<run_b>.json

# Export frozen baseline
agent-bench baseline --export latest

# Replay a prior run
agent-bench run --replay <run_id>

# Validate registry
agent-bench tasks validate --registry

Web UI Flow

terminal
python -m uvicorn agent_bench.webui.app:app --reload
# Visit http://localhost:8000
  • Run same agent/task combinations from form.
  • Check result JSON matches CLI output.
  • Click trace links, verify step entries.
  • Baselines panel shows correct rates.
  • Guide page loads.

Release Gating

  • Complete checklist in current commit.
  • Archive run_id values referenced in reports.
  • Run full test suite: python -m pytest.
  • Verify harness version matches release tag.