Manual Verification Checklist
Use this script before publishing results or tagging a release.
Prerequisites
terminal
python -m venv .venv && .venv/Scripts/activate pip install -e .[dev]
CLI Flow
terminal
# Run deterministic tasks agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42 agent-bench run --agent agents/rate_limit_agent.py --task rate_limited_api@1 --seed 11 agent-bench run --agent agents/chain_agent.py --task rate_limited_chain@1 --seed 7 agent-bench run --agent agents/ops_triage_agent.py --task log_alert_triage@1 --seed 21 # List recent artifacts agent-bench runs list --limit 5 # Generate baseline snapshots agent-bench baseline --agent agents/toy_agent.py --task filesystem_hidden_config@1 # Compare runs agent-bench baseline --compare .agent_bench/runs/<run_a>.json .agent_bench/runs/<run_b>.json # Export frozen baseline agent-bench baseline --export latest # Replay a prior run agent-bench run --replay <run_id> # Validate registry agent-bench tasks validate --registry
Web UI Flow
terminal
python -m uvicorn agent_bench.webui.app:app --reload # Visit http://localhost:8000
- Run same agent/task combinations from form.
- Check result JSON matches CLI output.
- Click trace links, verify step entries.
- Baselines panel shows correct rates.
- Guide page loads.
Release Gating
- Complete checklist in current commit.
- Archive run_id values referenced in reports.
- Run full test suite: python -m pytest.
- Verify harness version matches release tag.