Manual Verification Checklist

Use this script before publishing results or tagging a release.

Prerequisites

terminal

python -m venv .venv && .venv/Scripts/activate
pip install -e .[dev]

CLI Flow

terminal

# Run deterministic tasks
agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
agent-bench run --agent agents/rate_limit_agent.py --task rate_limited_api@1 --seed 11
agent-bench run --agent agents/chain_agent.py --task rate_limited_chain@1 --seed 7
agent-bench run --agent agents/ops_triage_agent.py --task log_alert_triage@1 --seed 21

# List recent artifacts
agent-bench runs list --limit 5

# Generate baseline snapshots
agent-bench baseline --agent agents/toy_agent.py --task filesystem_hidden_config@1

# Compare runs
agent-bench baseline --compare .agent_bench/runs/<run_a>.json .agent_bench/runs/<run_b>.json

# Export frozen baseline
agent-bench baseline --export latest

# Replay a prior run
agent-bench run --replay <run_id>

# Validate registry
agent-bench tasks validate --registry

Web UI Flow

terminal

python -m uvicorn agent_bench.webui.app:app --reload
# Visit http://localhost:8000

Run same agent/task combinations from form.
Check result JSON matches CLI output.
Click trace links, verify step entries.
Baselines panel shows correct rates.
Guide page loads.

Release Gating

Complete checklist in current commit.
Archive run_id values referenced in reports.
Run full test suite: python -m pytest.
Verify harness version matches release tag.