Troubleshooting Guide

A quick reference for the most common TraceCore/agent-bench issues across installation, CLI runs, tasks, and the optional web UI.

When in doubt, inspect the latest artifact in .agent_bench/runs/.

Tip: Load .agent_bench/runs/<run_id>.json directly, or use agent-bench runs list --limit 5 to find recent run IDs. The dashboard trace viewer at /?trace_id=<run_id> surfaces the same validator and harness messages.

Installation & Environment

agent-bench: command not found

  • Ensure you ran pip install -e .[dev] (editable install keeps CLI + registry in sync).
  • Activate the virtualenv before running commands (.venv\Scripts\activate on Windows).
  • Verify you are invoking the same interpreter that owns the editable install (e.g., which python).

Windows-specific

Add %APPDATA%\\Python\\Python3x\\Scripts (or the pipx shim dir) to PATH. See the "Windows PATH tip" in README.md for step-by-step instructions.

After editing PATH, open a new terminal so the shell picks up the change.

Common pitfalls

  • Launching agent-bench from PowerShell after activating the virtualenv in Command Prompt (or vice versa). Activate the env in the same shell you use to run the CLI so PATH and PYTHONPATH match.
  • Running commands from inside the .venv/ folder. Always run agent-bench from the repo root so relative imports (tasks, agents) resolve correctly.

ModuleNotFoundError: No module named 'agent_bench'

The package is not on PYTHONPATH. Activate the same virtualenv used for installation or export PYTHONPATH="$(pwd)" temporarily. Reinstall with pip install -e . if the editable link was removed.

Mixed Python versions between install and runtime

If python points at a different interpreter than the one that ran pip, the scripts land in another site-packages. Pin a single interpreter via py -3.12 -m venv .venv && .venv\\Scripts\\activate (Windows) or python3.12 -m venv .venv (macOS/Linux).

CLI Invocation Errors

Scaffold a new agent: new-agent

If the file already exists and --force is not set, the command exits non-zero with a clear error rather than silently overwriting.

Quick-start: run pairing

The fastest way to fire a known-good run without memorizing flags:

agent-bench run pairing log_stream_monitor          # run by name, seed 0
agent-bench run pairing log_stream_monitor --seed 7 # custom seed
agent-bench run pairing --list                      # show all available pairings

If you are inside a directory that contains exactly one paired agent file, the name can be omitted and it auto-selects. If the name is unknown or ambiguous, the CLI prints the pairing list and exits with a non-zero code.

Smoke-test every pairing in sequence (CI-friendly — exits non-zero if any fail):

agent-bench run pairing --all
agent-bench run pairing --all --seed 7 --timeout 120   # 120 s wall-clock limit per run

Wall-clock timeout: --timeout

Prevent a hung agent from blocking CI indefinitely:

agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 0 --timeout 60
agent-bench run pairing log_stream_monitor --timeout 90

If the run exceeds the limit the CLI exits immediately with a non-zero code and a clear message. The timeout is enforced via a daemon thread so the process terminates cleanly.

Inspect recent runs: runs summary

Print a compact table of recent runs without opening the dashboard:

agent-bench runs summary                                  # last 20 runs
agent-bench runs summary --task log_stream_monitor@1      # filter by task
agent-bench runs summary --failure-type budget_exhausted  # filter by outcome
agent-bench runs summary --limit 5                        # fewer rows

For raw JSON (e.g., for scripting) use agent-bench runs list with the same filters.

Task & Harness Issues

failure_type: budget_exhausted

The agent hit the step or tool-call ceiling before success or validator termination.

  • Check the trace for repeated actions (loops, redundant reads).
  • Budgets are set in the task's task.toml manifest. To debug, temporarily increase them there and re-run.
  • Inspect whether you are stuck in recovery loops (e.g., repeating read_file on the same path).

failure_type: timeout or non_termination

Timeouts occur only if you passed --timeout or a task enforces one.

non_termination is reserved; if you see it, file a bug with the trace and harness version.