v0.8.0February 21, 2026

Record Mode Implemented

Record-mode capture, deterministic replays, and dashboard diagnostics now form a single workflow. TraceCore seals agent behavior end-to-end — from CLI compare to web-based inspection.

Why this release matters

Record mode has been the promised end state: launch an agent, capture its every action, and seal that execution into a contract that CI can enforce forever. The missing pieces were surfacing budget drift, highlighting taxonomy changes, and giving teams a way to inspect divergences without parsing raw JSON.

With v0.8.0, Record → Diff → Visualize is a single flow.

The new workflow

1. Record with CLI

agent-bench baseline --compare gained a --format prettyview. It shows status, run metadata, budget deltas, and the first divergent steps using Rich tables.

terminal

agent-bench baseline --compare run_a.json run_b.json --format pretty --show-taxonomy
Compare: DIFFERENT
Agent A: agents/runbook_verifier_agent.py
Budget Usage: steps 10 → 10 (Δ 0)
Failure Taxonomy: budget_exhausted → success
Per-Step Differences: first 5 shown

A new --show-taxonomy flag highlights failure-type drift so budget exhaustion, sandbox violations, and logic failures are obvious in CI logs.

2. Analyze in the dashboard

The TraceCore web UI now mirrors the CLI insight:

Budget burn chart plotting remaining steps/tool_calls per trace step.
Outcome taxonomy badges (green/amber/red) in Trace Viewer + Recent Runs.
Delta tables showing baseline vs current actions with result-change markers.
Color-coded budget drift numbers that match CLI pretty output.

Choosing any run surfaces the budget chart and taxonomy badge immediately. Comparing runs inside the Baselines tab renders the delta table plus the divergent step details.

3. Seal bundles with confidence

Record mode still writes .agent_bench/baselines/<run_id> bundles. What’s new is the trust that comes afterward: every run can be diffed, visualized, and re-recorded only when it truly changed.

Highlights

Budget + taxonomy surfacing in both CLI and Web UI.
runbook_verifier baseline tightened (40/40 budgets) and sealed with --record.
Docs updated: trace_artifacts.md now covers CLI diff + Web UI analysis patterns.
New pytest coverage: baseline pretty output regression tests.

What’s next

Audited sandbox declarations are the final piece: each task will ship a network/filesystem allowlist and record mode will enforce it. The roadmap in docs/record_mode.md now reflects this.

Until then, the deterministic episode runtime already keeps every action, delta, and failure story visible.

All news Full changelog →