Record Mode Implemented
Record-mode capture, deterministic replays, and dashboard diagnostics now form a single workflow. TraceCore seals agent behavior end-to-end — from CLI compare to web-based inspection.
Why this release matters
Record mode has been the promised end state: launch an agent, capture its every action, and seal that execution into a contract that CI can enforce forever. The missing pieces were surfacing budget drift, highlighting taxonomy changes, and giving teams a way to inspect divergences without parsing raw JSON.
With v0.8.0, Record → Diff → Visualize is a single flow.
The new workflow
1. Record with CLI
agent-bench baseline --compare gained a --format prettyview. It shows status, run metadata, budget deltas, and the first divergent steps using Rich tables.
agent-bench baseline --compare run_a.json run_b.json --format pretty --show-taxonomy Compare: DIFFERENT Agent A: agents/runbook_verifier_agent.py Budget Usage: steps 10 → 10 (Δ 0) Failure Taxonomy: budget_exhausted → success Per-Step Differences: first 5 shown
A new --show-taxonomy flag highlights failure-type drift so budget exhaustion, sandbox violations, and logic failures are obvious in CI logs.
2. Analyze in the dashboard
The TraceCore web UI now mirrors the CLI insight:
- Budget burn chart plotting remaining steps/tool_calls per trace step.
- Outcome taxonomy badges (green/amber/red) in Trace Viewer + Recent Runs.
- Delta tables showing baseline vs current actions with result-change markers.
- Color-coded budget drift numbers that match CLI pretty output.
Choosing any run surfaces the budget chart and taxonomy badge immediately. Comparing runs inside the Baselines tab renders the delta table plus the divergent step details.
3. Seal bundles with confidence
Record mode still writes .agent_bench/baselines/<run_id> bundles. What’s new is the trust that comes afterward: every run can be diffed, visualized, and re-recorded only when it truly changed.
Highlights
- Budget + taxonomy surfacing in both CLI and Web UI.
- runbook_verifier baseline tightened (40/40 budgets) and sealed with --record.
- Docs updated: trace_artifacts.md now covers CLI diff + Web UI analysis patterns.
- New pytest coverage: baseline pretty output regression tests.
What’s next
Audited sandbox declarations are the final piece: each task will ship a network/filesystem allowlist and record mode will enforce it. The roadmap in docs/record_mode.md now reflects this.
Until then, the deterministic episode runtime already keeps every action, delta, and failure story visible.