Ledger & Record Mode Foundations
A static agent registry, content-addressed baseline bundles, replay enforcement, and strict CI gating. The building blocks for sealed execution contracts.
The problem this solves
TraceCore has always produced deterministic run artifacts. What it lacked was a way to seal a run — to say “this is the certified behavior, and any future run that deviates from it is a regression.”
Without that, CI can only check whether a run succeeds. It cannot check whether it succeeds in the same way it did before — same steps, same tool calls, same results at each step. That distinction matters for operational agents where the path to success is part of the contract.
v0.7.0 adds the full foundation: a registry of known agents, a bundle format for sealing runs, and CLI enforcement for replay and strict mode.
The Ledger
The Ledger is a static registry of certified agents and their baseline performance metrics. It lives in agent_bench/ledger/registry.json and is validated against a formal JSON Schema.
agent-bench ledger # agents/chain_agent.py [api] rate_limited_chain@1, deterministic_rate_service@1 # agents/ops_triage_agent.py [operations] log_alert_triage@1, config_drift_remediation@1 # agents/toy_agent.py [core] filesystem_hidden_config@1, rate_limited_api@1 agent-bench ledger --show toy_agent
Each entry records the agent path, suite, description, and per-task baseline rows — success rate, average steps, and optionally the canonical run artifact ID. The web UI exposes this at /ledger with search and success-rate bar charts. The machine-readable endpoint is GET /api/ledger.
The Ledger is a static snapshot — it is not updated automatically by runs. For live aggregate stats computed from persisted artifacts, use agent-bench baseline. The Ledger is the stable reference point for CI comparisons and documentation.
Baseline bundles
A baseline bundle is a content-addressed directory that seals a single certified run into four files:
.agent_bench/baselines/<run_id>/
manifest.json # agent, task, seed, outcome, budget usage
tool_calls.jsonl # one line per trace step: action + result
validator.json # final validation snapshot
integrity.sha256 # SHA-256 hashes of the three files aboveWriting a bundle takes one flag on the existing baseline command. Verifying one is a single command:
# Seal the most recent matching run
agent-bench baseline \
--agent agents/toy_agent.py \
--task filesystem_hidden_config@1 \
--bundle
# {"bundle_dir": ".agent_bench/baselines/3dbba411...", "run_id": "3dbba411..."}
# Verify integrity
agent-bench bundle verify .agent_bench/baselines/3dbba411...
# OK .agent_bench/baselines/3dbba411...
agent-bench bundle verify .agent_bench/baselines/3dbba411... --format json
# {"ok": true, "errors": []}Bundles are designed to be committed to version control. SHA-256 hashes cover every file in the bundle, so tampering is detectable.
Replay and strict enforcement
Replay mode
Replay re-runs the agent and diffs the trace against the bundle step-by-step. Action type, action args, and result must all match. Success and termination reason must match. Step count must match.
agent-bench run \ --agent agents/toy_agent.py \ --task filesystem_hidden_config@1 \ --replay-bundle .agent_bench/baselines/3dbba411... # [REPLAY OK]
If the trace diverges, the run exits 1 and prints exactly what changed:
[REPLAY FAILED]
step 3: result mismatch — baseline={'ok': True, 'value': 'correct_value'} fresh={'ok': False, ...}Strict mode
Strict mode is a superset of replay. It additionally enforces that steps_used and tool_calls_used do not exceed the baseline. This is the intended CI gate — it answers “did anything operationally meaningful change?”
agent-bench run \ --agent agents/toy_agent.py \ --task filesystem_hidden_config@1 \ --replay-bundle .agent_bench/baselines/3dbba411... \ --strict # [STRICT OK]
Both modes are also available as a Python API via agent_bench.runner.replay for programmatic use in test suites.
Recommended workflow
The full seal-to-CI workflow in four steps:
# 1. Run the agent agent-bench run --agent agents/my_agent.py --task my_task@1 --seed 0 # 2. Seal the bundle agent-bench baseline --agent agents/my_agent.py --task my_task@1 --bundle # 3. Commit the bundle git add .agent_bench/baselines/<run_id> git commit -m "seal: my_agent baseline for my_task@1" # 4. Gate every PR agent-bench run \ --agent agents/my_agent.py \ --task my_task@1 \ --replay-bundle .agent_bench/baselines/<run_id> \ --strict
What is still ahead
The --record flag — first-time capture with an audited sandbox that enforces declared network and filesystem permissions — is not yet implemented. The bundle format and replay/strict enforcement are the foundation it will build on.
The execution mode table in docs/record-mode has been updated to reflect what is and is not implemented.
Other changes in v0.7.0
- Ruff linter integrated — E/F rules, scoped to agent_bench/, CI step runs before pytest.
- Trace entries now include action_ts (UTC ISO 8601) and budget_delta — additive, no breaking changes.
- diff_runs normalization — strips action_ts and budget_delta before comparing, so pre-v0.7.0 baseline artifacts don't produce false divergences in --compare or CI workflows.
- 15 new tests — 160 total. Determinism tests updated to exclude action_ts.