TraceCore Core: Deterministic Episode Runtime
TraceCore's invariant core is a Deterministic Episode Runtime: a bounded runtime that executes agent-environment interaction with fixed inputs and emits replayable traces plus a structured verdict.
This is the stable nucleus that can power multiple futures (test framework, runtime platform, protocol/standard) without changing the primitive.
Canonical Definition
A Deterministic Episode Runtime executes:
Agent + Environment + Seed + Budgets (+ Harness version + Task version)
and produces:
Deterministic interaction trace + Structured termination outcome + Replayable artifact
What This Is
- A controlled interaction container for agent behavior under constraints.
- A deterministic execution model with reproducible outcomes.
- An artifact-first diagnostic layer for CI and regressions.
What This Is Not
- A leaderboard.
- An LLM-as-judge scoring framework.
- A broad intelligence benchmark.
- A hosted product requirement.
Those can be built on top. They are not the core.
The Episode Spec (v0)
Required Inputs
- Agent implementation — Must satisfy the reset/observe/act interface.
- Task/environment version — Closed-world, deterministic setup and validator.
- Seed — Explicit seed used for deterministic setup and execution.
- Budgets — steps, tool_calls, optional wall-clock timeout.
- Runtime identity — Harness version and task version included in artifacts.
Execution Model
The runtime loop is discrete and bounded:
- Setup environment from task + seed.
- Reset agent with task spec.
- Repeat observe → act → execute → validate while budgets remain.
- Terminate with structured reason.
- Persist run artifact for replay/comparison.
Outcome Model: Termination vs. Failure Taxonomy
TraceCore separates exact stop condition from analysis bucket.
termination_reason: precise termination event from the runtime. failure_type: normalized category for filtering, dashboards, and CI policy gates.
Canonical Failure Types
- budget_exhausted
- invalid_action
- sandbox_violation
- logic_failure
- timeout
- non_termination (reserved for future use; not emitted by the current runner)
Mapping Guidance
Typical runtime termination reasons map as follows:
- steps_exhausted → budget_exhausted
- tool_calls_exhausted → budget_exhausted
- invalid_action → invalid_action
- action_exception → invalid_action
- sandbox_violation → sandbox_violation
- timeout → timeout
- logic_failure → logic_failure
- non_termination → non_termination
Terminal validator failures ({"ok": false, "terminal": true}) emit termination_reason=logic_failure unless an explicit override is provided.
Deterministic Replay Contract
Replay is a first-class property, not a convenience feature.
Given the same:
- task id/version,
- agent implementation,
- seed,
- budgets,
- and compatible harness/task contracts,
the runtime must produce reproducible outcomes with a stable trace envelope, or a diff that is explicit and inspectable.
Why This Matters
If an episode cannot be replayed deterministically, it is not a reliable infrastructure primitive; it is only a demo.
Artifact Contract (Core Surface)
Every episode must emit a machine-readable artifact suitable for automation and audit. Core fields include:
- identity (run_id, trace_id, task_ref, agent, harness_version)
- control inputs (seed)
- outcome (success, termination_reason, failure_type, failure_reason)
- bounded usage (steps_used, tool_calls_used)
- full action_trace
Additive schema evolution is acceptable; breaking schema changes require versioning and release notes.
Why This Is the Right Strategic Focus Now
Defining this primitive cleanly avoids early identity lock-in and preserves optionality:
- Want pytest-for-agents? Wrap episodes in test runners.
- Want runtime packaging? Package environments around episode contracts.
- Want a standard/protocol? Publish this spec as the interoperable core.
All three paths depend on the same deterministic episode runtime.
Practical Operator Value
This core gives teams:
- Regression detection with stable seeds and baseline compare workflows.
- Actionable failures via structured taxonomy and full trace context.
- CI-native gating using deterministic pass/fail and policy thresholds.
- Auditable evidence through persisted run artifacts and replayability.
One-Line Mental Model
If pytest tests functions, TraceCore executes deterministic episodes.
If Docker packages containers, TraceCore packages bounded agent-environment interactions.