TraceCore Core: Deterministic Episode Runtime

TraceCore's invariant core is a Deterministic Episode Runtime: a bounded runtime that executes agent-environment interaction with fixed inputs and emits replayable traces plus a structured verdict.

This is the stable nucleus that can power multiple futures (test framework, runtime platform, protocol/standard) without changing the primitive.

Canonical Definition

A Deterministic Episode Runtime executes:

Agent + Environment + Seed + Budgets (+ Harness version + Task version)

and produces:

Deterministic interaction trace + Structured termination outcome + Replayable artifact

What This Is

A controlled interaction container for agent behavior under constraints.
A deterministic execution model with reproducible outcomes.
An artifact-first diagnostic layer for CI and regressions.

What This Is Not

A leaderboard.
An LLM-as-judge scoring framework.
A broad intelligence benchmark.
A hosted product requirement.

Those can be built on top. They are not the core.

The Episode Spec (v0)

Required Inputs

Agent implementation — Must satisfy the reset/observe/act interface.
Task/environment version — Closed-world, deterministic setup and validator.
Seed — Explicit seed used for deterministic setup and execution.
Budgets — steps, tool_calls, optional wall-clock timeout.
Runtime identity — Harness version and task version included in artifacts.

Execution Model

The runtime loop is discrete and bounded:

Setup environment from task + seed.
Reset agent with task spec.
Repeat observe → act → execute → validate while budgets remain.
Terminate with structured reason.
Persist run artifact for replay/comparison.

Outcome Model: Termination vs. Failure Taxonomy

TraceCore separates exact stop condition from analysis bucket.

termination_reason: precise termination event from the runtime. failure_type: normalized category for filtering, dashboards, and CI policy gates.

Canonical Failure Types

budget_exhausted
invalid_action
sandbox_violation
logic_failure
timeout
non_termination (reserved for future use; not emitted by the current runner)

Mapping Guidance

Typical runtime termination reasons map as follows:

steps_exhausted → budget_exhausted
tool_calls_exhausted → budget_exhausted
invalid_action → invalid_action
action_exception → invalid_action
sandbox_violation → sandbox_violation
timeout → timeout
logic_failure → logic_failure
non_termination → non_termination

Terminal validator failures ({"ok": false, "terminal": true}) emit termination_reason=logic_failure unless an explicit override is provided.

Deterministic Replay Contract

Replay is a first-class property, not a convenience feature.

Given the same:

task id/version,
agent implementation,
seed,
budgets,
and compatible harness/task contracts,

the runtime must produce reproducible outcomes with a stable trace envelope, or a diff that is explicit and inspectable.

Why This Matters

If an episode cannot be replayed deterministically, it is not a reliable infrastructure primitive; it is only a demo.

Artifact Contract (Core Surface)

Every episode must emit a machine-readable artifact suitable for automation and audit. Core fields include:

identity (run_id, trace_id, task_ref, agent, harness_version)
control inputs (seed)
outcome (success, termination_reason, failure_type, failure_reason)
bounded usage (steps_used, tool_calls_used)
full action_trace

Additive schema evolution is acceptable; breaking schema changes require versioning and release notes.

Why This Is the Right Strategic Focus Now

Defining this primitive cleanly avoids early identity lock-in and preserves optionality:

Want pytest-for-agents? Wrap episodes in test runners.
Want runtime packaging? Package environments around episode contracts.
Want a standard/protocol? Publish this spec as the interoperable core.

All three paths depend on the same deterministic episode runtime.

Practical Operator Value

This core gives teams:

Regression detection with stable seeds and baseline compare workflows.
Actionable failures via structured taxonomy and full trace context.
CI-native gating using deterministic pass/fail and policy thresholds.
Auditable evidence through persisted run artifacts and replayability.

One-Line Mental Model

If pytest tests functions, TraceCore executes deterministic episodes.

If Docker packages containers, TraceCore packages bounded agent-environment interactions.