TraceCore Core: Deterministic Episode Runtime

TraceCore's invariant core is a Deterministic Episode Runtime: a bounded runtime that executes agent-environment interaction with fixed inputs and emits replayable traces plus a structured verdict.

This is the stable nucleus that can power multiple futures (test framework, runtime platform, protocol/standard) without changing the primitive.

Canonical Definition

A Deterministic Episode Runtime executes:

Agent + Environment + Seed + Budgets (+ Harness version + Task version)

and produces:

Deterministic interaction trace + Structured termination outcome + Replayable artifact

What This Is

  • A controlled interaction container for agent behavior under constraints.
  • A deterministic execution model with reproducible outcomes.
  • An artifact-first diagnostic layer for CI and regressions.

What This Is Not

  • A leaderboard.
  • An LLM-as-judge scoring framework.
  • A broad intelligence benchmark.
  • A hosted product requirement.

Those can be built on top. They are not the core.

The Episode Spec (v0)

Required Inputs

  • Agent implementation — Must satisfy the reset/observe/act interface.
  • Task/environment version — Closed-world, deterministic setup and validator.
  • Seed — Explicit seed used for deterministic setup and execution.
  • Budgets — steps, tool_calls, optional wall-clock timeout.
  • Runtime identity — Harness version and task version included in artifacts.

Execution Model

The runtime loop is discrete and bounded:

  • Setup environment from task + seed.
  • Reset agent with task spec.
  • Repeat observe → act → execute → validate while budgets remain.
  • Terminate with structured reason.
  • Persist run artifact for replay/comparison.

Outcome Model: Termination vs. Failure Taxonomy

TraceCore separates exact stop condition from analysis bucket.

termination_reason: precise termination event from the runtime. failure_type: normalized category for filtering, dashboards, and CI policy gates.

Canonical Failure Types

  • budget_exhausted
  • invalid_action
  • sandbox_violation
  • logic_failure
  • timeout
  • non_termination (reserved for future use; not emitted by the current runner)

Mapping Guidance

Typical runtime termination reasons map as follows:

  • steps_exhausted → budget_exhausted
  • tool_calls_exhausted → budget_exhausted
  • invalid_action → invalid_action
  • action_exception → invalid_action
  • sandbox_violation → sandbox_violation
  • timeout → timeout
  • logic_failure → logic_failure
  • non_termination → non_termination

Terminal validator failures ({"ok": false, "terminal": true}) emit termination_reason=logic_failure unless an explicit override is provided.

Deterministic Replay Contract

Replay is a first-class property, not a convenience feature.

Given the same:

  • task id/version,
  • agent implementation,
  • seed,
  • budgets,
  • and compatible harness/task contracts,

the runtime must produce reproducible outcomes with a stable trace envelope, or a diff that is explicit and inspectable.

Why This Matters

If an episode cannot be replayed deterministically, it is not a reliable infrastructure primitive; it is only a demo.

Artifact Contract (Core Surface)

Every episode must emit a machine-readable artifact suitable for automation and audit. Core fields include:

  • identity (run_id, trace_id, task_ref, agent, harness_version)
  • control inputs (seed)
  • outcome (success, termination_reason, failure_type, failure_reason)
  • bounded usage (steps_used, tool_calls_used)
  • full action_trace

Additive schema evolution is acceptable; breaking schema changes require versioning and release notes.

Why This Is the Right Strategic Focus Now

Defining this primitive cleanly avoids early identity lock-in and preserves optionality:

  • Want pytest-for-agents? Wrap episodes in test runners.
  • Want runtime packaging? Package environments around episode contracts.
  • Want a standard/protocol? Publish this spec as the interoperable core.

All three paths depend on the same deterministic episode runtime.

Practical Operator Value

This core gives teams:

  • Regression detection with stable seeds and baseline compare workflows.
  • Actionable failures via structured taxonomy and full trace context.
  • CI-native gating using deterministic pass/fail and policy thresholds.
  • Auditable evidence through persisted run artifacts and replayability.

One-Line Mental Model

If pytest tests functions, TraceCore executes deterministic episodes.

If Docker packages containers, TraceCore packages bounded agent-environment interactions.