Runner & Failure Semantics

The harness runner executes an agent inside a task and produces a reproducible outcome.

Responsibilities

Must

  • Load a task.
  • Initialize the environment.
  • Enforce the agent interface contract.
  • Enforce budgets.
  • Execute the observe → act loop.
  • Validate success or failure.
  • Emit a machine-readable result.

Must Not

  • Interpret intent.
  • Retry failures.
  • Modify agent behavior.
  • Judge outputs subjectively.

High-Level Flow

flow
load task → load agent → setup environment → reset agent →
loop: observe → act → execute action → update budgets →
check termination → validate → emit result

Result Format

result.json
{
  "task_id": "filesystem_hidden_config",
  "version": 1,
  "seed": 42,
  "success": true,
  "failure_reason": null,
  "failure_type": null,
  "steps_used": 37,
  "tool_calls_used": 12,
  "action_trace": []
}

Failure Semantics

Every failed run is classified into one of these failure_type buckets:

TypeDescription
budget_exhaustedSteps or tool calls depleted
invalid_actionSchema violations or action exceptions
sandbox_violationEnvironment access outside allowed surface
logic_failureValidator declared a terminal failure or run ended without specific failure
timeoutOptional wall-clock limit tripped
non_terminationHarness had to abort the run

Successful runs always emit failure_type: null.

Determinism Contract

Given the same inputs, results must be reproducible.