Runner & Failure Semantics

The harness runner executes an agent inside a task and produces a reproducible outcome.

Responsibilities

Must

Load a task.
Initialize the environment.
Enforce the agent interface contract.
Enforce budgets.
Execute the observe → act loop.
Validate success or failure.
Emit a machine-readable result.

Must Not

Interpret intent.
Retry failures.
Modify agent behavior.
Judge outputs subjectively.

High-Level Flow

flow

load task → load agent → setup environment → reset agent →
loop: observe → act → execute action → update budgets →
check termination → validate → emit result

Result Format

result.json

{
  "task_id": "filesystem_hidden_config",
  "version": 1,
  "seed": 42,
  "success": true,
  "failure_reason": null,
  "failure_type": null,
  "steps_used": 37,
  "tool_calls_used": 12,
  "action_trace": []
}

Failure Semantics

Every failed run is classified into one of these failure_type buckets:

Type	Description
budget_exhausted	Steps or tool calls depleted
invalid_action	Schema violations or action exceptions
sandbox_violation	Environment access outside allowed surface
logic_failure	Validator declared a terminal failure or run ended without specific failure
timeout	Optional wall-clock limit tripped
non_termination	Harness had to abort the run

Successful runs always emit failure_type: null.

Determinism Contract

Given the same inputs, results must be reproducible.