Runner & Failure Semantics
The harness runner executes an agent inside a task and produces a reproducible outcome.
Responsibilities
Must
- Load a task.
- Initialize the environment.
- Enforce the agent interface contract.
- Enforce budgets.
- Execute the observe → act loop.
- Validate success or failure.
- Emit a machine-readable result.
Must Not
- Interpret intent.
- Retry failures.
- Modify agent behavior.
- Judge outputs subjectively.
High-Level Flow
flow
load task → load agent → setup environment → reset agent → loop: observe → act → execute action → update budgets → check termination → validate → emit result
Result Format
result.json
{
"task_id": "filesystem_hidden_config",
"version": 1,
"seed": 42,
"success": true,
"failure_reason": null,
"failure_type": null,
"steps_used": 37,
"tool_calls_used": 12,
"action_trace": []
}Failure Semantics
Every failed run is classified into one of these failure_type buckets:
| Type | Description |
|---|---|
| budget_exhausted | Steps or tool calls depleted |
| invalid_action | Schema violations or action exceptions |
| sandbox_violation | Environment access outside allowed surface |
| logic_failure | Validator declared a terminal failure or run ended without specific failure |
| timeout | Optional wall-clock limit tripped |
| non_termination | Harness had to abort the run |
Successful runs always emit failure_type: null.
Determinism Contract
Given the same inputs, results must be reproducible.