Task Harness Specification (v0)

A task is a closed-world environment with a deterministic initial state, a constrained action surface, and a single success condition.

Task Directory Layout

directory structure

tasks/
  <task_id>/
    task.toml   (preferred; task.yaml legacy)
    setup.py
    actions.py
    validate.py
    README.md   (optional)

Nothing outside this directory may influence task behavior.

task.toml

id = "filesystem_hidden_config"
suite = "filesystem"
version = 1
description = "Extract the correct configuration value from the filesystem."
deterministic = true
seed_behavior = "fixed"

[budgets]
steps = 200
tool_calls = 50

[action_surface]
source = "actions.py"
schema = "introspected"

[validator]
entrypoint = "validate.py:validate"

No logic, no conditionals, no imports.
Once released, this file is immutable.
Changing behavior requires a new version.

Module Responsibilities

setup.py

Creates the world. Runs before the agent starts. Must be deterministic given the seed. No network access, no wall-clock dependence.

actions.py

Defines everything the agent can do. All actions are synchronous, logged, and budgeted. No shell access, no filesystem escape, no reflection.

validate.py

Defines success. Deterministic, final-state only. No LLMs, no partial credit, no time-based logic.

Step Model

Agent receives an observation.
Agent emits exactly one action.
Harness executes the action.
Result is recorded.
Budgets are decremented.

No action batching. No background execution.

Anti-Cheating Guarantees

Process isolation.
Read-only task metadata.
No filesystem escape.
No environment introspection.
No dynamic imports outside the task.