Task Harness Specification (v0)
A task is a closed-world environment with a deterministic initial state, a constrained action surface, and a single success condition.
Task Directory Layout
directory structure
tasks/
<task_id>/
task.toml (preferred; task.yaml legacy)
setup.py
actions.py
validate.py
README.md (optional)Nothing outside this directory may influence task behavior.
task.toml
task.toml
id = "filesystem_hidden_config" suite = "filesystem" version = 1 description = "Extract the correct configuration value from the filesystem." deterministic = true seed_behavior = "fixed" [budgets] steps = 200 tool_calls = 50 [action_surface] source = "actions.py" schema = "introspected" [validator] entrypoint = "validate.py:validate"
- No logic, no conditionals, no imports.
- Once released, this file is immutable.
- Changing behavior requires a new version.
Module Responsibilities
setup.py
Creates the world. Runs before the agent starts. Must be deterministic given the seed. No network access, no wall-clock dependence.
actions.py
Defines everything the agent can do. All actions are synchronous, logged, and budgeted. No shell access, no filesystem escape, no reflection.
validate.py
Defines success. Deterministic, final-state only. No LLMs, no partial credit, no time-based logic.
Step Model
- Agent receives an observation.
- Agent emits exactly one action.
- Harness executes the action.
- Result is recorded.
- Budgets are decremented.
No action batching. No background execution.
Anti-Cheating Guarantees
- Process isolation.
- Read-only task metadata.
- No filesystem escape.
- No environment introspection.
- No dynamic imports outside the task.
What Makes a Good Task?
A good task
- Fails brittle agents quickly.
- Rewards conservative behavior.
- Has exactly one right outcome.
- Surfaces why the agent failed.
A bad task
- Requires guessing.
- Encourages hacks.
- Depends on timing.
- Takes minutes to run.