Task Harness Specification (v0)

A task is a closed-world environment with a deterministic initial state, a constrained action surface, and a single success condition.

Task Directory Layout

directory structure
tasks/
  <task_id>/
    task.toml   (preferred; task.yaml legacy)
    setup.py
    actions.py
    validate.py
    README.md   (optional)

Nothing outside this directory may influence task behavior.

task.toml

task.toml
id = "filesystem_hidden_config"
suite = "filesystem"
version = 1
description = "Extract the correct configuration value from the filesystem."
deterministic = true
seed_behavior = "fixed"

[budgets]
steps = 200
tool_calls = 50

[action_surface]
source = "actions.py"
schema = "introspected"

[validator]
entrypoint = "validate.py:validate"
  • No logic, no conditionals, no imports.
  • Once released, this file is immutable.
  • Changing behavior requires a new version.

Module Responsibilities

setup.py

Creates the world. Runs before the agent starts. Must be deterministic given the seed. No network access, no wall-clock dependence.

actions.py

Defines everything the agent can do. All actions are synchronous, logged, and budgeted. No shell access, no filesystem escape, no reflection.

validate.py

Defines success. Deterministic, final-state only. No LLMs, no partial credit, no time-based logic.

Step Model

  • Agent receives an observation.
  • Agent emits exactly one action.
  • Harness executes the action.
  • Result is recorded.
  • Budgets are decremented.

No action batching. No background execution.

Anti-Cheating Guarantees

  • Process isolation.
  • Read-only task metadata.
  • No filesystem escape.
  • No environment introspection.
  • No dynamic imports outside the task.

What Makes a Good Task?

A good task

  • Fails brittle agents quickly.
  • Rewards conservative behavior.
  • Has exactly one right outcome.
  • Surfaces why the agent failed.

A bad task

  • Requires guessing.
  • Encourages hacks.
  • Depends on timing.
  • Takes minutes to run.