Task Catalog
Deterministic, closed-world tasks with frozen versions, constrained actions, and mechanical validators.
Registry & Plugin Workflow
- tasks/registry.json is the manifest that keeps documentation and specs in sync.
- Each task directory includes a task.toml manifest describing budgets, entrypoints, and deterministic behavior.
- External task packages can register via the agent_bench.tasks entry-point group.
- The loader merges bundled manifest rows + plugin descriptors.
filesystem_hidden_config@1
| Property | Value |
|---|---|
| Suite | filesystem |
| Deterministic | Yes |
| Path | tasks/filesystem_hidden_config/ |
Forces agents to plan cautious filesystem exploration to recover API_KEY without brute-force traversal.
Skills stressed: Stateful search across nested directories, budget-aware exploration, validating when a clue resolves the goal.
rate_limited_api@1
| Property | Value |
|---|---|
| Suite | api |
| Deterministic | Yes |
| Path | tasks/rate_limited_api/ |
Single-endpoint API that enforces strict quotas and transient failures; agents must respect retry_after windows.
rate_limited_chain@1
| Property | Value |
|---|---|
| Suite | api |
| Deterministic | Yes |
| Path | tasks/rate_limited_chain/ |
Extends the API with a handshake template and chained endpoints that expire. Skills stressed: Parsing templates, tracking handshake lifetimes, differentiating errors.
deterministic_rate_service@1
| Property | Value |
|---|---|
| Suite | api |
| Deterministic | Yes |
| Path | tasks/deterministic_rate_service/ |
Deterministic yet unforgiving service combining handshake confirmation, required payload templates, rate limiting, and a guaranteed transient hiccup.
log_alert_triage@1
| Property | Value |
|---|---|
| Suite | operations |
| Deterministic | Yes |
| Path | tasks/log_alert_triage/ |
Walk deterministic log artifacts and recover the final ALERT_CODE used for escalation.
config_drift_remediation@1
| Property | Value |
|---|---|
| Suite | operations |
| Deterministic | Yes |
| Path | tasks/config_drift_remediation/ |
Compare desired vs. live configuration and output the exact remediation patch line.
incident_recovery_chain@1
| Property | Value |
|---|---|
| Suite | operations |
| Deterministic | Yes |
| Path | tasks/incident_recovery_chain/ |
Follow a deterministic recovery handoff chain to extract the final RECOVERY_TOKEN.
log_stream_monitor@1
| Property | Value |
|---|---|
| Suite | operations |
| Deterministic | Yes |
| Path | tasks/log_stream_monitor/ |
Poll a seeded, paginated log stream across multiple pages, filter out INFO/WARN noise, and emit the STREAM_CODE embedded in the first CRITICAL entry.
Skills stressed: Cursor-based pagination without over-fetching, signal/noise discrimination across a multi-page stream, stopping immediately once the trigger condition is met.
Why it matters: Mirrors production monitoring loops where agents must watch a live stream, ignore routine events, and fire exactly once on a critical signal — without exhausting tool-call budgets on noise.
Quick start: agent-bench run pairing log_stream_monitor
runbook_verifier@1
| Property | Value |
|---|---|
| Suite | operations |
| Deterministic | Yes |
| Path | tasks/runbook_verifier/ |
Confirm that every incident runbook phase executed in order and emit the RUNBOOK_CHECKSUM(phase codes + ACK + handoff token) by stitching evidence across README, runbook index, per-phase manifests, sequence logs, timeline, and final handoff instructions.
Skills stressed: Maintaining strict ordering under tight tool-call budgets, validating that each phase completed, and refusing to emit outputs when any artifact is missing.
Why it matters: Mirrors audit workflows where operators must prove each mitigation phase ran before handing off an incident—with zero tolerance for missing steps.