CI Workflow

Use the reusable GitHub Actions workflow to run a task and compare results against a baseline.

GitHub Actions

.github/workflows/tracecore-ci.yml

name: tracecore-ci

on:
  pull_request:
  workflow_dispatch:

jobs:
  tracecore-compare:
    uses: ./.github/workflows/baseline-compare.yml
    with:
      agent_path: agents/chain_agent.py
      task_ref: rate_limited_chain@1
      seed: "0"
      baseline: .agent_bench/baselines/rate_limited_chain_chain_agent.json
      require_success: "true"
      max_steps: "180"
      max_tool_calls: "60"
      max_step_delta: "10"
      max_tool_call_delta: "5"

Exit codes: 0 = identical, 1 = different, 2 = incompatible task/agent.

GitLab CI

.gitlab-ci.yml

stages:
  - run
  - compare
  - gate

run_agent:
  stage: run
  script:
    - pip install -e .[dev]
    - agent-bench run --agent agents/chain_agent.py --task rate_limited_chain@1 --seed 0 > run.json
  artifacts:
    paths:
      - run.json
      - .agent_bench/runs/

compare_baseline:
  stage: compare
  needs: [run_agent]
  script:
    - pip install -e .[dev]
    - agent-bench baseline --compare .agent_bench/baselines/rate_limited_chain_chain_agent.json $(python -c "import json;print(json.load(open('run.json'))['run_id'])")

policy_gates:
  stage: gate
  needs: [compare_baseline]
  script:
    - pip install -e .[dev]
    - python scripts/policy_gate.py --run-json run.json --baseline .agent_bench/baselines/rate_limited_chain_chain_agent.json --max-steps 180 --max-step-delta 10

Reusable Workflow Parameters

agent_path: Path to agent file
task_ref: Task reference (e.g., filesystem_hidden_config@1)
seed: Random seed for deterministic run
baseline: Path to baseline JSON
require_success: Whether baseline must show success
max_steps / max_tool_calls: Budget thresholds
max_step_delta / max_tool_call_delta: Allowed deviation from baseline