nightlyMarch 7, 2026

Nightly scale-up, richer telemetry, and safer agent loading

The latest benchmark work tightens the loop between scale testing, observability, and operator confidence. Nightly CI now exercises a larger acceptance slice, the performance harness emits richer artifacts for analysis, and the dashboard surfaces new storage and LLM-usage signals that make regressions easier to catch.

Highlights

Nightly acceptance coverage expanded to 11 episodes with 10 workers to better exercise parallel execution paths.
Performance harness artifacts now include per-episode series JSON, disk-footprint reporting, optional compressed bundles, and LLM telemetry volume metrics.
The metrics dashboard now surfaces artifact-size and LLM telemetry tracking with alert-style badges for fast triage.
Optional AutoGen dependencies now fail more gracefully by deferring runtime errors until the rate_limit_agent is actually used.

A bigger nightly acceptance slice

The nightly workflow was expanded from a smaller smoke run into an 11-episode acceptance slice driven by 10 workers . That makes the nightly lane a better proxy for real batch load, while also validating that the current scheduling and timeout behavior stays budget-compliant under heavier concurrency.

what shipped

nightly acceptance suite: 11 episodes
workers: 10
focus: parallel execution coverage + retained artifacts

Alongside the workflow change, the benchmark repo also added a targeted validation test for the 10-worker slice, which helps guard against silent regressions in parallel execution behavior.

Performance artifacts that are easier to analyze

The performance harness now emits richer output instead of a single coarse summary. Recent updates added run-artifact disk-footprint reporting, optional psutil-based system metrics, per-episode chart-ready JSON series, and optional compressed artifact bundles with threshold guidance.

That combination makes it much easier to answer practical questions like: which episode is driving storage growth, whether LLM traffic expanded alongside latency, and how far a run is from the team's size budgets.

Dashboard telemetry that points at the problem faster

The Web UI metrics surface now includes artifact-size and LLM telemetry tracking, paired with performance alert badges. Instead of treating every regression as just a duration problem, operators can now distinguish between wall-clock growth, telemetry-volume growth, and artifact bloat directly in the dashboard.

new telemetry dimensions

artifact_size_bytes
llm_telemetry_volume
performance alert badges

Safer optional dependency handling for agents

Agent loading also got more forgiving. The rate_limit_agent now uses a graceful import fallback for AutoGen-related dependencies, which means environments that do not install those optional packages no longer fail immediately during import discovery.

Instead, the runtime surfaces the error when that specific agent is actually invoked. That keeps plugin discovery, registry validation, and unrelated workflows usable in lean environments while still preserving a clear failure mode when the optional dependency is truly required.

Why this matters

These changes all push in the same direction: make TraceCore easier to trust at scale. The nightly lane exercises more realistic batch conditions, the harness produces artifacts that are ready for charts and audits, and the dashboard turns storage and telemetry growth into first-class signals instead of hidden costs.

All news View benchmark repo →