The Arc Since the Six Were Caught
After fixing six agent failure modes, the system graduated from outcome checking to path checking. A field report on trajectory evals and generalization.
INDEX / BLOG
Writing about agent orchestration, trust layers, cost engineering, and lessons from building production AI systems.
$ Deep-dives on autonomous AI agents, infrastructure, and cost engineering. No fluff, just systems.
After fixing six agent failure modes, the system graduated from outcome checking to path checking. A field report on trajectory evals and generalization.
58 runs, ~93% autonomous pass, and why that number is honest evidence — not a reliability proof. A field report on agent evals at solo scale.
How fail-closed design, headless execution seams, and converting incidents to fixtures stop AI agents lying about success. A field report from a solo dev-loop.
A strict contract enforced on only one side produces false rejections as reliably as a loose one produces false acceptances. A field report from my fleet.
A PR merged over a red CI job while dev-loop QA said green. Here's the root cause, the fix, and why partial greens are the most dangerous lies in agent systems.
How a silent None from an AI agent registered as a pass — and the fail-closed contract that fixed it. A field report from a production agent fleet.
Six ways my autonomous dev-loop reported success when nothing had landed — and the fail-closed gates that caught each one before they corrupted the ledger.