ContractBench: How LLM Agents Fail by Design

Agent failures aren't random. ContractBench exposes two distinct failure modes across 38 models. Here's what the taxonomy means for how you build.

Dark abstract neural network visualization -- LLM agent reliability -- Øbliq.
ContractBench and ANNEAL reveal that LLM agent failures are structurally predictable—and fixable at the knowledge layer before the model ever acts.

Summary

Two new research artifacts, ContractBench and ANNEAL, are quietly converging on the same architectural problem: LLM agents break in predictable, structurally diagnosable ways, and the field is starting to build the tooling to catch and fix those breaks automatically. The reader takes away a concrete shift in how agent reliability should be engineered, not monitored after the fact, but governed at the knowledge layer before the model acts.

The failures are not random. That is the sentence practitioners need to internalize before reading anything else here.

When ContractBench evaluated 38 models on observation contract compliance, the headline number was that no model crossed 80%. Claude-Opus-4.6 led at 77.8%. GPT-5 showed non-monotonic scaling, meaning a larger model in the family was not consistently more compliant than a smaller one. Qwen 3.5 showed a hard capability cliff between the 4B and 9B parameter variants: 0% compliance at 4B, 56.6% at 9B. These are not noise. These are structural signatures.

The Benchmark Is Not the Point

Compliance Scores Are a Map, Not a Destination

Practitioners tend to fixate on leaderboard positions. That is the wrong read here. The ContractBench result worth holding onto is the failure taxonomy itself. The benchmark categorizes failures along two axes: validity failures, where agents produce outputs that violate the formal constraints of an observation contract, and integrity failures, where agents corrupt the artifact they were supposed to preserve. That dual-axis structure tells you something the leaderboard does not: these failure modes require different interventions.

Validity failures are mostly recoverable at inference time. You can catch them with output validation, schema enforcement, or structured generation constraints. Integrity failures are worse because they can propagate silently through a pipeline. An agent that modifies an artifact it was only supposed to observe does not always produce an error. It produces a corrupted downstream state that surfaces three steps later as an inexplicable wrong answer.

Taxonomy Becomes the Training Signal, Not Just Measurement

The ContractBench paper also proposes using its failure taxonomy as an in-context reward signal. That is the practical unlock. You are not just measuring failure; you are labeling it in a way that can be fed back into the agent's reasoning loop during execution.

No model in a 38-model evaluation crossed 80% compliance on observation contracts. The best result, Claude-Opus-4.6 at 77.8%, means roughly one in five constrained-use operations fails under controlled benchmark conditions.

Why ANNEAL Changes the Repair Calculus

Fixing Recurring Failures Without Touching Weights

ANNEAL takes a different angle on the same structural problem. Instead of benchmarking failure, it automates repair. The core mechanism is Failure-Driven Knowledge Acquisition (FDKA), which localizes recurring failures in process knowledge graphs and generates typed symbolic patches to fix them. The weights of the foundation model are never modified. The repair happens in the knowledge layer that sits above the model.

The reported results are aggressive: 0% holdout failure rates on recurring faults across four domains. For comparison, ReAct and Reflexion retained 72-100% failure rates on the same recurring faults. Removing FDKA from ANNEAL dropped success rate by up to 26.7 percentage points, which isolates the mechanism cleanly.

Symbolic Repair Beats Fine-Tuning Every Time

The architecture pattern here deserves attention. ANNEAL is not doing prompt engineering or fine-tuning. It is doing something closer to symbolic program repair on a graph structure that encodes process knowledge. FDKA synthesizes patch candidates through constrained LLM generation, scores them across multiple dimensions, validates them with canary testing before committing, and maintains full provenance so every accepted edit can be rolled back deterministically.

That rollback capability is load-bearing in a way that the paper undersells. Production agent systems fail in ways that are difficult to attribute. When a patch is validated, committed, and later turns out to be wrong, you need to undo it without corrupting the broader knowledge graph. Deterministic rollback is not a nice-to-have; it is the feature that makes governed symbolic repair practical rather than experimental.

ANNEAL's Failure-Driven Knowledge Acquisition reduces recurring fault rates to 0% on holdout sets. ReAct and Reflexion, two of the most widely deployed agent reasoning patterns, retain 72-100% failure rates on the same faults.

The Canary Testing Step Is Underrated

Most agent reliability work skips validation before deployment. You generate a fix, you apply it, you see what happens. ANNEAL inserts a canary testing gate between generation and acceptance. A proposed patch must pass multi-dimensional scoring and survive canary execution before it enters the knowledge graph. This is standard practice in software deployment and almost nonexistent in agent systems. The fact that ANNEAL formalizes it is a signal about where the field is heading.

The Convergence Nobody Has Named Yet

Both Systems Are Building Governance, Not Just Detection

ContractBench and ANNEAL were not built to solve the same problem. ContractBench is a benchmark for measuring compliance. ANNEAL is a repair architecture for recurring failures. But read together, they are pointing at the same architectural gap that current agent frameworks do not address: agents need a governed layer between the model and the environment that can observe, classify, and repair failures without human intervention and without retraining.

The field is not converging on better models for agents. It is converging on governed layers that sit between the model and the world and make the model's failures structurally addressable.

ReAct gives agents a reasoning loop. Reflexion gives agents a self-critique mechanism. Neither of them provides deterministic repair of a diagnosed, recurring, structurally localized failure. That gap is exactly what ANNEAL fills. ContractBench's failure taxonomy is exactly the classification schema you would need to route failures to a repair system like ANNEAL.

Both Papers Point at the Same Missing Layer

This is not coincidence. This is the field responding to a real production problem. Agents that run long-horizon tasks accumulate failure patterns. Those patterns are diagnosable. The question the community is now answering is: who does the diagnosing, and who does the repairing, and how do you govern that process so it does not introduce new failures while fixing old ones?

What This Means for Your Stack Today

If you are running agents in production and you are relying on retry logic and human escalation to handle recurring failures, you are building on a pattern that two independent research threads are now demonstrating is insufficient. Retry logic does not reduce failure rate on recurring faults; it just runs the same broken path again. Human escalation does not scale.

The practical path forward is not to wait for ANNEAL to ship as a library. It is to start doing two things now. First, build failure taxonomies for your specific agent workflows. ContractBench's dual-axis validity-integrity split is a reasonable starting template. Second, separate recurring failures from one-off failures in your logging. One-off failures are model problems. Recurring failures are structural problems in your process knowledge, and they are fixable without touching your model.

Scaling Alone Will Not Fix Your Agent

The capability cliff in Qwen 3.5, zero to 56.6% between 4B and 9B, is a reminder that scaling alone does not solve compliance. Something qualitative changes at 9B that is not just a smooth extrapolation of 4B behavior. That is relevant for anyone making infrastructure decisions about which model tier to deploy for agent tasks with hard compliance requirements.

Three things to operationalize now

Start logging failure mode, not just failure occurrence. Validity failure and integrity failure require different mitigations.

Separate recurring from one-off failures

Recurring failures in your agent pipeline are structural. They belong in a knowledge graph, not a retry queue.

Treat observation contracts as first-class artifacts

If your agent reads a constrained artifact, enforce read-only access at the tool layer, not in the prompt.

The Bottom Line

  • No model in the current generation handles constrained-use artifacts reliably at production scale, and 77.8% is the ceiling today
  • Failure taxonomy is more operationally valuable than compliance scores because it tells you which failure class requires which fix
  • Governed symbolic repair above the model layer, without weight modification, is the architectural direction for recurring agent failures
  • Canary testing before patch acceptance is the practice that separates experimental repair from production-safe repair
  • If you are using ReAct or Reflexion for recurring-fault-prone tasks, you are accepting a 72-100% recurrence rate by design

Sources: ArXiv cs.SE (Software Engineering & Coding Agents) (May 19, 2026), ArXiv CS.AI (May 19, 2026)