Coding Agents

Coding Agents: Why Quality Has Hit a Ceiling

Memory condensation and code cleanliness barely affect agent output quality. Here's what two new studies reveal about where coding agent gains actually come from.

Philip

20 May 2026 — 5 min read

Two studies on coding agents reveal that memory condensation and code cleanliness don't move pass rates—but they do reshape token costs significantly.

Summary

Two new studies on coding agents surface a quiet pattern: the quality of agent outputs is becoming increasingly decoupled from the quality of the environment around them. Memory condensation strategy barely moves the needle on hypothesis quality. Code cleanliness doesn't change pass rate. What both findings share is more important than what they measure individually.

The field has spent two years optimizing agents for capability. The next two years will be spent optimizing them for cost shape. These studies are early evidence of that shift.

The Metric That Keeps Not Moving

Both papers arrive at a structurally identical finding, from different directions.

In the memory condensation study, GPT-4o was run across sixty DiscoveryBench tasks using eight different strategies: sliding windows, LLM-generated summaries, tool-call masking, and more. The hypothesis quality scores across strategies were statistically indistinguishable. You could spend the engineering effort to implement a sophisticated LLM-based summarization condenser, absorb a 24 to 94 percent increase in token costs, and produce the same scientific hypotheses as a simple sliding window.

Cleaner Code Buys You Absolutely Nothing Either

In the code cleanliness study, Claude Code was run across 33 tasks on six repository pairs designed as minimal-pair controls. Clean code versus messy code. Pass rate on hidden tests: unchanged.

The metric that practitioners most care about (did the agent complete the task correctly?) does not respond to either intervention. This is not a failure of the interventions. It is a signal about where the optimization ceiling is on task success for current models, and a redirect toward what actually differs between configurations.

Quality Is Plateauing, Cost Is Not

What does shift is the resource consumption profile. LLM-based condensers cost 24 to 94 percent more in tokens while delivering no quality gain. Clean codebases produce 7 to 8 percent token reduction and 34 percent fewer file revisitations. Masking tool-call outputs saves a net 8.6 percent in tokens.

These numbers are modest individually. Combined, they point at something structural: the primary variable under practitioner control right now is not "will the agent succeed" but "what does it cost the agent to try, and how many times does it need to try."

Optimization Teams Are Solving The Wrong Problem

This is a different optimization target than the one most teams are currently pursuing.

LLM-based memory condensers increased token costs by 24 to 94 percent across sixty DiscoveryBench tasks, with no measurable improvement in hypothesis quality.

What "Operational Footprint" Actually Means

The code cleanliness paper introduces language worth keeping: "operational footprint." It refers to the token consumption and file revisitation patterns generated by an agent working through a task, distinct from whether the task was ultimately completed.

A 34 percent reduction in file revisitations is not a minor UX improvement. It is a window into how much redundant context traversal an agent does when navigating a poorly structured codebase. The agent succeeds either way, but in the messy codebase it is effectively thrashing. It returns to files it has already read, rebuilds context it should have retained, and spends tokens reconstructing spatial understanding of the repository.

Redundant Traversal Bleeds Into Every Production Metric

This has downstream consequences that don't show up in pass rate but absolutely show up in production cost, latency, and the probability of mid-task context window exhaustion on longer tasks.

The Invisible Tax on Agent Loops

ReAct-style agents and plan-and-execute architectures both accumulate context across steps. Every unnecessary file read is a context injection. Every redundant tool call is a token consumed. In a short 33-task benchmark, this is quantifiable but tolerable. In a production agent running thousands of tasks per day against a large monorepo, a 34 percent reduction in file revisitations translates directly to infrastructure cost.

The cleanliness study isolates what bad code does to an agent that has no prior knowledge of the repository structure. The agent has to explore. The exploration cost is determined by how navigable the repository is. This is, structurally, the same problem as a poorly indexed knowledge base in a RAG pipeline: the retrieval burden increases because the information architecture is bad, not because the retrieval model is bad.

The agent succeeds either way. But in the messy codebase it is thrashing, and thrashing at scale is a billing problem, not a capability problem.

Domain Specificity as the Underrated Insight

The memory condensation study contains a finding that deserves more attention than it gets in the abstract: the optimal condenser varies by scientific domain and task length.

This is not a null result dressed up as nuance. It is a concrete architectural implication. If you are building a coding agent for bioinformatics workflows, the optimal memory strategy is different from the one you would use for social science analysis tasks. The DiscoveryBench evaluation spans six scientific domains, and no single condensation strategy dominates across all of them.

Domain Matters More Than Architecture Evangelists Admit

The implication for teams building domain-specific research agents is direct: do not adopt a universal memory architecture and assume it generalizes. The sliding window that works adequately for short physics tasks may be the wrong choice for long-horizon chemistry tasks where intermediate tool outputs need to be preserved, not truncated.

The Configuration Space Nobody Is Mapping

Current agent frameworks treat memory condensation as a single dial. Most implementations default to something simple: a fixed context window, maybe a basic summarization step before overflow. The condensation study's finding that masking tool-call outputs saves tokens without hurting quality suggests that the information content of tool responses is unevenly distributed, and that structured elision (not summarization) is often the right move.

The practical recommendation is unglamorous: profile your agent's token consumption by task type before committing to a condensation strategy. The 8.6 percent savings from tool-call masking is not large enough to matter on a single task. Across a production workload, it is.

No single memory condensation strategy dominated across all six scientific domains in the DiscoveryBench evaluation. Domain-specific configuration is not optional, it is the correct default.

The Direction This Points

Taken together, these two papers sketch the outline of a problem that does not yet have a clean name.

Agent capability, measured by task completion, is becoming relatively stable within a benchmark regime. The models are good enough to solve the tasks. What varies is the resource cost of solving them, and that cost is shaped by factors the agent does not control: the structure of the codebase it is navigating, the memory architecture it is constrained to, the condensation strategy selected by whoever built the pipeline.

Environment Shapes Performance More Than Model Choice

This means the next layer of optimization is environmental, not model-level. You are not going to improve your agent's scientific discovery quality by switching condensation strategies. You might reduce your token bill by 8 to 94 percent depending on which one you pick.

You are not going to improve your coding agent's pass rate by cleaning up your codebase. You will reduce the number of times it revisits the same file by a third.

The agent quality ceiling is, for now, largely a function of the model. The agent cost floor is a function of the environment. Practitioners who recognize that distinction will be spending their optimization budget in the right place.

Three Signals Pointing the Same Direction

Memory condensation strategy doesn't move quality, it moves cost. Domain-specific configuration is the correct default, not a tuning afterthought.

Code cleanliness doesn't change pass rate, it changes traversal cost. The 34% reduction in file revisitations is an infrastructure metric wearing a software quality label.

Both findings share a structure: agent outputs are decoupled from the quality of surrounding infrastructure, but agent costs are tightly coupled to it. Optimize accordingly.

The Bottom Line

Task success rate is not the metric that responds to memory condensation or codebase cleanliness. Token cost and traversal efficiency are.
LLM-based condensers cost 24 to 94 percent more in tokens with no quality gain. Masking tool-call outputs is the only condensation strategy that saves tokens.
A 34 percent reduction in file revisitations from clean code is an infrastructure cost argument, not a code aesthetics argument.
Domain-specific memory configuration is not a tuning option, it is the baseline requirement for production research agents.
The optimization frontier for agents has shifted from capability to cost shape. Teams still optimizing for pass rate are solving last year's problem.

Sources: ArXiv CS.LG (May 20, 2026), ArXiv cs.SE (Software Engineering & Coding Agents) (May 20, 2026)