AI Agents

LLM Agent Reliability: Strip Context Down

What do a modular agent toolkit and an academic training method share? Both argue the prompt can't be trusted alone. Here's what that means for agent design.

Philip

08 May 2026 — 6 min read

Two unconnected engineering efforts—agent-stack and constant-context skill learning—converge on one insight: remove context to gain reliability.

Summary

Two separate engineering efforts, one in production tooling and one in academic research, are converging on the same architectural instinct: strip context down to its functional minimum. The implication for how we build and train agents is more radical than either project individually suggests. Readers who see only the surface will miss the structural shift underneath.

The conversation about agent reliability has been dominated by orchestration frameworks, memory backends, and tool registries. That framing is wrong, or at least incomplete. The more interesting signal emerging from the current generation of agent engineering is not about adding capability. It is about what you can afford to remove.

Two pieces of work, arriving from completely different directions, are quietly making the same argument.

The Convergence Nobody Named Yet

The agent-stack project ships six micro-libraries: AgentFit for context-window fitting, AgentGuard for network egress allowlisting, AgentSnap for snapshot tests, AgentVet for tool-argument validation, AgentCast for structured-output retry, and AgentBudget for token and dollar caps. Each is zero-dependency, under 500 lines of code, and published independently across npm, PyPI, and the MCP registry.

The constant-context skill learning paper takes a fundamentally different approach to a fundamentally different problem: instead of managing what goes into the prompt at runtime, it moves procedural knowledge out of the prompt entirely and into model weights. A deterministic tracker renders a compact state block from task progress. The model conditions on current observation plus that compact state. Prompt tokens per turn drop by 2 to 7x across ALFWorld, WebShop, and SciWorld benchmarks.

The Prompt Cannot Be Trusted Alone

These two projects share no authors, no codebase, no stated connection. But they share a premise: the prompt is not a reliable substrate for agent behavior. Reliability has to be enforced somewhere else.

Fat Prompts Are a Liability, Not a Feature

The industry spent 2024 treating context windows as free real estate. Longer context meant more task history, more examples, more tool definitions, more guardrails baked into the system prompt. The assumption was that more information in the context equals better agent behavior.

That assumption is now under pressure from two directions simultaneously. The constant-context paper shows that Qwen3-8B achieves 89.6% unseen success on ALFWorld using a compact state block rather than a full history dump. The agent-stack project encodes egress control, output validation, and budget enforcement as runtime primitives rather than prompt instructions.

Prompts Are Wishes, Not Commands

Both are responding to the same observed failure mode: prompts are soft. You can write "do not make external HTTP calls" in a system prompt and the model will sometimes make external HTTP calls. You can write "always return valid JSON" and the model will sometimes return broken JSON. Context-window fitting in a prompt can go wrong under token pressure. These are not edge cases in production. They are the normal distribution.

The prompt is not a contract. Every invariant you encode only in natural language is an invariant that can be violated under distribution shift, context pressure, or model update.

What "Composable by Inclusion" Actually Means

The agent-stack's design philosophy is worth unpacking precisely because it is a deliberate rejection of the dominant framework model. You do not import a new programming model. You drop in a single primitive that enforces one invariant and nothing else.

AgentBudget caps token spend and dollar spend. AgentVet validates tool arguments against a schema before execution. AgentCast retries structured output generation until the output parses. These are not features of a platform. They are load-bearing walls you install into whatever stack you already have.

One Primitive Runs Everywhere Without Compromise

The three runtime forms, TypeScript, Python, and MCP server, are not an afterthought. They mean that a Python orchestration layer and a TypeScript frontend can enforce the same egress allowlist through a shared MCP server without coordinating codebases. That is a real interoperability primitive, not a marketing claim.

The Under-500-Lines Constraint Is an Architectural Statement

Keeping each library under 500 lines of code is not about minimalism for its own sake. It is a legibility guarantee. A senior engineer can read the entire implementation of AgentGuard in one sitting and understand exactly what it will and will not catch. That auditability matters in regulated environments and in any production system where a security or compliance team needs to sign off on behavior.

Large frameworks resist this kind of audit. Their behavior emerges from interactions between abstractions. The agent-stack primitives have no hidden interactions because they have no dependencies on each other. You can reason about what each one does in isolation.

The Six Enforcement Layers

AgentFit keeps prompt assembly inside the context window before the call is made, not after truncation silently corrupts it

AgentGuard blocks outbound network calls not on an explicit allowlist, enforced in the runtime, not the system prompt

AgentSnap runs snapshot tests against agent outputs so regressions in behavior are caught before deployment

AgentVet validates tool arguments against a schema at call time, preventing malformed inputs from reaching external APIs

AgentCast retries structured output generation with error feedback until the output is parseable, or raises after a configurable limit

AgentBudget enforces hard token and dollar caps per run, making cost overruns a caught exception rather than a billing surprise

The Weights vs. Runtime Tradeoff

The constant-context paper opens a different frontier. Moving procedural knowledge into fine-tuned weights via step-level supervised fine-tuning followed by online reinforcement learning is a fundamentally different kind of reliability investment. It is not a runtime check. It is a training-time commitment.

The tradeoff is sharp. Baking skills into weights means faster inference, smaller prompts, and behavior that is harder to override accidentally. It also means the skill is frozen until you retrain. A runtime primitive like AgentVet can be updated by changing a schema file. A weight-encoded skill requires a new fine-tuning run.

For task families that are stable and high-frequency, weight-encoded skills will outperform prompt-encoded procedures on both cost and reliability. For tasks that change weekly, they will not. Knowing which is which is the actual design decision your team needs to make.

Small Models Already Prove The Tradeoff Works

The Qwen3-8B and Llama-3.1-8B results are from peer-reviewed benchmarks with controlled ReAct prompting baselines, which makes them worth taking seriously rather than discounting as vendor claims. A 2 to 7x reduction in prompt tokens per turn is not a rounding error. At production scale that is a meaningful cost and latency reduction.

The Boundary Between Trainable and Enforceable Is Not Fixed

The practical implication is that agent reliability is not a single-layer problem. Runtime enforcement handles what you cannot bake into weights: cost caps, egress control, schema validation. Weight-encoded skills handle what is too expensive or too fragile to re-derive from context on every turn.

The teams building serious production agents will eventually maintain both layers. The mistake to avoid is treating them as alternatives. They are not competing approaches. They address different failure modes at different points in the execution stack.

The prompt was never the right place to enforce invariants. We just did not have the tooling or the training methods to put them anywhere else.

Where This Leaves Your Stack

If you are running agents in production today, the immediate action is to audit which of your behavioral invariants live only in natural language inside a system prompt. Each one is a latent reliability bug. The agent-stack primitives give you a migration path that does not require a framework rewrite.

If you are training smaller models for agentic tasks, the constant-context framing should change how you scope your training data. History compression into a compact state block is not just a prompt engineering trick. It is a curriculum design question: what does the model need to learn to track, and what can the deterministic tracker handle externally?

Reliability Is Leaving Prompts For Good

The direction of travel is clear. Reliability is moving out of prompts and into two destinations: the runtime enforcement layer and the weights. Both destinations are more auditable, more composable, and more resilient to context pressure than natural language instructions in a context window.

Frameworks that continue treating the prompt as the primary reliability surface are accumulating technical debt that will become visible the moment a model update shifts the distribution underneath them.

The Bottom Line

Prompt-only reliability is a fragile architecture and the tooling to replace it now exists
The agent-stack's six primitives give you runtime enforcement without framework lock-in, drop them in one at a time
Constant-context skill learning shows weight-encoded procedures outperform history-in-context on stable task families with measurable token savings
The design decision your team actually needs to make: which invariants belong in the runtime layer and which belong in the weights
Teams still encoding egress control and output format requirements purely in system prompts are one model update away from a production incident

Sources: DEV.to (May 8, 2026), ArXiv CS.AI (May 8, 2026)