AI Infrastructure

AI Agent Infrastructure: The Plumbing Problem

Why do AI agents collapse in production? The model isn't the problem — the infrastructure is. Discover the supervisor patterns and fault-tolerance systems that fix it.

Philip

15 Apr 2026 — 5 min read

Most AI agents fail in production not because of the model but missing infrastructure. Learn what fault-tolerant orchestration, supervisor patterns, and credential security actually require.

Summary

The agent production gap is widening. Teams are shipping agents that work in demos and collapse in the real world, not because the models are wrong, but because the surrounding infrastructure is missing. This piece covers what that infrastructure actually looks like, what the credential problem is doing to security posture, and why the next benchmark you should care about tests robustness under fault injection, not accuracy on clean data.

The Plumbing Problem Nobody Wants to Debug

Six months of building agents. That is what it took one practitioner to discover that the failure modes had nothing to do with the model. Crashes, runaway loops, no visibility. The model was fine. The infrastructure around it was not.

This is the pattern. Teams optimize prompts, upgrade to a newer model version, tune retrieval. Then they deploy and the agent hits a malformed API response at step four of a seven-step plan-and-execute chain, and there is no recovery path. No rollback. No restart logic. The whole thing silently dies or, worse, silently continues in a degraded state.

Erlang Solved This in 1986

The supervisor pattern is not new. Erlang's OTP framework built fault-tolerant systems around process supervisors decades ago. The insight was simple: processes will fail, so build the system to expect failure and recover automatically. Three restart strategies cover most production cases. One-for-one restarts only the crashed process. One-for-all restarts everything under the supervisor when any single child fails. Rest-for-one restarts the failed process and every process started after it, preserving dependency order.

Nexus OS, one practitioner's custom orchestration layer, applies exactly this model to AI agents. They claim a 40% reduction in crashes and a 30% reduction in costs. No independent validation exists for those numbers, and the methodology for measuring "cost reduction" in this context is unclear. Costs compared to what baseline? Over what traffic volume? The numbers are plausible but not trustworthy without more detail.

Sagas Turn Failure Into Structured Rollback

The saga pattern on top of supervisors is where things get genuinely useful. A saga is a sequence of steps where each step has a corresponding compensation action. If step four fails, the system executes the compensation actions for steps one through three in reverse order, leaving the world in a consistent state. This is distributed transaction logic applied to agentic workflows, and it matters most when your agent is writing to external systems, not just reading from them.

The real failure mode in production agents is not hallucination. It is the absence of compensation logic when a multi-step transaction partially completes.

The Credential Surface Nobody Is Auditing

Every agent integration point is an identity. An LLM platform call requires credentials. A database query requires credentials. A cloud resource access requires credentials. A single agent that touches five services in one workflow carries five distinct credential relationships, and most teams are managing this the same way they managed service credentials in 2015: hardcoded, stored in plain text, shared across systems.

29 million leaked secrets were reported across software projects in 2025. AI agents exacerbate this problem structurally, not just at scale. The issue is that agent code is often written fast, iterated fast, and the credential management discipline that exists in core platform code does not always transfer to agent scaffolding written by an ML engineer at 3am.

Every Integration Point Is an Attack Surface

The correct mental model is: each tool your agent can call is a capability that can be hijacked. Fine-grained authorization, the approach WorkOS is pushing with their FGA product, treats this seriously. The idea is to define precise policies governing what an agent can do, not just whether it can authenticate. Authentication proves identity. Authorization constrains behavior. Most current agent deployments have the first and are missing the second.

The practical implication: if your agent can read a database, can it also write? Can it write to tables it should never touch? If you cannot answer these questions by looking at your authorization layer, your security posture depends entirely on the agent behaving correctly. That is not a security posture. That is optimism.

Skipping These Practices Will Burn You

Rotating credentials regularly, using vault-based secret storage, and scoping permissions to the minimum required surface are not new practices. They are just practices that get skipped when the team is moving fast on agent features.

What Good Evaluation Actually Looks Like

OccuBench is a benchmark worth paying attention to, not because it covers 100 professional task scenarios across 65 specialized domains, but because of how it evaluates. Most agent benchmarks measure task completion on clean inputs. OccuBench injects faults.

Three fault types: explicit errors that the agent can detect directly, implicit data degradation that the agent must independently notice, and mixed faults combining both. The finding that matters is this: implicit faults are substantially harder for agents to handle than explicit errors. When a tool returns a malformed result that looks plausible, agents fail to detect it and continue building on a corrupted foundation. When a tool returns an obvious error code, agents handle it correctly.

Agents that score well on clean benchmarks can still be systematically fooled by inputs that are wrong in ways that look right.

Implicit Faults Break Agents Before Anyone Notices

This is not an academic concern. Production environments produce implicit faults constantly. An API that starts returning stale cached data. A retrieval system that returns semantically similar but contextually wrong chunks. A database query that returns empty results when the correct answer is "no records exist" versus "the query itself is broken." Agents trained and evaluated on clean pipelines have no learned behavior for detecting these cases.

GPT-5.2 Gains 27.5 Points from Reasoning Effort

The OccuBench results show that larger models and higher reasoning effort consistently improve performance across all fault types. GPT-5.2 improves by 27.5 points moving from minimal to maximum reasoning effort. This is a peer-reviewed benchmark result, not a vendor claim, and it changes how you should think about compute allocation for agentic tasks.

For high-stakes professional workflows, defaulting to maximum reasoning effort is not wasteful. It is the correct tradeoff. The benchmark also confirms what practitioners already suspect: no single model dominates across all industry categories. Specialization still matters.

On OccuBench, no single model leads across all 10 industry categories. Domain-specific evaluation should be a requirement before selecting a model for any professional vertical.

Context Window Management Is Becoming Infrastructure

Context Surgeon is a small tool with a large implication. The standard approach to context window management is auto-compaction: wait until the window fills, then compact. The problem is that stale tool results from early in a conversation remain in the context window and degrade response quality long before the window is full.

Context Surgeon enables agents to actively edit their own context window, removing stale tool results rather than waiting for the compaction threshold. The architectural implication is that context window state is now something agents can reason about and manage, not just consume.

Frameworks Hide The Context Rot Problem

Building agents without frameworks, as one practitioner documented in detail, surfaces exactly this kind of problem. LangChain and similar frameworks abstract away context management, which means format drift and corpus gaps accumulate invisibly. Building from scratch forces explicit decisions about what the agent sees and when.

The tradeoff is real: custom implementations are slower to build and require more debugging discipline. The payoff is that you understand every failure mode instead of debugging an abstraction layer you do not control.

The Bottom Line

Supervisor patterns from Erlang OTP apply directly to agent orchestration; implement saga compensation logic for any agent that writes to external systems
29 million leaked secrets in 2025 is a credential management failure, not a model failure; audit your agent's authorization surface today, not after an incident
OccuBench's fault injection methodology is the right way to evaluate production readiness; clean-input benchmarks are not sufficient
GPT-5.2 gains 27.5 points from increased reasoning effort on professional tasks; compute allocation for agentic workloads deserves explicit architecture decisions
Context window state is infrastructure; actively managing what the agent sees is not a prompt engineering trick, it is a reliability requirement

Sources: Medium: AI Agents (April 14, 2026), Medium: LLM (April 14, 2026), DEV.to (April 14, 2026), NewsAPI (April 14, 2026), ArXiv cs.CL (NLP & Language Models) (April 14, 2026)

LangChain + Qdrant RAG: Where Pipelines Break

CoMIC: Cloud-Edge Memory for LLM Agents

He Hit the Same Wall Every Time. So He Removed It.

LangGraph 1.2.3: RemoteGraph's Streaming Shift