AI Agent Architecture: Production Patterns That Work
Picking the right LLM is the easy part. Discover the memory, routing, and failure-mode decisions that separate production AI agents from demos.
Summary
The agent stack is maturing fast, and the decisions that matter are no longer about which LLM to call. They are about memory architecture, tool governance, and the infrastructure underneath. This piece breaks down the production patterns that separate working agents from demo agents, and names the tradeoffs you will hit before you hit them.
The Four Decisions That Actually Define Your Agent
Most developers building with LangChain or n8n 2.0 spend their first week picking a model. That is the wrong week to spend it. The model is the easiest part to swap. The decisions that compound and hurt later are the ones you make about memory, tool surface, routing, and what happens when the agent is wrong.
Memory Type Is a Cost and Quality Decision Simultaneously
In n8n's native LangChain integration, memory configuration breaks into three practical options: Buffer (short conversations, full context window), Summary (long conversations collapsed by the LLM into compressed state), and Postgres-backed persistence (cross-session continuity). Each trades differently.
Buffer memory is cheap to reason about and expensive to run at scale because you are stuffing raw conversation history into every prompt. Summary memory collapses that cost but introduces a lossy compression step where the LLM summarizing the context can drop nuances that matter downstream. Postgres persistence solves the statelessness problem across user sessions but adds a retrieval hop and requires you to actually trust your schema design under concurrent load.
Pick Your Failure Mode, Not Your Feature
The decision is not which one is best. It is which failure mode you can afford. A customer support agent with short, bounded interactions should run Buffer. A research assistant that users return to across days needs Postgres-backed memory or it is cosplaying as stateful while being amnesiac.
Model Routing Is Where the Economics Live
One production pattern worth taking seriously: route simple queries to gpt-4o-mini and complex ones to Claude 3.5 Sonnet. The claim from one deployment covering 40+ production systems is a 60% or more reduction in agent costs. No independent methodology is provided for that number, so treat it as directionally interesting rather than reproducible as stated. But the underlying principle is structurally sound.
The question is who classifies "simple" versus "complex." If the classifier itself is an LLM call, you have added latency and cost to save latency and cost. If it is a heuristic (token count, keyword match, intent pattern), it will misclassify at the edges. The routing layer is not free, and the savings depend entirely on how accurately you can segment your query distribution before the expensive model sees it.
Tool Governance Is the Problem Nobody Talks About Until Production
LangChain's ReAct architecture pattern hands the LLM a list of tools and lets it decide which one to call. This is elegant in demos. In production, it means your LLM is making authorization decisions.
An agent with access to web search, an email sender, a file reader, and a webhook has a significant blast radius. The LLM deciding to send an email based on a retrieved file is not a hypothetical edge case. It is the architecture. And the monitoring and control mechanisms sitting on top of that architecture in most current deployments are thin.
Immutable Audit Trails Are Not Optional at This Tool Surface
The langchain-agentlair library's approach of writing tamper-proof records of agent actions via an immutable audit trail points at a real gap. Most LangChain deployments log to stdout or to a database with no integrity guarantees. When an agent sends an email to the wrong recipient or hits an API it should not have touched, the forensic question is: what decision path led here? Without an append-only record of tool calls, inputs, and outputs, that question is unanswerable.
This is not a compliance concern first. It is a debugging concern. The audit trail is how you find the prompt that caused the bad tool call.
Behavior Design Beats Rule Lists, But Not for the Reason You Think
The argument that rules-based systems fail in dynamic environments is correct but often stated for the wrong reasons. The real problem with long rule lists in system prompts is not that agents ignore them. It is that rules interact in ways you did not anticipate, and the LLM has to resolve those conflicts at inference time with no visibility into how it resolved them.
The FORGE agent case illustrates the alternative: system prompt defining values and goals, a constrained tool set, feedback loops, and a memory architecture that records decisions and allows the agent to learn from mistakes. This is behavior-based design, and it is structurally closer to how you would design a reliable human employee's operating context than how you would write a firewall rule.
The Constraint Set Matters More Than the Goal Statement
Behavior-based design still requires discipline on the tool side. An agent with a clear goal but an unconstrained tool surface will find creative paths to that goal that violate your unstated assumptions. The constraint is not the rule list. The constraint is the tool set you provision and the feedback signal you give when outputs are wrong.
This is why the memory architecture that records decisions and errors is not a nice-to-have. It is the feedback loop. Without it, the agent has no signal to update behavior except the system prompt you edit manually.
You are not writing rules for your agent. You are designing the environment in which it decides. Get the tool surface and the feedback loop wrong, and the system prompt is irrelevant.
The Infrastructure Layer Is Becoming Its Own Stack
The Agent2Agent protocol, backed by over 150 organizations, and Anthropic's Model Context Protocol TypeScript SDK, which now has over 34,700 dependent projects, are converging on a picture of agents as networked services that discover, call, and pay for each other's capabilities.
This has immediate practical implications. If you are building an agent today that uses MCP to expose tools, the question of authentication and authorization on that tool surface is not a future problem. Other agents, not just your own UI, will be calling those tools. The attack surface for prompt injection through a compromised upstream agent is real and underexplored.
Local Hardware Now Defines Your Agent's Ceiling
The Mac Mini and Ollama pattern for running Gemma 4 26B locally (requiring at least 32GB unified memory and manual GPU memory tuning via the iogpu.wired_limit_mb setting) points at a parallel trend: inference is moving to the edge for privacy and cost reasons, but the operational complexity of tuning local inference is non-trivial and currently poorly documented.
Local Inference Buys Privacy, Not Simplicity
Ollama's default settings are not optimized for Apple Silicon's unified memory architecture. Getting meaningful throughput from a 26B parameter model on a Mac Mini requires aggressive quantization at Q4 or below, which degrades output quality in ways that are workload-dependent. The tradeoff is real: you own the data, but you own the performance problem too.
Production Agent Decisions That Compound
Memory architecture determines your cost floor and your statefulness ceiling. Choose before you build, not after.
2.
Tool surface determines your blast radius. Every tool you provision is an action the agent can take without asking.
3.
Routing logic determines your economics. The classifier that sends queries to cheaper models has its own error rate.
4.
Audit logging determines your forensic capability. Without it, you cannot explain what happened when something goes wrong.
The Bottom Line
- Memory type is a cost and quality decision made once but paid forever, choose based on your failure tolerance not your demo needs.
- Tool governance is the production problem most LangChain tutorials skip entirely.
- Behavior-based agent design works, but only if the tool set is constrained and the feedback loop is real.
- MCP and A2A are creating an agent-to-agent call graph that your current auth model was not designed for.
- Local inference on Apple Silicon is viable but requires manual tuning that Ollama does not handle by default.
Sources: DEV.to (April 4, 2026), Medium: LangChain (April 4, 2026), Dev.to: AI tag (April 4, 2026), Dev.to: LLM tag (April 4, 2026)