LLM Tool Validation: What Zod Can't Fix

Zod and Pydantic catch shape errors in LLM tool calls, not semantic ones. What does runtime validation actually solve in agentic systems?

Dark abstract neural network visualization -- LLM tool validation -- Øbliq.
Runtime schema validation catches malformed outputs, but semantic errors slip through unchallenged. Here's what Zod actually buys you in agentic systems.

Summary

Tool schemas and runtime validation are being sold as the solution to unreliable LLM tool calls. They help, but they displace the actual problem rather than solve it. This piece interrogates what Zod-based validation actually buys you, where it fails silently, and what the research on agentic reliability reveals that the developer tooling community is not saying out loud.

The Validation Layer Is Not Your Safety Net

There is a pattern in how practitioners respond to flaky LLM tool calls: add more structure. Define tighter schemas. Validate at the boundary. Use Zod, use Pydantic, use JSON Schema with strict mode. The logic is sound on its face. The model outputs something almost valid, your system breaks, so you add a parser that catches the breakage earlier.

This is a real improvement. Treating model output as untrusted input is the correct mental model. The safeParse approach, where you check validity and handle failure explicitly rather than letting malformed data propagate downstream, is strictly better than the alternative. If you are shipping tool-using agents and you are not doing this, stop and fix that first.

Validation Catches Shape, Not Meaning

But the framing around this pattern has quietly overpromised. Validation at the schema boundary catches shape errors. It does not catch semantic errors. When a model passes a syntactically valid SearchDocsInput with a logically incoherent query, or sets includeDrafts: false in a context where the user explicitly asked for drafts, Zod passes that call through without complaint. The schema has no access to intent.

What Validation Actually Solves

The concrete case is narrow: a model hallucinates a field name, passes an integer where a string was expected, or omits a required parameter. Runtime schema validation catches this at the tool boundary, prevents the downstream function from receiving garbage, and gives the agent a structured error it can potentially recover from. That is the full scope of the win.

What it does not address is the broader reliability problem in agentic systems, and the gap between these two things is where practitioners get burned. They add Zod, their integration tests pass, and they ship. Then in production they encounter tool call sequences that are individually valid but collectively incoherent, agents that retry indefinitely with the same malformed semantic logic because the schema never flagged a problem, and failure modes that produce confident-looking outputs that are simply wrong.

Vendors Profit When Problems Look Already Solved

The developer tooling community has a financial incentive to make reliability look like a solved problem. Schema validation is marketable. It fits in a tutorial. It generates library adoption. The harder parts do not.

Runtime schema validation catches shape errors at the tool boundary. It has no mechanism for catching semantic errors. These are different failure modes and they require different solutions.

What the Research Is Actually Measuring

The SmellBench evaluation on architectural code smell repair puts numbers on a problem that schema validation cannot touch. The best-performing agent configuration across GPT, Claude, Gemini, and Mistral achieves a 47.7% resolution rate on hard-severity architectural smells in scikit-learn. That sounds usable until you read the other number: the same agent introduces 140 new smells while making those repairs.

This is not a schema problem. Every tool call that agent made was presumably valid. The inputs were well-formed. The outputs compiled. The issue is that the agent lacks cross-module architectural understanding. It repairs locally and breaks globally. Zod would have passed every single one of those calls.

Every Fix Breeds New Problems

The SmellBench scoring methodology is worth understanding because it models something that developer tooling benchmarks rarely do: net codebase impact. It separately evaluates repair effectiveness, false positive identification, and total codebase change. An agent that resolves 30 smells but introduces 50 new ones scores worse than an agent that resolves 15 and introduces none. This is the right framing for production systems, and almost no tool schema library asks the question.

The 47.7% Number Needs a Denominator

When the best available agent configuration resolves fewer than half of hard-severity architectural smells while degrading the codebase in other dimensions, the honest interpretation is that current LLMs do not have adequate architectural reasoning for cross-module refactoring at production scale. This is not a tuning problem. It is not a prompt engineering problem. It is a capability gap.

Practitioners using agents for code tasks should treat this number as a prior. If you are deploying coding agents on anything other than well-scoped, localized changes, you are accepting a meaningful probability that the agent will ship something that passes tests but degrades long-term maintainability. Validation at the tool boundary does not change that probability.

The agent that resolves 47.7% of hard architectural smells while introducing 140 new ones is not a reliability win. It is a demonstration that valid tool calls are a necessary but nowhere near sufficient condition for system correctness.

Compute Allocation Is the Problem No One Is Validating

DIAL, the Direction-Informed Adaptive Learning approach for test-time compute in LLM agents, surfaces a different class of problem that schema validation is orthogonal to. The core finding is that fixed-direction gates for adaptive computation are unstable because the same signal can mean opposite things depending on the environment and the backbone model. A state feature that indicates "needs more compute" in one setting indicates "compute here will hurt" in another.

This matters for practitioners building agents because test-time compute allocation is already a real decision in systems using chain-of-thought, self-consistency sampling, or iterative tool use. The naive approach is to apply extra computation uniformly or based on a fixed heuristic. DIAL's result, that per-environment, per-backbone direction learning outperforms fixed baselines across six environments and three backbones, suggests the naive approach is leaving significant performance on the table.

Catching Bad Outputs Misses Half The Problem

The validation-first worldview tends to treat agent reliability as a problem of catching bad outputs. DIAL suggests a complementary framing: reliability is also a problem of allocating inference correctly so that the outputs are better before they reach the validation boundary. These are not competing approaches, but the developer community conversation is heavily weighted toward the former.

MedExAgent Shows What Structured Decision-Making Actually Requires

MedExAgent's formalization of clinical diagnosis as a Partially Observable Markov Decision Process with three action types (questioning, ordering exams, diagnosing) is instructive not because of the medical domain but because of what it reveals about what agents need to be reliable under noisy conditions. The two-stage training pipeline, supervised fine-tuning on synthetic conversations followed by DAPO optimization of a composite reward function, is doing work that no schema library can replicate.

The composite reward function is the key mechanism. It is not optimizing for valid outputs. It is optimizing for diagnostic accuracy, examination cost, and patient discomfort simultaneously. The agent learns to be parsimonious with tool calls because the reward structure penalizes unnecessary exams, not because a schema catches when it orders too many.

Reward Shaping Builds Decisions, Not Just Outputs

This is the architecture that actually produces reliable behavior in complex environments. Schema validation is a guard rail on the output side. Reward-shaped training is an intervention on the decision-making side. If you are building agents for high-stakes domains and you have only implemented the former, you have covered perhaps 20% of the reliability surface.

Warning: treating schema validation as your reliability strategy means you have defended against shape errors while leaving semantic errors, compute misallocation, and reward-misaligned decisions entirely unaddressed. All three will appear in production before your schema does.

What Actually Needs to Change

Zod and Pydantic belong in every agent codebase. SafeParse at tool boundaries is not optional hygiene, it is table stakes. But the conversation that frames this as a reliability solution is doing practitioners a disservice.

The research points to a more honest taxonomy of failure modes. Shape errors are caught by schemas. Semantic errors, including locally valid but globally destructive actions of the kind SmellBench documents, require evaluation frameworks that measure net impact rather than task completion. Compute misallocation, the problem DIAL addresses, requires adaptive test-time strategies that are sensitive to the specific environment and backbone, not global heuristics. Decision quality under uncertainty, what MedExAgent demonstrates, requires reward-shaped training that internalizes cost functions, not just output filtering.

Schema Libraries Solve The Wrong Problem Entirely

The developer tooling industry will keep shipping schema libraries because they solve a real problem, are easy to explain, and integrate cleanly into existing workflows. None of that makes them adequate.

The Bottom Line

  • Zod and runtime schema validation solve shape errors at tool boundaries. They do not address semantic errors, architectural reasoning failures, or decision quality under uncertainty.
  • The SmellBench result of 47.7% resolution with 140 new smells introduced is a ceiling number, not a baseline. Treat it as your prior for production coding agents on complex tasks.
  • DIAL's finding that compute direction is environment-specific and backbone-specific means your adaptive inference heuristics are likely miscalibrated. This is not a schema problem.
  • MedExAgent's composite reward structure shows that parsimonious, cost-aware tool use requires training-time incentives, not runtime validation.
  • If your agent reliability strategy is primarily a validation strategy, you have addressed the easiest failure mode and left the rest to production.

Sources: DEV.to (May 11, 2026), ArXiv cs.SE (Software Engineering & Coding Agents) (May 11, 2026), ArXiv cs.CL (NLP & Language Models) (May 11, 2026), ArXiv CS.LG (May 11, 2026)