Typed Pseudocode Is Reshaping LLM Agent Skills
Why are typed pseudocode contracts beating markdown skill libraries? ALFWorld benchmarks reveal 22.8% token savings. Here's what it means for agent builders.
Summary
A quiet structural shift is underway in how LLM agents encode procedural knowledge. The direction of travel points toward typed, verifiable skill representations rather than freeform text. If you are building agent tooling today, the choice of skill representation is no longer cosmetic: it determines how much context you burn, how reliably the agent invokes the right behavior, and whether your skill library degrades or compounds over time.
The field has been arguing about prompting strategies and memory architectures for two years. The argument that has not yet been named clearly is the one happening at a more fundamental layer: what is the correct data structure for an agent's learned procedural knowledge?
Most teams today answer this question by default. They store skills as markdown documents, natural language descriptions, or raw conversation excerpts. That choice feels low-friction. It is also a quiet tax on every inference call that follows.
The Typed Turn in Agent Skill Representation
Skill-as-Pseudocode takes markdown skill libraries and converts them into typed pseudocode with explicit contracts. The mechanism is precise: it extracts typed signatures from procedural passages, runs them through a four-check verifier testing coverage, binding, replacement, and risk, then inlines the verified contracts into rewritten skill skeletons. The agent receives two signals simultaneously: a typed signature describing what the skill does, and a concrete invocation template describing how to call it.
The benchmark numbers from ALFWorld's 134-game unseen split are specific enough to take seriously. Against the Graph-of-Skills baseline using gpt-4o-mini, SaP wins 82 of 402 paired games versus 47 for GoS. Token reduction is 22.8% on input. LLM calls drop 14.5% per game.
Why Typed Contracts Reduce Token Burn
The mechanism behind the token reduction is worth understanding, because it is not obvious. When a skill is stored as prose, the agent must infer the invocation pattern from the description every time. Typed pseudocode with a concrete template collapses that inference step. The agent sees the signature, pattern-matches to the template, and executes. Less ambiguity requires less context to resolve.
This is the same reason typed APIs outperform undocumented ones in human software engineering. The difference is that in agent systems, the cost of ambiguity is paid in tokens and latency on every single invocation, not just during development.
Typed Templates Eliminate Inference, Not Just Words
The four-check verifier is the part that most teams will undervalue. Deterministic quality control on the skill library means the library compounds rather than degrades. Unverified markdown skill collections tend to accumulate inconsistencies silently. A skill gets updated in one place, the prose description drifts from the actual behavior, and the agent starts making systematically wrong invocations. The verifier makes that class of failure detectable before it reaches inference.
Behavioral Consistency Is the Unasked Production Question
The consistency research on multi-step tool-calling pipelines asks a question that most practitioners have not formally posed: does your agent select the same tools, in the same order, with the same arguments, across repeated identical invocations?
Most teams assume the answer is approximately yes, especially with temperature zero. The empirical answer is likely more uncomfortable than that.
Why Non-Determinism Compounds in Tool Chains
In a single LLM call, output variance at low temperature is small. In a multi-step tool-calling pipeline, variance compounds. Each step introduces a small probability of a different tool selection or argument value. By step five or six, the behavioral distribution across repeated runs can be meaningfully wide, even with identical inputs.
This matters practically because most agent evaluation setups measure end-to-end task completion. They do not measure behavioral reproducibility across runs. An agent that achieves 70% task completion through inconsistent paths is a different reliability profile than one achieving 70% through consistent paths. The first will produce debugging sessions that cannot be reproduced. The second will not.
Typed Contracts Shrink the Compounding Variance Problem
Typed skill representations and typed tool interfaces are part of the same structural response to this problem. When invocation templates are explicit and contracts are verified, the space of possible tool selections and argument values narrows. Reproducibility improves not because the model becomes less stochastic, but because the structured interface removes degrees of freedom from the decision.
Where Multi-Agent Specialization Beats General Agents
Agora's approach to consensus protocol verification makes a structural argument that generalizes beyond bug detection. The framework assigns specialized agents to explore protocol state spaces and synthesize attack scenarios using domain-specific constraints. It discovers 15 previously unknown protocol-level logic bugs across four consensus implementations: Raft, EPaxos, HotStuff, and BullShark.
The diagnostic claim is that single-function code analysis fails on deep logic bugs because those bugs exist in the interaction between components, not within any single component. You cannot find a Byzantine fault tolerance violation by reading one function. You find it by reasoning about global protocol invariants across the full state space.
Typed skill contracts and specialized agent roles are solving the same problem from different angles: how to make agent behavior predictable without making the agent dumber.
The Skill Library Is the Agent's Architecture
The pattern emerging across these research directions is coherent. Skill-as-Pseudocode establishes typed contracts for how an agent invokes individual capabilities. The consistency research establishes that behavioral reproducibility in tool chains is a measurable, auditable property. Agora establishes that explicit role separation and domain-aware collaboration outperforms general agents on tasks requiring global state reasoning.
These are not separate developments. They are converging on the same architectural claim: the structure you impose on your agent's knowledge representation and role boundaries is more important than the model underneath, for a wide class of production tasks.
What Self-Evolution Research Reveals About Current Ceilings
BenchTrace introduces failure avoidance rate as a metric, measuring what fraction of test cases an agent avoids after having previously encountered a similar failure. Qwen3-32B and GPT-4.1 both fall below a 30% end-to-end pass rate on reflection evaluation, with diagnosis identified as the primary bottleneck.
The forgetting problem surfaces in the evolution results. Agents that improve FAR on recent failures show negative transfer across tasks and lose ground on earlier lessons. This is not surprising, but the specific diagnosis being the bottleneck is informative. Agents are not failing primarily because they cannot formulate corrective strategies. They are failing because they cannot correctly characterize what went wrong.
Typed Contracts Make Failures Finally Attributable
This connects back to typed skill representations in a non-obvious way. If skills have explicit contracts with coverage checks, a failure is more attributable. The agent can compare the typed signature against what actually happened and generate a structured diagnosis rather than a prose description of approximate failure conditions. The quality of the skill representation determines the quality of what can be learned from failures.
The Three Structural Shifts Converging Right Now
Typed skill contracts over prose descriptions: Explicit signatures with verifier-checked contracts reduce per-call token cost and make skill libraries auditable. The SaP results are reproducible enough to treat as a signal, not noise.
2.
Behavioral reproducibility as a first-class metric: Multi-step tool-calling consistency is not currently measured in most production pipelines. It should be. The variance compounds across steps in ways that make end-to-end success rates misleading.
3.
Role specialization over general agents for complex domains: Agora's 15 novel bug discoveries across four consensus implementations suggest that domain-constrained multi-agent architectures outperform general agents on tasks requiring global state reasoning, not just task decomposition.
The Bottom Line
- Represent skills as typed pseudocode with verified contracts, not markdown prose, the token and reliability gains are measurable
- Add behavioral reproducibility to your eval suite before your next production deployment, end-to-end success rate is not sufficient
- Diagnosis quality determines learning quality: if your agent cannot characterize failures precisely, its self-improvement loop will be slow or noisy
- Role specialization with domain constraints outperforms general agents on tasks with global state dependencies
- The structure of your skill library is architectural, not cosmetic: it determines what your agent can learn, how reliably it invokes behavior, and how debuggable failures will be
Sources: ArXiv cs.CL (NLP & Language Models) (May 29, 2026), ArXiv CS.AI (May 29, 2026), ArXiv cs.SE (Software Engineering & Coding Agents) (May 29, 2026)