Agent Observability

AgentSearchBench: Execution Beats Description

Semantic similarity is a weak predictor of agent performance. See how AgentSearchBench quantifies the gap and why execution-grounded signals change everything.

Philip

27 Apr 2026 — 5 min read

AgentSearchBench proves semantic search fails for AI agent selection. Execution-aware probing lifts ranking quality 40%—here's what that means for your pipelines.

Summary

Two new benchmarks reveal a quiet structural shift in how we evaluate agentic AI systems: not by what agents claim to do, but by what they actually do under execution. The gap between semantic description and behavioral performance is larger than most teams expect, and the tooling to close it is only now becoming measurable. Practitioners building agent pipelines need to understand what this means for discovery, debugging, and trust.

The Description Gap Is Getting Quantified

For most of the past two years, finding the right agent for a task has worked roughly the same way as finding a library on npm: you read the description, scan a few examples, and make a judgment call. This is now provably wrong as a selection strategy.

AgentSearchBench formalizes what practitioners have suspected informally. Across nearly 10,000 real-world agents spanning multiple providers, semantic similarity between a task query and an agent description is a weak predictor of actual task performance. The benchmark evaluates agent search as two distinct problems: retrieval (which agents are even candidates?) and reranking (of those candidates, which one should run?). Both are evaluated using execution-grounded signals, meaning the ground truth is not human annotation of relevance but observable task completion.

Descriptions Lie, Execution Reveals The Truth

The number that matters: execution-aware probing improves ranking quality by 40% over description-based methods. The methodology here is benchmark-internal and should be read as directional rather than absolute, but 40% on a ranking metric at this scale is not noise. It reflects a real structural difference between what agents say they do and what they demonstrably do.

Semantic Embeddings Fail at the Boundary That Matters

This failure mode has a specific shape. Agents that are semantically close to a task description often share surface vocabulary without sharing the underlying capability. A "web research agent" and a "document retrieval agent" look similar in embedding space. Under execution, one might handle multi-step browser navigation while the other fails the moment it encounters a login wall.

Description-based retrieval compounds over agent pipeline composition. If you're building a system that dynamically selects sub-agents for delegated steps, a 20% error rate at the agent selection layer doesn't stay contained. It propagates. The wrong agent doesn't just fail; it often produces plausible-looking output that passes downstream without triggering any error signal.

Debugging Multi-Agent Failures Is a Different Problem Than Debugging Code

TraceElephant arrives at the same moment and addresses the other end of the same problem. AgentSearchBench is about selecting agents before execution. TraceElephant is about understanding what went wrong after execution.

The core contribution is providing full execution traces, including all intermediate inputs, context windows, and inter-agent communication, for LLM-based multi-agent systems, then measuring how accurately different attribution techniques identify the root cause of failure. The benchmark improves attribution accuracy by up to 76% over partial-observation counterparts.

Partial Observation Debugging Misses What Actually Matters

The partial-observation comparison is the technically important part. Most production debugging today is partial-observation debugging. You get logs. You get outputs. You get maybe a structured trace if your orchestration layer is mature. You do not typically get the full context window that each agent saw at each step, the exact prompt construction, the sampling parameters in effect, and the inter-agent messages in sequence. TraceElephant treats the absence of this information as a measurable liability, not an inconvenience.

Nondeterminism Breaks Standard Root Cause Analysis

Standard software debugging assumes that given identical inputs, you get identical outputs. LLM-based agents violate this assumption at the model level. Temperature above zero means the same trace can produce different behaviors across runs. When a multi-agent system fails, the failure might not reproduce on the next execution, even with identical inputs and identical agent code.

TraceElephant's reproducible environments address this directly by fixing the randomness surface during evaluation. This is what makes attribution accuracy measurable at all. Without reproducibility, you cannot distinguish "this attribution technique correctly identified the failure" from "this failure happened to reproduce in a way that made the attribution look correct." The benchmark infrastructure is doing real work here, not just providing a dataset.

What These Two Benchmarks Name Together

Read separately, AgentSearchBench and TraceElephant address different phases of the agent lifecycle. Read together, they are describing the same underlying problem from two directions.

The problem is that agent behavior is not legible from artifacts alone. Agent descriptions are artifacts. Log files are artifacts. Even structured traces are artifacts. What both benchmarks push toward is behavioral grounding: evaluating agents by running them, and evaluating failures by reconstructing the full execution context, not by reading summaries of it.

Artifacts Lie, Behavior Is The Only Truth

This is a meaningful departure from how most teams currently operate. The dominant workflow is: write a description, test manually, ship, and debug reactively. Both benchmarks are building the measurement infrastructure that makes a different workflow possible: probe behaviorally before selection, trace completely before attribution.

The real cost of description-based agent selection is not the 40% ranking degradation at selection time. It is the silent propagation of wrong-agent errors through downstream pipeline steps that look like model quality problems.

Execution-Aware Probing Is Not Yet a Standard Pattern

Execution-aware probing, the mechanism that drives the 40% improvement in AgentSearchBench, means running lightweight test cases against candidate agents before committing to them for a real task. This is closer to property-based testing or capability probing than to traditional search. The agent receives a small, cheap, representative sub-task and its behavior on that sub-task informs ranking.

This pattern is not currently standard in any major agent framework. LangGraph, AutoGen, and CrewAI all support orchestration and tool-use, but none of them expose a first-class API for behavioral probing at agent selection time. The benchmark is measuring a capability gap that the frameworks have not yet filled.

We are building agent systems where selection is based on documentation and debugging is based on logs. Both benchmarks say that is not sufficient, and now we have numbers to prove it.

Where This Is Heading

The direction of travel is toward infrastructure that treats agent behavior as a first-class signal at every stage of the lifecycle. At selection time, this means probing rather than querying. At debugging time, this means full trace reconstruction rather than partial log analysis. The benchmarks are not describing this infrastructure; they are creating the measurement surface that will justify building it.

For teams running agent pipelines in production today, the near-term implication is specific. Agent registries, whether internal to your organization or external from providers like AWS Bedrock Agents, OpenAI Assistants, or emerging third-party marketplaces, currently present agents through descriptions and metadata. The selection layer has no behavioral grounding. If you are composing multi-agent workflows with dynamic sub-agent selection, you are implicitly trusting that semantic similarity is good enough. It is not, and you now have a benchmark that quantifies the gap.

Full Trace Capture Becomes Non-Negotiable Infrastructure

The slightly longer-term implication is that debugging multi-agent systems will require full trace capture as a baseline operational requirement, not an optional observability feature. Partial logs were acceptable when agent systems were simple enough that failure modes were traceable by inspection. At the complexity level that production multi-agent systems are reaching, partial-observation debugging is structurally insufficient. The 76% accuracy improvement from full traces over partial observations is not a research result. It is a warning about what you are flying blind through every time an agent pipeline fails in production and the logs do not tell you why.

If your agent orchestration layer is not capturing full context windows for every agent step in production, you are not debugging failures. You are guessing at them.

The Bottom Line

Semantic similarity between task descriptions and agent descriptions is a weak selection signal; execution-aware probing improves ranking quality by 40% in controlled evaluation and should be standard practice before committing to an agent in a composed pipeline
Full execution traces improve failure attribution accuracy by up to 76% over partial-observation methods; if your observability stack does not capture complete context windows per agent step, your debugging is structurally limited
No major orchestration framework currently exposes behavioral probing at agent selection time; this is the next infrastructure gap that will be filled, and teams building agent registries should move first
The two benchmarks together define a new operational baseline: probe before you select, trace completely before you attribute

Sources: ArXiv CS.AI (April 27, 2026), ArXiv CS.MA (April 27, 2026)