AI Agents

Why Relvy Bets on Agentic RCA Over LLMs

Why do general-purpose LLMs fail at production incident RCA? Relvy's architecture exposes the noise problem — and how specialized AI agents solve it.

Philip

10 Apr 2026 — 5 min read

General-purpose LLMs hit 36% on root cause analysis benchmarks. Relvy's specialized agentic architecture explains exactly why and what fixes it.

Summary

Relvy is making a serious bet that specialized agentic tooling beats general-purpose LLMs for production incident response. The architecture is worth understanding because it exposes a broader pattern: where exactly general-purpose reasoning breaks down under operational constraints, and what you actually need to replace it with. If you run on-call rotations or build observability tooling, this changes how you think about the RCA problem.

Why General-Purpose LLMs Fail at Root Cause Analysis

The failure mode is well-documented by anyone who has tried to throw GPT-4 at a production incident. The model is too broad, too easily distracted by irrelevant context, and has no reliable way to distinguish a Z-score anomaly in request latency from a noisy but benign metric spike that happens every Tuesday morning.

Relvy puts a number on it: 36% accuracy on OpenRCA for general-purpose LLMs. That number needs scrutiny. OpenRCA is a specific benchmark, and benchmark performance does not automatically transfer to your Kubernetes cluster at 3am. The methodology for how Relvy measures its own improvement over that baseline is not independently validated from what has been published. Treat the 36% figure as a meaningful signal about the problem space, not as a certified floor that Relvy has definitively cleared.

The Noise Problem Is Architectural, Not Prompt-Level

The core insight in Relvy's design is that signal-to-noise ratio collapse is not something you can fix with a better system prompt. When you feed an LLM a full observability dump, several things happen simultaneously: the context window fills with low-information spans, attention gets distributed across hundreds of metrics instead of concentrated on the anomalous ones, and the semantic gap between raw telemetry and actionable hypotheses is too large to bridge through in-context reasoning alone.

Relvy's answer is to move intelligence upstream, before the LLM sees anything. Anomaly detection using Z-scores, seasonality decomposition, and time series decomposition filters the signal before it enters the agent's reasoning loop. This is not a novel statistical idea. What is architecturally interesting is that Relvy treats pre-filtering as a first-class tool in the agent's toolkit rather than a preprocessing step that happens outside the agentic loop. The agent can invoke anomaly detection, receive a compressed signal, and then decide what to investigate next. That is a different cognitive model than "summarize this pile of logs."

Runbook-Anchored Execution and the Determinism Tradeoff

The real failure mode in agentic RCA is not hallucination. It is unbounded exploration: an agent that can ask anything will eventually ask the wrong thing at the wrong cost.

Relvy's most defensible architectural choice is runbook-anchored execution. The idea is to use existing on-call runbooks as a structured prior that constrains the agent's exploration space. Instead of letting the agent pursue arbitrary hypotheses, the runbook defines a directed graph of investigation steps. The agent can still do stochastic exploration within that graph, but the search space is bounded.

This trades flexibility for reliability in exactly the way production systems need. An unbounded ReAct loop investigating a database timeout incident might decide to check IAM permissions, then container resource limits, then DNS resolution, in an order that a senior SRE would never choose. The runbook encodes accumulated operational knowledge about which hypotheses are worth checking first, in which order, under which conditions.

The Human-in-the-Loop Question Is Not Optional

Relvy includes HITL capabilities for high-stakes decisions. This is the right call, and the reasoning matters beyond Relvy specifically. Any agent that can trigger mitigation actions, roll back deployments, or modify infrastructure is operating in a domain where a false positive has asymmetric costs. The agent being right 90% of the time means it is catastrophically wrong at a frequency that production systems cannot absorb.

The practical implication: if you are evaluating any agentic RCA system, the first question is not accuracy. It is the blast radius of an incorrect automated action. Systems that gate mitigation behind explicit human approval are architecturally safer regardless of their benchmark scores.

The Tool Architecture Pattern Worth Stealing

Relvy uses what they describe as MCP-like patterns for tool orchestration. Progressive tool exposure, the idea of dynamically presenting tools to an agent as needed rather than loading all available tools into the context simultaneously, is the right pattern here and worth unpacking regardless of whether you use Relvy.

Loading 40 tools into a context window for an agent investigating a single service degradation is the agentic equivalent of giving a new hire access to every internal system on day one. The cognitive overhead is real, the attention dilution measurable. Relvy's specialized tools, anomaly detection, problem slicing, log pattern search, and span tree reasoning, are designed to be surfaced selectively based on the current investigation state.

Relvy's Four Specialized Tool Classes

Anomaly detection using Z-scores and time series decomposition, separating genuine signal from cyclical noise before LLM reasoning begins

Problem slicing

narrows the investigation scope to relevant services and time windows, reducing context size and improving reasoning focus

Log pattern search

structured querying against log corpora rather than raw log ingestion, avoiding context window saturation

Span tree reasoning

distributed trace analysis that can identify causal chains across service boundaries, where flat metric analysis fails

What the Email API Comparison Tells Us About Agent Infrastructure

A seemingly unrelated development illustrates the same architectural maturity pattern. Purpose-built APIs for AI agent email capabilities, with sub-200ms inbox provisioning and built-in OTP extraction, exist now. The question of whether an agent should parse raw MIME or call a structured extraction endpoint has the same answer as whether your RCA agent should read raw logs or call a log pattern tool: the structured endpoint wins on reliability and latency, every time.

Agent infrastructure is fragmenting into specialized primitives. Memory layers with hybrid vector search. Cryptographic audit trails for agent-to-agent interactions. Purpose-built email APIs. Anomaly detection as an agent tool. The pattern is consistent: general-purpose capabilities lose to specialized ones when the domain has well-defined constraints and failure modes.

What Practitioners Should Actually Do With This

The agent that is right 90% of the time is wrong at a frequency that production systems cannot afford. Accuracy is the wrong primary metric for agentic RCA.

If you are running on-call rotations today and thinking about where agentic tooling fits, the Relvy architecture suggests a concrete evaluation framework.

First, audit your existing runbooks. Runbook-anchored execution only works if your runbooks encode genuine operational knowledge. If your runbooks are outdated or absent, the anchoring provides false structure. The agent will follow a wrong map confidently.

Test On Your Worst Incidents, Not Demos

Second, do not evaluate RCA agents on demo incidents. Evaluate them on your actual historical incidents, with your actual observability data. The 36% OpenRCA figure for general-purpose LLMs might be 60% or 20% on your specific stack. Benchmark transfer is unreliable enough that you need your own measurement.

Third, treat the mitigation loop separately from the investigation loop. Automate investigation aggressively. Gate mitigation conservatively. These are different risk profiles and should have different approval requirements.

The broader pattern: specialized agent tooling is consistently outperforming general-purpose LLM application in domains with high operational stakes, well-defined failure modes, and existing human expertise that can be encoded as structured priors. On-call RCA is exactly that domain. The architecture is right even if the specific product claims need independent validation.

The Bottom Line

General-purpose LLMs scoring 36% on OpenRCA is a real signal that the RCA problem requires specialized tooling, not better prompts
Runbook-anchored execution is the correct architectural pattern for constraining agentic exploration in high-stakes operational domains
Progressive tool exposure outperforms context-stuffing for both accuracy and latency in multi-tool agent architectures
HITL gating on mitigation actions is non-negotiable regardless of agent accuracy claims
Evaluate RCA agents on your own historical incident data, not published benchmarks, before trusting any production handoff

Sources: DEV.to (April 9, 2026), NewsAPI (April 9, 2026)