AI Agents

Hermes Agent Fixes Context Bloat Through MCP Tool Search

Published: May 30, 20269 min read

Context bloat is crippling your AI agents. Discover how the Hermes Agent architecture uses BM25 retrieval and progressive schema disclosure to improve performance.

What You'll Build — and Why It Matters

If your AI agents are underperforming, the bottleneck is probably not the model. It's the harness around it.

Context bloat — the practice of dumping every available tool schema into an agent's context window at once — is quietly killing accuracy across production deployments. Nous Research's Hermes Agent has shipped a direct fix: Tool Search for the Model Context Protocol (MCP), a mechanism that uses BM25 progressive schema disclosure to feed agents only the tools they need, when they need them.

The results are striking. Anthropic evaluations show accuracy gains of 49% to 74% on Claude Opus 4 — not from a model upgrade, but from redesigning how tools are presented to the model. This tutorial breaks down the three core techniques behind that gain and gives you a concrete blueprint for applying them to your own agent harness.

Prerequisites: Familiarity with MCP server architecture, basic understanding of BM25 retrieval, and a working agent deployment using a frontier model (Claude, GPT-4-class, or equivalent).

The Problem: Why Context Bloat Degrades Agent Performance

Modern MCP servers can expose dozens — sometimes hundreds — of tools. The naive approach is to serialize every tool's name, description, and parameter schema into the system prompt or tool-call payload at the start of every turn.

This creates three compounding failure modes:

Attention dilution. Large language models allocate attention across the entire context. A 200-tool schema block forces the model to reason over thousands of tokens of irrelevant structure before it reaches the user's actual request.
Selection noise. When every tool is visible, the model must perform implicit retrieval inside its own forward pass — a task it was not trained to do reliably at scale.
Prompt budget exhaustion. Tool schemas consume tokens that could carry conversation history, retrieved documents, or chain-of-thought reasoning, all of which directly improve task accuracy.

Anthropic evaluations show accuracy gains of 49% to 74% on Claude Opus 4 — achieved not by changing the model, but by changing what the model sees.

The Hermes Agent architecture treats this as a retrieval problem, not a modeling problem. Here's how to replicate that approach.

Step 1 — Implement BM25 Tool Retrieval as a Pre-Filter

The first technique is replacing full schema injection with BM25-based tool retrieval. Before any tool schemas enter the context, a lightweight BM25 index ranks available tools against the current user query and returns only the top-k candidates.

Why BM25 (Not a Vector Search)?

BM25 is a sparse, term-frequency retrieval algorithm. For tool selection, it has three practical advantages over dense vector search:

Zero inference cost. BM25 runs on CPU with no embedding model required, adding negligible latency to your agent loop.
Exact-match reliability. Tool names and parameter names tend to be precise identifiers (create_calendar_event, query_crm_by_account_id). BM25 handles exact-match signals better than cosine similarity over averaged embeddings.
Interpretability. You can inspect which terms drove a tool's ranking — useful for debugging harness behavior in production.

Implementation Blueprint

Index construction (do this at server startup):

python from rank_bm25 import BM25Okapi import json

def build_tool_index(tools: list[dict]) -> tuple[BM25Okapi, list[dict]]: """ tools: list of MCP tool objects with 'name', 'description', 'inputSchema' """ corpus = [] for tool in tools: # Concatenate name + description + top-level parameter names param_names = list(tool.get("inputSchema", {}).get("properties", {}).keys()) doc = f"{tool['name']} {tool['description']} {' '.join(param_names)}" corpus.append(doc.lower().split())

Code

index = BM25Okapi(corpus)
return index, tools

Query-time retrieval (run before each agent turn):

python def retrieve_tools(query: str, index: BM25Okapi, tools: list[dict], top_k: int = 8) -> list[dict]: tokenized_query = query.lower().split() scores = index.get_scores(tokenized_query) ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True) return [tools[i] for i in ranked_indices[:top_k]]

Key configuration decision: Start with top_k=8. In Hermes Agent's architecture, surfacing 8–12 tools per turn preserves enough coverage for multi-step tasks while eliminating the attention dilution caused by full schema injection. Tune this against your specific MCP server's tool count and task distribution.

Step 2 — Apply Progressive Schema Disclosure

Retrieving the right tools is necessary but not sufficient. The second technique governs how much of each tool's schema is disclosed, and when.

Progressive schema disclosure means presenting tool schemas in layers of increasing detail, matched to the agent's current decision-making stage:

Stage	What the Agent Sees	Purpose
Tool selection	Name + one-line description only	Minimize tokens during routing decisions
Parameter planning	Full `inputSchema` for selected tool(s)	Provide complete spec before the agent constructs a call
Execution confirmation	Populated call + expected output schema	Enable pre-flight validation

Why This Works

A model deciding which tool to invoke doesn't need to know that create_calendar_event accepts an optional recurrence_rule parameter with an RFC 5545 string format. That detail is only relevant once the tool has been selected. Injecting it earlier adds tokens and introduces irrelevant schema structure into the selection decision.

Implementation Blueprint

Structure your tool objects with two disclosure levels:

python def render_tool_for_selection(tool: dict) -> str: """Minimal disclosure: name + description only.""" return f"- {tool['name']}: {tool['description']}"

def render_tool_for_execution(tool: dict) -> dict: """Full disclosure: complete MCP tool object.""" return tool # Pass the full schema to the model's tool-call payload

In your agent loop:

python

tool_menu = "\n".join(render_tool_for_selection(t) for t in retrieved_tools) system_prompt = f"{base_system_prompt}\n\nAvailable tools:\n{tool_menu}"

Stage 2: Once the model names a tool, inject the full schema

selected_tool_name = parse_tool_selection(model_response) selected_tool = next(t for t in retrieved_tools if t["name"] == selected_tool_name) full_schema_payload = render_tool_for_execution(selected_tool)

This two-stage pattern reduces the token footprint of the selection phase by 60–80% for tools with complex schemas, while preserving full spec fidelity at execution time.

Step 3 — Redesign Your Harness Around Retrieval-First Principles

The third technique is architectural: treating tool retrieval as a first-class component of the agent harness, not an afterthought.

Hermes Agent's accuracy gains demonstrate a broader principle that the AI agent community has been slow to internalize: model capability and harness design are co-equal determinants of agent performance. A 49–74% accuracy improvement without any model change is a strong empirical signal that most production agents are leaving significant performance on the table through poor harness design.

Three Harness Design Principles to Adopt Now

Principle 1: Decouple tool registration from tool injection. Your MCP server should maintain a full registry of all available tools. Your agent harness should maintain a separate retrieval layer that gates what enters the context. These are different concerns and should be separate code paths.

Principle 2: Make retrieval observable. Log which tools were retrieved for each turn, their BM25 scores, and whether the model ultimately called one of the retrieved tools. A retrieval miss (model attempts to call a tool that wasn't in the retrieved set) is a direct signal to tune your index or increase top_k.

Principle 3: Treat schema token cost as a first-order metric. Add schema token count to your agent observability dashboard alongside latency, cost, and task completion rate. In most deployments, schema tokens are the single largest controllable cost driver — and the one most directly correlated with attention dilution.

Connecting to the Broader MCP Ecosystem

The Model Context Protocol was designed to standardize how agents connect to tools and data sources. Hermes Agent's Tool Search extension sits cleanly within that standard — it doesn't require changes to how tools are defined, only to how they are surfaced. This means the pattern is portable: any MCP-compatible agent harness can adopt BM25 retrieval and progressive disclosure without modifying upstream tool servers.

The core insight from Nous Research's work: agent performance bottlenecks lie in software harness design rather than raw model capability alone.

This reframes where engineering effort should go. Before reaching for a more capable (and expensive) model, audit your harness. Measure your schema token footprint. Implement retrieval-gated disclosure. The Hermes Agent results suggest the performance ceiling for well-designed harnesses is substantially higher than most teams have explored.

Putting It Together: A Minimal Reference Architecture

Here's the full agent turn loop incorporating all three techniques:

Receive user message
BM25 retrieval → top-k tool candidates (names + descriptions only)
Stage-1 prompt: base system prompt + lightweight tool menu
Model response: tool selection (name only, or direct answer if no tool needed)
If tool selected: a. Inject full inputSchema for selected tool b. Model constructs tool call parameters c. Execute via MCP d. Return result to model for synthesis
Stream final response to user

This loop adds one BM25 retrieval step (sub-millisecond on CPU) and one additional model turn for complex tool calls. The tradeoff: substantially reduced schema token overhead and the 49–74% accuracy gains documented in Anthropic's evaluations of Hermes Agent on Claude Opus 4.

What to Measure After Deployment

Before declaring success, instrument these four metrics:

Retrieval recall: What percentage of agent turns result in the correct tool being in the retrieved set? Target >95%.
Schema token reduction: Compare average schema tokens per turn before and after. Expect 50–80% reduction for MCP servers with 20+ tools.
Task completion rate: End-to-end success on your benchmark suite. This is the number that should move toward the 49–74% improvement range.
Retrieval miss rate: Turns where the model attempted to call a tool outside the retrieved set. This drives your top_k tuning.

Looking for AI solutions for your business?

Discover how our AI services can help you stay ahead of the competition.

Contact Us

Continue Reading

Enterprise AI

Hermes Agent Fixes Context Bloat Through MCP Tool Search

What You'll Build — and Why It Matters

The Problem: Why Context Bloat Degrades Agent Performance

Step 1 — Implement BM25 Tool Retrieval as a Pre-Filter

Why BM25 (Not a Vector Search)?

Implementation Blueprint

Step 2 — Apply Progressive Schema Disclosure

Why This Works

Implementation Blueprint

Stage 1: Give the model a lightweight tool menu

Stage 2: Once the model names a tool, inject the full schema

Step 3 — Redesign Your Harness Around Retrieval-First Principles

Three Harness Design Principles to Adopt Now

Connecting to the Broader MCP Ecosystem

Putting It Together: A Minimal Reference Architecture

What to Measure After Deployment

Further Reading

Looking for AI solutions for your business?

Continue Reading

Unchecked AI Spending Is Now a Major Enterprise Security Risk

Anthropic Hits $965B: The New Benchmark for AI Finance

Dynamic Workflows: Scaling Enterprise AI Agent Platforms