Context bloat is crippling your AI agents. Discover how the Hermes Agent architecture uses BM25 retrieval and progressive schema disclosure to improve performance.
What You'll Build — and Why It Matters
If your AI agents are underperforming, the bottleneck is probably not the model. It's the harness around it.
Context bloat — the practice of dumping every available tool schema into an agent's context window at once — is quietly killing accuracy across production deployments. Nous Research's Hermes Agent has shipped a direct fix: Tool Search for the Model Context Protocol (MCP), a mechanism that uses BM25 progressive schema disclosure to feed agents only the tools they need, when they need them.
The results are striking. Anthropic evaluations show accuracy gains of 49% to 74% on Claude Opus 4 — not from a model upgrade, but from redesigning how tools are presented to the model. This tutorial breaks down the three core techniques behind that gain and gives you a concrete blueprint for applying them to your own agent harness.
Prerequisites: Familiarity with MCP server architecture, basic understanding of BM25 retrieval, and a working agent deployment using a frontier model (Claude, GPT-4-class, or equivalent).
The Problem: Why Context Bloat Degrades Agent Performance
Modern MCP servers can expose dozens — sometimes hundreds — of tools. The naive approach is to serialize every tool's name, description, and parameter schema into the system prompt or tool-call payload at the start of every turn.
This creates three compounding failure modes:
- Attention dilution. Large language models allocate attention across the entire context. A 200-tool schema block forces the model to reason over thousands of tokens of irrelevant structure before it reaches the user's actual request.
- Selection noise. When every tool is visible, the model must perform implicit retrieval inside its own forward pass — a task it was not trained to do reliably at scale.
- Prompt budget exhaustion. Tool schemas consume tokens that could carry conversation history, retrieved documents, or chain-of-thought reasoning, all of which directly improve task accuracy.
Anthropic evaluations show accuracy gains of 49% to 74% on Claude Opus 4 — achieved not by changing the model, but by changing what the model sees.
The Hermes Agent architecture treats this as a retrieval problem, not a modeling problem. Here's how to replicate that approach.
Step 1 — Implement BM25 Tool Retrieval as a Pre-Filter
The first technique is replacing full schema injection with BM25-based tool retrieval. Before any tool schemas enter the context, a lightweight BM25 index ranks available tools against the current user query and returns only the top-k candidates.
Why BM25 (Not a Vector Search)?
BM25 is a sparse, term-frequency retrieval algorithm. For tool selection, it has three practical advantages over dense vector search:
- Zero inference cost. BM25 runs on CPU with no embedding model required, adding negligible latency to your agent loop.
- Exact-match reliability. Tool names and parameter names tend to be precise identifiers (
create_calendar_event,query_crm_by_account_id). BM25 handles exact-match signals better than cosine similarity over averaged embeddings. - Interpretability. You can inspect which terms drove a tool's ranking — useful for debugging harness behavior in production.
Implementation Blueprint
Index construction (do this at server startup):
python from rank_bm25 import BM25Okapi import json
def build_tool_index(tools: list[dict]) -> tuple[BM25Okapi, list[dict]]: """ tools: list of MCP tool objects with 'name', 'description', 'inputSchema' """ corpus = [] for tool in tools: # Concatenate name + description + top-level parameter names param_names = list(tool.get("inputSchema", {}).get("properties", {}).keys()) doc = f"{tool['name']} {tool['description']} {' '.join(param_names)}" corpus.append(doc.lower().split())
index = BM25Okapi(corpus)
return index, tools
Query-time retrieval (run before each agent turn):
python def retrieve_tools(query: str, index: BM25Okapi, tools: list[dict], top_k: int = 8) -> list[dict]: tokenized_query = query.lower().split() scores = index.get_scores(tokenized_query) ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True) return [tools[i] for i in ranked_indices[:top_k]]
Key configuration decision: Start with top_k=8. In Hermes Agent's architecture, surfacing 8–12 tools per turn preserves enough coverage for multi-step tasks while eliminating the attention dilution caused by full schema injection. Tune this against your specific MCP server's tool count and task distribution.
Step 2 — Apply Progressive Schema Disclosure
Retrieving the right tools is necessary but not sufficient. The second technique governs how much of each tool's schema is disclosed, and when.
Progressive schema disclosure means presenting tool schemas in layers of increasing detail, matched to the agent's current decision-making stage:
| Stage | What the Agent Sees | Purpose |
|---|---|---|
| Tool selection | Name + one-line description only | Minimize tokens during routing decisions |
| Parameter planning | Full inputSchema for selected tool(s) | Provide complete spec before the agent constructs a call |
| Execution confirmation | Populated call + expected output schema | Enable pre-flight validation |
Why This Works
A model deciding which tool to invoke doesn't need to know that create_calendar_event accepts an optional recurrence_rule parameter with an RFC 5545 string format. That detail is only relevant once the tool has been selected. Injecting it earlier adds tokens and introduces irrelevant schema structure into the selection decision.
Implementation Blueprint
Structure your tool objects with two disclosure levels:
python def render_tool_for_selection(tool: dict) -> str: """Minimal disclosure: name + description only.""" return f"- {tool['name']}: {tool['description']}"
def render_tool_for_execution(tool: dict) -> dict: """Full disclosure: complete MCP tool object.""" return tool # Pass the full schema to the model's tool-call payload
In your agent loop:
python
Stage 1: Give the model a lightweight tool menu
tool_menu = "\n".join(render_tool_for_selection(t) for t in retrieved_tools) system_prompt = f"{base_system_prompt}\n\nAvailable tools:\n{tool_menu}"
Stage 2: Once the model names a tool, inject the full schema
selected_tool_name = parse_tool_selection(model_response) selected_tool = next(t for t in retrieved_tools if t["name"] == selected_tool_name) full_schema_payload = render_tool_for_execution(selected_tool)
This two-stage pattern reduces the token footprint of the selection phase by 60–80% for tools with complex schemas, while preserving full spec fidelity at execution time.
Step 3 — Redesign Your Harness Around Retrieval-First Principles
The third technique is architectural: treating tool retrieval as a first-class component of the agent harness, not an afterthought.
Hermes Agent's accuracy gains demonstrate a broader principle that the AI agent community has been slow to internalize: model capability and harness design are co-equal determinants of agent performance. A 49–74% accuracy improvement without any model change is a strong empirical signal that most production agents are leaving significant performance on the table through poor harness design.
Three Harness Design Principles to Adopt Now
Principle 1: Decouple tool registration from tool injection. Your MCP server should maintain a full registry of all available tools. Your agent harness should maintain a separate retrieval layer that gates what enters the context. These are different concerns and should be separate code paths.
Principle 2: Make retrieval observable.
Log which tools were retrieved for each turn, their BM25 scores, and whether the model ultimately called one of the retrieved tools. A retrieval miss (model attempts to call a tool that wasn't in the retrieved set) is a direct signal to tune your index or increase top_k.
Principle 3: Treat schema token cost as a first-order metric. Add schema token count to your agent observability dashboard alongside latency, cost, and task completion rate. In most deployments, schema tokens are the single largest controllable cost driver — and the one most directly correlated with attention dilution.
Connecting to the Broader MCP Ecosystem
The Model Context Protocol was designed to standardize how agents connect to tools and data sources. Hermes Agent's Tool Search extension sits cleanly within that standard — it doesn't require changes to how tools are defined, only to how they are surfaced. This means the pattern is portable: any MCP-compatible agent harness can adopt BM25 retrieval and progressive disclosure without modifying upstream tool servers.
The core insight from Nous Research's work: agent performance bottlenecks lie in software harness design rather than raw model capability alone.
This reframes where engineering effort should go. Before reaching for a more capable (and expensive) model, audit your harness. Measure your schema token footprint. Implement retrieval-gated disclosure. The Hermes Agent results suggest the performance ceiling for well-designed harnesses is substantially higher than most teams have explored.
Putting It Together: A Minimal Reference Architecture
Here's the full agent turn loop incorporating all three techniques:
- Receive user message
- BM25 retrieval → top-k tool candidates (names + descriptions only)
- Stage-1 prompt: base system prompt + lightweight tool menu
- Model response: tool selection (name only, or direct answer if no tool needed)
- If tool selected: a. Inject full inputSchema for selected tool b. Model constructs tool call parameters c. Execute via MCP d. Return result to model for synthesis
- Stream final response to user
This loop adds one BM25 retrieval step (sub-millisecond on CPU) and one additional model turn for complex tool calls. The tradeoff: substantially reduced schema token overhead and the 49–74% accuracy gains documented in Anthropic's evaluations of Hermes Agent on Claude Opus 4.
What to Measure After Deployment
Before declaring success, instrument these four metrics:
- Retrieval recall: What percentage of agent turns result in the correct tool being in the retrieved set? Target >95%.
- Schema token reduction: Compare average schema tokens per turn before and after. Expect 50–80% reduction for MCP servers with 20+ tools.
- Task completion rate: End-to-end success on your benchmark suite. This is the number that should move toward the 49–74% improvement range.
- Retrieval miss rate: Turns where the model attempted to call a tool outside the retrieved set. This drives your
top_ktuning.
Further Reading
- Hermes Agent Ships Tool Search for MCP — MarkTechPost
- Model Context Protocol Specification — Anthropic
- BM25 Algorithm Overview — rank_bm25 Python Library
- Nous Research — Hermes Model Family
Last reviewed: May 30, 2026



