Google's Open Knowledge Format (OKF) is transforming how AI agents access data. Learn how this structured Markdown standard improves agentic workflows.
What Is the Open Knowledge Format — and Why Does It Matter for Agents?
Google Cloud's Open Knowledge Format (OKF) is a new open standard that converts scattered organizational documentation into structured Markdown files with YAML frontmatter headers. The goal: give AI agents a consistent, portable, and machine-readable knowledge layer they can actually rely on during complex, multi-step tasks.
If you've been building on any AI agent workflow automation platform — whether that's Vertex AI, LangChain, CrewAI, or a custom stack — you've almost certainly hit the same wall: agents fail not because the model is wrong, but because the knowledge they're querying is fragmented, inconsistently formatted, or locked in proprietary systems. OKF is Google Cloud's answer to that infrastructure gap.
The format formalizes a pattern that AI researcher Andrej Karpathy popularized under the name "LLM Wiki" — the idea that organizations should maintain a living, structured wiki written specifically for language models to read, not just humans. OKF takes that concept and gives it a production-grade specification, making it enterprise-deployable at scale.
This tutorial walks through three concrete ways OKF changes how you structure data for agentic workflows, with practical guidance on adopting each pattern.
Prerequisites
Before diving in, you should be comfortable with:
- Basic YAML syntax (key-value pairs, lists, nested objects)
- Writing documentation in Markdown
- Foundational understanding of how AI agents retrieve and use context (RAG pipelines, tool-calling, memory systems)
- Familiarity with at least one agent orchestration framework (LangChain, LlamaIndex, Vertex AI Agent Builder, etc.)
You don't need to be a Google Cloud expert — OKF is an open format, not a proprietary lock-in.
Way 1: Replace Unstructured Docs With YAML-Fronted Markdown Files
The Problem With Raw Documentation
Most enterprise knowledge lives in a graveyard of formats: PDFs, Confluence pages, SharePoint wikis, Notion exports, internal Slack threads, and Google Docs with inconsistent heading structures. When an agent tries to retrieve relevant context from this soup, it's working against itself — chunking strategies break on inconsistent structure, metadata is absent, and there's no signal to help the model understand what kind of document it's reading.
How OKF Fixes This
OKF mandates that every knowledge artifact be stored as a .md file with a YAML frontmatter block at the top. This frontmatter carries structured metadata that agents can parse before they even read the body content.
Here's what a minimal OKF-compliant file looks like:
yaml
title: "Customer Refund Policy — Enterprise Tier" type: policy owner: customer-success last_updated: 2026-05-01 tags: [refunds, enterprise, billing] audience: [support-agents, ai-agents]
Followed by the document body in standard Markdown.
What This Unlocks for Agents
With YAML frontmatter in place, your retrieval layer gains pre-retrieval filtering. Instead of embedding every document and hoping cosine similarity surfaces the right one, an agent can first query metadata:
- "Find all documents of type
policytaggedbillingupdated after 2026-01-01" - "Retrieve only documents with
audience: ai-agents"
This dramatically reduces hallucination risk caused by retrieving tangentially related documents. The agent gets fewer, more relevant chunks — and spends less of its context window on noise.
Practical step: Audit your top 20 most-queried internal documents. Convert each to a .md file and add a YAML frontmatter block with at minimum: title, type, owner, last_updated, and tags. This alone will measurably improve retrieval precision in your agent stack.
Way 2: Build an LLM Wiki — A Knowledge Layer Written for Machines
The Karpathy Insight
The conceptual backbone of OKF traces directly to Andrej Karpathy's "LLM Wiki" idea. The core observation is deceptively simple: humans write documentation for humans, and then we're surprised when language models struggle to use it reliably.
Human docs are written with assumed context, implicit jargon, pronoun-heavy prose, and structural inconsistencies that human readers resolve effortlessly. LLMs don't have that ambient context — they need documentation that's explicit, self-contained, and consistently structured.
Google Cloud's OKF formalizes this as an enterprise pattern: maintain a parallel (or primary) knowledge base written with LLM consumption as a first-class concern.
What an LLM Wiki Entry Looks Like
An OKF-compliant LLM Wiki entry follows specific writing conventions beyond just the frontmatter:
- No assumed context — every term is defined or linked on first use
- Declarative statements over narrative prose — "The refund window is 30 days for Enterprise tier" rather than "As we mentioned in our onboarding docs, Enterprise customers enjoy an extended window"
- Explicit scope boundaries — the document states clearly what it covers and what it doesn't
- Machine-readable lists over prose enumerations — use Markdown bullet lists and tables, not "first... second... finally..."
Here's a contrast:
Human-optimized (before):
"Our enterprise customers have historically enjoyed a more flexible approach to refunds, and the team typically works things out on a case-by-case basis, though the standard window we advertise is about a month."
LLM Wiki / OKF-optimized (after):
"Enterprise tier refund window: 30 calendar days from purchase date. Exceptions require VP Customer Success approval. Standard tier: 14 days, no exceptions."
Implementation Steps
- Identify your agent's top failure modes — where does it give wrong or hedged answers? Those topics are your first LLM Wiki targets.
- Rewrite those documents using the OKF conventions above — declarative, self-contained, list-heavy.
- Tag them with
audience: ai-agentsin the YAML frontmatter so your retrieval layer can prioritize them. - Version-control the wiki in Git — OKF's Markdown-based format is natively diff-able, making it easy to track changes and audit what your agent knew at any point in time.
Key insight from Google Cloud's OKF announcement: The format is explicitly designed to be portable and system-agnostic — your LLM Wiki isn't locked to Vertex AI. The same
.mdfiles can feed LangChain, LlamaIndex, or any RAG pipeline that can read from a file system or object store. (Source: The Decoder)
Way 3: Standardize Agent-Facing Metadata to Enable Cross-Agent Portability
The Multi-Agent Coordination Problem
As organizations mature their AI agent deployments, they inevitably move from single-agent systems to multi-agent architectures — orchestrator agents that delegate to specialist sub-agents, each with its own knowledge scope. This is where data standardization becomes a hard infrastructure problem, not just a nice-to-have.
Without a common format, Agent A's knowledge base is opaque to Agent B. Metadata schemas differ. Chunking strategies conflict. An orchestrator can't reliably route a query to the right sub-agent because it can't inspect what each agent actually knows.
OKF as a Shared Contract
OKF's YAML frontmatter schema acts as a shared contract across agents. When every knowledge artifact carries consistent metadata fields, orchestration logic becomes tractable:
yaml
title: "Q2 2026 Pricing — APAC Region" type: pricing scope: regional region: apac valid_from: 2026-04-01 valid_until: 2026-06-30 confidentiality: internal agent_permissions: [sales-agent, pricing-agent]
Now an orchestrator agent can make routing decisions based on structured metadata rather than semantic guesswork:
- Route pricing queries to agents with
type: pricingdocuments in scope - Enforce
confidentialityandagent_permissionsfields as access controls - Use
valid_from/valid_untilto automatically surface or suppress time-sensitive knowledge
Practical Adoption Path
Step 1 — Define your organization's OKF schema extension. OKF provides a base schema; you extend it with fields specific to your domain. Decide upfront which fields are mandatory vs. optional for your use case.
Step 2 — Build a validation step into your knowledge ingestion pipeline. Before any document enters your agent's knowledge base, run a YAML frontmatter validator. Reject documents that don't conform. This enforces quality at the source.
Step 3 — Expose metadata as a queryable index. Store frontmatter fields in a lightweight metadata store (a simple SQLite table, a vector DB's metadata filters, or a dedicated document store like Elasticsearch). Agents query metadata first, then retrieve full document content only when needed.
Step 4 — Establish a governance process for schema changes. Because multiple agents depend on the same schema, unilateral changes break things. Treat your OKF schema extension like an API — version it, communicate changes, and maintain backward compatibility.
Putting It Together: A Minimal OKF-Ready Agent Stack
Here's what a minimal production-ready setup looks like after applying all three patterns:
| Layer | Implementation |
|---|---|
| Knowledge storage | Git repository of .md files with YAML frontmatter |
| Metadata index | Vector DB (e.g., Weaviate, Pinecone) with frontmatter fields as filterable properties |
| Ingestion pipeline | CI/CD job that validates OKF schema, chunks Markdown body, embeds, and upserts to vector DB |
| Retrieval strategy | Pre-filter by metadata → semantic search on filtered subset → return top-k chunks |
| Agent framework | Any (LangChain, LlamaIndex, Vertex AI Agent Builder) — OKF is format-agnostic |
The beauty of this architecture is its simplicity. You're not introducing exotic infrastructure — you're imposing structure on what you already have.
What to Watch Next
OKF is early-stage, and the ecosystem around it is still forming. A few developments worth tracking:
- Tooling for automated OKF migration — converting legacy Confluence/Notion exports to OKF-compliant Markdown at scale is still largely a manual or custom-scripted process. Expect this gap to close quickly.
- OKF schema registries — as adoption grows, community-maintained schema registries for common document types (policies, runbooks, pricing sheets) will reduce the overhead of defining extensions from scratch.
- Integration with agent memory systems — OKF currently addresses external knowledge. How it interacts with agent working memory and episodic memory systems is an open design question.
For teams building on any AI agent workflow automation platform today, OKF represents a pragmatic, low-overhead path to more reliable agents. The format is simple enough to adopt incrementally — start with your highest-value documents, validate the retrieval improvement, and expand from there.
Source: Google Cloud's Open Knowledge Format Turns Scattered Docs Into Markdown Files for AI Agents — The Decoder
Last reviewed: June 15, 2026



