PDF vs .kcp — Why Agentic AI Needs a New Knowledge Format
A seminal essay on why PDFs fail LLMs and what comes next: KCP, an AI-native format for executable, verifiable, agent-ready knowledge.
"The document is not the knowledge."
For three decades, the document has been the atomic unit of recorded thought. We wrote books, then PDFs, then HTML, then Markdown — and we trusted the rectangle of text to carry meaning across time, software, and minds. That trust was always implicit. The reader was assumed to be human, patient, and capable of reconstructing structure from prose.
In 2026, the reader is no longer human. It is a model: stochastic, tireless, expensive per token, and increasingly autonomous. When that reader looks at a PDF, it does not see a chapter. It sees a flat stream of characters interrupted by page numbers, ligatures, table fragments, and decorative noise. The format is a museum piece pretending to be infrastructure.
This essay is an argument that the Knowledge Context Protocol (KCP) — and the file format .kcp that implements it — is to AI knowledge what HTTP was to documents and what the Model Context Protocol (MCP) is to tools. It is the missing substrate for agentic systems: a portable, verifiable, executable representation of what an agent knows. If you only read one section, read the comparison: PDF vs .kcp — 45 dimensions, side by side.
Why this essay exists now
In late 2024, Anthropic published the Model Context Protocol, a small open specification for how language models call tools and discover external capabilities. Within twelve months, MCP went from a research artifact to a default integration surface adopted by OpenAI, Google DeepMind, and most major agent frameworks. The pattern is now obvious in hindsight: the moment agents needed to act, the community converged on a common verb grammar.
Acting is half of cognition. The other half is knowing. And here, the field has no consensus and no protocol. Every vendor invents its own retrieval pipeline, its own chunking heuristic, its own embedding store, its own metadata schema. The result is what Karpathy has called the "context engineering" problem: most production failures of LLM systems are not reasoning failures, they are context failures. The model is asked to think with the wrong material.
The substrate of that material — the file an agent ingests when it loads "the company handbook," "the regulation," "the textbook," "the case law" — is still, overwhelmingly, the PDF. A format finalized by Adobe in 1993 to make printed pages portable across operating systems. A format whose original specification dedicates more pages to font embedding and color management than to semantics, because semantics were never the point. The point was to preserve the picture of a page.
We have spent two years asking trillion-parameter models to read pictures of pages.
A short history of the document
Every recording medium is shaped by its reader.
- The codex (~1st century) optimized for the human eye scanning bound pages — random access within a finite volume.
- The printing press (~1450) optimized for mechanical reproduction at scale, fixing layout as a feature rather than a constraint.
- The PDF (1993) optimized for display fidelity across devices: it freezes a page so it looks the same on any screen or printer.
- HTML (1991) optimized for hyperlinked browsing with separable structure and style.
- Markdown (2004) optimized for writers who wanted plaintext that survived rendering.
- RAG chunks (2020–) optimized for vector similarity search over arbitrary text, accepting structural amnesia as the cost.
Each format made sense for its reader. None of them was designed for an autonomous reasoner that ingests thousands of documents per session, must cite every claim, and pays per token in latency, money, and hallucination risk.
A new reader has appeared. The format has not caught up.
The four hidden taxes of feeding PDFs to LLMs
Every PDF that enters an agent's context window incurs four costs that compound across a workflow. They are not marketing claims; they are operational facts that anyone who has shipped a RAG system in production has measured.
1. Structural noise. A typical scientific PDF dedicates 20–35% of its tokens to layout artifacts: running headers, footers, page numbers, figure captions repeated in two columns, table cells exploded into ungrammatical fragments by the extractor. Foundational work on layout-aware parsing — LayoutLM, Nougat, Marker — exists precisely because raw PDF text is hostile to language models. Even the best parsers leave residue.
2. Semantic dilution. A single fact ("the deductible is $2,500") is often distributed across a definition, an example, a footnote, and a summary. Vector retrieval returns the chunk with the highest cosine similarity to the query, not the chunk with the highest evidential value. Studies of long-context retrieval such as Lost in the Middle show that even when the right span is in context, models often ignore it.
3. Missing relations. Documents encode entities and propositions, but the relations between them — causal, procedural, conditional, temporal — live in the reader's head. An LLM asked "what happens if the policy lapses?" must reconstruct a graph that the document never wrote down. This is the error class that knowledge-graph–augmented retrieval (e.g. GraphRAG) was invented to address, by re-deriving the missing structure from the source at significant cost.
4. Retrieval collapse. As corpora grow, embedding-only retrieval degrades. The well-known BEIR benchmark showed that dense retrievers underperform classical BM25 on out-of-distribution domains — and that hybrid sparse/dense pipelines are now standard not because they are elegant but because each modality patches the other's blind spots. The PDF, by offering neither structure nor metadata, forces every downstream system to re-invent the index.
The compounding effect is real. In internal benchmarks of agents reading regulatory documents, replacing raw PDF extraction with structured packages produces 53–80% fewer tool calls per task, comparable or better answer accuracy, and 5–20× lower context cost depending on document length. The methodology and caveats are described below.
What knowledge looks like to an agent
Before defining .kcp, it is worth defining the target.
For an autonomous reasoner, knowledge is not prose. It is executable context: a representation that supports look-up by intent, traversal by relation, citation by provenance, and selective loading by budget. It has at least the following properties.
- Typed claims. Every assertion is tagged with what kind of statement it is — a definition, a procedure, a constraint, a quantity, an exception.
- Named entities. Concepts and objects have stable identifiers, not just surface strings, so two passages that talk about "the policyholder" can be aligned.
- Explicit relations. "X requires Y," "A precedes B," "C is a special case of D" are first-class edges, not paragraphs the model has to infer.
- Provenance. Every claim points back to its source span, with a hash, an offset, and a timestamp, so any answer can be audited.
- Confidence and scope. Claims carry the conditions under which they hold and the model or human that produced them.
- Operational hooks. Where appropriate, claims expose machine-callable actions — a calculator, a validator, a query — so the agent can execute rather than re-derive.
A paragraph of prose has none of these. A .kcp package has all of them.
Introducing .kcp
The .kcp file is an open, AI-native package format. In one sentence: it is a typed, addressable, signed knowledge graph compiled from a source document, plus a manifest and a retrieval index, distributed as a single deterministic artifact.
A .kcp package contains:
- a manifest with identity, version, source provenance, license, and compatibility flags;
- a set of modular chunks in YAML, each one a typed claim with a stable id;
- a knowledge graph of typed edges between chunks (defines, requires, exemplifies, contradicts, supersedes);
- a retrieval index — embeddings plus keyword and structural indices, precomputed so the consumer does not pay the cost;
- a provenance ledger — for every claim, a span pointer back to the original source plus a content hash;
- optional operational bindings — links to MCP tools or executable functions a reader may call to verify or extend a claim.
The format is described in detail in the Protocol specification and the Anatomy of a .kcp package. A reference compiler is at /compiler.
The crucial property is that .kcp is not a replacement for the source document. The PDF, the textbook, the regulation — they remain. The .kcp is the compiled form of their knowledge, the way an .o is the compiled form of a .c file. You distribute the source for humans and the compiled artifact for machines.
PDF vs .kcp — the comparison, condensed
The full 45-dimension comparison lives at /pdf-vs-kcp. A condensed view:
| Dimension | .kcp | |
|---|---|---|
| Primary reader | Human eye | Autonomous reasoner |
| Structure | Linear narrative + visual layout | Typed claims + graph |
| Retrieval | Re-derived per query (chunk + embed) | Precomputed, addressable |
| Provenance | Implicit (page numbers) | Explicit (span hash + ledger) |
| Updates | Reissue the file | Diff at the claim level |
| Token cost | Pays for layout and prose | Pays for evidence only |
Across purpose, organization, retrieval, agents, reasoning, and scale, the pattern is consistent: the PDF is optimized for a reader that no longer dominates the workload, and the .kcp is optimized for the one that does. If you want to argue with the comparison, argue with the forty-five rows, not the six.
Empirical signal
The headline number — 53–80% fewer tool calls, with equivalent or better answer quality — comes from controlled comparisons run inside the KCP Lab on three task families: regulatory question answering, technical-manual support, and case-law retrieval. The protocol holds the model, the question set, the judge, and the budget constant, and varies only the substrate: raw PDF text vs. compiled .kcp package.
Important caveats:
- The numbers are per task family, not universal. Highly narrative source material (a novel) compresses less than highly structured source material (a tax code). KCP gains the most where structure was always implicit.
- The comparisons use the same retrieval budget on both sides. A PDF pipeline that is allowed to call its retriever five times will close part of the gap; one that is allowed to call it twenty times can close more, at proportional cost.
- The judge model is held out from the generator model to reduce evaluator bias, in line with the practice recommended by Zheng et al., "Judging LLM-as-a-Judge".
The point of publishing the methodology is not to win a benchmark; it is to make the comparison falsifiable. We expect — and welcome — independent replications that disagree.
MCP and KCP are complementary, not competing
A common first reaction is to ask whether .kcp overlaps with MCP. It does not. The two protocols address orthogonal halves of agent design.
┌───────────────────────────┐
│ AGENT │
└─────────────┬─────────────┘
│
┌───────────────┴───────────────┐
│ │
┌────▼────┐ ┌────▼────┐
│ MCP │ │ KCP │
│ ACTION │ │COGNITION│
│ layer │ │ layer │
└────┬────┘ └────┬────┘
│ │
tools, APIs, knowledge packages,
side effects typed claims, graph
MCP standardizes the verbs an agent can invoke: fetch this URL, run this query, send this email. KCP standardizes the nouns an agent reasons over: this regulation, this textbook chapter, this case file, this product spec. An MCP-only agent can act on the world but must re-derive what it knows on every prompt. A KCP-only agent has perfect memory but no hands. A modern agentic system needs both, exactly the way a CPU needs both an ALU and a memory hierarchy.
Implications
If .kcp is even partially right about being the missing layer, several consequences follow.
- For RAG vendors. The "embed everything and pray" architecture is a transitional technology, the way runtime type inference was transitional before TypeScript. The future RAG stack consumes pre-compiled knowledge packages and falls back to embed-and-pray only for the long tail of unstructured content.
- For enterprise knowledge bases. Compiling a corpus into
.kcpis a one-time investment that pays back across every model upgrade. Today's embedding store is locked to today's embedding model; a.kcpgraph is not. - For publishers. A textbook, a regulation, a research paper has a second product attached: its compiled
.kcpform, distributed to AI consumers under a licence designed for machine reading. This is a market that does not yet exist and will exist within five years. - For regulators and auditors. Provenance ledgers make AI answers citable in court. The current state — "the model said so" — is an audit failure waiting to happen. Span-anchored claims with content hashes give regulated industries a paper trail that survives subpoena.
- For open science. A
.kcpof a paper is more useful to a meta-analysis than the PDF, because the claims are extracted, typed, and aligned. The dream of the Semantic Web has a second chance, this time with a compiler the community can actually run.
Objections, honestly
"Isn't this just RAG with extra steps?" Conventional RAG is a runtime technique: at query time, embed the question, retrieve chunks, inject. KCP moves the structural work to compile time: the chunking, the graph, the index are precomputed and shipped. The extra steps are paid once, by the publisher, instead of every time, by every consumer. The economic shape of the problem changes.
"Won't models eventually read PDFs perfectly?" Maybe. But "read" is not the bottleneck — reasoning at scale under a budget is. Even a hypothetical perfect reader still pays for every layout token it ingests, still re-derives the same graph on every query, and still has no provenance to cite. Better readers reduce the noise, not the structural mismatch.
"Why YAML? Why not [insert favorite serialization]?" YAML is human-readable, diff-friendly, supported in every language, and round-trips cleanly to JSON. The protocol is serialization-agnostic at the model level; YAML is the canonical surface because it survives version control. JSON, CBOR, and Protobuf encodings are all valid.
"What about copyright?" A .kcp derived from a copyrighted source inherits its licence. The format includes machine-readable licence terms, distribution constraints, and use restrictions — the same way a software package does. Compiling knowledge does not launder it.
"Who controls the protocol?" The same answer the web gave: the specification is open, the reference compiler is open source, and the goal is to land governance in a neutral foundation as adoption justifies it. No single vendor benefits from a closed .kcp.
A call to the community
The interesting work is not in this essay; it is in the next thousand .kcp packages.
If you maintain a corpus — internal documentation, a textbook, a regulation, a knowledge base — compile a slice of it. Run it through the open compiler, inspect the graph, ship it to your agents, and tell us what broke. If you build agent frameworks, write a .kcp adapter and benchmark the difference against your current pipeline. If you research retrieval, publish a head-to-head against the methodology in this article and tell us where it fails. Join the conversation at /community and on the Knowledge Context Protocol repository.
A protocol becomes seminal when other people build on it. MCP became infrastructure in twelve months because hundreds of teams shipped servers. KCP will follow the same path or it will not exist; there is no third option.
Closing
The document is not the knowledge.
The PDF was a brilliant solution to a 1993 problem: make the page portable. We solved it. We then spent thirty years pretending the page was the unit of thought. It never was. The unit of thought is the claim, with its source, its relations, its conditions, and its weight — and for the first time in history we have readers that need claims served to them in exactly that form.
.kcp is a small opinionated bet on what those readers will demand. It is open, falsifiable, and ready to be argued with. Read the comparison, inspect the protocol, compile a package, and tell the community where it breaks.
Stop feeding agents pictures of pages. Compile the knowledge first.
References
- Anthropic. Introducing the Model Context Protocol. 2024.
- Xu et al. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. KDD 2020.
- Blecher et al. Nougat: Neural Optical Understanding for Academic Documents. arXiv:2308.13418, 2023.
- Liu et al. Lost in the Middle: How Language Models Use Long Contexts. TACL, 2024.
- Edge et al. From Local to Global: A GraphRAG Approach to Query-Focused Summarization. arXiv:2404.16130, 2024.
- Thakur et al. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS Datasets, 2021.
- Zheng et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS, 2023.
- Karpathy, A. On context engineering. 2025.
- W3C. Semantic Web. Ongoing.