Vercel's Eval Data Suggests Static Context Beats Skill Retrieval for AI Coding Agents

There's a recurring theme in AI engineering that never gets old: the "dumb" approach beating the sophisticated one. This time, Vercel has the receipts. The company published eval results on January 27, 2026, comparing two approaches for teaching AI coding agents about Next.js 16 APIs — framework features like use cache, connection(), and forbidden() that postdate many widely used models' training cutoffs.

The results were lopsided enough to be worth examining closely. And the implications — if they generalize — touch on a fundamental question in agent design: should an AI agent decide when to look something up, or should the information simply be there?

The headline numbers

In Vercel's tests, a compressed 8KB documentation index embedded directly in AGENTS.md (a markdown file injected into the agent's context as persistent instructions) achieved a 100% pass rate across build, lint, and test assertions. A skill-based retrieval system — where the agent has access to the same documentation but must decide to invoke it — topped out at 79% even with explicit instructions encouraging its use. Without those instructions, the skill system scored 53%, identical to the baseline with no documentation at all.

The result is striking. It's also narrow in scope, limited in disclosed methodology, and potentially non-generalizable. Both of those things can be true simultaneously.

What Vercel actually tested

Some methodological context is important before interpreting the numbers. Vercel's open-source eval repository contains 20 fixture-based tasks. Each fixture consists of a Next.js project scaffold, a task prompt, and a set of vitest assertions that check build, lint, and test outcomes. The eval suite was "hardened" by removing test leakage, resolving contradictions, and shifting to behavior-based assertions, according to Vercel's blog post. Configurations were tested with retries to reduce model variance. (Retries can inflate apparent pass rates: a task that fails on first attempt but succeeds on retry appears as a pass, masking the underlying inconsistency that required the retry.)

What Vercel does not disclose in the blog post: the exact model or models used, the number of retries per configuration, the distribution of task difficulty, or whether prompt tuning effort was equivalent across all four configurations. The open-source eval repo recommends runs: 10 and notes that single runs are insufficient for measuring reliability, but it's unclear whether the published results reflect that recommendation.

The distinction is important: a 100% pass rate on 20 tasks with retries tells a different statistical story than, say, 100% on 200 tasks without retries. The result is directionally interesting — the gap between 100% and 79% is substantial, and the gap between skills-with-default-behavior and baseline (both 53%) is itself a meaningful finding. But treating the 100% figure as proof of a general architectural principle requires more evidence than a single framework's eval suite provides.

Even with these caveats, the delta between "docs available but not used" and "docs always visible" is large enough to illuminate a broader reliability problem: tool use depends on self-diagnosis, and models are bad at self-diagnosis.

Two models of giving agents information they don't have

The Vercel eval tested a question that's increasingly central to agent engineering: how do you give a model knowledge that isn't in its training data?

Vercel compared two approaches that represent distinct points on an increasingly recognized spectrum in context engineering — the practice of curating what information goes into a model's context window at each step of an inference task.

Push context (AGENTS.md) works by embedding information directly in the agent's persistent instructions. Whatever sits in AGENTS.md is available to the agent throughout a session. The agent never decides whether to load it; it's simply present. Claude Code uses the equivalent file CLAUDE.md. The Open Agent spec says AGENTS.md/CLAUDE.md-style files appear in 60,000+ repositories, and similar conventions have been adopted by tools including Cursor, Devin, GitHub Copilot, and Gemini CLI.

Pull context (Skills) uses a progressive disclosure model. The agent sees a catalog of available skills — just names and descriptions — and decides at runtime which ones to activate. When activated, the full skill content is injected into the conversation context. As Anthropic's Claude Skills documentation describes, this selection happens through "pure LLM reasoning" — no algorithmic routing or intent classification, just the model's judgment about whether a skill is relevant.

Andrej Karpathy has described context windows as analogous to RAM in an operating system — a useful conceptual shorthand, though the analogy has limits. Context windows don't behave like memory hierarchies: attention mechanisms weight information differently than memory addressing, and models can't "seek" to arbitrary positions the way a CPU addresses RAM. Still, the metaphor captures the core engineering tension: persistent context (data loaded at boot time) versus retrieved context (data fetched from storage on demand). The tradeoff between availability and window budget is real, even if the underlying mechanisms differ.

What went wrong with skills — and what the research says about why

Vercel's data reveals a three-stage failure pattern that turns out to be well-supported by existing research on LLM calibration and tool use.

The agent didn't use the skill. In 56% of eval cases with default skill configuration, the agent never invoked the Next.js documentation skill. The agent had access to the information and chose not to use it. The result: 53% pass rate, identical to baseline.

This is consistent with a substantial body of research on LLM calibration — work that spans general confidence estimation, domain-specific assessment, and tool-use decision-making, all converging on the same core finding: models consistently overestimate what they know.

Xiong et al. (ICLR 2024) established the baseline finding that LLMs are systematically overconfident when verbalizing certainty, with confidence values predominantly clustering in the 80-100% range regardless of actual accuracy — a pattern they attribute to models imitating human patterns of expressing confidence. A 2025 biomedical calibration study across nine models extended this to domain-specific tasks, finding overconfidence in 84.3% of 351 experiment scenarios tested.

Most directly relevant to the skill invocation problem, Liu et al. (EMNLP 2024 Findings) studied uncertainty calibration specifically for tool-using language agents, identifying prompt design and execution trace selection as "two primary areas that suffer from miscalibration" — meaning agents misjudge both when to use tools and which execution paths to trust.

Taken together, these results point to a consistent failure mode: models are poorly calibrated not just on answers, but on when to seek help.

The implication for skill systems is straightforward: if a model is systematically overconfident about its existing knowledge, it will systematically underestimate how much it needs external documentation. A skill the model doesn't think it needs is a skill the model won't invoke.

Explicit instructions helped, but were fragile. When Vercel added AGENTS.md instructions telling the agent to use the skill, the trigger rate rose above 95% and the pass rate improved to 79%. But the result was sensitive to specific phrasing: "Explore the project first, then invoke the skill" worked substantially better than more directive formulations. Vercel's blog does not report pass rates for the less effective phrasings, so the 79% figure likely represents a near-best-case result.

This fragility is not unique to Vercel's setup. Prompt sensitivity — where small changes in instruction phrasing produce large changes in model behavior — is one of the least well-understood properties of LLM-based systems. It does, however, illustrate why reducing the number of decisions an agent must make (which phrasing to attend to, when to trigger, what to prioritize) has practical engineering value.

Static context eliminated the decision entirely. With the compressed docs index in AGENTS.md, the agent's pass rate hit 100%. No skill invocation decision, no sequencing choice, no phrasing sensitivity.

The compression approach: engineering, not magic

The mechanism Vercel used to avoid context window bloat is worth understanding in more detail, because it determines how transferable the approach is.

The initial documentation injection was approximately 40KB. Vercel compressed it to ~8KB while maintaining the 100% pass rate — an 80% reduction. The compressed format uses a pipe-delimited index that maps documentation sections to local file paths:

[Next.js Docs Index]|root: ./.next-docs|IMPORTANT: Prefer retrieval-led reasoning 
over pre-training-led reasoning|01-app/01-getting-started:{01-installation.mdx,...}|...

This is not the documentation itself — it's a lookup table. The agent sees the index, recognizes which topics have local documentation available, and reads specific files when it needs details. The approach shifts the retrieval decision from "should I look this up?" to "which specific file should I read?" — a much narrower and less error-prone decision.

Vercel's blog does not describe whether the compression was performed manually or algorithmically, or how sensitive the pass rate is to compression quality. The pipe-delimited format itself may or may not be significant; it's plausible that any sufficiently structured format pointing to retrievable files would perform similarly, but this hasn't been tested.

A key instruction embedded in the index — "Prefer retrieval-led reasoning over pre-training-led reasoning" — appears to play an important role by explicitly redirecting the agent away from its training data. Without ablation studies isolating this instruction's contribution, it's difficult to know whether the index structure or the meta-instruction does more of the work.

There's a plausible cognitive argument for why compressed lookup tables might be particularly effective for transformer-based models. Full documentation is high in token count but low in information density per token — prose, examples, and explanations that a human reader needs but that a model processes as additional attention targets competing with the actual signal. A pipe-delimited index inverts this ratio: it's structurally sparse and informationally dense, with each token carrying high relevance to the retrieval task.

One hypothesis is that this aligns better with how attention mechanisms allocate weight — a short, structured index presents fewer distractors than the equivalent information spread across pages of natural language. The tradeoff is that structural compression discards semantic context that might help with novel or ambiguous queries, which might help explain why the approach works well for well-defined API lookups but could perform differently for tasks requiring deeper conceptual understanding.

The open-source project olore has generalized this pattern to other frameworks, compressing 438 Prisma documentation files into approximately 4KB. The npx @next/codemod@canary agents-md command automates the Next.js setup by detecting the project's version, downloading matching docs into .next-docs/, and injecting the compressed index. Early adopters have reported bugs with index generation, which suggests the tooling is still maturing.

Where this approach doesn't work — and where skills still do

The most important limitation of Vercel's finding is its scope. The test evaluated reference knowledge retrieval: the agent needed to know API syntax, function signatures, and usage patterns for a specific framework version. This is factual, static, and small — precisely the kind of knowledge that compresses well and benefits from persistent availability.

Several conditions limit how far this generalizes.

Scale. Daniel Miessler's Personal AI Infrastructure project runs 43 skills totaling approximately 398KB (~99,000 tokens). Loading all of that passively would consume roughly half of a 200K context window. For systems with substantial skill libraries, on-demand loading isn't a preference — it's a constraint imposed by context window economics. Miessler's analysis articulates a useful dividing principle: "Constraints on ALL outputs → passive. Capabilities for SPECIFIC outputs → active."

Complexity. Skills that involve multi-step procedural workflows, code execution, or multi-agent orchestration can't be meaningfully compressed into an 8KB index. Reference knowledge and procedural knowledge have different information density properties.

Multi-framework conflicts. A project using Next.js, Prisma, TanStack Query, and a custom internal framework would need multiple compressed indexes in its AGENTS.md. At some point, the cumulative size creates the same context window pressure that skills were designed to avoid — along with potential interference between overlapping instructions.

Model capacity. HumanLayer's analysis of the Claude Code harness found that the system prompt already contains approximately 50 individual instructions, and estimated that frontier thinking models can reliably follow roughly 150–200 instructions total. Every instruction in AGENTS.md competes for model attention with everything else in the prompt. Smaller models have correspondingly lower instruction budgets.

Maintenance overhead. A compressed docs index must be regenerated whenever the framework's documentation changes. For actively developed projects, this creates a maintenance burden that skill-based systems avoid through on-demand retrieval of current documentation.

Cost and latency. Persistent context incurs token costs on every API call, whether or not the agent needs the information for a given turn. An 8KB index adds roughly a few thousand tokens per turn, depending on encoding and formatting. For a single development session, that's negligible. For an agentic pipeline processing hundreds of tasks daily — say, a CI system running code generation across pull requests — the overhead compounds. A couple thousand extra input tokens per call, at current frontier model pricing on the order of single-digit dollars to low teens per million input tokens (varying by model and tier), across 500 calls per day, starts to add up. Multiply by several framework indexes and the cost becomes a line item worth tracking.

Vercel acknowledges the complementary nature of the two approaches. AGENTS.md provides "broad, horizontal improvements to how agents work with Next.js across all tasks," while skills are "better for vertical, action-specific workflows that users explicitly trigger." This framing — horizontal knowledge as persistent context, vertical capabilities as on-demand skills — appears sound, though it remains to be validated across domains beyond Next.js.

To illustrate where skills would clearly outperform: consider a "migrate to App Router" workflow that requires reading the project's current routing structure, generating a step-by-step migration plan, executing file moves and rewrites, and running verification tests. This kind of procedural, multi-step task involves conditional logic, code execution, and state management across steps — none of which can be meaningfully compressed into a static index. The skill format's ability to bundle scripts, reference files, and structured instructions is purpose-built for exactly this class of problem. Attempting to encode it as persistent context would bloat AGENTS.md without adding reliability, because the agent needs the information only when the user explicitly requests a migration.

What this suggests for agent design going forward

Several emerging patterns seem worth tracking, with the caveat that most of these are supported by limited evidence and may prove model-generation dependent.

First, compressed reference indexes appear to be an effective form of context engineering for framework-specific knowledge. The approach — a lightweight lookup table pointing to retrievable local files, rather than full documentation in the prompt — is a reasonable middle ground between context bloat and retrieval unreliability. Projects like olore suggest this pattern has legs beyond Next.js.

Second, the agent skill invocation problem may be a specific case of a broader LLM calibration failure. Models don't invoke tools they don't think they need, and current models are systematically overconfident about their existing knowledge. This suggests that any architecture relying on models to self-assess their knowledge gaps faces a fundamental headwind — one that may diminish as models improve at tool use, but that is substantial today.

Third, reducing agent decision points appears to improve reliability in cases where the cost of a wrong decision is high. This echoes a familiar principle in software reliability engineering: system failure probability tends to rise with the number of conditional branches, because each branch can introduce additional failure modes. In traditional software, the solution is defensive defaults and reduced branching. In agent systems, the analogous move is reducing the number of points where the model must assess its own knowledge state — a judgment that, as the calibration literature shows, it performs unreliably.

Finally, evals targeting knowledge outside model training data are essential for measuring documentation wiring effectiveness. If evals only test knowledge the model already has, every approach will appear roughly equivalent. Vercel's eval design, whatever its limitations, gets this right.

The broader question

The tension Vercel's results surface — passive availability versus active retrieval — is not specific to coding agents or Next.js. It's a version of a question that recurs wherever AI systems must decide what to know and what to look up: how much should we trust the model's judgment about its own knowledge boundaries?

Current evidence, from Vercel's evals and from the broader LLM calibration literature, suggests that the answer is "not as much as we might hope." That may change as models improve. In the meantime, reducing the number of knowledge-retrieval decisions an agent must make — through compressed indexes, persistent context, or other forms of proactive context engineering — appears to produce more reliable results than trusting the model to seek information on its own.

Whether that pattern holds at scale, across domains, and across model generations is the open question. Vercel's data is a useful data point, not a settled answer.