cloudflare

The LLM Whisperers: How Cloudflare and Anthropic Cracked the Code on AI Agent Efficiency

There's a delicious irony at the heart of modern AI development. We've spent years training large language models on every scrap of code humanity has ever written—Stack Overflow answers, GitHub repositories, programming textbooks, documentation—teaching them to become fluent in Python, JavaScript, TypeScript, and dozens of other languages. Then, when it comes time to actually use these models as agents that interact with external tools, we ask them to do something completely unnatural: generate perfectly formatted JSON objects wrapped in XML tags, specifying function names and parameters in a rigid schema they've barely seen during training.

It's like hiring a concert pianist and asking them to communicate by tapping out Morse code on the keys.

This fall, both Cloudflare and Anthropic published technical blog posts that arrive at the same radical conclusion: we've been doing this backwards. Instead of forcing LLMs to speak in the synthetic language of function calling, just let them write actual code. The performance gains are dramatic—Anthropic reports their test case dropped from 150,000 tokens to 2,000, a 98.7% reduction.

But the real story isn't just about efficiency. It's about the operational trade-offs that emerge when you shift from structured function calls to executing arbitrary agent-generated code. The benefits are real. So are the new failure modes, the security attack surfaces, and the operational complexity that comes with running untrusted code at scale.

The MCP Explosion and Its Discontents

To understand why this matters, you need to know about the Model Context Protocol—Anthropic's open standard for connecting AI agents to external systems, launched in November 2024. Think of it as USB-C for AI: a universal adapter that lets any agent talk to any tool without custom integration code.

The adoption has been rapid. The community has built thousands of MCP servers, and as Anthropic notes in their November 4, 2025 blog post, "developers routinely build agents with access to hundreds or thousands of tools across dozens of MCP servers."

This created two immediate problems that anyone who's actually built production AI agents knows intimately:

Problem one: Context window obesity. Traditional MCP implementations load every tool definition upfront. A simple tool description might look like this:

gdrive.getDocument
Description: Retrieves a document from Google Drive
Parameters:
  documentId (required, string): The ID of the document
  fields (optional, string): Specific fields to return
Returns: Document object with title, body, metadata, permissions...

Multiply that by 500 tools across 20 MCP servers, and Anthropic reports you're "processing hundreds of thousands of tokens before reading a request." For context, Claude's extended context window is 200K tokens—you could be spending half of it just describing what tools are available.

Problem two: The intermediate result tax. Here's a common workflow: download a meeting transcript from Google Drive, then attach it to a Salesforce lead. With traditional tool calling, that transcript flows through the model twice:

TOOL CALL: gdrive.getDocument(documentId: "abc123")
→ Returns full 10,000-word transcript
[Model reads entire transcript into context]

TOOL CALL: salesforce.updateRecord(
  objectType: "Lead",
  data: { "Notes": [entire 10,000-word transcript] }
)
[Model writes entire transcript again]

As Anthropic explains: "Every intermediate result must pass through the model. In this example, the full call transcript flows through twice. For a 2-hour sales meeting, that could mean processing an additional 50,000 tokens."

Code Mode: The Cloudflare Revelation

Cloudflare's "Code Mode" approach, published September 26, 2025, cuts to the heart of the problem with characteristic bluntness. Their insight: LLMs are trained on vast repositories of actual code, but tool-calling formats are synthetic constructs that barely appear in training data. The model is fluent in JavaScript but stutters in function-call JSON. As their team puts it: "LLMs are better at writing code to call MCP, than at calling MCP directly."

So they flipped the script entirely. Instead of exposing tools as callable functions, Cloudflare's Agents SDK generates TypeScript interfaces from MCP schemas and lets Claude write actual code:

import * as gdrive from './servers/google-drive';

import * as salesforce from './servers/salesforce';

const transcript = (awaitgdrive.getDocument({ documentId: 'abc123' })).content;

await salesforce.updateRecord({ objectType: 'SalesMeeting',recordId: '00Q5f000001abcXYZ', data: { Notes: transcript } });

This is code the model has seen a million variations of during training. It's natural language for an LLM that grew up reading GitHub. The execution environment is where Cloudflare's infrastructure chops shine. Each piece of agent-generated code runs in a V8 isolate—the same lightweight JavaScript sandboxes that power Cloudflare Workers. These aren't containers; they're far more elegant. An isolate spins up in milliseconds with just a few megabytes of memory. So fast, in fact, that Cloudflare creates a fresh isolate for every code execution and immediately discards it. No reuse, no prewarming, just instant-on sandboxing. The security model is particularly clever. The isolate has zero network access—`fetch()` and `connect()` throw errors by default. Instead of letting code make arbitrary HTTP calls (a security nightmare), Cloudflare uses "bindings"—live JavaScript objects that provide direct access to specific MCP servers. The agent code can call `gdrive.getDocument()`, but it can't phone home to an attacker's server or exfiltrate data through side channels.

Anthropic's Take: Same Insight, Different Angle

Anthropic's engineering team published their own exploration of code execution with MCP just this week, arriving at the same core conclusion through a different lens. Their post focuses less on infrastructure and more on patterns—how to structure code-based tool access in ways that maximize efficiency. Their canonical example: instead of loading all tool definitions into context, present them as a filesystem structure that the agent can explore:

servers/
├── google-drive/
│ ├── getDocument.ts
│ ├── listFiles.ts
│ └── index.ts
├── salesforce/
│ ├── updateRecord.ts
│ ├── query.ts
│ └── index.ts
└── slack/
├── postMessage.ts
└── index.ts

When the agent needs a tool, it navigates the filesystem just like a developer would—listing directories, reading relevant files. As they describe it: "Models are great at navigating filesystems. Presenting tools as code on a filesystem allows models to read tool definitions on-demand, rather than reading them all up-front."

The result? That 98.7% token reduction for their test case—from 150,000 tokens to 2,000 for complex workflows.

But Anthropic's post goes further, exploring implications that Cloudflare's more product-focused piece only hints at:

Data filtering in the execution environment. Instead of pulling 10,000 spreadsheet rows through the model's context, the agent writes code that filters locally:

const allRows = await gdrive.getSheet({ sheetId: 'abc123' });
const pendingOrders = allRows.filter(row => 
  row["Status"] === 'pending'
);
console.log(`Found ${pendingOrders.length} pending orders`);
console.log(pendingOrders.slice(0, 5)); // Only log first 5

As Anthropic notes: "The agent sees five rows instead of 10,000. Similar patterns work for aggregations, joins across multiple data sources, or extracting specific fields—all without bloating the context window."

Privacy-preserving operations. Here's where it gets genuinely interesting. With code execution, intermediate data never has to enter the model's context at all. Anthropic describes a pattern where the execution harness automatically tokenizes sensitive data:

// Agent writes this code:
const sheet = await gdrive.getSheet({ sheetId: 'abc123' });
for (const row of sheet.rows) {
  await salesforce.updateRecord({
    data: { Email: row.email, Phone: row.phone }
  });
}

// But the model only sees tokenized data:
[
  { email: '[EMAIL_1]', phone: '[PHONE_1]' },
  { email: '[EMAIL_2]', phone: '[PHONE_2]' },
  ...
]

Real email addresses and phone numbers flow from Google Sheets to Salesforce, but never through the model. "The real email addresses, phone numbers, and names flow from Google Sheets to Salesforce, but never through the model," they explain. This isn't just efficiency—it's a completely different security model for handling PII.

Skills that compound. Once an agent develops working code for a task, it can save that implementation as a reusable function. Over time, the agent builds a library of higher-level capabilities—essentially growing its own standard library of domain-specific operations.

The Trade-Offs: What the Blog Posts Don't Tell You

Both posts make compelling cases for code execution, but they're written by teams shipping products. Let's examine what gets glossed over.

Dimension Traditional Tool Calling Code Execution
Token cost (complex workflow) 150,000 2,000
Infrastructure requirement None (just LLM API) Sandbox runtime, monitoring
Cold start latency ~0ms overhead Tens of milliseconds (sandbox)
Debugging complexity Medium (inspect tool calls) High (inspect generated code)
Security attack surface Model → API Model → Code → Sandbox → API
Failure modes Malformed JSON, wrong params Logic errors, infinite loops, resource exhaustion
Operational overhead Low High (sandbox maintenance, skill library)
Best for Simple workflows, few tools Complex workflows, many tools

The debugging nightmare. When an agent makes a bad tool call, you see exactly what went wrong: {"tool": "updateRecord", "params": {"recordId": null}} — ah, it forgot to pass the ID. When an agent writes code that compiles but has a subtle logic error, debugging becomes archaeology. The code looks right. It runs without errors. But it's quietly doing the wrong thing.

Consider this agent-generated code:

const leads = await salesforce.query({
  query: 'SELECT Id FROM Lead' 
});
for (const lead of leads) {
  await crm.updateStatus(lead.Id, 'contacted');
}

Looks fine, right? Except it's calling updateStatus in a loop, making 1,000 sequential API calls instead of using a batch operation. The agent solved the problem, but in the most inefficient way possible. Traditional tool calling with proper batch operations in the tool definitions would have prevented this.

The security paradox. Sandboxing adds safety, but code execution introduces new attack vectors. An agent could write code that intentionally hides malicious operations in complex logic:

// Innocent-looking data processing
const results = data.map(item => {
  const processed = transform(item);
  // Buried 50 lines deep: exfiltrate data
  if (processed.value > threshold) {
    // Oops, logs go to external service
    logger.debug(JSON.stringify(processed));
  }
  return processed;
});

With tool calling, you audit the tools. With code execution, you need to audit the code and the execution environment. Static analysis of LLM-generated code is still an open problem.

The operational burden. Cloudflare has V8 isolates because they're Cloudflare—they've spent years building that infrastructure. For everyone else, you're looking at:

  • Setting up and maintaining a sandbox runtime (Docker, Firecracker, gVisor, or OS-level sandboxing)
  • Resource limits and monitoring to prevent runaway code
  • Skill library version management as your agent accumulates code
  • Logging and observability for code execution (what ran, why, what did it access)

Anthropic's own work on sandboxing Claude Code shows they're using Linux bubblewrap and macOS Sandbox—OS-level primitives that require careful configuration. They've open-sourced this work, which is admirable, but adopting it isn't turnkey.

When Code Execution Is Overkill

This is crucial: code execution isn't always the right answer.

If your agent only uses 3-5 tools for simple workflows, the overhead of code execution likely exceeds the benefits. A customer service bot that looks up orders and sends canned responses? Traditional tool calling is simpler, more auditable, and has fewer moving parts.

Code execution makes sense when:

  • You're connecting to dozens of MCP servers with hundreds of tools
  • Workflows involve complex multi-step operations with data transformation
  • Intermediate results are large (documents, datasets, query results)
  • You need privacy-preserving data handling (PII tokenization)
  • Your operational maturity includes secure sandbox infrastructure

Code execution is probably overkill when:

  • You have a handful of simple tools
  • Workflows are short and linear
  • All data is small enough to fit comfortably in context
  • Your team doesn't have sandbox infrastructure or security expertise
  • Deployment is on-premises with limited compute resources

The cost-benefit calculation changes with scale. At 100 agent invocations per day with 5 tools, traditional tool calling is fine. At 100,000 invocations per day with 200 tools, you're burning tens of thousands of dollars in unnecessary token costs—enough to justify building sandbox infrastructure.

Why This Works: The Training Data Tells All

The deeper reason code generation works goes back to the fundamental nature of how these models were trained. When you look at the Common Crawl snapshots, GitHub dumps, and code repositories that make up LLM training data, you find billions of lines of production code. Function calls, error handling, loops, data transformations—the full spectrum of software engineering patterns, written by humans trying to solve real problems.

What you don't find much of: synthetic tool-calling formats. JSON function descriptors wrapped in XML tags are an artifact of LLM deployment, not training data. They're a post-hoc convention that researchers and engineers invented to make models more useful, but models haven't seen nearly enough examples to develop real fluency.

This explains why tool calling remains brittle. Models hallucinate function names, mess up parameter types, and struggle with complex tool compositions. They're generating text in a format they were never properly trained on.

Meanwhile, writing code to chain API calls together? That's the mother tongue.

Convergent Evolution in Real Time

What's striking is how independently Cloudflare and Anthropic arrived at essentially the same conclusion within weeks of each other. Cloudflare published on September 26, 2025; Anthropic on November 4, 2025. The posts reference each other, but these were clearly parallel discoveries driven by the same pressures: agents are scaling up, tool counts are exploding, and the old approach breaks at this scale.

Cloudflare acknowledges this directly in their post, noting that Anthropic "published similar findings, referring to code execution with MCP as 'Code Mode.' The core insight is the same: LLMs are adept at writing code and developers should take advantage of this strength."

It's reminiscent of other moments in computing history when infrastructure constraints forced architectural rethinks. The move from CGI scripts to FastCGI and WSGI as web traffic scaled. The container revolution when VMs became too heavyweight. GraphQL's emergence as REST APIs grew unwieldy. In each case, the old approach worked fine at small scale, but fundamental limitations emerged under pressure.

We're watching that phase transition happen in real time for AI agents.

What Comes Next

The adoption curve for code execution with MCP remains uncertain. The efficiency gains are compelling, but the operational complexity is real. Here's what needs to happen for this to become more widely adopted:

Standardized sandbox runtimes. Right now, everyone building this is rolling their own. We need the equivalent of Docker for agent code execution—a standard runtime that handles sandboxing, resource limits, and monitoring with sane defaults. Anthropic's open-source work is a start, but broader ecosystem tooling is needed. Sandbox-as-a-service offerings will likely emerge.

Better debugging tools. If agents are writing code, developers need tools to understand what that code does, why it was generated, and where it went wrong. Think: execution traces, code annotation from the model ("I'm writing this loop to process all leads"), and static analysis tools tuned for LLM-generated code patterns.

Observability frameworks. Standard logging, tracing, and monitoring specifically designed for agent-written code. What tools did it use? What data flowed where? What was the execution path? These aren't just debugging tools—they're audit and compliance requirements for enterprise adoption.

Clearer security models. Enterprise adoption requires confidence in the security boundaries. That means formal verification of sandbox properties, audit logging of all code execution, and clear policies about what code can and cannot do. The "tokenize PII" pattern Anthropic describes is elegant, but needs standardization and third-party validation.

Agent frameworks with batteries included. LangChain, AutoGPT, and similar frameworks need to bundle secure execution environments, or at least provide clear integration paths. The ecosystem will likely split between "lightweight" frameworks (tool calling only) and "full-stack" frameworks (code execution included), similar to how web frameworks split between templating-only and full MVC.

Economic viability analysis. The token savings are dramatic, but you're trading compute for infrastructure cost. At what scale does a sandbox runtime become cheaper than burning tokens? For high-frequency agents handling complex workflows at scale, probably immediately. For occasional-use assistants with simple workflows, maybe never.

Code generation libraries for skills. As agents accumulate reusable code snippets and skills, we'll need versioning, testing, and distribution mechanisms—essentially package managers for agent skills.

If I had to estimate—and this is educated speculation rather than prediction—many high-scale agent platforms will adopt code execution within 18-24 months. For smaller deployments and simpler use cases, traditional tool calling will remain the pragmatic choice until the tooling matures and sandbox infrastructure becomes commodity. The pressure points that drove Cloudflare and Anthropic to this approach (tool overload, token costs, workflow complexity) don't hit everyone equally or immediately.

The Meta-Lesson

Strip away the technical details, and there's a broader lesson here about working with AI systems: understand what your tool is actually good at, and build around those strengths.

LLMs are fundamentally text prediction engines trained on human-generated content. They excel at tasks that look like the text they were trained on. When we ask them to generate synthetic formats with no parallel in their training data, we're working against the grain of the system.

The revolution isn't teaching models to do new things—it's recognizing what they already do well and building infrastructure around those strengths. Code Mode and code execution with MCP aren't about adding capabilities; they're about removing artificial constraints.

But—and this is crucial—removing constraints means accepting new responsibilities. As Anthropic notes in their post: "Running agent-generated code requires a secure execution environment with appropriate sandboxing, resource limits, and monitoring. These infrastructure requirements add operational overhead and security considerations that direct tool calls avoid."

You're not just calling APIs anymore; you're executing arbitrary code generated by a probabilistic model. That's powerful. It's also risky. The engineering challenges are real: sandboxing, debugging, security, operational overhead.

The question isn't whether code execution is "better" than tool calling in some abstract sense. The question is: for your specific use case, at your specific scale, with your specific operational maturity, do the token savings and workflow improvements justify the infrastructure complexity?

For Cloudflare and Anthropic, operating at massive scale with world-class infrastructure teams, the answer is clearly yes. For everyone else, the calculus depends on scale, technical capacity, and how many tools you're trying to orchestrate.

The industry is learning—again—that there are no silver bullets, only trade-offs.

Unlock the Future of Business with AI

Dive into our immersive workshops and equip your team with the tools and knowledge to lead in the AI era.

Scroll to top