Code as Action: The Pattern Behind Programmatic Tool Calling

There's a design decision buried inside every AI agent that most developers make without realizing they've made it: what is the atomic unit of action? For most agent systems built in the last two years, the answer is a tool call — a single function invoked with a JSON payload, its result surfaced back to the model, the model reasons about what to do next, another tool call fires. It works. It scales poorly.

The Code as Action pattern — formalized as CodeAct in a 2024 ICML paper by Xingyao Wang and colleagues at UIUC, and now productized in Anthropic's Programmatic Tool Calling API — replaces the tool call as the atomic unit with a code block. Instead of selecting a tool and emitting a JSON payload, the model writes Python that calls multiple tools, processes their results, applies control flow logic, and lets you keep intermediate data inside the container rather than routing it back through the model's context. The difference is architectural, not incremental.

The Problem with Tool-Call-as-Atom

Standard agentic tool use works like this: Claude calls check_status(server_1), waits for the result, processes it, calls check_status(server_2), and so on. Fifty servers means fifty API round-trips, each consuming tokens for the request, the reasoning, and the response. Context fills with intermediate results that are only relevant as inputs to the next step. Latency compounds. Costs compound.

The structural issue isn't the number of tool calls. It's that the model is being used as an orchestrator for work that doesn't require language model reasoning at each step. Iterating over a list, filtering a dataset, retrying a failed call with adjusted parameters — these are programming problems. They have programming solutions. Routing them through a language model on every iteration is the wrong abstraction.

As Anthropic's own engineering team put it when releasing PTC: "When using natural language tool calling, each invocation requires a full inference pass, and intermediate results pile up in context whether they're useful or not. Code is a natural fit for orchestration logic, such as loops, conditionals, and data transformations."

Code is the right abstraction here precisely because it's how you express deterministic control flow over messy, real-world data. Loops, conditionals, exception handling, aggregation — these aren't things an LLM needs to reason about per-step. They're things a Python interpreter executes.

How It Works

The mechanism is simple. Claude writes the code; your service runs it in a managed container and resolves any tool-call intents the code emits. Execution output (including errors) is returned as the next observation. The model either emits more code or responds in natural language.

observe → plan → write code → execute → observe → ...

In Anthropic's PTC implementation, tools opt in via allowed_callers in their definition:

{
  "name": "query_database",
  "description": "Execute a SQL query. Returns rows as JSON objects.",
  "input_schema": { "...": "..." },
  "allowed_callers": ["code_execution_20250825"]
}

When the task benefits from batch execution or complex data processing, Claude writes a script that calls your tools as async Python functions. The code execution environment emits tool call intents; your application resolves them and resumes execution. You control what enters the model's context window — in the typical case, that's only the final processed output, not the fifty intermediate results from fifty individual round-trips.

Here's what that looks like for the fifty-server example:

# Claude writes this; your infrastructure executes it
results = await asyncio.gather(*[
    check_status(f"server_{i}") for i in range(50)
])
healthy = sum(1 for r in results if r["status"] == "ok")
print({"healthy": healthy, "total": 50, "degraded": 50 - healthy})

One execution block. In the typical pattern, you emit only the aggregated result back to the model — the raw intermediate data stays inside the container. The intermediate results don't have to fill the context window.

Two operational details worth locking in before you ship a production integration. First, PTC does not guarantee parallel execution. The script can express parallelism via asyncio.gather(), but whether tool calls execute concurrently depends entirely on how your host application handles the emitted requests. PTC's primary win is context hygiene and token efficiency — orchestration logic moves from conversation turns into explicit code — even when execution is ultimately sequential. If you're fanning out many calls simultaneously, account for rate limits on receiving services.

Second, and this is the real foot-gun: when the container is waiting for tool results, your API response must be tool_resultblocks only — no extra text. Mix in a prose message while pending calls exist and the container will error. Build your response-routing logic around this before you go to production.

Why LLMs Are Good at This

The case for code as action isn't just about efficiency. It's about alignment between the model's training distribution and the task it's being asked to perform.

The reason JSON tool schemas create friction is structural: they're local, novel, and specific to each application. A model has never seen your particular tool schema during training. It has to generalize from first principles every time. Python syntax and idioms, by contrast, are globally ubiquitous in training data — GitHub repositories, Stack Overflow answers, documentation, textbooks. When you ask Claude to emit a JSON function call against an unfamiliar schema, you're asking it to handle novelty. When you ask it to write Python, you're operating in familiar territory.

That conceptual argument has empirical backing. The Wang et al. ICML 2024 paper evaluated 17 LLMs across two benchmarks — API-Bank and M³ToolEval — comparing code-based actions against text and JSON alternatives. Code outperforms both by up to 20% in task success rate on M³ToolEval, and requires substantially fewer turns on averageto complete equivalent tasks (§2.3, Table 3, Wang et al. ICML 2024). Fewer turns means fewer model calls, which maps directly to lower cost at production scale.

Self-Debugging as a First-Class Feature

One consequence of moving action into a code execution environment: the model gets structured error feedback rather than an opaque error message embedded in a tool result.

When execution fails, the model observes the failure as execution output from the container — it sees which part of the orchestration broke and can revise the code accordingly. Self-debugging becomes a natural part of the execution loop rather than a bolt-on:

write code → execute → observe failure → revise → execute → ...

The ICML paper's CodeActAgent — fine-tuned from Llama 2 and Mistral-7B on the CodeActInstruct dataset (7,000 multi-turn trajectories) — was specifically designed to exploit this loop. Its strong generalization to out-of-distribution agent tasks stems in part from error-driven iteration being baked into its training distribution.

There's also a production debugging benefit worth naming: code actions are inspectable in a way that JSON tool call sequences aren't. When something goes wrong, you have Python you can read, paste into a notebook, and run yourself. The observability compounds over time.

A Note on Versioning

Throughout this article you'll see identifiers like code_execution_20250825 and advanced-tool-use-2025-11-20. These are the current beta identifiers as of February 2026. Anthropic has historically changed or replaced similar identifiers without extended deprecation windows. Treat them as illustrative of the current implementation, not as stable public contracts. Always check the official documentation before deploying to production.

Pattern Selection: When Code as Action Earns Its Overhead

Code as action introduces real costs: a sandboxed execution environment to manage, container lifecycle to track (Anthropic's PTC containers expire after approximately 4.5 minutes of inactivity), and a broader security surface — a prompt injection that reaches a code executor is a different class of problem than one reaching a schema parser. These costs are not trivial.

The pattern earns them when the workflow is genuinely programmatic: iteration, conditional branching, data aggregation, retry logic, or composition that would otherwise require many sequential model invocations. It does not earn them when your agent orchestrates a small number of well-defined API calls in a fixed order.

The comparison is worth being explicit about:

Pattern	Action Format	Strengths	Weaknesses	Best For
ReAct / Tool-Use Loop	JSON / text tool calls	Simple, easy to trace, no sandbox required	Verbose, no native loops, context bloat at scale	Fixed sequential workflows
PTC (Anthropic implementation)	Managed execution + `allowed_callers`	Context hygiene, token-efficient, opt-in per tool	Beta API surface, container lifecycle, no Claude Code support yet	Multi-tool, data-heavy, iterative production workloads
CodeAct (research pattern)	Python execution; environment varies	Maximum flexibility, self-correcting	Environment management, broadest security surface	Research agents, exploratory automation, custom runtimes

A note on the table: "CodeAct (research pattern)" doesn't universally mean unrestricted PyPI access — it means code-as-action with a Python executor; the environment can be minimal or extended depending on the implementation. The key distinction between the research pattern and Anthropic's PTC is the managed environment and the explicit opt-in surface for individual tools.

The official docs recommend choosing either ["direct"] or ["code_execution_20250825"] per tool — not both. This gives the model unambiguous guidance on how each tool is intended to be used.

The Connection to Programmatic Tool Calling

CodeAct is the research pattern. Anthropic's Programmatic Tool Calling is the production instantiation of the same core insight, with two important additions: the allowed_callers opt-in and explicit caller provenance in every tool_use block.

{
  "type": "tool_use",
  "id": "toolu_xyz789",
  "name": "query_database",
  "input": { "sql": "SELECT ..." },
  "caller": {
    "type": "code_execution_20250825",
    "tool_id": "srvtoolu_abc123"
  }
}

The caller field is not cosmetic. It's how your application routes tool results back into a running code execution container versus back to the model's direct context. Your tool result handler needs to inspect this field and respond accordingly — and remember: while pending tool calls exist, your responses must be tool_result blocks only. Tools called programmatically must return results before the container expires (approximately 4.5 minutes of inactivity — monitor the expires_at field).

Current model compatibility (per official docs, February 2026):

Model	Tool Version
Claude Opus 4.5 (`claude-opus-4-5-20251101`)	`code_execution_20250825`
Claude Opus 4.6 (`claude-opus-4-6`)	`code_execution_20250825`
Claude Sonnet 4.5 (`claude-sonnet-4-5-20250929`)	`code_execution_20250825`
Claude Sonnet 4.6 (`claude-sonnet-4-6`)	`code_execution_20250825`

Anthropic's docs currently list availability via the Claude API and Microsoft Foundry.

What PTC Doesn't Yet Cover

If you're building with Claude Code rather than direct API integration, the current state is this: PTC has no documented support in Claude Code as of February 2026, whether via the terminal or the Desktop app. There's an active community feature request (GitHub issue #12836) that's seen significant traction, but no committed timeline from Anthropic.

For most Claude Code workflows, the practical gap isn't acute — native bash execution and parallel sessions in the Desktop app cover a lot of the same ground. But if you specifically need the context hygiene properties of PTC for high-volume production workloads, direct API integration remains the only path today. Given that the infrastructure exists at the API level, it's reasonable to expect this to surface in Claude Code eventually. Watch the release notes.

The Underlying Principle

Strip away the API surface, the beta headers, and the SDK options, and the insight at the center of Code as Action is this: build agent interfaces that align with what the model actually knows how to do.

Tool schemas are local and novel — every application invents its own. Python syntax and idioms are global and heavily represented in training data. When you force a model that's fluent in code to communicate through synthetic JSON function signatures, you're working against the grain of the system. When you let it write Python, you're working with it.

That principle is increasingly validated in production. Anthropic's engineering team built Claude for Excel on PTCspecifically because it needed to read and modify spreadsheets with thousands of rows without overloading the model's context window. The pattern that started as a research result — code is a better action primitive than JSON — is now load-bearing infrastructure.

The concert pianist analogy keeps holding. If you want the performance you hired for, hand them the instrument they trained on.