The shell is a better LLM endpoint than your protocol

I've been building tmuxllm — an agentic harness that runs on a tmux pane. Most of the recent design work has been around extending it: skills, agents, multi-agent coordination, conversational provisioning. Each extension started as an architectural question that felt new and ended in roughly the same place: the right answer was something Unix already provides, and the wrong answer was a layer that recreated it badly. After enough of these, the pattern stopped being a coincidence and became a thesis.

My thesis: the CLI is the right endpoint for LLM-consumed services because the shell environment provides — as substrate properties, not as features — exactly the capabilities LLM workflows need. Frameworks that wrap services in their own protocols are reinventing these capabilities, sometimes worse, while losing the LLM-friendliness of text-mediated tool use.

What LLM workflows actually need

If you watch an LLM-driven workflow play out, three properties show up over and over:

The first is flexibility of composition. An LLM-driven workflow is hard to predict in advance. The user's request might decompose into two steps or twelve. It might branch on intermediate results. It might need to backtrack and retry when an approach fails. It might need parallel exploration of alternatives. A rigid API doesn't accommodate this; a CLI invoked from a shell does, because the shell already has every composition primitive that's ever been needed — pipes, redirection, subshells, process substitution, parallel, xargs, make.

The second is persistent state across invocations. LLM workflows often span multiple turns, sometimes across days. Intermediate results need to persist. Authentication tokens need to live somewhere. Cached responses save real money on real API bills. The CLI inherits all of this from the substrate it runs on — the filesystem, environment variables, the user's shell session, the conventional locations defined by the XDG Base Directory Specification.

The third is capturability of working sequences. Once an LLM figures out a working sequence — the right yt-dlp flags, the right awk to clean the output, the right pretext invocation — that sequence wants to be captured for reuse. With a CLI, capture is trivial: you write a shell script. With an API, capture means writing client code in some language with some library. The friction differential matters for whether captured workflows actually accumulate over time.

The CLI hits all three properties not because it's a particularly clever endpoint format, but because it's the endpoint format that's already embedded in an environment that solves these problems. The shell solves composition. The filesystem solves persistence. Shell scripts solve capture. The CLI inherits these by living in the shell. LLM-driven workflows have specific properties, those properties map exactly onto properties the shell environment provides natively, so putting LLM-callable services in the shell is a structural fit.

The "LLM-friendly" piece is doing real work

LLMs are good at certain things and bad at others, and the asymmetry is sharper than people often acknowledge. They're good at generating text that follows conventional formats, reading and producing structured but flexible data like JSON or markdown, reasoning over text-based intermediate state, composing operations they've seen documented patterns for. They're bad at maintaining complex stateful objects across turns, following protocols with strict invariants the model can't see, reasoning about systems whose behavior isn't representable as text, operations with lots of opaque internal state.

The CLI is "LLM-friendly" specifically because every property of CLI-mediated work falls in the strength zone. CLI calls are text. Output is text. Composition is text — a shell script is just text. State is files, which are text or text-readable. Errors are text, with exit codes and stderr that can be reasoned about. Documentation is text, in man pages and --helpoutput the model can request and read.

Compare this to a framework that exposes services through Python objects with state and methods. The model has to maintain a mental representation of which objects exist, what methods they have, what their state is, how method calls mutate it, how exceptions propagate. That's exactly the kind of thing models are bad at. Compositions break in subtle ways because the model lost track of state somewhere between method calls.

The CLI eliminates this whole category of error by making everything text and all state external. The model doesn't need to maintain a mental object graph; it generates the next command, reads the output, generates the next command. Each step is bounded and inspectable. The model's job becomes much easier — and, not coincidentally, the system's failure modes become much easier to debug.

This isn't an accident. The shell's design — text in, text out, state in files — happens to match the modality LLMs work best in. They're text engines; the shell is a text-mediated environment. The fit is structural.

Why protocols are the wrong abstraction for the local case

The current dominant approach in the LLM-tool space is protocol-based — most prominently the Model Context Protocol (MCP), but also various proprietary tool-call schemas. The argument for protocols is real: they give you typed schemas, structured arguments, validation, cross-language interop. For some cases, those are genuine wins.

But for the primary case of "LLM uses local tools to do work," the protocol approach is structurally worse than the CLI approach for three reasons.

The first is that protocols force composition through the model. When you have two protocol-based tools, A and B, and you want B to consume A's output, the LLM is the composition layer. It calls A, reads the result, formats arguments for B, calls B. Every intermediate step has to be reasoned about by the model. With CLIs, you write A | B and the shell handles it. The LLM doesn't need to be in the loop for the plumbing.

This isn't just inefficient — it's a token cost that compounds. A pipeline with five tool calls has five LLM invocations to handle the orchestration, each carrying full context. The same pipeline as a shell command has one LLM invocation that produced the command, and the rest is shell execution.

The second is that protocols recreate state mechanisms badly. The shell already has environment variables, files, sessionStorage analogues. A protocol-based tool layer typically reinvents these as its own state model — sessions, contexts, cached tool results — that lives inside the framework. This means state is fragmented across two systems (shell and framework), the LLM can only see one of them at a time, and debugging requires understanding both. Whereas a CLI-based tool layer just uses the shell's state, which the model can inspect with ls, cat, env.

The third is that protocols don't compose with existing Unix tools. If your workflow is "fetch data from service X, filter with jq, transform with awk, send to service Y," the CLI-wrapped version of X and Y slots cleanly into the pipeline alongside jq and awk. The protocol version requires the LLM to mediate the filter and transform steps — either by calling more protocol tools (if they exist for these operations) or by reasoning over JSON in its own context (which burns tokens). The CLI version inherits decades of Unix tooling for free; the protocol version is starting from scratch on capabilities the shell has had for fifty years.

However, I want to be careful here. Protocols aren't wrong everywhere. For genuinely remote services where you'd be making HTTP calls anyway, the protocol layer is doing real work. For sandboxed environments where shell access isn't available, you need something, and a protocol is reasonable. For services that need bidirectional streaming, protocols handle this better than CLIs do. But these are the exceptions to a pattern where the default — local services that the LLM uses to do work — fits CLIs better than protocols.

The architectural answer that drops out of this is: local CLIs are the primary surface; remote services get reached through CLI wrappers (gcloud, aws, kubectl, mcp-cli) that turn them into local commands. The wrapper layer is where the protocol stuff lives, hidden behind a --help text. Once it's wrapped, composition is shell-native again. Same thesis, with one extra step at the boundary between local and remote.

What this implies for architecture

Once you accept the thesis, several architectural decisions become forced moves rather than preferences. I'll work through five.

Skills should be CLI commands. A skill is a unit of procedural knowledge — the LLM equivalent of "how to do X." If skills are exposed as anything other than CLI commands, you've introduced a non-text interface between the LLM and the capability, which is exactly the friction we're trying to avoid. A skill is a directory with a SKILL.md, possibly some bundled scripts, and an entry that's invocable from the shell. Nothing more architecturally complex than that.

State should be filesystem-resident. JSONL logs, skill libraries, agent definitions, task queues — all files. Anything in opaque framework state is invisible to the LLM and prone to drift. The model can ls a directory, cat a JSON file, tail -f a log. It can't introspect a Python object graph in the same way. Make state visible to the substrate the model is reasoning in.

Composition should be shell-mediated. Pipes, scripts, make, xargs. Not a custom DSL. Whatever composition primitive you might build, the shell already has, and the LLM can already reason about it because shells are heavily represented in training data. Custom DSLs are training-data-poor; shell is training-data-rich. The model knows shell.

Output should be text-first, JSON-on-demand. Default to human-readable output; offer --json for structured output. Same convention as git status versus git status --porcelain. The text version is for humans and casual LLM consumption; the JSON version is for skills piping into other skills. Both are text, both are inspectable, neither hides state from the model.

Errors should be loud and locatable. Exit codes, stderr messages, traceable to specific commands. The LLM should be able to read the failure and reason about it. Silent failures are the worst case — a tool returns "success" but produces wrong output, and the model proceeds confidently. Loud failures with locatable causes are debuggable. Build for debuggability.

These aren't five separate choices — they're the same choice applied at five layers. Each one says "stay in the substrate the LLM is good at." Together they constitute the architectural commitment.

The convergent evidence

Boris Cherny, who leads Claude Code at Anthropic, has described it as "a Unix utility rather than a traditional product, built from the smallest building blocks that are useful, understandable, and extensible." The architecture invests in deterministic infrastructure (context management, tool routing, recovery) rather than decision scaffolding (explicit planners or state graphs). The core philosophy: "the product is the model" — expose the model as directly as possible, with a minimal set of tools and minimal scaffolding.

The Claude Code core loop is small. The state model is minimal — essentially a message array. The tools are CLI-shaped: read files, edit files, run shell commands, search. There is no workflow engine. There is no graph of state transitions. There is no DSL for composition. The model drives the loop, the shell executes the work, files persist the state.

This is the same architectural answer tmuxllm has been arriving at, applied to software engineering specifically. Two independent efforts converging on the same architecture is meaningful evidence that the architecture is right. It's not "two different bets"; it's "two implementations of the conclusion that careful thought about LLMs plus tools converges on."

Most "AI agent" frameworks in the field today are doing the opposite. They're building protocol-based tool layers, custom state models, workflow DSLs, framework-specific composition primitives. The argument they're implicitly making is that these abstractions are the right ones. The convergent evidence from tmuxllm and Claude Code suggests they aren't — that the substrate the agent is running on already provides better versions of all of these, and that frameworks reinventing them are losing more than they're gaining.

A boundary case worth holding onto

If LLMs improve in ways that change what they're good at — say, they become genuinely good at maintaining complex stateful object graphs, or they get fast enough that the orchestration cost of mediating every step becomes negligible — the thesis weakens. The argument for CLIs over protocols rests on current model strengths and weaknesses. Those could change.

As of where the field is in 2026, the thesis holds. Models are still text engines. They still struggle with opaque state. They still benefit from inspectable, file-based intermediates. The shell is still a better fit than the protocol for local agentic work. But this is an empirical claim about current models, not a permanent architectural truth. If the underlying claim shifts, the architecture should shift with it.

The thing I find more durable, though, is the meta-pattern: when designing systems for LLMs, the right move is to look at what the LLM is actually good at and pick the substrate whose properties match. Today that's text and shell. Tomorrow it might be something else. The discipline isn't "always use shells"; it's "match the substrate to the model's strength zone, and don't recreate substrate properties in framework code."

What I'd watch for next

The agent frameworks that survive will be the ones that lean into substrate inheritance rather than fighting it. The ones that build the most polished framework abstractions will accumulate maintenance burden faster than they accumulate capability, and will eventually be replaced by lighter alternatives that just call out to the shell.

The MCP ecosystem will bifurcate: well-maintained CLI wrappers around remote services are not unlikely to thrive; pure-protocol local-tool implementations will probably lose ground to skills-and-shell-scripts approaches. The wrapping pattern (mcp-cli and similar) is the bridge between the two worlds, and it'll get more traffic over time.

Skill libraries will become the unit of accumulated competence in agentic systems. Not vector-stored conversation history, not learned behaviors baked into model weights, not framework-specific plugins — but versioned, filesystem-resident, shell-invocable skills authored over time and shared across users. The compounding asset is the library, not the agent.

The "AI agent operating system" framing will keep getting attempted, and the attempts will keep failing, because the operating system already exists. It's Unix. The job isn't to build a new OS for agents; it's to build a thin runtime that lets agents use the existing one well.

The synthesis

After enough architectural turns, my thesis is an opinionated bet on a single claim: LLMs work best when their capabilities are exposed as CLI commands in a shell environment, and the consequences of that claim consistently applied form a useful system.

Skills are CLIs. Composition is shell. State is filesystem. Coordination is process trees. Distribution is files. Each is the substrate-native answer to a problem most frameworks solve with their own machinery. Each is better-fitted to LLM consumption because text-and-process is the LLM's native medium.

The architecture is a position. The position is falsifiable. As of 2026, it holds. The work from here is to keep building so the implementation supports the claim — and to publish the comparisons and benchmarks that would let other people falsify it if they think it's wrong.

If you're designing a system for LLMs to use, the question worth asking before you reach for a protocol is: what does this protocol give me that the shell doesn't? If the answer is "remote access, sandboxing, or bidirectional streaming," fine, build the protocol. If the answer is "structured types, validated arguments, or cross-language calls," consider that the shell has all of these in different forms (--help text, exit codes, JSON output, command lookup) and that recreating them costs more than reusing them.

Unix solved most of these problems satisfactorily a long time ago. The work of LLM tool integration is mostly the work of recognizing which problems Unix already solved and not solving them again.