Unix solved this satisfactorily in 1971. We took two days to figure that out.

A workshop app needed to execute model-generated Python safely. We tried Pyodide, considered Firecracker, built a Docker-in-Docker sandbox, hit real friction — and then a tangent about macOS agent users reminded us that the Unix multi-user model has been solving this problem for over fifty years.

I spent two days working with Claude Code designing and building a sandboxed code executor for a small workshop PWA. This piece documents what we got wrong, what we got right, and what we only understood after the code was running. Every decision and reversal described here is reflected in the commits that shipped. The interesting parts are not the final architecture — which is straightforward — but the path we walked to get there, including the two-and-a-half wrong turns along the way.

I say "we" deliberately. The design session was a running conversation between me and Claude Code in the terminal. I'll quote the AI where its exact phrasing matters and narrate the rest.

The problem

We run a small Playground — a PWA that workshop participants use for multi-model chat, HTML generation, and an email-driven process runner. Vanilla-JS frontend, Node backend, deployed as a Docker Compose stack on a single VPS. No framework, no build step, one operator, a few dozen authenticated users at any given moment.

We wanted to add support for Agent Skills — the format for packaging reusable LLM capabilities as zips containing a SKILL.md plus optional bundled scripts. The existing Skills tab already let users browse and download skills. What we wanted next was a runtime: let users activate a skill in the Chat tab and have the model actually use it, including executing the Python scripts it ships.

The question: how do you execute model-generated Python safely, on a single VPS, for authenticated users, with real library support, without building serverless-grade infrastructure?

The competitive set is wider than you think

When you frame it as "what isolation primitive for code execution," the option space includes at least seven contenders: Pyodide (WASM in the browser), Docker subprocess on the host, Docker plus gVisor, Firecracker microVMs, external services like e2b.dev or Modal, V8 isolates or WASI on the server, and lightweight namespace tools like nsjail or bubblewrap. We ended up evaluating three of these seriously and dismissing the rest — and the one we should have considered from the start didn't make the list until day two.

Pyodide: the library-coverage problem

The initial instinct was Pyodide. Run Python in the browser via WASM, inside the sandboxed iframe we already use for the Code tab's HTML preview. Zero server cost, ships many of the common scientific Python packages pre-built. The isolation is whatever the browser gives you. For a workshop app, that's a strong default.

But Pyodide's real limitation isn't multi-language support, though that matters too. It's library coverage. We needed to run Anthropic's official pdf skill, which depends on pypdf, pdfplumber, pypdfium2, reportlab, tesseract (via pytesseract), and poppler (via pdf2image). Of those, pypdfium2 requires native bindings that don't compile to WASM. Tesseract is a system binary. Poppler is a system binary. Three of the six core dependencies are non-starters.

For pure-Python data skills, Pyodide is genuinely strong. For real-world skills that bundle whatever interpreter and toolchain the author needed, it's a non-starter. We moved on.

Firecracker: right tool, wrong project

With multi-language and native dependencies on the table, Claude Code reached for Firecracker — AWS's microVM runtime, the one Lambda uses to isolate tenants. Sub-200ms boot times in AWS's published benchmarks. Hypervisor-level isolation. It looks perfect for "run untrusted code safely, in any language."

I pushed back, and the pushback isn't about capability. Firecracker can do this. The pushback is about right-sizing. The decisive question: what does Firecracker buy you over Docker? For multi-language support, nothing — any Docker image gives you any language. For isolation, it buys you hypervisor-level separation, which matters when you're running code from thousands of untrusted strangers on shared hardware. AWS trusts Firecracker because their tenant count makes kernel-escape CVEs a realistic attack vector.

We have at most a few dozen authenticated workshop users. They signed up, they got a magic link, they're in. The threat model isn't "defend against internet-scale adversarial input." It's "someone accidentally runs a wrong rm, or someone writes a script that tries to phone home." That's a different problem, and it doesn't require a hypervisor to solve.

The operational cost of Firecracker on a single VPS is also real: KVM availability check (many cheaper VPS providers don't even expose /dev/kvm), guest kernel build and ongoing patch cadence, rootfs image pipeline, jailer configuration per VM, per-VM tap devices with NAT and iptables, a lifecycle manager, and panic/crash recovery handling. For a single operator, that's structural overhead for isolation the threat model doesn't demand.

At this point the argument was abstract — "more moving parts, more ops." It became concrete later.

The scope trap

With Docker chosen as the runtime, we had to design the execution contract. The first cut was clean: skills ship pre-authored scripts inside their zip, the Playground exposes a tool to the model for each script, the model picks one and calls it, the user approves via a permission modal, Docker runs it, output comes back.

I was confident about this scope. Then I looked at a real skill.

Anthropic's official pdf skill is 315 lines of SKILL.md. It has eight bundled scripts in scripts/, yes. But the bulk of the file is a tutorial showing the model how to use pypdf, pdfplumber, and reportlab directly — with ad-hoc Python snippets the model is expected to write and execute on the fly. The bundled scripts cover only the tricky reusable operations. The rest is example code the model is supposed to adapt and run.

In Claude Code or Codex, that "running it" is a native bash tool call. In our Playground, if we shipped the narrow scope, there would be nowhere for this code to go. The model would produce correct Python and the user would see unrendered text.

Ten minutes of reading one unzipped skill corrected a design decision that would have cost a week to unwind. We reverted to a unified run_code tool: one endpoint, any source, the skill's directory mounted read-only. If the model wants a bundled script, it calls it from within whatever code it writes. No artificial distinction between curated and ad-hoc execution.

The permission model had to change with it. The original per-script cache — "always allow this specific script" — rested on the assumption that the script was fixed. Under the new design, every call carries fresh source. You can't meaningfully cache consent for code that doesn't exist yet. We replaced it with a per-call approval modal and a per-chat opt-in checkbox ("Allow all code execution in this chat"), stored in sessionStorage so it dies on reload. Less clever, but honest. Caching consent for fresh code each invocation is security theater.

We built it. It didn't run.

We implemented the Docker subprocess executor over four commits. Per-turn docker run with --network=none --cap-drop=ALL --security-opt no-new-privileges --memory=512m --cpus=1.0 --pids-limit=128, read-only bind mounts, a curated playground-python image with Python 3.12 and all the PDF skill's dependencies baked in. Three API endpoints: upload, execute, download output. A tool-use loop in the frontend that switches from streaming to non-streaming when a skill is active, renders the permission modal inline, and records audit entries with source, stdout, stderr, and downloadable output files.

A code review caught four blocking bugs and two security issues — camelCase versus snake_case in the API contract, a missing query parameter, an infinite loop on the 11th tool call that would burn API credits silently, and a cross-user exec directory collision where two simultaneous users could share input/output directories because turn IDs weren't namespaced by user. All six fixed in one follow-up commit.

Then we went to test it.

Claude Code surfaced the problem before I opened a browser: "Before we go further — one real gap I need to surface: the api container can't spawn Docker."

The executor calls spawn('docker', ['run', ...]) from inside server/index.js. But the api container is node:20-alpine. It has no Docker CLI and no mount of the host's /var/run/docker.sock. Spawn would return ENOENT. This wasn't addressed in the design.

We wired Docker-in-Docker: added docker-cli to the api container at startup, mounted /var/run/docker.sock from the host, and immediately hit the next problem. When the api container builds a docker run --mount source=/app/data/exec/foo command, the host Docker daemon interprets that path on the host filesystem, not inside the calling container. /app/data/exec/foo doesn't exist on the host. The real path is something like /Users/martintreiber/.../server/data/exec/foo.

We solved it with a five-line toHostPath helper and an environment variable (HOST_SERVER_DIR=${PWD}/server) that maps container paths back to host paths. Then we fixed a port drift between the running stack (8082) and the compose file (8081) that cost a few minutes of pure infrastructure-state debugging.

Four pieces of glue, each requiring thought:

The socket mount — and understanding that it gives the api container root-equivalent access to the host Docker daemon.
Installing docker-cli in the container at startup.
The HOST_SERVER_DIR env var plus path translation so the daemon's bind mounts resolve correctly.
The port drift, which is the kind of config-state bug that only surfaces when you actually run the thing.

This is where the Firecracker argument stopped being abstract. If even the Docker path required an afternoon of non-trivial wiring, multiply that against the Firecracker list — KVM verification, guest kernel build, rootfs image, jailer config, per-VM tap devices, NAT and iptables, lifecycle manager — and you're looking at five to ten times the effort, plus recurring maintenance every time a kernel CVE drops.

To be fair, some of this Docker wiring is local-dev-only. In production, the api process runs on the host and talks to the host Docker daemon directly — no socket mount, no path translation. But even in that best case, Firecracker's production equivalent doesn't simplify, because the KVM, kernel, and jailer requirements are production requirements, not dev-env artifacts.

We also introduced a new attack surface. Mounting /var/run/docker.sock into the api container means any compromised Node process can docker run --privileged -v /:/host ... and own the VPS. We accepted this for local dev because the api container's attack surface is bounded to code we wrote. For production, the plan was a docker-socket-proxy (the tecnativa image is the standard pattern) that whitelists only the Docker API calls the executor needs and rejects --privileged, arbitrary bind mounts, and host networking.

The feature worked. End-to-end test passed: nginx → api container → auth → profile whitelist → Docker-in-Docker via socket → sandbox container → output capture, 171ms round-trip, network correctly blocked, skill directory mounted read-only.

But something about the architecture felt wrong. Three services in the compose file. A separate Docker image to build and maintain. A socket proxy to deploy. Path translation. The whole thing worked, but the complexity didn't match the problem.

The Unix Epiphany

Around the same time, I was thinking about a different problem: running AI agents on my Mac. The idea was to give each agent a dedicated macOS user account — not an admin, just a standard user with full rights within its own home directory and nothing else. It's a simple blast-radius reduction: if the agent process is compromised, it can manipulate files it can reach, but it can't alter system settings, manage other users, or install software system-wide.

Working through the details, I realized something obvious that I'd been overlooking: macOS is Unix. The user/group permission model, UID/GID, file ownership, process isolation by user — this is Unix technology from the early 1970s, and it's been a production multi-user isolation layer for over fifty years. On macOS, you get the same user model plus Apple-specific tooling like launchd. On Linux, you get the same model plus namespaces and cgroups.

And then the connection landed. On the Playground, I have users. They authenticate, they have email addresses, they have sessions. The api container runs Linux which has all the Unix user account management capabilities. I can create Linux users inside that container — one per authenticated Playground user — and run their code under their own UID. No sub-containers. No Docker daemon. No socket mount. No path translation. Just useradd, unshare, and gosu.

The Unix multi-user model is arguably the most battle-tested isolation primitive in computing. It's been hardened across billions of deployments. Every sysadmin on the planet understands it. And somehow, in two days of designing sandboxes, neither I nor Claude Code had considered it until a tangent about macOS made it impossible to ignore.

The UID-in-container model

The new design is simple enough to sketch in a paragraph. Every authenticated user gets a Linux UID inside the api container, provisioned lazily on first code execution. Each run_code call spawns the user's code via unshare -p -m --fork (per-turn PID and mount namespaces, so the child sees only its own process and its own mount view), then gosu drops to the user's UID/GID before exec-ing Python. Network policy is iptables owner-match rules applied once at container startup, keyed on the executor group's GID: all UIDs in the 10000–65000 range (an arbitrary reservation that avoids system UIDs) are blocked from reaching localhost, RFC 1918 addresses, and link-local addresses. The api process runs as root (UID 0, outside the executor range) and is unaffected by the rules.

The dockerproxy service is gone. The playground-python image is gone. DOCKER_HOST is gone. toHostPath is gone. The socket mount is gone. The compose file goes from three services to two. Estimated cold-start drops from 500–1500ms (per-turn docker run with overlay setup, seccomp evaluation, and network initialization) to roughly 15ms (unshare plus gosu plus mount setup — estimated, not benchmarked yet, but the operations involved are syscalls, not daemon round-trips).

The container itself is the isolation boundary from the host. Inside, per-user UIDs plus per-turn namespaces give user-versus-user isolation. This is not a novel architecture. It's how multi-user Unix systems have worked since before most of us were born.

The iptables discovery that almost broke it

The design wasn't completely straightforward. An earlier iteration used unshare -U --map-user=UID — a user namespace that maps the sandboxed UID into the container's user table. This avoids needing CAP_SYS_ADMIN on the container, which felt like a cleaner security posture. The mount setup worked. The UID drop worked. Everything looked right.

Then we ran the empirical probe. Two HTTP requests from inside the sandbox — one to localhost:3000, one to the Docker bridge IP — both reached their targets. The iptables hit counters read zero. The rules existed but had never fired.

The reason is subtle and worth understanding. When you create a user namespace with --map-user, the kernel tags sockets created by processes inside that namespace with the outer UID and GID — typically root — not the mapped inner UID. The iptables -m owner --gid-owner match reads sk_gid from the socket structure, which is a kgid_t in the init user namespace. The mapped UID never appears there. The rules are syntactically valid and semantically inert.

The fix was to drop the user namespace entirely. unshare -p -m --fork (PID and mount namespaces only, no -U), with gosuperforming a real setuid/setgid inside the container's init user namespace. Because it's a real credential change, not a namespace mapping, the resulting sk_uid/sk_gid on any socket the child opens matches what iptables expects. Re-ran the probe: both URLs blocked, rules 1 and 3 each recorded one packet hit.

The cost is CAP_SYS_ADMIN on the api container, because unshare -m (mount namespace) without a user namespace requires it. SYS_ADMIN is a broad capability — it lets the process mount things, join namespaces, and perform various administrative operations inside the container. It does not grant host-boundary escape; Docker's default seccomp profile and the container's mount namespace still enforce that boundary. But it widens what a compromised api process could do within its own container, and that's a real tradeoff we're accepting for the threat model we have.

The calibration ladder

The honest version of "which option is right" isn't a single answer. It's a ladder, with each rung triggered by a specific change in the threat model:

Unix UIDs inside the container (what we shipped) is correct while you have authenticated users, a small cohort, no tenant-sensitive data on the same host, and Linux already in the stack. This is where we are.

Docker plus gVisor (--runtime=runsc) becomes correct when you add skill upload from untrusted authors, expose execution to unauthenticated users, or start handling data where kernel-escape blast radius matters. It's a drop-in replacement — one package to install, one flag to add. Google Cloud Run uses it to isolate tenants. Claude Code's self-assessment was honest here: not naming gVisor in the original comparison was a genuine omission. It's the "strengthen the sandbox later without rearchitecting" lever.

Firecracker becomes correct when you're running skills for many mutually-untrusted tenants simultaneously, or when a compliance review strongly favors hypervisor-level isolation. The operational cost is justified by the threat model, not by a desire for defense-in-depth as an aesthetic preference.

External services (e2b.dev, Modal, Cloud Run) become correct when the ops burden of running your own execution infrastructure outweighs the per-call cost — typically a smaller team that doesn't want to run Docker at all, or a throughput profile that's very spiky and hard to size locally.

None of the triggers for the higher rungs are plausible for our project in the next six to twelve months. If one appears, gVisor is the first escalation, not Firecracker — and they should be compared head-to-head at that point, because gVisor may still be the right answer depending on which trigger fires.

What this cost and what it bought

The full implementation — both the Docker version we built and then replaced, and the UID-in-container version we shipped — took two days of working sessions with Claude Code. The Docker version taught us things the UID version needed (the run_code tool contract, the permission modal, the audit trail, the file upload/download flow). The wiring pain taught us that the Docker approach's complexity was disproportionate to the problem. And a completely unrelated thought experiment about macOS agent users pointed us to the solution we should have seen from the start.

If there's a general pattern, it's that honest design iteration depends on being willing to flip your own prior recommendation when new information shows up. The first runtime choice was workable but over-engineered. The first scope decision felt clean and was wrong. The first permission model felt clever and was wrong. The iptables-with-user-namespaces approach looked correct and was silently broken. Every correction made the design better. None of them would have happened under a "commit to the plan" mentality.

The Unix multi-user model is the most underused isolation primitive in the current wave of AI agent infrastructure. People are reaching for containers, microVMs, and cloud sandboxes — all legitimate tools for legitimate threat models — while the operating system they're running on has shipped a battle-tested, well-understood, perfectly adequate user isolation layer since before the first Ethernet cable was plugged in. It's not sufficient for every scenario. But for a workshop app on a single VPS, it's not just sufficient. It's the right answer. It took us two days and three wrong turns to see it.