Talkable documents - your document is listening

People build software now by talking to their phones. Not in some near-future sense. In the actual present. The pattern started on desktop — Karpathy named it "vibe coding" in February 2025, talking into Cursor's Composer via SuperWhisper and barely touching the keyboard — and has since moved onto phones. Someone walks the dog, holds the mic button, says "add a debounce to the search input and put the loading spinner inside the field instead of next to it," and a working change lands in a repo. The ergonomics are wrong in obvious ways — a phone screen is a bad place to read code, voice transcription mangles symbol names, the model occasionally renames a function you didn't want renamed — and yet the loop closes. People ship.

This feels novel. It is novel. But it's also the harder case. The easier case is documents.

Documents are the easy version

Every reason building software by voice is hard goes away when you swap "software" for "document":

Software has state machines, build systems, deploy pipelines, and a thousand cross-file invariants. A document has structure and prose. The model already knows prose better than it knows any framework.
Software has to run somewhere. A document doesn't run anywhere — it just exists, and the act of opening it is the entire deployment.
Software fails in a hundred subtle ways. A document fails by looking wrong, which the user can see in the next second.
Software needs tests. A document needs to read right.

If you can voice-program an app on your phone, you can voice-author a document on your phone — and the second one is structurally simpler. The fact that the second one isn't a thing yet, while the first one already is, is a clue that something else has been in the way. What's been in the way is that documents live inside applications that were never designed to be talked to.

Dictation is not instruction

Word has a microphone button. So does Google Docs. They both transcribe. You speak, and what you said appears as text in the document, character for character.

This is not the same thing.

Dictation is I type with my mouth. Instruction is I describe what I want, and the document becomes it. Those are different verbs operating on different surfaces. Dictation runs on the same plane as the keyboard — it produces character input. Instruction runs one level above — it produces document changes. The first one needs a microphone. The second one needs a microphone and a model and an in-document execution surface that can take an instruction like "tighten the second paragraph and pull the bullet list into a callout" and apply it.

The execution surface is the missing piece. Word doesn't have one. Google Docs doesn't have one. The office suite never had one. It assumed a keyboard, because in 1985 a keyboard was the only thing a writer was going to bring to a document, and by the time the assumption became wrong nobody felt like rebuilding the suite.

The shape of the missing piece

This is where reWritable is shaped right.

A reWritable document is a single HTML file with an embedded LLM and a ⌘K rewrite loop. You hit ⌘K, type an instruction, the model receives the document, returns a modified version, and the file re-renders in place. The previous post on this made the case that this turns the document into its own application — that the file you own on disk is the same file you edit by talking to it. That argument was about the architecture.

The argument here is about the input device. ⌘K is a keystroke. There is no reason it has to be. The same loop accepts speech the moment you hand it a transcript, and on a phone, the transcript is the thing the OS already produces when you tap the microphone. The pieces compose:

Layer	What it gives you
The phone	A microphone, ambient and ready
The OS	Speech-to-text, already integrated
The model	Instruction-following
The document	An in-file rewrite loop that turns instructions into changes

For the first time in the lifetime of the document-as-artifact, all four are present in one place. The keyboard era of authoring was a function of which of these layers existed. The voice era is a function of all four arriving at once.

What this looks like in practice

You are on a train. You open proposal.html on your phone. You hold the mic button.

"Drop the sales projections section. Add a paragraph at the end about the rollout timeline — Q1 onboarding, Q2 expansion, Q3 review."

You let go. The document re-renders. The sales projections are gone. A new closing paragraph names the three quarters. You skim it, hit the mic again — "make the rollout paragraph more conservative, say 'targeted onboarding' instead of just onboarding, and don't promise anything in Q3, just say we'll evaluate" — let go. It rewrites. You read it. It's right.

You get to the office. You open the same file on your laptop. You edit a sentence by hand. You send it. The recipient gets the file on their phone, opens it, holds the mic button, and asks the document a question about itself. The document answers. They tap the mic again to ask for a change. It changes.

None of this is currently the way documents work. All of it is achievable today with software that already exists — file-as-document, embedded LLM, OS-level voice input. The reason it isn't already mainstream is that the document has had nowhere to put the microphone. A reWritable file does.

The objections that come up

"I can already dictate to Word." Dictation produces text at the cursor. Voice instruction produces document-level changes — restructuring, retoning, rewording, rendering. They are not the same loop. One is an input method; the other is an authoring method.

"Voice is too imprecise for serious documents." The imprecision is upstream of the model. It doesn't matter, because the document is on the screen and the result is visible in the next second. Imprecision in input is fine when feedback is immediate. The loop is short enough that you correct by re-saying, the same way you correct a sentence by re-typing it.

"I'm not going to write a tax return on my phone." You are not going to write it on your phone. You are going to finishit on your phone — the line you forgot, the field you want to clarify, the paragraph you want to soften before sending. The asymmetry that matters is that most editing moments don't happen at a desk. They happen in a taxi, in a kitchen, in a bed at 11pm. The desk-bound document is the one you can't reach in the moment you actually want to change it.

"This will produce a lot of bad documents." It will produce more documents, period, because authoring will be available in moments where authoring previously wasn't possible. Some of them will be bad. The same was true when keyboards replaced typewriters and when typewriters replaced quills. The relevant question isn't whether the new mode produces some bad work — it's whether it lets people who previously couldn't author at all start to.

What this changes

The author stops being someone who sits down to write and becomes someone who maintains a document the way you maintain a conversation. The act of editing decouples from the act of being at a desk. The skill that matters is no longer "operate the word processor" — it's "describe the change you want." That skill is much more evenly distributed across the population than the first one ever was.

It also reorders the geography of authoring. The desktop word processor was built around a single workstation: one machine, one keyboard, one screen, one user with both hands free. A talkable document doesn't care where you are. The same artifact responds to keyboard at the desk, to voice on the phone, to a colleague's voice on their phone the next day. The document is the constant. The input device is whatever you happen to be holding.

That is, finally, the shape of authoring that the read/write web was supposed to deliver more than thirty years ago and didn't, because the document-as-coordination-protocol got captured by an office suite that had no place to put a microphone.

The microphone has somewhere to go now.

The document is listening.