Human code review is dead

Somewhere in your engineering org right now, there's a pull request that has been open for four days. It has 847 changed lines across 23 files. It was mostly produced by an agent. It has two approvals, both of which took under three minutes — you can tell because the timestamps are right there in the GitHub timeline. Nobody left a comment.

Pull requests were built for human-scale code changes and human-scale attention — and AI breaks both. This is not a failure of process. It's the process working exactly as designed, under conditions it was never designed for.

That's the reality sitting behind Ankit Jain's recent essay in Latent.Space, "How to Kill the Code Review." The intentionally provocative thesis: human-written code died in 2025, human code review will die in 2026, and the right response is to move the human checkpoint upstream — from line-by-line diff review to spec authorship and acceptance criteria. The essay landed with the expected mix of "finally, someone said it" and "this is dangerously naive." Both reactions have merit. But the more interesting question isn't whether code review is dead. It's whether it was ever reliably alive.

The Math Broke First

The empirical case is hard to argue with. According to Faros data cited by Jain — drawn from over 10,000 developers across 1,255 teams — high AI adoption organizations complete 21% more tasks (work items tracked in engineering systems, correlated against PR throughput) and merge 98% more pull requests, but PR review time increases 91%. Read that again: nearly double the review burden, for nearly double the PRs.

GitHub's 2025 Octoverse data adds context: merged pull requests hit 43 million per month in 2025, with GitHub reporting a large and growing share of that code as AI-assisted. The bottleneck moved. For the first thirty years of modern software engineering, the bottleneck was writing code. Now it's everything else — validation, review, comprehension, trust.

The human cognitive load problem is real and acute. An AI-augmented developer submits a 500-line PR that looks clean: sensible variable names, passes the linter, consistent style. But verifying it's actually correct requires a senior engineer to mentally reconstruct the entire logic flow. Consider something as mundane as timezone handling: AI-generated code will often produce results that parse correctly in UTC but silently fail at DST boundaries, or apply the wrong offset for ambiguous local times. The code looks right. The brain skims. The PR gets merged. Technical debt accretes invisibly, at velocity.

The rubber-stamp approval is not a new phenomenon. What's new is the scale, and the camouflage.

It Wasn't a Reliable Gate Before AI Either

Before reaching for a solution, it's worth being honest about the baseline. Code review as a near-universal practice is younger than most engineers realize. Jain notes it wasn't ubiquitous until around 2012–2014 — there aren't enough veterans around to remember shipping without it. And even after it became standard, review effectiveness fell off fast as diffs got larger and context got thinner. A widely-cited Cisco peer review study found that once inspection rates exceed a few hundred lines of code per hour, defect detection drops sharply. Beyond a certain diff size you're no longer buying defect detection — you're buying a sanity check.

The social mechanics didn't help. PRs that lingered created pressure to approve. Large diffs created cognitive overload. Reviewers who raised too many issues got a reputation for slowing things down. The incentive gradient pointed toward "looks good" rather than "I found something."

There are contexts where review genuinely works: small diffs with high context, tight-ownership subsystems where the reviewer knows the domain as well as the author, security-critical paths where slowing down is the point. Pair those conditions against a 500-line AI-generated diff from an agent working in an unfamiliar service, and you're not doing code review. You're performing it.

AI didn't break code review — it exposed review's failure modes at scale.

What the Skeptics Get Right

The pushback on Jain's proposed alternative — spec-driven development as the new primary artifact — is not frivolous. From the Latent.Space comments: "I think people who believe in spec-driven development are naive about how hard it is to write a full spec, compared to just writing code. There's a reason we use symbols with elaborate underlying meanings in mathematics — writing the equivalent in words is tedious."

The history of formal specification is a graveyard of good intentions. A spec that says "return the user's balance" doesn't tell you what happens when the user doesn't exist, when two concurrent writes race on the same account, or when a third-party API times out mid-transaction. In many respects, the code is the spec — because the edge cases live there, not in the prose.

The right frame isn't "full spec" — it's acceptance-criteria-driven development: testable invariants and behavioral contracts written before generation. "Must be idempotent." "Must not widen OAuth scopes." "Must not change database schema." "Amount fields must use the Money type." That's a different ask from waterfall specification, and it maps more cleanly to what experienced engineers already know they need to check.

There's a parallel problem with AI as reviewer. In one HackerNews thread, a practitioner claimed their AI reviewer caught "maybe 80%" of the bugs they actually cared about — but routinely buried the useful finding under twenty highly speculative, low-priority comments. The failure mode has a name: high recall, low precision — exhaustive linting masquerading as judgment. A good human reviewer has taste: they drop the nine things they noticed that don't matter and surface the one that does. AI reviewers tend toward exhaustiveness, which trains engineers to tune out the noise, which means the real bugs get missed anyway. The tool didn't fail. The incentive structure did.

Security Is the Wildcard Nobody Wants to Price In

The security dimension of AI-generated code deserves direct treatment, not just a caveat.

Veracode's 2025 GenAI Code Security Report, which analyzed over 100 LLMs across 80 curated coding tasks in Java, JavaScript, Python, and C#, found that AI-generated code introduced OWASP Top 10 vulnerabilities in 45% of test cases. The more concerning finding: security performance has remained flat even as models dramatically improved at producing syntactically correct code. Larger models don't write meaningfully more secure code than smaller ones. This suggests a systemic issue — models trained on public repositories inherit decades of vulnerable patterns — rather than a scaling problem that future model generations will solve.

The 45% figure needs a caveat: Veracode's tasks were designed to expose common weaknesses, not to mirror the full distribution of production code. The real-world rate is unknown. But the directional finding — that AI code can look impeccable while carrying known-class vulnerabilities like CWE-80 (cross-site scripting, failed in 86% of relevant tasks) — is consistent with what security teams are reporting in practice.

This is the key distinction for review architecture. Deterministic guardrails — linters, type checks, contract verification — catch known classes of issues. Attackers live in the gaps between classes, context, and configuration. A custom linter that enforces PreparedStatement usage doesn't catch the authentication bypass that emerges from how two services interact. That's not an argument against guardrails; it's an argument that guardrails alone are insufficient for security-critical paths, and that human escalation triggers for those paths aren't a hedge — they're table stakes.

The Architecture That Actually Scales

Strip away the provocative framing in Jain's essay and there's a structural insight worth keeping: the locus of human judgment needs to move. Not disappear — move. Here's what that looks like in practice.

Acceptance criteria first, not as documentation. The acceptance criteria come from the spec, not from the implementation. If the agent writes both the code and the tests, you've moved the problem — you're trusting the agent to test the right things. BDD-style contracts written before generation become the verification harness the agent can't negotiate with. For example:

Feature: Payment processing
  Scenario: Concurrent writes to the same account
    Given two simultaneous debit requests for $100 against a $150 balance
    When both requests are processed
    Then exactly one succeeds and one returns InsufficientFunds
    And the final balance is $50
    And no amount is double-debited

  Scenario: Third-party API timeout
    Given the payment gateway times out after 3 seconds
    When a debit request is in flight
    Then the transaction is rolled back
    And the account balance is unchanged
    And the operation is marked retriable

This turns intent into a verifiable artifact. The agent implements; the BDD framework enforces. You never read the implementation unless something fails — and when it fails, you know exactly what it violated.

But passing tests are not the end of the story. Contract tests miss integration edge cases. A scenario that passes in isolation can fail catastrophically when two services interact under load, or when a third dependency changes its behavior in production. This means observability isn't an operational afterthought — it's part of the verification architecture. The rollback plan should be written into the acceptance criteria before the first line is generated, not drafted in incident response. If you can't specify how to detect a silent failure and revert cleanly, the feature isn't ready to ship.

Evidence in the PR, not just diffs. Require the PR description to carry proof: links to passing test runs, load-test deltas for performance-sensitive paths, threat model notes for auth or payment changes, and an explicit rollout and revert plan. This turns the PR from an approval ritual into an evidence artifact — and makes "approval" mean something it hasn't meant in years.

Adversarial separation. The agent that writes the code should not be the agent that verifies it. This is an old principle — it's why QA shouldn't report to the engineering manager — and it's now cheap to enforce architecturally. A verification agent given only the spec and the output will find different failure modes than the one that generated it.

Narrow permissions. If the task is "fix the date parsing bug in utils/dates.py," the agent's filesystem access should be scoped to that file and its test. Not src/. Scope creep is where agents cause cascading problems, and it's fully preventable by design.

Escalation triggers for high-stakes paths, unconditionally. Auth logic. Database schema changes. New external dependencies. Payments. These get flagged for human review regardless of agent confidence score. The confidence score is not the relevant signal here.

Ship fast, observe everything, revert faster — and practice reverts like a fire drill. Feature flags, canary deployments, instant rollbacks — the system needs to assume things will escape the guardrails. Not because the guardrails are bad, but because they always are, at some margin. A rollback you've never rehearsed is not a rollback; it's a hope.

What Doesn't Scale

Pretending. The three-minute approval on the 847-line diff is not a code review. It's a signature on a document you haven't read. At the current velocity of AI-generated code, the gap between "we have a code review process" and "that process catches bugs" has become a chasm — and Addy Osmani's January observation that review is "becoming more strategic" is the polite way to describe it. The shift is from "Did you write this correctly?" to "Are we testing the right things?" That's a fundamentally different job.

The most valuable training for the engineers who will thrive in this environment isn't writing more code. It's designing verification systems — knowing what to test, knowing what questions to ask the review agent, knowing when the green checkmarks are lying, and knowing how to correct for errors.

The argument about whether code review is "dead" misses the operational reality: at AI speed, line-by-line review becomes ceremonial. The question isn't whether to keep pull requests. It's whether to replace performative approval with a verification system that can survive a world where a 500-line diff is a rounding error.

Move the judgment upstream. Automate what you can. Force humans onto the paths that can hurt you. And practice the rollback before you need it.

Sources:

Ankit Jain, "How to Kill the Code Review", Latent.Space — Faros data: 10,000+ developers, 1,255 teams
GitHub Octoverse 2025 — 43M PRs merged/month
SmartBear / Cisco peer review study — inspection rate and LOC guidance
Addy Osmani, "Code Review in the Age of AI", Elevate
Veracode, 2025 GenAI Code Security Report — 45% OWASP Top 10 failure rate across 100+ LLMs, 80 curated tasks; methodology note: tasks designed to expose known CWE weaknesses, not a representative sample of production code
HackerNews: "There Is an AI Code Review Bubble"