Thinking Machine

Why your “reproducible” AI keeps changing its mind

Ask ChatGPT the same question twice with temperature set to zero and a fixed seed—settings that should guarantee identical responses—and you'll likely get different answers. It's a phenomenon so common that most AI practitioners have learned to live with it, chalking it up to the inherent messiness of neural networks running on parallel hardware.

Mira Murati isn't buying that explanation. The former OpenAI CTO, whose brief tenure as interim CEO during the Sam Altman drama made headlines last year, has resurfaced with Thinking Machines Lab and $2 billion in reported seed funding. Her team's first public research doesn't just document the non-determinism problem—it claims to have solved it entirely.

The technical breakthrough centers on a counter-intuitive finding: the randomness in AI responses isn't caused by the usual suspects of concurrent processing and floating-point arithmetic. Instead, it stems from how inference servers batch requests together, creating subtle mathematical differences that cascade into completely different outputs.

Debunking the conventional wisdom

For years, researchers blamed LLM non-determinism on what Thinking Machines calls the "concurrency + floating point" hypothesis. The theory seemed logical: GPUs run thousands of parallel threads, floating-point math breaks mathematical laws like associativity (where (a + b) + c ≠ a + (b + c)), and racing threads could theoretically finish in different orders.

Recent academic papers have reinforced this view, stating that "parallel operations across multiple threads can yield different results based on execution order." The explanation felt satisfying enough that most practitioners accepted it as the cost of doing business with neural networks.

Horace He, the Thinking Machines researcher who authored the new paper, systematically dismantled this theory. His team ran identical matrix multiplications 1,000 times on the same GPU and achieved bitwise-identical results every time. "We're definitely using floating-point numbers. And our GPU definitely has a lot of concurrency," He noted. "Why don't we see nondeterminism in this test?"

The batch size smoking gun

The real culprit turned out to be batch invariance failure. When inference servers process requests, they group them into batches for efficiency. A request might be processed in a batch of 32 during peak hours, or alone during quiet periods. These different batch sizes trigger different computational pathways in GPU kernels, producing numerically different results for identical inputs.

Think of it as asking the same math question in a classroom: the correct answer shouldn't change depending on whether the teacher asks one student or a group of 30. But with current GPU kernels, that's exactly what happens.

To demonstrate this, Thinking Machines ran 1,000 identical prompts through a Qwen-3 model asking "Tell me about Richard Feynman" at temperature zero. The result: 80 unique responses. While 992 correctly continued "Feynman was born on May 11, 1918, in Queens, New York," eight responses diverged to "New York City." The first variation occurred at the 103rd token, with responses remaining identical up to that point.

With their batch-invariant kernel implementations, all 1,000 responses became identical—the mathematical behavior users would expect from a deterministic system.

The RLHF revolution hiding in plain sight

The implications extend beyond reproducibility into reinforcement learning from human feedback (RLHF), the technique that powers modern AI assistants. Current RLHF implementations suffer from a fundamental flaw: numerical differences between training and inference phases create what researchers call "off-policy" learning, where the model being trained differs subtly from the one generating training data.

Thinking Machines demonstrated true "on-policy" RLHF for the first time, achieving zero KL-divergence between training and sampling policies. In their experiments with a visual reasoning task, training runs that normally crashed without corrective techniques suddenly succeeded when perfect training-inference alignment was maintained. The team describes this as potentially "more transformative than the GPU kernel story" itself.

For AI labs spending tens of millions on RLHF compute clusters, eliminating the off-policy drift could mean shorter training runs and more reliable model updates. For a lab like Anthropic or OpenAI, which reportedly spends tens of millions per major training run, even a 10% efficiency gain would represent millions saved—assuming the industry adopts deterministic inference despite its costs.

Three operations, three levels of difficulty

The technical solution required rebuilding three core GPU operations that break batch invariance in transformer models. Listed in ascending order of complexity:

RMSNorm (layer normalization) proved easiest to fix. Standard implementations assign batch elements to GPU cores until cores outnumber elements, then switch to more complex reduction strategies. Thinking Machines' approach: accept the performance hit of simpler strategies for small batches rather than changing mathematical operations mid-stream.

Matrix multiplication presented steeper challenges due to "tensor core" optimizations that operate on data tiles. Different batch sizes can trigger different tile configurations, changing internal computation order. Their solution involves using fixed kernel configurations across all batch sizes, sacrificing peak performance for consistency.

Attention mechanisms proved most complex, requiring reductions across both feature and sequence dimensions. The challenge intensifies because inference optimizations like chunked prefill affect sequence processing. Thinking Machines implemented "fixed split-size" strategies rather than the dynamic optimizations that maximize GPU utilization.

The 60% performance tax

Determinism comes with significant costs. In benchmarks using Qwen-3-8B, standard vLLM processed 1,000 sequences in 26 seconds. Thinking Machines' unoptimized deterministic version took 55 seconds, with optimization bringing this down to 42 seconds—still a 60% performance penalty. In practical terms, that's the difference between a chatbot feeling instant and feeling sluggish, or an inference cluster needing 1.6× as many GPUs to maintain the same throughput.

The company has open-sourced their implementations, but adoption faces significant hurdles. Major frameworks like PyTorch and TensorFlow would need to integrate these kernels, and cloud providers would need to offer deterministic endpoints despite the performance costs.

For consumer applications where response variability enhances user experience, the trade-off makes little sense. But in regulated industries—finance, healthcare, legal—where identical inputs must produce identical outputs for audit trails and compliance, the calculus shifts entirely.

Industry pushback and practical skepticism

The broader AI industry has shown little appetite for performance-killing reliability features. Google, Meta, and NVIDIA have spent billions optimizing for raw speed, with benchmark leaderboards driving engineering priorities. Convincing these companies to voluntarily slow their systems by 60% for mathematical purity represents a significant cultural shift.

More realistic adoption scenarios involve regulated industries and specialized use cases where determinism justifies the performance penalty. Financial services firms facing algorithmic audits, medical AI companies seeking FDA approval, and government contractors requiring reproducible results could embrace the trade-off. Smaller AI labs might also adopt deterministic kernels as differentiation for enterprise clients prioritizing reliability over millisecond response times. Regulatory pressure could accelerate this trend, potentially making deterministic behavior a compliance requirement rather than an optional feature.

But even Thinking Machines acknowledges significant unsolved problems. Achieving batch invariance "in distributed multi-GPU settings remains an open problem," according to their paper. The techniques demonstrated work on single GPUs processing relatively simple tasks—scaling to the massive, distributed systems powering production AI services presents additional challenges.

The startup's $2 billion seed round—led by Andreessen Horowitz, represents one of the largest early-stage rounds in Silicon Valley history, dwarfing typical Series A funding by orders of magnitude. That provides resources to push beyond academic research into actual products. Murati has indicated the company's first offering will target "researchers and startups developing custom models," suggesting a deterministic inference service rather than waiting for ecosystem-wide adoption.

Beyond good enough

The research represents a broader philosophical divide in AI development. While most labs chase larger models and flashier capabilities, Thinking Machines focuses on fundamental reliability. As their paper concludes: "We reject this defeatism. With a little bit of work, we can understand the root causes of our nondeterminism and even solve them!"

Whether the industry shares that philosophy remains unclear. The tension between "fast enough" and "mathematically correct" will likely be resolved by external forces—regulatory requirements, liability concerns, or high-profile failures—rather than internal technical preferences.

For now, though, most of the AI industry seems content to keep shipping fast, fuzzy answers—so long as they arrive on time.

Unlock the Future of Business with AI

Dive into our immersive workshops and equip your team with the tools and knowledge to lead in the AI era.

Scroll to top