kv-cache

The KV Cache – The Hidden Bottleneck Holding Back Local AI

Everyone is talking about running large language models on laptops. Almost nobody is talking about why those models run out of memory halfway through a long conversation. The answer is not the model weights. It is the KV cache — and a paper being presented at ICLR 2026 this week may have just solved it.

The Mixture-of-Experts revolution has made local inference plausible in ways that were unthinkable a year ago. Qwen3.5-35B-A3B packs 35 billion parameters into a model that only activates 3 billion per token. Quantised to 4 bits, the weights fit in 20 GB. Community members report running it at over 100 tokens per second on an RTX 3090. A MacBook with 24 GB of unified memory handles it comfortably. For the first time, frontier-class reasoning is within reach of consumer hardware.

But try giving that model a 100,000-token context window and watch what happens. Your 24 GB machine does not run out of compute. It runs out of memory. Not because of the model — because of the cache.

What the KV cache actually is

Every transformer-based language model, when generating text, stores the key and value vectors from all previous tokens in what is called the KV cache. This is how the model "remembers" the conversation. Every time the model generates a new token, it needs to attend to every previous token, which means reading every stored key-value pair. The cache grows linearly with sequence length and scales with model depth.

For a model like Qwen3.5-35B-A3B, the architecture is unusually efficient. It uses a 3:1 hybrid design — three Gated DeltaNet layers (linear attention with a fixed-size state that does not grow with sequence length) for every one full attention layer with GQA using only 2 KV heads. Because the DeltaNet layers do not store per-token KV pairs at all, only 25% of the model's layers actually contribute to the KV cache. This makes the cache footprint dramatically smaller than a standard transformer of comparable size. But at 256K tokens of context, even this reduced cache still consumes multiple gigabytes in FP16. For denser architectures — models that use full attention in every layer — the numbers are far worse. A 35-billion-parameter dense model with standard multi-head attention can burn through 7–8 GB of KV cache at 32K context, and 30+ GB at 128K.

This is the bottleneck that determines whether your local model can read an entire codebase or only a single file. It is also the bottleneck that determines whether you can serve multiple concurrent users on a single GPU. And until now, the solutions have been unsatisfying. You either truncate context (and lose information), prune tokens heuristically (and hope you pruned the right ones), or apply crude per-channel quantisation with no theoretical guarantees about what you are losing.

TurboQuant: near-optimal compression with zero calibration

TurboQuant, accepted at ICLR 2026 and presented this week, was developed by a team led by Amir Zandieh and Vahab Mirrokni at Google Research, in collaboration with Majid Daliri (NYU), Majid Hadian (Google DeepMind), and researchers at KAIST. It proposes a vector quantisation scheme for the KV cache that compresses 16-bit values down to 3.5 or 2.5 bits per channel while preserving the geometric structure of the vectors. The key claim: at 3.5 bits, they achieve "absolute quality neutrality" with the full-precision model. At 2.5 bits, degradation is marginal. That translates to compression factors of 4.5× at 3.5 bits and over 6× at 2.5 bits.

What makes this more than an incremental improvement is the combination of three properties. First, the distortion rate is provably near-optimal — within a factor of approximately 2.7 of the information-theoretic lower bound established by Shannon's source coding theory. For a bit-width of 1, the gap shrinks to a factor of 1.45. This is not a heuristic that happens to work on the benchmarks they tested. It is a method with formal guarantees that hold for any input vector, in the worst case.

Second, the algorithm is online and data-oblivious. It requires no calibration data, no preprocessing, no codebook learning. You can apply it to each KV vector as it arrives during autoregressive generation. This matters enormously for practical deployment. Offline quantisation methods that need a representative calibration dataset are useless for the KV cache, because the cache is populated dynamically during inference — its contents depend entirely on the user's prompt.

Third, the computational overhead is negligible. The core operation is a random rotation followed by scalar quantisation against precomputed codebooks. No iterative optimisation, no nearest-neighbour search against learned centroids. The indexing time, in their nearest-neighbour experiments, was described as "virtually zero."

How it works

The core insight is elegant. Take a d-dimensional vector on the unit sphere. Multiply it by a random rotation matrix. After rotation, each coordinate follows a Beta distribution (converging to Gaussian in high dimensions), and distinct coordinates become nearly independent. This near-independence is the crucial property — it means you can quantise each coordinate independently using an optimal scalar quantiser, without worrying about correlations between dimensions, and still achieve near-optimal distortion for the full vector.

The optimal scalar quantiser for each coordinate is found by solving a one-dimensional continuous k-means problem using the Lloyd-Max algorithm. The resulting codebooks are precomputed once and stored. Quantisation then reduces to: rotate, find nearest centroid per coordinate, store the index. Dequantisation is the reverse: look up centroids, rotate back.

That handles MSE. But for transformer attention, what you actually need to preserve is inner products — the dot product between query and key vectors is what drives the attention mechanism. The paper shows that MSE-optimal quantisers introduce bias in inner product estimation. At 1 bit, the multiplicative bias is 2/π, which is substantial. Their solution is a two-stage approach: apply the MSE quantiser at one bit less than the target budget, then apply a 1-bit Quantised Johnson-Lindenstrauss (QJL) transform on the residual. The result is an unbiased inner product estimator with provably low distortion.

What this means for local inference

Let us make this concrete with the models people are actually running locally.

Qwen3.5-35B-A3B at 4-bit quantisation needs about 20 GB for weights. That leaves 4 GB on a 24 GB GPU for KV cache and overhead. Thanks to the hybrid architecture — only 25% of layers use full attention — the KV cache for this model is far smaller than for a dense transformer of similar size. But it is not negligible. At 256K context, even the reduced cache pushes the memory budget of consumer hardware, which is why the model card recommends reducing context length to 32K if you run into OOM errors.

Apply TurboQuant at 3.5 bits to the attention layers that do store KV pairs, and that cache shrinks by roughly 4.5×. A model that previously hit a wall at 32K tokens on your 24 GB GPU gains meaningful headroom — potentially enough to reach 64K or beyond, depending on the overhead budget. For dense architectures like Llama-3.1 where every layer stores KV pairs, the gains are even more dramatic: context windows could expand 3–4× on identical hardware.

For Qwen3-Coder-Next — the 80 billion parameter MoE designed for agentic coding with 256K context — the weight budget is even tighter. This model needs about 46 GB just for the 4-bit weights. Anyone running it on a 64 GB Mac is already memory-constrained, and like Qwen3.5-35B-A3B it uses the same 3:1 hybrid DeltaNet/attention design that limits how many layers store KV pairs. Even so, every gigabyte saved on the KV cache is a gigabyte that can go toward longer context or more concurrent sessions.

The paper validates these claims across Llama-3.1-8B-Instruct, Ministral-7B-Instruct, Gemma, and Mistral models — though not on the Qwen MoE models specifically, which is worth noting. The benchmark coverage is extensive: LongBench, Needle-in-a-Haystack, ZeroSCROLLS, RULER, and L-Eval. On the Needle-in-a-Haystack benchmark — where the model must find a specific sentence buried in a document of up to 104K tokens — TurboQuant at 4× compression matched full-precision performance exactly. On LongBench, the 3.5-bit variant scored 50.06 versus the full cache's 50.06. Identical. The 2.5-bit variant scored 49.74 — a gap so small it is within noise.

The ICLR version of the paper also includes speedup benchmarks that the earlier arXiv preprint did not: at 4-bit precision, attention logit computation on NVIDIA H100 GPUs is up to 8× faster than the 32-bit unquantised baseline. That is specifically the attention computation, not end-to-end inference throughput — but attention is the dominant cost at long sequence lengths, which is exactly where KV cache compression matters most.

Why this is different from existing approaches

The KV cache compression landscape is cluttered. KIVI does per-channel scalar quantisation without theoretical guarantees and shows meaningful degradation at aggressive compression ratios. SnapKV and PyramidKV prune tokens rather than quantise them, which is a fundamentally different trade-off — you are discarding information rather than compressing it. Product Quantisation methods used in vector databases require codebook training, which is a non-starter for online KV cache application. TurboQuant itself builds on PolarQuant (to be presented at AISTATS 2026, from overlapping authors), which handles Stage 1 — the random rotation and coordinate-wise quantisation. TurboQuant's contribution is adding the QJL residual correction that eliminates inner product bias, making the full two-stage system suitable for attention computation.

TurboQuant's advantage is that it sits at the intersection of theoretical optimality and practical deployability. It does not require tuning. It does not require representative data. It does not require leaving any tokens uncompressed. And it comes with a formal proof that you cannot do much better — the information-theoretic lower bound says so.

It is worth noting NVIDIA's KVTC, also accepted at ICLR 2026, which claims up to 20× compression using a JPEG-style transform coding approach. That compression ratio dwarfs TurboQuant's 6×. But KVTC requires a calibration step and targets offline cache storage and reuse, making it unsuitable for the online, real-time quantisation during autoregressive generation that local inference demands. The two methods are complementary rather than competing — different tools for different deployment scenarios.

In the nearest-neighbour search experiments, the practical implications are equally stark. TurboQuant consistently outperformed Product Quantisation in recall across every dataset tested (GloVe at 200 dimensions, OpenAI embeddings at 1536 and 3072 dimensions), while reducing indexing time from hundreds of seconds to essentially nothing. PQ needed 239 seconds to index 100K vectors at dimension 1536. TurboQuant needed 0.0013 seconds.

The bigger picture

The convergence is hard to ignore. MoE architectures have solved the weight problem — you can fit a frontier-class model on a consumer GPU by only loading the experts you need. Flash Attention and its successors have solved the compute problem — attention is no longer quadratic in practice. What remains is the memory problem, and the memory problem is dominated by the KV cache at long contexts.

TurboQuant is not the only approach to solving this, but it is the first to offer near-optimal guarantees with zero overhead. If KV cache quantisation at 3.5 bits becomes standard — and the results suggest it should — the practical context lengths achievable on local hardware expand significantly. For dense architectures, that could mean a 3–4× increase. That is the difference between a local coding agent that can see one file and one that can see your entire project.

The ecosystem is already moving

Google published its official Research blog post on March 24 and TurboQuant's ICLR poster session is April 25 — but the community did not wait. Within days of the blog post, independent developers had built working implementations from the paper's mathematics alone. A PyTorch implementation with custom Triton kernels has been tested on Gemma 3 4B, reportedly producing character-identical output to the uncompressed baseline at 2-bit precision. A pip install turboquantpackage is already on PyPI, offering drop-in KV cache compression for HuggingFace models. There is a llama.cpp fork with TurboQuant KV cache types (--cache-type-k turbo3) running on Apple Silicon with Metal GPU kernels, claiming speed parity with q8_0 at 4.9× compression. And a vLLM integration feature request is open, with a proposed implementation design already sketched out.

None of this is from Google. Google has not released official code. These are people reading a paper and writing their own implementations based on the mathematics — a signal that the algorithm is as clean and implementable as the authors claim. Whether it lands as a flag in llama.cpp, a backend in vLLM, or a built-in option in Ollama is the question. The pace of the community response suggests it will not take long.

For anyone building agent infrastructure that runs locally — and that is an increasingly large number of developers — this is not a paper to watch. It is already happening.

Unlock the Future of Business with AI

Dive into our immersive workshops and equip your team with the tools and knowledge to lead in the AI era.

Scroll to top