In the early days of adapter-based tuning, LoRA often felt like a charming hack—efficient, plausible, but with a nagging question: would performance always trail full fine-tuning? New research from Thinking Machines, led by John Schulman (co-founder of OpenAI and creator of the PPO algorithm), argues that the difference is not inevitable. Under the right regime, LoRA can track full fine-tuning so closely that we can treat it as a drop-in, dependable alternative rather than a risky shortcut.
What LoRA Does (And Why It Sometimes Disappoints)
LoRA (Low-Rank Adaptation) works by decomposing the weight update in fine-tuning into a low-rank additive form: W' = W + γBA. Instead of adjusting the full weight matrix W, you learn smaller matrices B and A (with rank r << original dimensions) and scale them with γ. This strategy dramatically reduces the number of trainable parameters and memory overhead, while leaving the base weights W frozen.
It's been appealing for many applications—serving multiple adapter versions, enabling smaller training memory footprints, and ease of weight transfer across models. However, its reputation was haunted by mixed results: sometimes LoRA worked well, sometimes it seemed to underperform full fine-tuning unless carefully tuned.
The Low-Regret Recipe: Three Simple Rules
What Thinking Machines brings to the table is a kind of "LoRA for practitioners" prescription: a hyperparameter and application regime that reliably places LoRA into what they call a low-regret regime—where LoRA performance aligns with that of full fine-tuning. The breakthrough is not incremental but foundational: LoRA need not come with a performance trade-off if you follow their rules.
Rule 1: Apply LoRA to All Layers
The first rule is simple but underappreciated: put LoRA adapters on all layers, not just attention. Many earlier works only applied adapters to attention projections (the Q, K, V, and output matrices), assuming that's where most of the "learning" happens.
But the Thinking Machines experiments show that limiting adapters to attention leads to stunted learning curves—even when compensating by raising rank. Instead, applying LoRA to both attention layers and MLP (feed-forward) layers delivers parity with full fine-tuning across models ranging from 7B to 70B parameters. In Mixture of Experts (MoE) architectures, adapters should also be applied to the expert layers.
The researchers tested this on Llama 2 models (7B, 13B, and 70B parameters) and Mixtral 8x7B, across supervised fine-tuning datasets including GSM8K (math reasoning) and instruction-following datasets. When LoRA was restricted to attention-only, models consistently underperformed by 5-15% on downstream metrics compared to full fine-tuning, even at high ranks (r=64). Adding MLP adapters closed this gap almost entirely.
They argue this parallels the "neural tangent kernel" view of deep learning: in this theoretical framework, training dynamics are determined by which parameters contribute most to gradient updates. Since MLP layers typically contain 2-3× more parameters than attention layers in transformer architectures, skipping them means you're missing the majority of the gradient signal that drives full fine-tuning.
Rule 2: Use ~10× the Learning Rate of Full Fine-Tuning
Second, the learning rate for LoRA should be roughly 10× that of full fine-tuning. This is the research's most counterintuitive finding.
In typical full fine-tuning of a 7B-70B parameter model, learning rates are in the 1e-5 to 5e-5 range. The Thinking Machines team found that optimal LoRA learning rates consistently fell in the 1e-4 to 5e-4 range—about 10× higher.
This held across every configuration they tested:
- Supervised fine-tuning on mathematical reasoning (MATH dataset), instruction following, and coding tasks
- Reinforcement Learning (RL) with policy gradients on reasoning tasks
- Multiple model sizes (7B, 13B, 70B parameters)
- Various rank values (r=8, 16, 32, 64)
For example, when full fine-tuning a Llama 2 7B model used a learning rate of 2e-5, the optimal LoRA learning rate was 2e-4. When full fine-tuning used 5e-5, LoRA's sweet spot was 5e-4. The 10× factor remained remarkably consistent.
Even more surprising: this optimal learning rate is mostly invariant across rank values in the early phases of training. This is due to LoRA's standard parameterization: in the typical implementation, the adapter output is scaled by α/r, where α is a hyperparameter (often set equal to r) and r is the rank. This scaling means that as you increase rank, each individual parameter's contribution is automatically adjusted, keeping the effective learning rate relatively stable across different rank choices. Thus, once you pick a base full fine-tuning learning rate, you can multiply by ten to set your LoRA learning rate—dramatically reducing your hyperparameter search space.
Rule 3: Keep Batch Sizes Moderate
Third, resist the urge to inflate batch size. LoRA is less tolerant of very large batches than full fine-tuning.
The researchers found that while full fine-tuning could handle batch sizes of 256-512 samples per step without performance degradation, LoRA's performance started to decline noticeably at batch sizes above 128. At a batch size of 512, the gap in final validation loss between LoRA and full fine-tuning widened by 10-20%, even when using optimal learning rates and applying adapters to all layers.
Crucially, this penalty is independent of rank, suggesting an inherent limitation in the low-rank BA product parameterization when pushed to aggressive mini-batching regimes. The practical recommendation: keep batch sizes in the 32-128 range for most applications. This is often welcome news since huge batches demand more memory and can mask training instability.
What This Looks Like in Practice
Before (Old LoRA Approach):
- Adapters: Attention layers only
- Learning rate: Same as full fine-tuning (1e-5 to 5e-5)
- Batch size: As large as memory allows (256-512)
- Rank: Aggressive tuning needed, often r=64+
- Result: Inconsistent performance, frequent underperformance vs full fine-tuning
After (Low-Regret LoRA):
- Adapters: All layers (attention + MLP + MoE if applicable)
- Learning rate: 10× full fine-tuning rate (1e-4 to 5e-4)
- Batch size: Moderate (32-128)
- Rank: r=16-32 often sufficient
- Result: Performance tracks full fine-tuning reliably
When You Follow the Rules
When you obey those rules—adapters on all layers, learning rate ≈ 10×, moderate batch sizes—LoRA's training trajectories follow full fine-tuning almost step for step. Loss curves line up, capabilities emerge in tandem, and once the adapter starts saturating, it levels off gracefully rather than collapsing. The adapter may stop improving as full fine-tuning would, but it doesn't catastrophically diverge—LoRA slows down, rather than breaks.
That's a large shift in reliability: tuning LoRA becomes a reproducible process rather than a gamble.
The Deeper Picture: Capacity and Compute
Behind the veneer of empirical rules are deeper observations worth unpacking. The Thinking Machines team shows that when LoRA is not capacity-constrained—when the adapter has enough degrees of freedom relative to the dataset—LoRA learns with the same sample efficiency as full fine-tuning.
Only when dataset size exceeds adapter capacity does LoRA begin falling behind. In their experiments, a rank-32 adapter on a 7B model could match full fine-tuning on datasets up to about 50,000 examples. Beyond that threshold, increasing rank to 64 or 128 restored parity. In other words, LoRA's limitations are not due to structural underperformance, but to hitting a capacity ceiling relative to what you're trying to teach it.
They also compute the computational efficiency: LoRA requires about ⅔ the FLOPs of full fine-tuning per training step. This is a theoretical calculation based on eliminating gradient updates on the full weight matrix W, replacing them with operations on the smaller A and B matrices. For a 7B parameter model with typical LoRA rank (r=32), this translates to approximately 33% fewer floating-point operations per training step.
At scale, this matters. While attention mechanisms in transformers are compute-intensive, the bulk of operations in training large models still comes from matrix multiplications in both attention and MLP layers. A 33% FLOP reduction translates to real wall-clock time savings, especially when training on consumer hardware or running many experiments in parallel.
The RL Sweet Spot
In Reinforcement Learning settings—such as policy gradient fine-tuning on mathematical reasoning tasks—LoRA's advantage becomes even more compelling. Because RL from human feedback (RLHF) or policy gradient methods provide only a scalar reward signal per episode—essentially just a number indicating "good" or "bad"—the information content per training sample is extremely sparse (O(1) bits) regardless of how large the model is. Even a rank-1 adapter often suffices to absorb this sparse signal.
In their experiments on MATH and GSM8K using policy gradient methods, LoRA with r=8 easily matched full fine-tuning across 7B, 13B, and 70B parameter models. This suggests that in domains where post-training is more signal-limited than capacity-limited, LoRA is especially well-suited—even with minimal capacity.
What's Still Unknown
The research establishes empirical rules but leaves some questions open:
Theory of the 10× rule: The authors don't yet have a complete theoretical explanation for why the 10× learning rate factor holds so universally. They offer suggestive arguments about effective learning rate scaling in low-rank subspaces, but stop short of a full proof.
Variant techniques: The findings primarily apply to standard LoRA. It's unclear how well these rules transfer to variants like QLoRA (quantized LoRA for 4-bit or 8-bit base models), PiSSA (principal singular value and singular vector adaptation), or other adapter methods.
Capacity prediction: While the research identifies when adapters become capacity-constrained, there's not yet a precise formula for predicting the minimum rank needed for a given dataset size and model architecture.
Extreme scales: All experiments were on models up to 70B parameters. Whether these rules hold for 400B+ parameter models remains to be tested.
When NOT to Use LoRA
Despite these advances, full fine-tuning still has its place:
- Catastrophic distribution shifts: If you're fundamentally changing what a model does (e.g., teaching a language model to output structured code in a completely new format), full fine-tuning may be necessary
- Maximum absolute performance: In settings where every 0.1% of accuracy matters and compute is unlimited, full fine-tuning may still edge out LoRA
- Very large datasets: If you're training on millions of examples with high complexity, you may need extremely high rank (r=128+) or resort to full fine-tuning
Bottom Line: LoRA Comes of Age
From the vantage point of tool builders and model tinkerers, this "LoRA without regret" contribution is a paradigm shift. LoRA no longer has to be a contested heuristic; it can be a plug-and-play technique for post-training, with predictable guardrails.
This work arrives as the adapter landscape has grown crowded with alternatives like Adapters, Prefix Tuning, and (IA)³. What distinguishes this contribution is not a new technique, but a recipe that makes the existing, widely-adopted LoRA method predictable and reliable—exactly what practitioners need.
That means hobbyists and research engineers alike might reliably fine-tune large models with modest compute—on Colab, on a small cluster, even locally—without worrying whether the adapter will misbehave.
If you imagine the fine-tuning process as a delicate alchemy of hyperparameters and hidden state, what Thinking Machines brings is a recipe. They've transformed LoRA from "fingers-crossed science project" into "industrial tool." Their emphasis on stability, reproducibility, and predictable failure modes dovetails with their broader mission: making AI systems more reliable and trustworthy.
In the broader scheme, this also tightens the theoretical lens we have on how we allocate learning capacity relative to dataset complexity. LoRA becomes not just a utility, but a probe into where—and how—information is stored in a large model. When adapter learning parallels full matrix learning, we see that much of what full fine-tuning does is "nicely compressible" into low-rank updates—at least in post-training regimes.
If you're working in an environment where compute is constrained, or you run many experiments in parallel, these results invite you to ditch the old anxiety that LoRA might "break this time." With the rules in hand, you can tune confidently, expecting loss curves to align, performance to emerge, and your adapter to behave. LoRA has grown up—and with it, the art of post-training just got a lot more dependable.
Quick Start: How to Use These Rules
💡 Memory Savings in Practice: A rank-32 LoRA adapter on Llama 2 70B requires training only ~0.1% of the model's parameters, reducing VRAM requirements from ~280GB (full fine-tuning in bf16) to ~40GB—bringing 70B fine-tuning within reach of a single A100 GPU.
For a Llama 2 7B model on a math reasoning task:
- Apply adapters: Attention layers (Query, Key, Value, and Output/O projections) + MLP layers (up and down projections)
- Set rank: Start with r=32
- Learning rate: If full fine-tuning uses 2e-5, use 2e-4 for LoRA
- Batch size: 64 or 128
- Expect: Training loss should track full fine-tuning within 5-10% throughout training
When to increase rank: If validation loss plateaus well above full fine-tuning after following all rules, try doubling rank (r=32 → r=64)
Troubleshooting: If you're seeing divergence or instability, check these in order: (1) Are adapters on ALL layers including MLPs? (2) Is LR truly ~10× your full fine-tuning rate? (3) Is batch size under 128? These three account for 95% of LoRA failures in practice.
Unlock the Future of Business with AI
Dive into our immersive workshops and equip your team with the tools and knowledge to lead in the AI era.