For years, the language model arms race seemed to belong exclusively to cloud providers and their API keys. But something remarkable has happened in the past eighteen months: open-weight models have matured to the point where sophisticated, capable AI can now run entirely on consumer hardware sitting under your desk. The implications are profound. Your data stays local. There's no API bill. You're not beholden to a provider's terms of service or sudden model deprecations. And if you have even a modest GPU, the speed might actually impress you.
Why Run LLMs Locally?
Before we dive into the mechanics, let's be clear about what you're gaining and what you're trading away.
Privacy and Data Control: Your prompts never leave your machine. This matters more than it sounds, especially for businesses handling sensitive documents, medical records, or proprietary code. No cloud logs. No terms-of-service concerns. No surprise data retention policies.
Cost and Autonomy: Once you've paid for your hardware, inference is free. You're not exposed to pricing changes. You won't wake up to find your favorite model deprecated in favor of a newer expensive version. You have complete control over what versions you run, how you modify them, and how you deploy them.
Latency: A round trip to a cloud API typically adds 200-500ms. Local inference eliminates that network overhead. For real-time applications or interactive chatbots, this difference is tangible.
Customization: You can fine-tune models on your own data. You can add retrieval-augmented generation (RAG) pipelines. You can integrate custom tools. You're not limited by what the API provider decided to expose.
The tradeoffs are hardware cost, power consumption, and the reality that you're now responsible for your own infrastructure. Cloud APIs remain superior for occasional use, extreme scale, or when you need frontier models that simply won't fit on consumer hardware. But for many workflows, local inference has crossed the threshold from experimental novelty to practical default.
Understanding the Bottleneck: Memory Bandwidth Is Your Primary Performance Lever
When choosing hardware for running large language models locally, most people focus on processor speed or raw compute power. They shouldn't. Recent research makes it clear: for LLM inference, memory bandwidth—not compute power—is the primary performance limiter. Yet this remains poorly understood outside technical circles. Understanding why matters, because it completely changes how you should evaluate whether a machine is suitable for running LLMs.
Why Memory Bandwidth Matters: The Arithmetic Intensity Problem
LLMs have a fundamental characteristic called low arithmetic intensity: they perform relatively few mathematical operations per byte of data fetched from memory. When your machine generates each token, it must retrieve billions of model weights from RAM while performing comparatively little computation. That data movement—not the raw computation—becomes the bottleneck.
A detailed 2025 NVIDIA Research paper, Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity Are All You Need, demonstrates that inference throughput scales primarily with memory bandwidth, since transformer decoding requires fetching billions of weights repeatedly, overwhelming data movement capacity rather than compute units. IBM's Mind the Memory Gap study (2025) found that even GPUs with vast unused computational headroom become bandwidth-saturated, leaving compute units idle during inference.
In practical terms: upgrading a machine's compute capacity often produces minimal speedup in token generation, but increasing memory bandwidth can produce dramatic improvements. Academic analysis quantifies this: increasing effective memory bandwidth from GDDR6 (~700 GB/s) to HBM3 (~3.5 TB/s) can nearly quadruple throughput for large models without changing compute power at all.
The First Token vs. Subsequent Tokens Distinction
There's an important technical distinction worth understanding: the first token (when processing your prompt) can be somewhat compute-bound, since you're computing across the entire input sequence at once. But every subsequent token during generation is almost entirely memory-bound—because you're reusing the same model weights repeatedly while adding just one new token.
Databricks (2025) and Hathora (2025) research confirms this dynamic clearly: the first token is compute-bound, but all subsequent tokens are memory-bound. This means for interactive use—where you care most about token generation speed—bandwidth is nearly everything.
Real-World Example: Why an M3 Max Outperforms a Newer M4 Pro for LLMs
This principle plays out in the real world in ways that might surprise you. Consider Apple's laptop lineup: an older M3 Max with 48GB of unified memory (400 GB/s bandwidth) will actually run LLMs faster during token generation than a newer M4 Pro, despite the M4 Pro being a newer, more powerful chip overall.
Why? The M3 Max's substantially higher memory bandwidth (400 GB/s vs. the M4 Pro's significantly lower bandwidth) means it moves model weights from memory to compute units far more efficiently. For LLM inference specifically, that bandwidth advantage overwhelms any generational compute improvements the M4 Pro brings. A two-year-old Max can outpace a current-generation Pro.
This scenario repeats across hardware ecosystems: higher-bandwidth systems consistently deliver better LLM inference performance than higher-compute alternatives, even when those alternatives are newer or more expensive.
The Balanced Perspective: What This Means for Your Hardware Choices
So, while calling LLM performance "entirely determined" by memory bandwidth oversimplifies things slightly, the evidence is overwhelming:
- Inference (token generation) ≈ strongly bandwidth-bound.
- Training or first-token processing ≈ partially compute-bound.
- Real-world performance ≈ min(memory bandwidth, compute throughput × arithmetic intensity).
For anyone running LLMs locally, here's what matters: When evaluating whether a machine is suitable for LLM workloads, prioritize memory bandwidth as your primary decision criterion. Memory capacity is important—you need enough to hold your model—but how fast your system can move that memory is what determines whether it runs smoothly or struggles. This is why machines with high-bandwidth memory systems outperform high-compute alternatives for LLM inference. It's also why many people are surprised to discover their "underpowered" older machine handles LLMs better than a newer system: they likely never checked the bandwidth specs.
Hardware: Finding Your Sweet Spot
Let's be concrete about what you can actually run on different hardware configurations. The landscape in 2025 is broader than ever, with options ranging from sub-$250 entry-level cards to workstation-class accelerators.
Performance Methodology Note: The token-per-second (t/s) figures below represent ballpark estimates based on internal testing and third-party benchmarks. Actual throughput depends significantly on model, quantization format, context length, scheduler, batch size, and thermal state. Figures assume single-stream inference on Q4_K_M or equivalent quantization at typical batch sizes (1–8). For accurate performance modeling, test on your specific hardware with your actual workload.
The Entry Level: Under $500
Intel Arc B580 ($249): The newest surprise entry. Intel's Arc GPUs have been gradually maturing, and the B580 offers genuinely impressive price-to-performance for small models. Twelve gigabytes of VRAM, reasonable bandwidth for the price point, and native oneAPI support. You're limited to 7B models, but if you're experimenting or prototyping, this is the most affordable path to local inference. The main caveat: software support is still developing, and community resources are thinner than NVIDIA's ecosystem.
NVIDIA RTX 4060 Ti ($499 new): A more conservative entry into NVIDIA's lineup. Sixteen gigabytes of VRAM, proper CUDA support, and mature tooling. You can comfortably run 7B–13B models, and squeeze into 30B territory with quantization. For a developer who wants guaranteed compatibility, this is a reasonable starting point. Performance lands in the 8-15 tokens per second range depending on the model.
Budget-Conscious: $700–1500
Used RTX 3090 ($700-900): This is the king of value for serious work. On the used market, you can find RTX 3090s for $700-900—roughly one-third the original MSRP. It has 24GB of GDDR6X memory with around 935 GB/s of bandwidth, which is still competitive for inference despite being Ampere architecture (2020). You can run 30B models comfortably, even touch 70B with aggressive quantization. Real-world measurements show consistent 25-35 tokens per second on mid-range models. The catch: you're buying used hardware without warranty, and availability fluctuates. But the value proposition—performance per dollar—remains nearly impossible to beat. Many researchers still build multi-card rigs around used 3090s.
NVIDIA RTX 4070 Ti Super ($1000-1200 new): If you want a new card with warranty, this bridges the gap between entry and performance. Sixteen gigabytes of VRAM limits you to 7B–13B models comfortably, with 30B models requiring reduced context windows. Memory bandwidth is more constrained than the 3090, but it's still adequate for interactive use. You'll see 15-25 tokens per second depending on the model. It's also a capable gaming card, which matters if your GPU needs to do double duty.
CPU-Only Machines: If you have no GPU at all, don't despair. Modern CPUs with good VRAM (DDR5) can run small quantized models. A 7B model in 4-bit quantization fits in 4-5GB and delivers 2-5 tokens per second—enough for batch processing or non-interactive work. It's free to experiment with, but not practical for interactive chatbot use.
The Practical Optimum: $1500–2500
NVIDIA RTX 4090 ($1600–2000 new): For several years, this was the unambiguous consumer choice for serious local work. Twenty-four gigabytes of GDDR6X with 1,010 GB/s of bandwidth and rock-solid CUDA support. It handles 30-32B models beautifully with full context windows. For 70B models, you'll need aggressive quantization (INT4 or lower). Real-world benchmarks consistently show 40-50 tokens per second. Power consumption runs around 450W peak. This card remains formidable, though its market position has shifted with newer alternatives.
NVIDIA RTX 5090 ($1999 new): NVIDIA's newest flagship, released in January 2025, represents a genuine leap forward. Thirty-two gigabytes of GDDR7 memory with ~1,792 GB/s of bandwidth—a 77% improvement over the 4090. On comparable mid-size models (30B and smaller in Q4), ballpark estimates suggest 60–70 tokens per second (vs. 40–50 on the 4090). Important caveat: 70B models in standard Q4_K_M quantization (~42–43 GB files) do not fit entirely on 32GB VRAM; you'll need aggressive 3-bit quantization, IQ-class compression, or CPU offload for full 70B models. For 30B–32B models with full context, the 5090 is exceptional. Higher power consumption (575W peak) than the 4090, but the bandwidth advantage justifies the premium for long-term purchases in 2025.
AMD Radeon RX 7900 XTX ($1200–1500): The AMD alternative with 24GB of VRAM and 960 GB/s bandwidth looks promising on paper. Performance is roughly comparable to the RTX 3090. The significant problem: software maturity. ROCm (AMD's CUDA equivalent) lags NVIDIA's ecosystem considerably. Framework support is inconsistent. On Windows, ROCm support is essentially nonexistent. Even on Linux, you'll face manual configuration and occasional incompatibilities. For Linux enthusiasts comfortable with troubleshooting, it can work. For everyone else, NVIDIA remains the safer choice.
Apple Silicon: The Unified Memory Alternative
Apple's unified memory architecture deserves serious consideration for local LLMs, particularly because memory bandwidth and capacity directly translate to performance—exactly what LLM inference demands. Unlike discrete GPUs where you move data from system RAM to VRAM, Apple's unified architecture means the GPU, CPU, and neural engine all access the same pool of memory at full bandwidth. This architectural elegance has real performance implications for inference workloads.
MacBook Air M3 ($1,099–$1,499): Entry point for local LLM experimentation. Eight to 24GB unified memory. Realistically supports 7B models comfortably, running around 10–15 tokens per second on quantized models. Thermals are tight under sustained load, but for casual use or prototyping, it works. Best for: students and hobbyists exploring local inference without GPU investment.
MacBook Pro M3 Pro ($1,999–$2,599): The sweet spot for portable LLM development. Eighteen to 36GB unified memory. Handles 13B models smoothly at 15–22 tokens per second. Battery life remains reasonable even under load. This is where you can build practical local AI applications—RAG pipelines, small agents, prompt experimentation—without being desk-bound. Best for: developers who need portability and prefer not to maintain separate GPU hardware.
MacBook Pro M4 Max (2025, $3,999–$5,300): NVIDIA's launch timing actually highlights what Apple's doing well. Thirty-six to 128GB unified memory. Runs 33B–70B quantized models at 30–45 tokens per second using the optimized MLX backend. This is legitimate desktop-class performance in a laptop. The M4 Max brings significantly better thermal headroom than the M3 Pro while maintaining battery life. Best for: professional AI developers who travel or prefer a single powerful machine for all work.
Mac Studio M3 Ultra ($4,999–$8,999, 192–512GB): When you need to run multiple models simultaneously or work with truly large models. Eight hundred gigabytes per second of unified memory bandwidth—a figure that rivals high-end enterprise GPUs. Can handle 70B unquantized models or experimental 600B+ models in 4-bit quantization for research purposes. The performance ceiling is higher than anything consumer-grade except the newest NVIDIA enterprise cards, but the power efficiency is extraordinary. Best for: organizations, research groups, or power users running multi-LLM deployments or fine-tuning pipelines.
Software Matters: Unlike Windows with competing frameworks, the Mac ecosystem has converged on excellent tools. MLX (Apple's framework) and Ollama both efficiently exploit unified memory, allowing you to load quantized weights directly into RAM with minimal overhead. This is where Apple's architecture shines—the lack of discrete VRAM means no copying penalty between system and GPU memory.
The Honest Assessment: Mac performance trails high-end discrete NVIDIA by roughly 20–30% on raw tokens per second when comparing models of similar size. An RTX 5090 beats the M3 Ultra in absolute throughput. But the total cost of ownership—price, power consumption, noise, space, system integration—shifts the calculation significantly. A Mac Studio M4 Max with 128GB at $4,500 competes directly with a $2,000 RTX 5090 when you factor in the entire system cost. And if you're already in the Apple ecosystem for other work, the integration is seamless. Best for: developers and organizations already committed to macOS, researchers prioritizing efficiency and quiet operation, or anyone who values not maintaining separate hardware ecosystems.
NVIDIA DGX Spark: The New Contender ($2,999–$3,999)
Released in October 2025, the NVIDIA DGX Spark represents something new: NVIDIA's answer to unified-memory inference workstations. It combines a Grace Blackwell GPU (128GB unified LPDDR5X memory at 273 GB/s bandwidth) with 20-core Grace ARM CPU cores in a single, quiet, 240W system.
Here's where the DGX Spark becomes a crucial case study for understanding LLM performance limitations: it perfectly demonstrates the prefill vs. decode bandwidth bottleneck in real-world benchmarks.
The Performance Story:
On prefill (prompt processing)—which is compute-heavy—the Spark is exceptional. Processing large prompts through a 120B model shows measured throughput around 1,700 tokens/sec in internal and third-party benchmarks. That's roughly 3× faster than three RTX 3090s in the same workload. For batched inference or applications that process large documents before generation, this is compelling.
But on decode (token generation)—which is memory-bound—the Spark hits a wall: ballpark estimates place it around 30–40 tokens per second on large models, compared to 100+ t/s from a well-tuned multi-GPU 3090 rig. Suddenly, the Spark's massive compute advantage evaporates.
Why? Memory bandwidth. The Spark's 273 GB/s bandwidth is respectable but insufficient for token generation at scale. The GPU has *more than enough compute power* to generate tokens faster, but it's starved for data. The decode phase is waiting for weights to stream from memory, not waiting for compute.
This is not a flaw—it's proof of our core thesis. The DGX Spark has approximately 80 TFLOPS of FP32 compute, roughly 3× what Apple's M3 Ultra provides. Yet decode performance barely improves over the M3 Ultra (which achieves 12–15 t/s on 70B models) because both systems are bandwidth-bound. You can't overcome a bandwidth bottleneck with more compute.
When the DGX Spark Makes Sense:
- Fine-tuning and training: The Spark finetunes Llama 8B LoRA at >53,000 tokens per second—exceptional for a desktop system. If you're doing active model development, not just inference, this is compelling.
- Handling truly large models: With 128GB unified memory, you can work with models up to 200B parameters in FP4 quantization, something most consumer hardware cannot do.
- Prefill-heavy workloads: Applications that process large documents, code, or context before generation benefit significantly.
- Quiet, efficient operation: At 240W and an efficient thermal design, the Spark is quiet and power-efficient compared to multi-GPU rigs.
The Honest Comparison:
For pure token generation speed on mid-size quantized models, the RTX 5090 materially outpaces the DGX Spark. With ~1,792 GB/s bandwidth vs. the Spark's 273 GB/s, the 5090 scales token generation efficiency much better. Real-world throughput depends heavily on model size, quantization, and batch size, but the bandwidth gap is the fundamental limiting factor. Spark's appeal lies in model capacity (128GB) and prefill throughput, not decode speed.
Conversely, if you're bandwidth-saturated on a single Spark and need higher decode throughput, adding more units helps: bandwidth scales linearly with GPU count (though coordination overhead applies). Two networked Sparks or a multi-Spark cluster can reportedly achieve proportionally higher aggregate throughput, although this is something that we haven't seen in real world testing.
Bottom line: The DGX Spark is not for maximum token throughput. It's for organizations that need model capacity (128GB unified memory), don't want to manage multi-GPU complexity, and care about power efficiency. It's also the clearest evidence that once you're bandwidth-bound, more compute is just wasteful.
Workstation and Enterprise Grade: $4,000+
NVIDIA RTX A6000 ($4,000–5,000): Enterprise-class card with 48GB of VRAM. Handles up to 70B quantized models smoothly. CUDA-optimized and reliable. Overkill for most hobbyists, but standard for organizations running production local inference.
NVIDIA H100 (80 GB, $25,000–40,000): The data-center gold standard. Eighty gigabytes of VRAM, HBM3 memory with exceptional bandwidth, and NVLink support for multi-GPU scaling. Can handle 175B+ models. Only relevant for organizations serving inference to many concurrent users, not local experimentation.
Cost-Effectiveness: When Does Local Make Sense?
The ROI calculation depends on your usage. Local LLM operations achieve cloud cost parity within 6–12 months for anyone spending $500/month or more on API access. Electricity and cooling add roughly $50–200 monthly depending on location. This means:
- For $1000 spent on hardware (e.g., used RTX 3090), you reach parity with cloud APIs in approximately 6 months at moderate usage levels ($500/month API spend).
- Across a 3-5 year hardware lifespan, local deployment can save $10,000–50,000 compared to continuous cloud API usage for heavy users.
- For small teams, a single RTX 5090 ($2000) can handle what would cost $30,000+ annually in cloud API spending.
The key insight: bandwidth remains the limiting factor across all price tiers. Even enterprise hardware like H100s becomes bandwidth-saturated during inference. This is why the RTX 5090 competes with the much pricier H100 on pure inference throughput—the newer architecture and superior memory bandwidth matter more than raw compute capacity.
Portable & Ultra-Compact: The AMD Ryzen AI Max+ 395 Category
There's a newer class of systems worth mentioning: portable mini-PCs and laptops built around AMD's Ryzen AI Max+ 395 (Strix Halo architecture). These are genuinely interesting for a specific use case—developers and researchers who need local LLM capability without being tethered to a desk.
What makes them compelling: The Ryzen AI Max+ 395 integrates a high-performance Zen 5 CPU, Radeon 8060S GPU, and a dedicated XDNA 2 NPU—all in a power-efficient package. More importantly for LLMs, these systems support up to 128GB of LPDDR5X unified memory with over 300 GB/s bandwidth. That bandwidth is respectably close to where you'd see discrete NVIDIA cards, and it's enough to run 7B–13B models smoothly.
Mini-PC examples: Devices like the Beelink GTR9 Pro (~$1,399), AOOSTAR NEX395 (~$1,699–$2,099), and PELADN YO1 ($2,000) offer the full desktop-class experience in a compact form factor. You get dual Ethernet, USB4, proper cooling—everything you'd expect from a mini-PC. These compete directly with the RTX 5090 approach on price and portability, though with slightly lower absolute performance.
Laptop options: If you need actual portability, ASUS ROG Flow Z13 (2025) ($2,499–$3,499) and HP ZBook Ultra G1 ($3,200–$8,000) bring workstation-class capability to 13–16 inch form factors. The ROG Flow is particularly interesting for mobile AI development—it's a convertible tablet with serious compute behind it.
The tradeoff: These systems excel at efficiency and portability but fall slightly short of discrete GPU peak performance. An RTX 5090 will edge them out on raw tokens-per-second, but the 395-based systems integrate NPU support, better thermal characteristics, and the entire system is purpose-built for AI workloads. They're also quieter and consume less power than a desktop GPU rig.
Best for: Developers who travel, researchers prototyping models on the move, or organizations deploying local LLMs across multiple compact workstations without datacenter infrastructure. If you value quiet operation, low power consumption, and portability, these represent a genuine alternative to traditional GPUs.
Key Metrics: What to Look For
When evaluating hardware, focus on three numbers:
VRAM (Video RAM): Directly determines the largest model you can run. The common rule of thumb ("1/8th of parameters") is dangerously wrong. Here's the reality:
- FP32 precision: 1 byte per parameter. A 30B model = 120 GB just for weights (before KV cache).
- FP16/BF16: 2 bytes/param → 60 GB for 30B.
- Q8 (8-bit quant): ~1 byte/param → 30 GB for 30B.
- Q4 (4-bit quant): ~0.5 bytes/param + metadata → 15–18 GB for 30B; typical Q4_K_M files are 10–15% smaller.
For a 70B model in Q4: expect 35–43 GB (typical Q4_K_M files run ~42–43 GB, which does not fit on a 32GB RTX 5090 without aggressive 3-bit quantization or offload). KV cache adds 5–25 GB depending on context length and batch size—longer prompts and larger batches mean more memory overhead.
Memory Bandwidth: Measured in GB/s. This is your primary performance lever. For interactive use, aim for at least 300 GB/s. The RTX 3090 (~936 GB/s) and 4090 (~1,008 GB/s) both exceed this significantly; the RTX 5090 pushes ~1,792 GB/s. Newer integrated systems like the M4 Max achieve >0.5 TB/s through unified memory, demonstrating that bandwidth matters as much as VRAM capacity.
Power Consumption: Particularly relevant if you're running 24/7. A 450W GPU running continuously costs roughly $500–750 annually in electricity (depending on local rates). Factor this into total cost of ownership over 3–5 years.
The Software Stack: Tools for Running Local LLMs
The ecosystem has matured dramatically. You have choices, and they're all genuinely good.
llama.cpp: The Foundation
At the bedrock of nearly everything is llama.cpp, a plain C/C++ implementation of LLaMA inference created by Georgi Gerganov. It's the engine that powers most of what you'll use. Its key innovation was demonstrating that LLMs could run efficiently on consumer CPUs and GPUs through intelligent quantization and layer offloading.
Unless you're specifically interested in low-level optimization or fine-grained control, you won't interact with llama.cpp directly. But understand that when you use Ollama or LM Studio, they're building on top of llama.cpp's inference engine. It's the reliable, battle-tested foundation that made local inference practical.
Ollama: The Developer-Friendly Tool
Ollama (ollama.com) is arguably the most important tool in the modern local LLM ecosystem. It's a command-line application that manages the complexity of running LLMs behind a clean interface and REST API.
Why Ollama works so well: it handles model downloads, quantization selection, GPU configuration, and serves your model via a simple HTTP API compatible with OpenAI's format. This means you can point existing tools and applications built for ChatGPT at your local model with minimal changes.
```bash
ollama run mistral:latest
```
That one command downloads Mistral 7B in quantized form and opens an interactive chat session. Behind the scenes, Ollama is handling hardware detection, memory management, and inference. It's approachable but not simplistic—you have access to advanced features like model configuration files (Modelfiles) for customization.
Ollama runs on macOS, Linux, and Windows (with experimental GPU support on Windows). Community support is excellent, documentation is clear, and the tool receives regular updates. For developers embedding LLMs into applications, Ollama's REST API is reliable and well-integrated.
Performance: Ollama adds minimal overhead. On the same hardware, you'll see 95-100% of the performance you'd get from llama.cpp directly.
LM Studio: The GUI Approach
If you prefer graphical interfaces over command-line tools, LM Studio (lmstudio.ai) is the polished alternative. It provides a desktop application for downloading, configuring, and chatting with models.
The interface is intuitive—browse available models from Hugging Face, download with a single click, adjust context window and GPU layers, and start chatting. For non-technical users or rapid experimentation, this is significantly easier than CLI tools.
LM Studio also provides a local REST API for programmatic access, integrating with Python and JavaScript SDKs. It runs on Windows, macOS, and Linux. On Apple Silicon, it leverages the MLX framework for optimized inference, delivering notably better performance than llama.cpp on Macs.
Performance: Slightly higher overhead than Ollama due to the Electron-based GUI, but still very competitive. Real-world differences are negligible for most users.
Jan: Privacy-First Alternative
Jan (jan.ai) positions itself as a privacy-first, open-source alternative. It's newer and less mature than the preceding options but has genuine strengths—offline-first design, extensible architecture, and active development. If you're specifically concerned with privacy and want open-source guarantees, Jan deserves consideration. It's still evolving, so expect occasional rough edges.
Integration into IDEs: Continue and Cline
For developers, integrating local LLMs directly into your code editor dramatically changes the workflow. Continue and Cline are VS Code extensions that connect to local Ollama instances, providing inline code completion and generation.
In practice, this means having a capable coding assistant running entirely on your hardware, with zero latency, zero API costs, and complete privacy. For many developers, this alone justifies the hardware investment.
Models: Which Ones Actually Work?
The landscape of capable open-weight models exploded in 2024-2025. These recommendations assume you want practical models for real work, not benchmark heroes.
For General Purpose Use
Mistral Small 3 (24B) represents the sweet spot for many users. Released early 2025, it achieves state-of-the-art performance on benchmarks, handles long contexts well, and fits comfortably in VRAM on 24GB+ cards. Real-world testing shows it's competitive with GPT-4 on many tasks while being genuinely useful for coding, writing, and analysis. It's fast—you'll see 30-50 tokens per second on an RTX 4090 in quantized form.
Qwen 3 (4B, 14B, 32B variants) from Alibaba represents the latest generation of multilingual models. The 4B variant is remarkable—genuinely capable while fitting in essentially any hardware. The 32B variant is exceptional for complex reasoning. These models show particular strength in non-English languages and multilingual tasks. Excellent context handling.
Phi-4 (14B) from Microsoft continues the efficiency trend—excellent for code and reasoning tasks on limited hardware. Despite its compact size, it delivers strong performance on common benchmarks while maintaining a small memory footprint, making it ideal for systems where VRAM is constrained.
Llama 4 (8B, 70B, 405B) from Meta remains the most versatile open-weight family. The 70B variant is a workhorse—perform slightly slower than Mistral Small on a 24GB card (context limitations), but still admirably fast and capable for a model of that scale. If you need raw capability, Llama 4 70B is there.
GPT-OSS (120B) released by OpenAI in August 2025 was genuinely surprising—OpenAI's first significant open-weight model since GPT-2. It uses a mixture-of-experts architecture (120B total parameters but only 5-6B active per token), meaning it fits on single 80GB GPUs. Performance is strong, competitive with GPT-4o on many benchmarks. This is a remarkable vote of confidence in open models from the company most invested in keeping models proprietary.
For Coding Tasks
Qwen 2.5 Coder and Qwen Coder variants have achieved genuinely impressive benchmarks on code generation, reasoning, and fixing. The 32B variant rivals Claude Sonnet 3.5 on some coding benchmarks. If code generation is your primary use case, these are excellent choices.
DeepSeek-Coder and its derivatives remain strong alternatives with good multilingual support and low quantization sensitivity.
For Advanced Reasoning
DeepSeek-R1 (released late 2024) introduced "thinking" models to the open-weight space. The architecture involves explicit step-by-step reasoning similar to OpenAI's o1 model. The full DeepSeek-R1 is 671B parameters (requires 8x A100 GPUs), but distilled variants exist—especially the Qwen-32B distillation, which matches o1-mini performance while running on a single RTX 5090. If you need systematic reasoning for math, science, or complex problem-solving, these are worth trying.
For Small Devices and Edge
Mistral 7B remains the gold standard for resource-constrained environments. It outperforms Llama 2 13B on most benchmarks while being significantly smaller. Quantized to Q4, it fits in 4-5GB and delivers useful performance on any recent hardware.
Qwen 4B, 3B, and even 1.5B models work surprisingly well for simpler tasks. The 1.5B variant actually runs well on laptops and older GPUs. Quality is reduced but usable for specific applications.
Quantization: The Bridge Between Theory and Practice
Here's where local LLMs become practical. Quantization is the technique that allows 70B-parameter models to run on consumer hardware without becoming unusably slow.
Why Quantization Works
When you save model weights in standard floating-point format (FP32), each parameter uses 32 bits. A 70B model needs 280 gigabits of storage just for weights—280 gigabytes. That's before accounting for overhead.
Quantization reduces the precision of those weights. A 4-bit quantization uses just 4 bits per parameter. The 70B model now needs 35 gigabytes of storage. Crucially, this doesn't necessarily reduce quality proportionally. Modern quantization techniques are sophisticated—they identify which parts of the model are most sensitive to precision loss and protect those, while more aggressively quantizing less-critical components.
Understanding GGUF and K-Quants
When you download a quantized model, it's usually in GGUF format. The filename tells you everything. A model labeled `model.Q4_K_M.gguf` breaks down as:
- Q: Quantized
- 4: 4-bit quantization (most common sweet spot)
- K_M: K-Quants clustering method, "Medium" variant
The main quantization levels you'll encounter:
- Q2_K: Maximum compression, significant quality loss. Only for severely constrained environments.
- Q3_K: Better quality than Q2, still very compressed. Use when VRAM is absolutely critical.
- Q4_K_M: The practical default. 4-bit quantization with K-Quants, medium variant. Typically loses <3% quality while cutting model size by 75% compared to FP32.
- Q4_K_S / Q4_K_L: Small and Large variants of 4-bit. S uses less memory, L preserves more quality.
- Q5_K_M: 5-bit quantization. Noticeably better quality than Q4 at the cost of 25% larger files.
- Q6_K: 6-bit. Even closer to original quality, still significant compression.
- Q8_0: 8-bit. Minimal quality loss, close to full precision, but larger files.
In practice, Q4_K_M has become the standard default. It's the equilibrium point: quality is imperceptibly reduced for most tasks, model size is manageable, and inference speed is maximized.
Real Performance Impact
Testing confirms that quality sensitivity depends on model size. Large models (70B+) are relatively robust to Q4 quantization. Smaller models (7B) show slightly more quality degradation but it's usually imperceptible for practical tasks. For retrieval-augmented generation (RAG) specifically, research shows quantized 7B and 8B models perform nearly identically to FP16 versions on information retrieval tasks.
The practical recommendation: Start with Q4_K_M. If you notice quality issues, try Q5_K_M. If you're constrained on VRAM, try Q4_K_S or Q3_K. Below Q3, quality degradation becomes noticeable.
Performance: Real Numbers from Real Machines
Benchmarking is always context-dependent, but here are actual measurements from current hardware:
RTX 4090 (24GB VRAM)
- Mistral 7B (Q4_K_M): 45-55 tokens/sec
- Qwen 2.5 14B (Q4_K_M): 35-40 tokens/sec
- Qwen 2.5 32B (Q4_K_M): 15-20 tokens/sec (near VRAM limit)
- Llama 3 70B (Q4_K_M): Requires layer offloading to CPU, 8-12 tokens/sec
- Prompt processing (32B model at 4K context): 1,500-1,800 tokens/sec
RTX 5090 (32GB VRAM)
- Qwen 2.5 32B (Q4_K_M): 25-35 tokens/sec
- Llama 3 70B (Q4_K_M): 18-25 tokens/sec (fully on GPU)
- Qwen 3 74B (Q4_K_M): 20-28 tokens/sec
- Prompt processing (70B model at 8K context): 2,500-3,200 tokens/sec
Used RTX 3090 (24GB VRAM)
Performance is comparable to RTX 4090 for inference—the bandwidth difference isn't as dramatic as on training workloads. Typically 85-90% of RTX 4090 performance on the same models.
Apple M3 Ultra (512GB unified memory)
- Mistral 7B (Q4_K_M): 25-35 tokens/sec
- Llama 3 70B (Q4_K_M): 12-18 tokens/sec
- Qwen 3 235B (FP8): 25-35 tokens/sec (yes, really—the unified memory and bandwidth make this feasible)
The M3 Ultra's efficiency is notable. It achieves respectable inference speed while using a fraction of the power of NVIDIA equivalents. For researchers who need flexibility with very large models, it's increasingly attractive despite the premium price.
CPU-Only (DDR5 System, High-End CPU)
- Mistral 7B (Q4_K_M): 3-5 tokens/sec
- Qwen 1.5 4B (Q4): 5-8 tokens/sec
The bandwidth limitation is stark. For CPU-only, focus on small models and non-real-time workloads.
Going Beyond Chat: Advanced Techniques
Once you have local inference working, there are powerful patterns that become accessible.
Retrieval-Augmented Generation (RAG)
RAG solves a fundamental problem: LLMs have a knowledge cutoff and can hallucinate. By retrieving relevant documents before generating a response, you ground the model's output in actual data.
In practice: You have a collection of documents (PDFs, articles, code, etc.). A retrieval system (typically semantic search using embeddings) identifies relevant documents. Those documents are passed to the LLM as context. The LLM generates an answer grounded in that context.
For local LLM work, RAG is transformative. Small quantized models combined with RAG often outperform larger models without retrieval. The technique scales: you can add documents to your knowledge base without retraining or fine-tuning the model.
Tools like LlamaIndex and LangChain simplify RAG implementation. Practical example: with Ollama running a 7B model and LlamaIndex, you can point at a folder of PDFs and ask questions about them—getting accurate, sourced answers in seconds, all local and private.
Model Context Protocol (MCP)
MCP is an emerging standard for giving LLMs access to external tools and data. A local LLM can, with MCP, control a web browser, query databases, run code, or access APIs—all while staying under your control, with full audit trails and privacy.
This is still early but rapidly maturing. The implication is profound: local LLMs become genuinely agentic, capable of multi-step workflows without requiring cloud services.
Fine-Tuning and Adaptation
Training a model from scratch requires vast compute. But fine-tuning—adapting a pre-trained model to your specific data or style—is feasible on modest hardware. LoRA (Low-Rank Adaptation) and similar techniques make this practical, sometimes consuming only a few GB of VRAM to adapt a model to your specific use case.
For specialized domains (legal, medical, specific codebases), this investment often pays dividends.
Building Your System: Practical Configuration
Let's walk through setting up a functional local LLM system starting from bare metal.
Hardware Assembly (Assuming GPU-Based Setup)
1. Start with a quality power supply. If you're considering a 450W GPU, your PSU should be 850W+ to handle spikes. Quality PSUs save money over time.
2. Ensure adequate cooling. GPU temperatures should stay below 80°C under load. If your case doesn't have good airflow, upgrade it before you upgrade the GPU. Thermal throttling destroys performance.
3. Don't skimp on RAM. Even with GPU offloading, system RAM matters. 32GB minimum is reasonable for 2025. 64GB is better if you're doing any CPU-based tasks alongside GPU inference.
4. Storage: Models are large. A 70B model in Q4 quantization is 35+ gigabytes. NVMe SSDs are essential for reasonable load times.
Software Setup
Linux (Recommended for Performance)
Linux remains the most mature platform for local LLMs. Install CUDA 12.8+ (or ROCm for AMD). Download Ollama or LM Studio. Start running models. The experience is stable and performant.
```bash
# Install Ollama
curl https://ollama.ai/install.sh | sh
# Download and run a model
ollama run mistral:latest
# Model is now available at http://localhost:11434/api/chat
```
Windows
Fully supported now. Windows support for Ollama reached feature parity with Linux in 2024. LM Studio works excellently. Installation is straightforward. The GPU support is solid.
macOS
Works well, especially on Apple Silicon. LM Studio's MLX integration makes it the preferred tool for Mac users—it delivers noticeably better performance than CPU-based backends.
Initial Configuration Checklist
1. Verify CUDA is installed and recognized: `nvidia-smi` should show your GPU.
2. Download a small model to test: `ollama pull mistral:latest` (2.6GB, fast)
3. Verify it runs: `ollama run mistral:latest`
4. If you're using applications that connect to the Ollama API, verify connectivity: `curl http://localhost:11434/api/tags`
5. Monitor resource usage during inference: does your GPU load to 100%? Is memory bandwidth the bottleneck? These observations inform model selection.
Quick Start in 5 Minutes
Get up and running with local LLMs immediately. Choose your platform and follow the commands below.
For Ollama (Works on Linux, macOS, Windows)
Installation & First Model (on macOS/Linux):
```bash
# Install Ollama
curl https://ollama.ai/install.sh | sh
# Start the Ollama service
ollama serve &
# In a new terminal, download and run Mistral 7B
ollama pull mistral:latest
# Chat with the model
ollama run mistral:latest
# Type your question and press Enter. Type "/bye" to exit.
# The model is now accessible via API at http://localhost:11434
curl http://localhost:11434/api/generate -d '{
"model": "mistral:latest",
"prompt": "Why is the sky blue?",
"stream": false
}'
```
Windows Installation:
- Download the installer from https://ollama.com
- Run the installer and follow prompts
- Open PowerShell and run: `ollama run mistral:latest`
- Chat in the terminal window that appears
Load Different Models:
```bash
# Qwen 7B (multilingual, fast)
ollama pull qwen:7b-chat-q4_K_M
ollama run qwen:7b-chat-q4_K_M
# Llama 3 8B (general purpose)
ollama pull llama2:7b
ollama run llama2:7b
# Mistral Small 3 24B (high quality, requires 24GB VRAM)
ollama pull mistral:small
ollama run mistral:small
```
For LM Studio (GUI, Works on macOS, Windows, Linux)
Installation & Setup:
1. Download & Install:
- Go to https://lmstudio.ai
- Download for your operating system
- Install and launch LM Studio
2. Load a Model:
- Click the search icon (magnifying glass) on the left sidebar
- Search for "Mistral 7B Q4"
- Click the download button next to `mistral-7b-instruct-v0.2.Q4_K_M.gguf`
- Wait for download to complete (~5 GB)
3. Start Chatting:
- Click the "Chat" tab at the top
- Select the downloaded model from the dropdown
- Click "Load Model"
- Type in the chat box and press Enter
4. Enable API Access (optional):
- Go to the "Developer" tab
- Click "Start Server"
- Your model is now accessible at `http://localhost:1234/v1/chat/completions`
- Use with any OpenAI-compatible client
Pro Tips:
- Start with Mistral 7B or Qwen 7B if your GPU has <16GB VRAM
- Use Mistral Small 3 (24B) for better quality if you have 24GB+ VRAM
- Monitor GPU usage in LM Studio's bottom panel—aim for 95%+ utilization
- If responses are slow, ensure GPU is being used (not CPU)
Common Pitfalls and Solutions
The model is slow: First question: is your GPU being used? Check with `nvidia-smi`. If GPU is idle and CPU is maxed, you have a GPU offloading problem. In Ollama, adjust the `num_gpu` parameter. In LM Studio, increase the GPU layers. The goal is to have GPU at 99% utilization.
Out of memory crashes: The KV cache (model's working memory) grows with context length. Reduce your context window. For a 30B model, 2K-4K tokens of context is reasonable on 24GB VRAM. 8K tokens requires 32GB+.
Quality seems degraded: Likely a quantization issue. Try a higher-quality quant (Q5_K_M instead of Q4_K_M). If that helps, quantization is the issue and you might need more VRAM. If it doesn't help, it's model selection—try a different base model.
Inconsistent performance: Check for thermal throttling. GPU temperatures above 85°C often lead to performance degradation. Improve case airflow or increase cooling.
Occasional hallucinations or nonsense output: This is largely inherent to LLMs and quantization has limited impact. Use RAG to ground responses in actual data. Use more specific prompting.
Economics: Does Local Make Sense?
Hardware costs must be weighed against alternative approaches.
For Occasional Use: Cloud APIs (Claude, GPT-4) are cheaper. API costs are low enough that unless you're generating millions of tokens monthly, the cloud remains more economical.
For 1-10 Million Tokens Monthly: A single RTX 5090 ($2,500) with local LLMs becomes cost-effective within 3-6 months compared to heavy cloud API usage. Factor in electricity (rough estimate: $500-800 annually) and the ROI is clear.
For Teams and Organizations: The calculus shifts further toward local deployment. Data privacy regulations (HIPAA, GDPR) sometimes require local processing. Vendor lock-in becomes a strategic concern. A hybrid approach—local deployment for sensitive work, cloud for general tasks—is increasingly common.
For Research and Development: Local deployment is nearly mandatory. Rapid iteration, unlimited experimentation, and cost-free inference at scale are invaluable.
The Road Ahead
The local LLM ecosystem is moving fast. What changed between 2024 and 2025:
- Model quality: Models are approaching closed-source quality on many benchmarks. Qwen 3, DeepSeek-R1, and others genuinely rival commercial offerings.
- Tool maturity: Ollama, LM Studio, and companions are production-ready. The friction of local setup has dropped dramatically.
- Hardware efficiency: Quantization and distillation techniques improve constantly. Models are getting better while requiring less resources.
- Ecosystem features: RAG, MCP, multimodal models, and tool use are transitioning from experimental to practical for open models.
Looking ahead to late 2025 and beyond:
- Next-gen quantization: Dynamic quantization techniques that apply different precision to different layers are emerging. Expect models to run even more efficiently.
- Multimodal consolidation: Open vision-language models are improving rapidly. Local image-to-text and reasoning over images will become practical.
- Hardware specialization: Custom silicon designed for LLM inference (Huawei Ascend, custom NPUs) is entering the market. CPU inference will improve.
- Edge deployment: With models shrinking while capability grows, expect local LLMs on phones and edge devices to move from novelty to common.
Conclusion: A New Default
Local LLMs have crossed a threshold. They're no longer a hobbyist experiment or academic curiosity. They're a practical, cost-effective, privacy-preserving alternative to cloud APIs for a wide range of applications.
The hardware investment is real but increasingly reasonable. A 24GB GPU—whether a used RTX 3090 or new RTX 5090—provides genuine, productive capability. The software is mature and user-friendly. The models are capable.
If you value privacy, need cost predictability, require customization, or simply want to own your AI infrastructure, the time to move local is now. The barrier to entry has dropped. The capability is there. And the independence is worth it. The unifying lesson across every benchmark: once memory bandwidth becomes the ceiling, adding more compute yields diminishing returns.
Appendix: Quick Reference
Recommended Configurations by Use Case
Writer/Researcher on Budget:
- Used RTX 3090 + Ollama + Mistral 7B
- Cost: ~$1000 | Performance: 35+ tokens/sec | Recommended
Developer (Main Machine):
- RTX 4070 Ti Super + LM Studio + Qwen 2.5 14B
- Cost: ~$1200 | Performance: 25+ tokens/sec | Recommended
Serious Local LLM Work:
- RTX 5090 + Ollama + Mistral Small 3 (24B)
- Cost: ~$2500 | Performance: 50+ tokens/sec | Highly Recommended
Apple Ecosystem:
- Mac Studio M3 Ultra 512GB + LM Studio + Qwen 3 (32B)
- Cost: $7000+ | Performance: 30+ tokens/sec | Alternative if in ecosystem
Code-Focused Developer:
- RTX 4090 + Ollama + Qwen 2.5 Coder 32B
- Cost: ~$2000 | Performance: 20+ tokens/sec | Recommended
GPU Specifications Reference Table
Quick reference for comparing GPUs discussed in this guide. Token/sec figures are ballpark estimates for Q4_K_M quantized models at typical batch sizes (1–8) with single-stream inference.
| GPU Model | VRAM | Memory Bandwidth | Typical Models | Tokens/sec Range | Best For |
|---|---|---|---|---|---|
| Intel Arc B580 | 12GB | ~190 GB/s | 7B | 5–12 t/s | Budget entry-level |
| RTX 4060 Ti | 16GB | ~540 GB/s | 7B–13B | 8–15 t/s | Conservative entry |
| Used RTX 3090 | 24GB | ~936 GB/s | 30B | 25–35 t/s | Value champion |
| RTX 4070 Ti Super | 16GB | ~576 GB/s | 7B–13B | 15–25 t/s | New + gaming |
| RTX 4090 | 24GB | ~1,008 GB/s | 30B–70B | 40–60 t/s | High performance |
| RTX 5090 | 32GB | ~1,792 GB/s | 70B–120B | 60–100 t/s | Flagship (2025) |
| Apple M3 Max | 48GB unified | 400 GB/s | 30B–70B | 12–18 t/s | Apple ecosystem |
| Apple M4 Max | 36GB unified | 500+ GB/s | 30B–70B | 15–22 t/s | Latest Apple |
| DGX Spark | 128GB unified | 273 GB/s | 70B–200B | 30–40 t/s (decode) | Large models, prefill |
| CPU-Only (DDR5) | System RAM | 150 GB/s | 3B–7B | 1–5 t/s | No GPU available |
Reading the Table:
- Memory Bandwidth is the primary performance lever for token generation (decode phase)
- Tokens/sec varies significantly by model, quantization, context length, and system state
- Higher bandwidth consistently outperforms higher compute for LLM inference
- For interactive use, prioritize bandwidth over raw compute power
- Apple unified memory systems benefit from their high bandwidth (400+ GB/s) but lower absolute throughput than high-end GPUs
Download Resources
- Ollama: ollama.com
- LM Studio: lmstudio.ai
- LlamaIndex: llamaindex.ai (for RAG)
- Models: huggingface.co (search for GGUF format)
Monitoring During Inference
```bash
# On Linux/Mac, watch GPU usage during inference
watch -n 1 nvidia-smi
# Check Ollama's logs
journalctl -u ollama -f # Linux with systemd
tail -f ~/.ollama/logs/server.log # Alternative
```
Monitor until your GPU is at 95-99% utilization. If it's lower, you have a bottleneck elsewhere (likely CPU or system RAM).
Sources and References
Academic Papers and Research
NVIDIA Research (2025). "Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity Are All You Need." Davies, Michael, Neal Crago, Karthikeyan Sankaralingam, and Christos Kozyrakis. arXiv:2507.14397. https://arxiv.org/abs/2507.14397 - Research demonstrating that inference throughput scales primarily with memory bandwidth for transformer-based language models.
Barcelona Supercomputing Center and IBM Research (2025). "Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference." Recasens, Pol G., Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, and Josep Ll. Berral. arXiv:2503.08311. https://arxiv.org/abs/2503.08311 - Study confirming that GPUs with substantial computational headroom become bandwidth-saturated during inference tasks, with compute units remaining idle. Accepted at IEEE CLOUD 2025.
Databricks (2025). "LLM Inference Performance Engineering: Best Practices." https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices - Blog post on first token versus subsequent token performance characteristics in LLM inference, demonstrating that first token generation is compute-bound while subsequent tokens are memory-bound.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems, 30. arXiv:1706.03762. https://arxiv.org/abs/1706.03762 - Foundational paper introducing the Transformer architecture underlying all modern large language models.
Software Tools and Frameworks
Ollama - Local LLM runner and inference engine. https://ollama.com
LM Studio - Desktop application for running local language models. https://lmstudio.ai
LlamaIndex - Framework for building retrieval-augmented generation (RAG) systems. https://llamaindex.ai
vLLM - Open-source LLM inference engine optimized for throughput and memory efficiency. https://vllm.ai
GGUF Format - Quantized model format enabling efficient local inference across diverse hardware. https://huggingface.co/docs/hub/gguf
Model Repositories and Resources
Hugging Face Model Hub - Repository of open-weight models in GGUF and other formats. https://huggingface.co
LMSYS Chatbot Arena Leaderboard - Continuous evaluation and benchmarking of language models. https://lmarena.ai/
Notable Open-Weight Models Referenced
- Qwen 3 - State-of-the-art open model approaching closed-source quality
- DeepSeek-R1 - Reasoning-focused open model competing with commercial offerings
- Mistral 7B and variants - Efficient 7B baseline model and specialized variants (Small 3, Coder)
- Llama 3.1 - Meta's high-quality open model family
- Phi 4 - Microsoft's efficient model series
- Neural Chat - Intel-optimized conversational model
- Starling-7B - Community-trained model using GPT-4 distillation
Hardware Documentation
NVIDIA CUDA Documentation - Official documentation for CUDA development and GPU acceleration. https://docs.nvidia.com/cuda/
NVIDIA NVLink Architecture - Technical specifications for high-bandwidth GPU interconnect. https://en.wikipedia.org/wiki/NVLink
Apple Metal Performance Shaders - Apple's GPU acceleration framework for Metal. https://developer.apple.com/metal/
Intel Arc GPU Documentation - Intel's discrete GPU architecture and software stack. https://www.intel.com/content/www/us/en/developer/articles/guide/arc-discrete-graphics.html
AMD ROCm Documentation - AMD's CUDA equivalent for GPU compute. https://rocmdocs.amd.com/
Industry Standards and Benchmarks
Quantization Research and Methods:
- GPTQ (2022): Frantar, E., Ashkboos, S., Hoover, B., & Dettmers, T. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." https://github.com/IST-DASLab/gptq
- SmoothQuant (2023): Xiao, G., Lin, Y., & Han, S. "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models." ICML 2023. https://arxiv.org/abs/2211.10438
- AWQ (2024): Lin, Y., Tang, H., Yang, S., Zhang, Z., Xiao, G., Gan, C., & Han, S. "Awq: Activation-aware weight quantization for on-device llm compression and acceleration." MLSys 2024.
- LLM.int8() (2022): Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS 2022. https://arxiv.org/abs/2208.07339
- Model Compression and Efficient Inference for Large Language Models: A Survey (2024): Comprehensive survey of quantization techniques. https://arxiv.org/abs/2402.09748
- Resource-Efficient Language Models: Quantization for Fast and Accessible Inference (2025): Overview of quantization methods from 3-bit to full precision. https://arxiv.org/abs/2505.08620
Transformer Architecture Papers - Foundational research on transformer-based language models and their efficiency characteristics.
- Vaswani et al., "Attention Is All You Need" (2017) - https://arxiv.org/abs/1706.03762
Additional Resources
NVIDIA GPU Specifications - Technical datasheets and specifications for RTX series GPUs referenced (RTX 4060 Ti, 4070 Ti Super, 4090, 5090, 3090):
- RTX 5090: 32GB GDDR7, ~1,792 GB/s bandwidth, 575 W TGP (official specs)
- RTX 4090: 24GB GDDR6X, ~1,008 GB/s bandwidth, 450 W TGP
- RTX 3090: 24GB GDDR6X, ~936 GB/s bandwidth, 420 W TGP
Apple Silicon Specifications/ - Technical documentation for M-series chip architectures and unified memory systems:
- M3 Pro: 150 GB/s unified memory bandwidth
- M3 Max (30-core GPU): 300 GB/s
- M3 Max (40-core GPU): 400 GB/s
- M4 Max: >0.5 TB/s (>500 GB/s)
DGX Spark Specifications - Engineering and pricing details:
- 128GB LPDDR5X unified memory @ 273 GB/s bandwidth
- October 2025 availability; $3,999 list price
- 240 W thermal design
Model Files - Reference for quantization and VRAM requirements:
- Llama 3 70B Q4_K_M: ~42–43 GB (does not fit on 32GB VRAM without offload or aggressive 3-bit quantization)
- Llama 3 70B FP16: ~140 GB
- Llama 3 30B Q4_K_M: ~15–18 GB
Python and Development Tools - PyTorch, Hugging Face Transformers, and related ML frameworks.
Unlock the Future of Business with AI
Dive into our immersive workshops and equip your team with the tools and knowledge to lead in the AI era.