A 70-billion parameter model in full precision requires 140GB of memory — more than any consumer GPU offers. Yet people run these models on laptops with 32GB of RAM. The secret is quantization: compressing model weights from 16-bit floats to 4-bit integers, reducing memory by 75% with surprisingly little quality loss.
Foundation: Why Models Are So Large
Each parameter in a neural network is a number — typically stored as a 16-bit floating point value (FP16). A 70B parameter model has 70 billion of these numbers: 70B × 2 bytes = 140GB. That is just the weights. During inference, you also need memory for activations, KV cache, and the runtime itself.
The insight behind quantization is that models are massively over-precise. Most weights cluster around small values, and the difference between storing them as 16-bit floats versus 4-bit integers is negligible for the model's actual computations.
How Quantization Works
The Basic Idea
Take a group of weights (typically 32 or 128), find their range, and map that range onto a smaller set of discrete values. A 4-bit integer can represent 16 distinct values (0-15). If your weights range from -1.5 to +1.5, you map that range onto 16 evenly spaced points. Each weight is rounded to the nearest point.
GGUF Format
The GGUF format (used by llama.cpp and Ollama) supports multiple quantization levels:
- Q8_0: 8-bit quantization. ~70GB for a 70B model. Virtually no quality loss.
- Q5_K_M: 5-bit mixed precision. ~48GB. Very slight quality reduction.
- Q4_K_M: 4-bit mixed. ~40GB. The sweet spot for most users.
- Q3_K_M: 3-bit mixed. ~30GB. Noticeable quality drop but still usable.
- Q2_K: 2-bit. ~25GB. Significant degradation. Research use only.
Why "K" and "M" Matter
The K-quant variants use k-means clustering to find optimal quantization points rather than uniform spacing. The M (medium) designation means critical layers (attention output, feed-forward gating) are quantized less aggressively than less sensitive layers. This mixed precision approach preserves quality where it matters most.
The Trade-offs
Quantization is not free. Here is what you sacrifice:
- Reasoning quality: Complex multi-step reasoning degrades first. A Q4 model might solve 92% of what the FP16 version handles.
- Instruction following: Subtle nuance in instructions may be missed. The model's "resolution" is literally lower.
- Creative coherence: Long creative outputs may show more repetition or drift.
What you gain: 70-75% memory reduction, 2-3x faster inference (memory bandwidth is the bottleneck), and the ability to run models that would otherwise require datacenter hardware.
Practical Guide
For most users running models locally with Ollama:
- 16GB RAM: Stick with 7B-13B models at Q4_K_M
- 32GB RAM: Run 30B-34B models at Q4_K_M, or 70B at Q2_K
- 64GB RAM: 70B at Q4_K_M — the sweet spot for maximum quality on consumer hardware
- GPU with 24GB VRAM: 7B-13B fully offloaded to GPU for maximum speed
Common Misconceptions
"Quantized models are significantly worse"
At Q4_K_M, most users cannot distinguish the output from the full-precision version in blind tests. The perplexity increase is typically under 1%.
"You need a GPU for local inference"
CPU inference with quantized models is viable. It is slower than GPU, but a modern CPU with 64GB RAM running a Q4 70B model produces ~5-10 tokens per second — usable for many applications.
Going Deeper
Explore AWQ (Activation-aware Weight Quantization) and GPTQ for GPU-optimized quantization. Look into speculative decoding for speed improvements. And watch the emerging field of 1-bit models — binary neural networks that push quantization to its theoretical limit.
[INTERNAL: GGUF vs GPTQ vs AWQ: Model Format Wars]
[INTERNAL: Choosing the Right Hardware for Local LLM Inference]
Frequently Asked Questions
Which quantization level should I use?
Q4_K_M is the default recommendation. It provides the best balance of quality and memory savings for most use cases. Only go lower if you must due to hardware constraints.
Can I quantize a model myself?
Yes. The llama.cpp repository includes quantization tools. Download the full-precision model, run the quantize binary, and you get a GGUF file at your chosen quality level.
Does quantization affect fine-tuned models differently?
Fine-tuned models can be more sensitive to quantization if the fine-tuning pushed weights into unusual ranges. Test your specific model at different quantization levels to find your quality threshold.