A 70-billion parameter model in full precision requires 140GB of memory — more than any consumer GPU offers. Yet people run these models on laptops with 32GB of RAM. The secret is quantization: compressing model weights from 16-bit floats to 4-bit integers, reducing memory by 75% with surprisingly little quality loss.

Foundation: Why Models Are So Large

Each parameter in a neural network is a number — typically stored as a 16-bit floating point value (FP16). A 70B parameter model has 70 billion of these numbers: 70B × 2 bytes = 140GB. That is just the weights. During inference, you also need memory for activations, KV cache, and the runtime itself.

The insight behind quantization is that models are massively over-precise. Most weights cluster around small values, and the difference between storing them as 16-bit floats versus 4-bit integers is negligible for the model's actual computations.

How Quantization Works

The Basic Idea

Take a group of weights (typically 32 or 128), find their range, and map that range onto a smaller set of discrete values. A 4-bit integer can represent 16 distinct values (0-15). If your weights range from -1.5 to +1.5, you map that range onto 16 evenly spaced points. Each weight is rounded to the nearest point.

GGUF Format

The GGUF format (used by llama.cpp and Ollama) supports multiple quantization levels:

Why "K" and "M" Matter

The K-quant variants use k-means clustering to find optimal quantization points rather than uniform spacing. The M (medium) designation means critical layers (attention output, feed-forward gating) are quantized less aggressively than less sensitive layers. This mixed precision approach preserves quality where it matters most.

The Trade-offs

Quantization is not free. Here is what you sacrifice:

What you gain: 70-75% memory reduction, 2-3x faster inference (memory bandwidth is the bottleneck), and the ability to run models that would otherwise require datacenter hardware.

Practical Guide

For most users running models locally with Ollama:

Common Misconceptions

"Quantized models are significantly worse"

At Q4_K_M, most users cannot distinguish the output from the full-precision version in blind tests. The perplexity increase is typically under 1%.

"You need a GPU for local inference"

CPU inference with quantized models is viable. It is slower than GPU, but a modern CPU with 64GB RAM running a Q4 70B model produces ~5-10 tokens per second — usable for many applications.

Going Deeper

Explore AWQ (Activation-aware Weight Quantization) and GPTQ for GPU-optimized quantization. Look into speculative decoding for speed improvements. And watch the emerging field of 1-bit models — binary neural networks that push quantization to its theoretical limit.

[INTERNAL: GGUF vs GPTQ vs AWQ: Model Format Wars]

[INTERNAL: Choosing the Right Hardware for Local LLM Inference]

Frequently Asked Questions

Which quantization level should I use?

Q4_K_M is the default recommendation. It provides the best balance of quality and memory savings for most use cases. Only go lower if you must due to hardware constraints.

Can I quantize a model myself?

Yes. The llama.cpp repository includes quantization tools. Download the full-precision model, run the quantize binary, and you get a GGUF file at your chosen quality level.

Does quantization affect fine-tuned models differently?

Fine-tuned models can be more sensitive to quantization if the fine-tuning pushed weights into unusual ranges. Test your specific model at different quantization levels to find your quality threshold.