Quantization
Learn what Quantization means in AI and machine learning, with examples and related concepts.
Definition
Quantization is the process of reducing the numerical precision of a model’s weights — typically from 16-bit floating point (FP16) to 8-bit (INT8) or 4-bit (INT4) integers. This makes models smaller, faster, and cheaper to run, with minimal loss in quality.
Think of it like reducing the resolution of an image. A 4K photo (full precision) has incredible detail, but a 1080p version (quantized) looks nearly identical to most viewers and takes up 4x less space. Similarly, a 4-bit quantized Llama 3 70B model runs on a single consumer GPU instead of requiring a $10K+ server setup — while producing nearly the same quality output.
Quantization is what makes local AI possible. Without it, running a 70B parameter model would require 140GB of GPU memory (at FP16). With 4-bit quantization, it fits in 35GB — within reach of a single high-end GPU.
How It Works
Full Precision (FP16) — 16 bits per weight:
Weight value: 0.123456789...
Storage: 2 bytes per parameter
70B model = 140 GB
INT8 Quantization — 8 bits per weight:
Weight value: 0.12 (rounded)
Storage: 1 byte per parameter
70B model = 70 GB (2x smaller)
INT4 Quantization — 4 bits per weight:
Weight value: 0.1 (more rounding)
Storage: 0.5 bytes per parameter
70B model = 35 GB (4x smaller)
Quantization Methods
| Method | Bits | Quality Loss | Speed | Popular Format |
|---|---|---|---|---|
| FP16 | 16 | None (baseline) | Baseline | Native PyTorch |
| INT8 | 8 | Minimal | ~1.5x faster | bitsandbytes |
| GPTQ | 4 | Small | ~2-3x faster | .safetensors |
| AWQ | 4 | Small | ~2-3x faster | .safetensors |
| GGUF | 2-8 | Varies | CPU-friendly | .gguf (llama.cpp) |
GGUF — The Local AI Standard
GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and Ollama. It’s optimized for CPU inference and supports mixed quantization:
Quantization levels in GGUF:
Q2_K — 2-bit (smallest, noticeable quality loss)
Q4_K_M — 4-bit (best balance of size/quality) ← Most popular
Q5_K_M — 5-bit (slightly better quality)
Q6_K — 6-bit (near FP16 quality)
Q8_0 — 8-bit (minimal quality loss)
Why It Matters
- Local AI — Run LLMs on laptops and consumer GPUs instead of cloud data centers
- Cost reduction — Smaller models = less GPU memory = cheaper inference
- Speed — Lower precision = faster matrix math = faster response times
- Privacy — Run models locally without sending data to external APIs
- Democratization — Makes powerful AI accessible to individuals and small teams, not just companies with $100K GPU budgets
Example
# Run a quantized model with Ollama (simplest approach)
# Ollama automatically uses quantized GGUF models
ollama pull llama3:8b # 4-bit quantized, ~4.7GB
ollama pull llama3:70b # 4-bit quantized, ~40GB
ollama pull mistral:7b-q5_K_M # Specific quantization level
# Compare sizes:
# llama3:8b FP16 = 16GB → Q4 = 4.7GB (3.4x smaller)
# llama3:70b FP16 = 140GB → Q4 = 40GB (3.5x smaller)
# Quantize a model using bitsandbytes (INT8/INT4)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# Load a model in 4-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype="float16",
bnb_4bit_quant_type="nf4", # NormalFloat4 — best for LLMs
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=quantization_config,
device_map="auto",
)
# Original: ~16GB in FP16
# Quantized: ~5GB in 4-bit
# Quality: ~98% of original on most benchmarks
# Compare quantization levels programmatically
import os
models = {
"Q4_K_M": "llama-3-8b-Q4_K_M.gguf", # 4.7 GB
"Q5_K_M": "llama-3-8b-Q5_K_M.gguf", # 5.5 GB
"Q8_0": "llama-3-8b-Q8_0.gguf", # 8.5 GB
}
for quant, filename in models.items():
size_gb = os.path.getsize(filename) / (1024**3)
print(f"{quant}: {size_gb:.1f} GB")
# In practice:
# Q4_K_M — Best for most users (good quality, fits consumer GPUs)
# Q5_K_M — Slightly better quality, worth it if you have the RAM
# Q8_0 — Near-perfect quality, use if you have 16GB+ VRAM
Quantization Quality Impact
Real-world benchmark comparison (Llama 3 8B):
| Quantization | Size | MMLU Score | Perplexity | Quality vs FP16 |
|---|---|---|---|---|
| FP16 | 16.0 GB | 66.6 | 6.14 | 100% (baseline) |
| Q8_0 | 8.5 GB | 66.4 | 6.16 | ~99.7% |
| Q5_K_M | 5.5 GB | 65.8 | 6.25 | ~98.8% |
| Q4_K_M | 4.7 GB | 64.9 | 6.38 | ~97.4% |
| Q2_K | 3.2 GB | 58.1 | 7.89 | ~87.2% |
The sweet spot is Q4_K_M — 4x compression with less than 3% quality loss.
Key Takeaways
- Quantization reduces model precision (16-bit → 4-bit) for 3-4x smaller size and faster inference
- The quality loss at 4-bit is surprisingly small — under 3% on most benchmarks
- GGUF + llama.cpp/Ollama is the standard for running quantized models locally
- Q4_K_M is the most popular quantization level — best balance of quality and size
- Quantization is what makes local AI practical on consumer hardware
Part of the DeepRaft Glossary — AI and ML terms explained for developers.