Quantization — model compression technique replacing FP16/FP32 weights with lower precision (INT8, INT4, INT2). 70B LLM: FP16 = 140 GB RAM → INT4 = 35 GB (fits in a single H100 80GB). Accuracy loss minimal (1-3% perplexity) for INT4. Popular formats: GGUF (llama.cpp), GPTQ, AWQ, bitsandbytes. Enables inference on consumer GPUs (3090, 4090).
Below: details, example, related terms, FAQ.
# Ollama — run Llama 3 70B INT4 quantized
$ ollama pull llama3:70b # ~40 GB INT4 GGUF
$ ollama run llama3:70b "Explain TCP"
# Python with transformers + bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4')
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-70B', quantization_config=config)INT8 — <1% perplexity. INT4 — 1-3% (acceptable). INT2 — 5-10% (noticeable).
Llama.cpp packs weights 4-bit per channel with scale factor. Dequantised on-the-fly in kernel. Minimal speed penalty when compute-bound.
QLoRA — yes. Training fine-tunes LoRA adapters (FP16), base model stays INT4. One-stop setup, cheapest fine-tuning.