Skip to content

LLM Quantization

Key idea:

Quantization — model compression technique replacing FP16/FP32 weights with lower precision (INT8, INT4, INT2). 70B LLM: FP16 = 140 GB RAM → INT4 = 35 GB (fits in a single H100 80GB). Accuracy loss minimal (1-3% perplexity) for INT4. Popular formats: GGUF (llama.cpp), GPTQ, AWQ, bitsandbytes. Enables inference on consumer GPUs (3090, 4090).

Below: details, example, related terms, FAQ.

Try it now — free →

Details

  • Precision levels: FP16 (baseline) → INT8 (2x compression) → INT4 (4x) → INT2 (8x, experimental)
  • GGUF: universal format for llama.cpp, works on CPU + GPU
  • GPTQ: quantisation with calibration dataset, best compression-quality tradeoff
  • AWQ (Activation-aware Weight Quantization) — latest, best accuracy at INT4
  • Tools: llama.cpp, vLLM, TGI (Text Generation Inference), transformers with bitsandbytes

Example

# Ollama — run Llama 3 70B INT4 quantized
$ ollama pull llama3:70b  # ~40 GB INT4 GGUF
$ ollama run llama3:70b "Explain TCP"

# Python with transformers + bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4')
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-70B', quantization_config=config)

Related Terms

Learn more

Frequently Asked Questions

Accuracy loss?

INT8 — <1% perplexity. INT4 — 1-3% (acceptable). INT2 — 5-10% (noticeable).

How does INT4 GGUF work?

Llama.cpp packs weights 4-bit per channel with scale factor. Dequantised on-the-fly in kernel. Minimal speed penalty when compute-bound.

Fine-tune quantised model?

QLoRA — yes. Training fine-tunes LoRA adapters (FP16), base model stays INT4. One-stop setup, cheapest fine-tuning.