Quantization — техника compression модели через замену FP16/FP32 weights на меньшие precision (INT8, INT4, INT2). 70B LLM: FP16 = 140 GB RAM → INT4 = 35 GB (fits в single H100 80GB). Accuracy loss минимальный (1-3% perplexity) для INT4. Popular formats: GGUF (llama.cpp), GPTQ, AWQ, bitsandbytes. Enables inference на consumer GPUs (3090, 4090).
Ниже: подробности, пример, смежные термины, FAQ.
# Ollama — run Llama 3 70B INT4 quantized
$ ollama pull llama3:70b # ~40 GB INT4 GGUF
$ ollama run llama3:70b "Explain TCP"
# Python with transformers + bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4')
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-70B', quantization_config=config)INT8 — <1% perplexity. INT4 — 1-3% (acceptable). INT2 — 5-10% (noticeable).
Llama.cpp packs weights 4-bit per channel с scale factor. Dequantized on-the-fly в kernel. Minimal speed penalty при compute-bound.
QLoRA — да. Training finetunes LoRA adapters (FP16), base model остаётся INT4. 1-stop setup, cheapest fine-tuning.