LLM Quantization

Q: Accuracy loss?

INT8 — <1% perplexity. INT4 — 1-3% (acceptable). INT2 — 5-10% (noticeable).

Q: How does INT4 GGUF work?

Llama.cpp packs weights 4-bit per channel with scale factor. Dequantised on-the-fly in kernel. Minimal speed penalty when compute-bound.

Q: Fine-tune quantised model?

QLoRA — yes. Training fine-tunes LoRA adapters (FP16), base model stays INT4. One-stop setup, cheapest fine-tuning.

Igor Verentsov

By Igor Verentsov · Updated Jun 4, 2026

Key idea:

Quantization — model compression technique replacing FP16/FP32 weights with lower precision (INT8, INT4, INT2). 70B LLM: FP16 = 140 GB RAM → INT4 = 35 GB (fits in a single H100 80GB). Accuracy loss minimal (1-3% perplexity) for INT4. Popular formats: GGUF (llama.cpp), GPTQ, AWQ, bitsandbytes. Enables inference on consumer GPUs (3090, 4090).

Below: details, example, related terms, FAQ.

Free online tool — HTTP header checker: instant results, no signup.

Check your site →

Details

Precision levels: FP16 (baseline) → INT8 (2x compression) → INT4 (4x) → INT2 (8x, experimental)
GGUF: universal format for llama.cpp, works on CPU + GPU
GPTQ: quantisation with calibration dataset, best compression-quality tradeoff
AWQ (Activation-aware Weight Quantization) — latest, best accuracy at INT4
Tools: llama.cpp, vLLM, TGI (Text Generation Inference), transformers with bitsandbytes

Example

# Ollama — run Llama 3 70B INT4 quantized
$ ollama pull llama3:70b  # ~40 GB INT4 GGUF
$ ollama run llama3:70b "Explain TCP"

# Python with transformers + bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4')
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-70B', quantization_config=config)

Related Terms

TL;DR: Understanding LLM Quantization — INT8/INT4

LLM quantization, specifically INT8 and INT4 formats, is a technique used to reduce the memory footprint and computational requirements of large language models (LLMs) while maintaining performance. By converting weights and activations from floating-point to lower-bit representations, models can operate faster and more efficiently on hardware with limited resources. This is particularly critical for deploying AI applications in real-time environments where latency and resource optimization are paramount.

The Importance of Quantization in LLMs

Large Language Models (LLMs) such as GPT-3 and BERT have revolutionized natural language processing, but their deployment often faces challenges due to their substantial resource requirements. Quantization addresses these challenges by converting model weights and activations from 32-bit floating-point (FP32) to lower-bit representations like 8-bit integer (INT8) and 4-bit integer (INT4). This reduction drastically decreases memory usage and can improve inference speed without significantly degrading model accuracy.

Quantization is particularly beneficial for edge devices and applications requiring low latency. For instance, deploying a model that uses INT8 quantization can lead to a 4x reduction in model size compared to its FP32 counterpart, while maintaining a high level of performance. In scenarios where INT4 is feasible, the reduction can be even more significant, allowing models to fit into constrained environments.

Key benefits of LLM quantization include:

Memory Efficiency: Lower bit-width representations reduce the amount of memory required to store model parameters.
Speed Improvements: INT8 and INT4 operations can be executed faster on compatible hardware, resulting in quicker inference times.
Energy Savings: Reduced computational requirements lead to lower power consumption, an essential factor for mobile and embedded systems.

To implement quantization, developers can use frameworks like TensorFlow or PyTorch, which provide built-in support for these operations. For example, TensorFlow's tf.quantization.quantize function can convert a model to INT8 as follows:

import tensorflow as tf

# Load your model
model = tf.keras.models.load_model('path_to_model')

# Quantize the model
quantized_model = tf.quantization.quantize(model, input_range=(0, 255), method='minmax')

Practical Implementation of INT8 and INT4 Quantization

To demonstrate the practical application of LLM quantization, consider a scenario where you need to deploy a trained BERT model for sentiment analysis on a mobile device. The model's original FP32 size is 420 MB, which is too large for efficient deployment. By applying INT8 quantization, you can reduce its size while retaining accuracy.

Here’s a step-by-step guide on how to implement INT8 quantization using PyTorch:

Install Required Libraries: Ensure you have the latest versions of PyTorch and torchvision installed. You can do this using pip:

pip install torch torchvision

Load the Pre-Trained Model: Use a pre-trained BERT model from the Hugging Face library:

from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

Prepare the Model for Quantization: Set the model to evaluation mode and prepare it for quantization:

model.eval()

# Fuse layers if necessary (for example, Conv2d + BatchNorm2d)
# This is a common practice for quantization.

Apply Quantization: Use PyTorch's built-in quantization utilities:

import torch

# Specify quantization configuration
deploy_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

Save the Quantized Model: Finally, save the quantized model for deployment:

torch.save(deploy_model.state_dict(), 'quantized_bert.pth')

By following these steps, you can successfully convert a BERT model into an INT8 quantized version, significantly reducing the model size to approximately 105 MB while ensuring that the performance remains acceptable for inference tasks. For scenarios where even lower precision (like INT4) is viable, similar steps can be followed, although additional considerations regarding accuracy and compatibility with hardware will be necessary.

In conclusion, LLM quantization to INT8 and INT4 formats is a powerful technique that enhances the deployment of machine learning models, particularly in resource-constrained environments. As practitioners continue to explore optimization strategies, understanding and implementing quantization will be essential for effective AI solutions.

Learn more

How-to

Glossary

What is CDC (Change Data Capture)

Research

Frequently Asked Questions

Accuracy loss?

INT8 — <1% perplexity. INT4 — 1-3% (acceptable). INT2 — 5-10% (noticeable).

How does INT4 GGUF work?