MoE (Mixture of Experts) — sparse transformer architecture: instead of a monolithic FFN, the model contains many expert networks + a router that picks top-k experts for each token. Total params huge (1.8T), but active per-token smaller (400B). Inference cost sub-linear to total size. Public MoE models: Mixtral 8x7B (47B total, 13B active), DeepSeek R1 (671B, ~37B active), GPT-4 suspected MoE.
Below: details, example, related terms, FAQ.
# Run Mixtral 8x7B via Ollama (quantized)
$ ollama pull mixtral:8x7b
$ ollama run mixtral:8x7b "What is MoE?"
# Python with transformers
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('mistralai/Mixtral-8x7B-Instruct-v0.1')
# 47B total params, only 13B 'active' during inferenceIt lets you scale parameters cheaply (inference cost ~ active params). Frontier models 2025+ are mostly MoE (GPT-4, Claude 3.5, Gemini, DeepSeek R1).
Harder than dense. LoRA on router + experts separately. Requires more data.
Need memory for total params (all experts must be loaded). Mixtral 8x7B → 47B × 2 bytes FP16 = 94 GB. INT4 quant → ~26 GB.