Skip to content

MoE (Mixture of Experts)

Key idea:

MoE (Mixture of Experts) — sparse transformer architecture: instead of a monolithic FFN, the model contains many expert networks + a router that picks top-k experts for each token. Total params huge (1.8T), but active per-token smaller (400B). Inference cost sub-linear to total size. Public MoE models: Mixtral 8x7B (47B total, 13B active), DeepSeek R1 (671B, ~37B active), GPT-4 suspected MoE.

Below: details, example, related terms, FAQ.

Try it now — free →

Details

  • Router: for each token selects top-k=2 experts (out of 8-128)
  • Expert: usually an FFN block inside a transformer layer
  • Parameters: 10-100× more than dense equivalent at the same inference cost
  • Pros: huge capacity + manageable inference cost + experts can specialise
  • Cons: training complexity, routing collapse (few experts overused), serving overhead

Example

# Run Mixtral 8x7B via Ollama (quantized)
$ ollama pull mixtral:8x7b
$ ollama run mixtral:8x7b "What is MoE?"

# Python with transformers
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('mistralai/Mixtral-8x7B-Instruct-v0.1')
# 47B total params, only 13B 'active' during inference

Related Terms

Learn more

Frequently Asked Questions

Why is MoE trending?

It lets you scale parameters cheaply (inference cost ~ active params). Frontier models 2025+ are mostly MoE (GPT-4, Claude 3.5, Gemini, DeepSeek R1).

Fine-tuning MoE?

Harder than dense. LoRA on router + experts separately. Requires more data.

Running MoE locally?

Need memory for total params (all experts must be loaded). Mixtral 8x7B → 47B × 2 bytes FP16 = 94 GB. INT4 quant → ~26 GB.