MoE (Mixture of Experts)

Igor Verentsov

By Igor Verentsov · Updated Jun 4, 2026

Key idea:

MoE (Mixture of Experts) — sparse transformer architecture: instead of a monolithic FFN, the model contains many expert networks + a router that picks top-k experts for each token. Total params huge (1.8T), but active per-token smaller (400B). Inference cost sub-linear to total size. Public MoE models: Mixtral 8x7B (47B total, 13B active), DeepSeek R1 (671B, ~37B active), GPT-4 suspected MoE.

Below: details, example, related terms, FAQ.

Free online tool — HTTP header checker: instant results, no signup.

Check your site →

Details

Router: for each token selects top-k=2 experts (out of 8-128)
Expert: usually an FFN block inside a transformer layer
Parameters: 10-100× more than dense equivalent at the same inference cost
Pros: huge capacity + manageable inference cost + experts can specialise
Cons: training complexity, routing collapse (few experts overused), serving overhead

Example

# Run Mixtral 8x7B via Ollama (quantized)
$ ollama pull mixtral:8x7b
$ ollama run mixtral:8x7b "What is MoE?"

# Python with transformers
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('mistralai/Mixtral-8x7B-Instruct-v0.1')
# 47B total params, only 13B 'active' during inference

Related Terms

TL;DR: Understanding MoE (Mixture of Experts) in Sparse LLMs

MoE (Mixture of Experts) is a machine learning architecture that enables sparse activation of neural networks, allowing only a subset of models (experts) to be active during inference, which enhances efficiency and scalability. This approach is particularly beneficial in large language models (LLMs), where it can significantly reduce computational costs while maintaining high performance. For instance, Google's Switch Transformer utilizes MoE to scale to 1.6 trillion parameters while activating only 2% of its experts at a time.

The Mechanics of Mixture of Experts (MoE)

Mixture of Experts (MoE) is a sophisticated model architecture designed to optimize the performance of large-scale neural networks, particularly in the realm of natural language processing (NLP). The primary principle behind MoE is that it divides the model into multiple 'experts,' each trained on different aspects of the data. During inference, only a small number of these experts are activated, leading to a sparse representation that reduces computational demands while maintaining the model's effectiveness.

In technical terms, an MoE model can be represented mathematically as:

y = Σ (g_i(x) * f_i(x))

Here, g_i(x) represents the gating function that determines which experts to activate based on input x, and f_i(x) denotes the individual expert functions. The gating mechanism is crucial for efficiently routing inputs to the appropriate experts, allowing the model to dynamically adapt to varying input conditions.

MoE architectures have been effectively employed in various implementations, such as:

Google's Switch Transformer: A pioneering model that implements MoE with up to 1.6 trillion parameters, where only a maximum of 2% of experts are active at any time.
GShard: A distributed training framework that allows large-scale MoE models to be trained across multiple devices, optimizing resource utilization and training speed.

These implementations demonstrate the scalability and efficiency of MoE, making it a compelling choice for practitioners aiming to leverage large datasets without incurring prohibitive costs.

Practical Example: Implementing MoE in TensorFlow

To implement a Mixture of Experts model in TensorFlow, you can leverage the TensorFlow Model Garden, which provides pre-built architectures and components for MoE. Below is a simplified example of how to set up an MoE layer in TensorFlow, demonstrating the essential components of building such a model.

import tensorflow as tf
from tensorflow.keras.layers import Layer

class MoELayer(Layer):
    def __init__(self, num_experts, expert_units, **kwargs):
        super(MoELayer, self).__init__(**kwargs)
        self.num_experts = num_experts
        self.expert_units = expert_units
        self.experts = [tf.keras.layers.Dense(expert_units) for _ in range(num_experts)]
        self.gate = tf.keras.layers.Dense(num_experts, activation='softmax')

    def call(self, inputs):
        gate_outputs = self.gate(inputs)
        expert_outputs = [expert(inputs) for expert in self.experts]
        final_output = tf.reduce_sum(tf.stack(expert_outputs) * tf.expand_dims(gate_outputs, axis=-1), axis=0)
        return final_output

# Usage
inputs = tf.keras.Input(shape=(input_shape,))
moe_layer = MoELayer(num_experts=4, expert_units=128)(inputs)
model = tf.keras.Model(inputs=inputs, outputs=moe_layer)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

This code snippet demonstrates how to create a custom MoE layer, where:

num_experts: The number of experts in the model.
expert_units: The number of units in each expert's dense layer.
gate: A gating mechanism that determines the activation of experts based on the input.

By using this MoE layer within a TensorFlow model, practitioners can efficiently handle large datasets and complex tasks while optimizing computational resources. This flexible implementation can be adapted further based on specific project requirements, such as adjusting the number of experts or the activation function of the gating mechanism.

Learn more

How-to

Glossary

What is CDC (Change Data Capture)

Research

Frequently Asked Questions

Why is MoE trending?

It lets you scale parameters cheaply (inference cost ~ active params). Frontier models 2025+ are mostly MoE (GPT-4, Claude 3.5, Gemini, DeepSeek R1).

Fine-tuning MoE?

Harder than dense. LoRA on router + experts separately. Requires more data.

Running MoE locally?

Need memory for total params (all experts must be loaded). Mixtral 8x7B → 47B × 2 bytes FP16 = 94 GB. INT4 quant → ~26 GB.

Try the live tool that powered this guide

Free plan — 10 monitors, checks every 5 min, no card required. Upgrade for 1-minute interval and multi-region monitoring.

Start free See pricing