Skip to content

Transformer Architecture

Key idea:

Transformer — neural network architecture introduced by Google in 2017 ("Attention is All You Need"). The basis of all modern LLMs. Key innovation — self-attention mechanism: each token attends to all other tokens in the sequence + computes weights. Plus: multi-head attention, positional encoding, layer normalisation, feed-forward network. Decoder-only (GPT) vs encoder-only (BERT) vs encoder-decoder (T5).

Below: details, example, related terms, FAQ.

Try it now — free →

Details

  • Self-attention: Q × K^T → softmax × V. O(N²) complexity on context length
  • Multi-head: parallel attention heads (8-128), each captures different patterns
  • Positional encoding: RoPE, ALiBi — add position info without absolute indices
  • Layers: 24-120 (GPT-4 suspected ~100 layers)
  • Long context: Flash Attention, sparse attention — reduce O(N²)

Example

# PyTorch simple self-attention
import torch, torch.nn as nn
class SelfAttention(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.Q = nn.Linear(dim, dim)
        self.K = nn.Linear(dim, dim)
        self.V = nn.Linear(dim, dim)
    def forward(self, x):
        q, k, v = self.Q(x), self.K(x), self.V(x)
        scores = q @ k.transpose(-2, -1) / (k.size(-1) ** 0.5)
        weights = torch.softmax(scores, dim=-1)
        return weights @ v

Related Terms

Learn more

Frequently Asked Questions

Why did transformer become dominant?

Parallel compute (unlike RNN), scales well with params + data, attention captures long-range dependencies. Works on any sequence data.

What is Flash Attention?

Optimised implementation of self-attention. Uses SRAM efficiently, memory linear (not quadratic). 2-4× faster training. v3 shipped in 2025.

Alternatives to transformer?

Mamba / State Space Models (SSM) — linear complexity. Not yet competitive with transformers on language but promising for specific tasks.