Transformer — neural network architecture introduced by Google in 2017 ("Attention is All You Need"). The basis of all modern LLMs. Key innovation — self-attention mechanism: each token attends to all other tokens in the sequence + computes weights. Plus: multi-head attention, positional encoding, layer normalisation, feed-forward network. Decoder-only (GPT) vs encoder-only (BERT) vs encoder-decoder (T5).
Below: details, example, related terms, FAQ.
# PyTorch simple self-attention
import torch, torch.nn as nn
class SelfAttention(nn.Module):
def __init__(self, dim):
super().__init__()
self.Q = nn.Linear(dim, dim)
self.K = nn.Linear(dim, dim)
self.V = nn.Linear(dim, dim)
def forward(self, x):
q, k, v = self.Q(x), self.K(x), self.V(x)
scores = q @ k.transpose(-2, -1) / (k.size(-1) ** 0.5)
weights = torch.softmax(scores, dim=-1)
return weights @ vParallel compute (unlike RNN), scales well with params + data, attention captures long-range dependencies. Works on any sequence data.
Optimised implementation of self-attention. Uses SRAM efficiently, memory linear (not quadratic). 2-4× faster training. v3 shipped in 2025.
Mamba / State Space Models (SSM) — linear complexity. Not yet competitive with transformers on language but promising for specific tasks.