Transformer — нейросеть-архитектура, введённая Google 2017 ("Attention is All You Need"). Основа всех modern LLM. Ключевой innovation — self-attention mechanism: каждый token "смотрит" на все остальные в sequence + вычисляет weights. Plus: multi-head attention, positional encoding, layer normalization, feed-forward network. Decoder-only (GPT) vs encoder-only (BERT) vs encoder-decoder (T5).
Ниже: подробности, пример, смежные термины, FAQ.
# PyTorch простая self-attention
import torch, torch.nn as nn
class SelfAttention(nn.Module):
def __init__(self, dim):
super().__init__()
self.Q = nn.Linear(dim, dim)
self.K = nn.Linear(dim, dim)
self.V = nn.Linear(dim, dim)
def forward(self, x):
q, k, v = self.Q(x), self.K(x), self.V(x)
scores = q @ k.transpose(-2, -1) / (k.size(-1) ** 0.5)
weights = torch.softmax(scores, dim=-1)
return weights @ vParallel compute (в отличие от RNN), scales хорошо с params + data, attention захватывает long-range dependencies. Works на любой sequence data.
Оптимизированная implementation self-attention. Использует SRAM efficiently, memory linear (not quadratic). 2-4× faster training. v3 — 2025.
Mamba / State Space Models (SSM) — linear complexity. Пока uncompetitive с transformers на language, но promising для specific tasks.