Transformer — LLM Architecture

Igor Verentsov

Transformer Architecture

By Igor Verentsov · Updated Jun 4, 2026

Key idea:

Transformer — neural network architecture introduced by Google in 2017 ("Attention is All You Need"). The basis of all modern LLMs. Key innovation — self-attention mechanism: each token attends to all other tokens in the sequence + computes weights. Plus: multi-head attention, positional encoding, layer normalisation, feed-forward network. Decoder-only (GPT) vs encoder-only (BERT) vs encoder-decoder (T5).

Below: details, example, related terms, FAQ.

Free online tool — HTTP header checker: instant results, no signup.

Check your site →

Details

Self-attention: Q × K^T → softmax × V. O(N²) complexity on context length
Multi-head: parallel attention heads (8-128), each captures different patterns
Positional encoding: RoPE, ALiBi — add position info without absolute indices
Layers: 24-120 (GPT-4 suspected ~100 layers)
Long context: Flash Attention, sparse attention — reduce O(N²)

Example

# PyTorch simple self-attention
import torch, torch.nn as nn
class SelfAttention(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.Q = nn.Linear(dim, dim)
        self.K = nn.Linear(dim, dim)
        self.V = nn.Linear(dim, dim)
    def forward(self, x):
        q, k, v = self.Q(x), self.K(x), self.V(x)
        scores = q @ k.transpose(-2, -1) / (k.size(-1) ** 0.5)
        weights = torch.softmax(scores, dim=-1)
        return weights @ v

Related Terms

TL;DR

The Transformer architecture is a deep learning model that revolutionizes natural language processing (NLP) by utilizing self-attention mechanisms and positional encodings, enabling efficient parallelization and long-range dependency handling. Its introduction in 2017 has led to significant advancements in large language models (LLMs), allowing for tasks such as translation, summarization, and text generation with unprecedented accuracy and fluency.

Understanding the Transformer Architecture

The Transformer architecture, introduced in the seminal paper "Attention is All You Need" by Vaswani et al. in 2017, represents a paradigm shift in how deep learning models process sequential data. Unlike traditional recurrent neural networks (RNNs), which process data sequentially and often struggle with long-range dependencies, Transformers leverage self-attention mechanisms to assess the importance of different input tokens relative to each other, irrespective of their position in the sequence.

At its core, a Transformer consists of an encoder and a decoder, both of which are composed of multiple layers of self-attention and feed-forward neural networks:

Encoder: The encoder processes the input sequence, generating a context-aware representation for each token. Each encoder layer comprises a multi-head self-attention mechanism followed by a position-wise feed-forward neural network.
Decoder: The decoder generates the output sequence, using the encoder's output as context. It also employs masked self-attention to prevent attending to future tokens during training, ensuring that predictions only consider previously generated tokens.

Each layer in both the encoder and decoder employs residual connections and layer normalization, which help in stabilizing training and improving convergence rates. The use of positional encodings allows the model to retain information about the position of tokens in the sequence, a crucial aspect since Transformers do not inherently account for order.

One of the standout features of the Transformer architecture is its scalability. The model can be trained on massive datasets, allowing it to learn rich representations of language. For instance, OpenAI's GPT-3, a Transformer-based model, has 175 billion parameters and can generate human-like text across various domains.

Practical Implementation of a Transformer Model

Implementing a Transformer model can be accomplished using popular deep learning frameworks such as TensorFlow or PyTorch. Below is an example of how to define a simple Transformer model using PyTorch:

import torch
import torch.nn as nn

class TransformerModel(nn.Module):
    def __init__(self, num_tokens, d_model, nhead, num_encoder_layers, num_decoder_layers):
        super(TransformerModel, self).__init__()
        self.transformer = nn.Transformer(d_model=d_model, nhead=nhead,
                                           num_encoder_layers=num_encoder_layers,
                                           num_decoder_layers=num_decoder_layers)
        self.fc_out = nn.Linear(d_model, num_tokens)

    def forward(self, src, tgt):
        output = self.transformer(src, tgt)
        return self.fc_out(output)

In this code snippet, we define a simple Transformer model with an encoder and decoder. The nn.Transformer class from PyTorch encapsulates the core functionality of the Transformer architecture. The forward method takes source and target sequences as input and outputs predictions for the target sequence.

To train this model effectively, consider using the following configuration:

num_tokens: Total number of unique tokens in your dataset.
d_model: Dimensionality of the encoder and decoder layers (commonly 512 or 1024).
nhead: Number of attention heads (typically 8 or 16).
num_encoder_layers: Number of layers in the encoder (often between 6 and 12).
num_decoder_layers: Number of layers in the decoder (usually the same as the encoder).

This configuration provides a solid foundation for building a Transformer model capable of handling various NLP tasks. By adjusting the hyperparameters and training on a diverse dataset, one can achieve state-of-the-art performance in tasks like translation, summarization, or question answering.

Learn more

How-to

Glossary

What is CDC (Change Data Capture)

Research

Frequently Asked Questions

Why did transformer become dominant?

Parallel compute (unlike RNN), scales well with params + data, attention captures long-range dependencies. Works on any sequence data.

What is Flash Attention?

Optimised implementation of self-attention. Uses SRAM efficiently, memory linear (not quadratic). 2-4× faster training. v3 shipped in 2025.

Alternatives to transformer?

Mamba / State Space Models (SSM) — linear complexity. Not yet competitive with transformers on language but promising for specific tasks.

Try the live tool that powered this guide

Free plan — 10 monitors, checks every 5 min, no card required. Upgrade for 1-minute interval and multi-region monitoring.

Start free See pricing