Context Window — What It Is in LLM

Igor Verentsov

LLM Context Window

By Igor Verentsov · Updated Jun 4, 2026

Key idea:

Context Window — max number of tokens (input + output) an LLM can process in a single call. 2026: Claude Opus 4.7 — 1M (200k stable), Gemini 2.5 — 2M, GPT-5 — 1M, Llama 3 — 128k-1M. 1 token ≈ 0.75 words. 1M tokens ≈ 750k words ≈ the whole Harry Potter × 4 books. Trade-off: more context = higher cost + slower + potential "lost in the middle".

Below: details, example, related terms, FAQ.

Free online tool — HTTP header checker: instant results, no signup.

Check your site →

Details

Tokens: Byte Pair Encoding (BPE) — ~0.75 word per token in English, 0.5 in Russian
Context budget: input + output ≤ window. With output = 4k, max input = (window - 4k)
Pricing: per token. 1M context × $3 per 1M = $3 per call
Lost in the middle: LLMs remember middle of long context worse (2023 Stanford research)
Caching: Anthropic prompt cache, OpenAI automatic cache — reduce cost for repeat context

Example

# Claude 1M context in Claude Agent SDK
from anthropic import Anthropic
client = Anthropic()

# Full codebase in context
with open('codebase.txt') as f:
  codebase = f.read()  # 500k tokens

response = client.messages.create(
  model='claude-opus-4-7[1m]',  # 1M context variant
  max_tokens=4096,
  system='You review code.',
  messages=[{'role':'user','content':f'Review:
{codebase}'}]
)

Related Terms

TL;DR: Understanding the Context Window in LLMs

The context window in Large Language Models (LLMs) refers to the maximum number of tokens that can be processed at once during inference. Typically ranging from 512 to 2048 tokens, the context window determines how much information the model can consider when generating responses. This limitation impacts the model's ability to understand and generate coherent and contextually relevant text.

What is a Context Window?

The context window is a critical parameter in the architecture of Large Language Models (LLMs) that dictates how many tokens the model can analyze at any one time. Tokens can include words, subwords, or characters, depending on the tokenization method used. For instance, models like GPT-3 have a context window of 2048 tokens, meaning it can only use the last 2048 tokens of the input for generating the next token in a sequence.

This limit is essential for maintaining computational efficiency and memory management. When the input exceeds the context window, earlier tokens are truncated, potentially leading to a loss of relevant information and coherence in the output. Understanding this concept is critical for developers and data scientists working with LLMs, as it influences how input data is structured and fed into these models.

Practical Example: Managing the Context Window

When working with LLMs, effectively managing the context window is vital for achieving optimal results. Consider a scenario where you need to generate a response based on a lengthy document. If the document exceeds the context window, you must decide which parts of the text to include. Here’s a practical approach:

Tokenization: Use a tokenizer to convert your text into tokens. For example, in Python, you might use the Hugging Face Transformers library:

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokens = tokenizer.encode(your_long_text, return_tensors='pt')

Trimming the Input: If your token count exceeds the model's context window, you can trim the input:

max_length = 2048
if len(tokens[0]) > max_length:
    tokens = tokens[0][-max_length:]  # Keep only the last 2048 tokens

Generating Output: After ensuring the input fits within the context window, you can proceed to generate the output:

output = model.generate(tokens)

By managing the context window effectively, you ensure that the most relevant information is retained for generating contextually appropriate responses. This practice is crucial for enhancing the performance of LLMs in real-world applications, where input data can often be extensive and complex.

Learn more

How-to

Glossary

What is CDC (Change Data Capture)

Research

Frequently Asked Questions

Long context vs RAG?

RAG: cheaper, extensible to infinite data, but loses semantics on chunk boundaries. Long context: simpler code but $$ cost + latency. Hybrid is usually best.

Do I need 2M tokens?

For whole-repo code review, book summary, long document analysis — yes. For chat — 32k-200k is enough.

How to optimise?

Prompt caching: 10× cheaper for repeat prefixes. Streaming for UX. Only necessary context — not whole history.

Try the live tool that powered this guide

Free plan — 10 monitors, checks every 5 min, no card required. Upgrade for 1-minute interval and multi-region monitoring.

Start free See pricing