Skip to content

LLM Context Window

Key idea:

Context Window — max number of tokens (input + output) an LLM can process in a single call. 2026: Claude Opus 4.7 — 1M (200k stable), Gemini 2.5 — 2M, GPT-5 — 1M, Llama 3 — 128k-1M. 1 token ≈ 0.75 words. 1M tokens ≈ 750k words ≈ the whole Harry Potter × 4 books. Trade-off: more context = higher cost + slower + potential "lost in the middle".

Below: details, example, related terms, FAQ.

Try it now — free →

Details

  • Tokens: Byte Pair Encoding (BPE) — ~0.75 word per token in English, 0.5 in Russian
  • Context budget: input + output ≤ window. With output = 4k, max input = (window - 4k)
  • Pricing: per token. 1M context × $3 per 1M = $3 per call
  • Lost in the middle: LLMs remember middle of long context worse (2023 Stanford research)
  • Caching: Anthropic prompt cache, OpenAI automatic cache — reduce cost for repeat context

Example

# Claude 1M context in Claude Agent SDK
from anthropic import Anthropic
client = Anthropic()

# Full codebase in context
with open('codebase.txt') as f:
  codebase = f.read()  # 500k tokens

response = client.messages.create(
  model='claude-opus-4-7[1m]',  # 1M context variant
  max_tokens=4096,
  system='You review code.',
  messages=[{'role':'user','content':f'Review:
{codebase}'}]
)

Related Terms

Learn more

Frequently Asked Questions

Long context vs RAG?

RAG: cheaper, extensible to infinite data, but loses semantics on chunk boundaries. Long context: simpler code but $$ cost + latency. Hybrid is usually best.

Do I need 2M tokens?

For whole-repo code review, book summary, long document analysis — yes. For chat — 32k-200k is enough.

How to optimise?

Prompt caching: 10× cheaper for repeat prefixes. Streaming for UX. Only necessary context — not whole history.