Context Window — max number of tokens (input + output) an LLM can process in a single call. 2026: Claude Opus 4.7 — 1M (200k stable), Gemini 2.5 — 2M, GPT-5 — 1M, Llama 3 — 128k-1M. 1 token ≈ 0.75 words. 1M tokens ≈ 750k words ≈ the whole Harry Potter × 4 books. Trade-off: more context = higher cost + slower + potential "lost in the middle".
Below: details, example, related terms, FAQ.
# Claude 1M context in Claude Agent SDK
from anthropic import Anthropic
client = Anthropic()
# Full codebase in context
with open('codebase.txt') as f:
codebase = f.read() # 500k tokens
response = client.messages.create(
model='claude-opus-4-7[1m]', # 1M context variant
max_tokens=4096,
system='You review code.',
messages=[{'role':'user','content':f'Review:
{codebase}'}]
)RAG: cheaper, extensible to infinite data, but loses semantics on chunk boundaries. Long context: simpler code but $$ cost + latency. Hybrid is usually best.
For whole-repo code review, book summary, long document analysis — yes. For chat — 32k-200k is enough.
Prompt caching: 10× cheaper for repeat prefixes. Streaming for UX. Only necessary context — not whole history.