Context Window
Learn what Context Window means in AI and machine learning, with examples and related concepts.
Definition
Context window is the maximum amount of text (measured in tokens) that an LLM can process in a single request — including both your input and the model’s output.
Think of it as the model’s working memory. Everything the model needs to “know” for a conversation — the system prompt, chat history, uploaded documents, and its own response — must fit inside this window. Once you exceed it, the oldest content gets dropped or the request fails.
Context windows have grown dramatically: GPT-3 had just 4K tokens (about 3,000 words). Today, Claude offers 200K tokens and Gemini 1.5 Pro supports up to 2M tokens. This expansion is one of the most impactful improvements in modern AI.
How It Works
┌─────────────────────────────────────────────────┐
│ Context Window (200K tokens) │
│ │
│ ┌──────────────┐ │
│ │ System Prompt │ ~500 tokens │
│ └──────────────┘ │
│ ┌──────────────────────────────┐ │
│ │ Conversation History │ ~5,000 tokens │
│ │ (previous messages) │ │
│ └──────────────────────────────┘ │
│ ┌──────────────────────────────────────┐ │
│ │ Retrieved Documents / Files │ ~50,000│
│ │ (RAG context, uploaded PDFs, code) │ tokens │
│ └──────────────────────────────────────┘ │
│ ┌──────────────┐ │
│ │ User's Query │ ~200 tokens │
│ └──────────────┘ │
│ ┌──────────────────────────────┐ │
│ │ Model's Response │ ~4,000 tokens │
│ │ (generated output) │ │
│ └──────────────────────────────┘ │
│ │
│ Remaining capacity: ~140,300 tokens │
└─────────────────────────────────────────────────┘
The model processes all tokens in the context window simultaneously using attention mechanisms. This means it can reference any part of the context when generating a response — but longer contexts use more compute and increase latency.
Why It Matters
- Document analysis — A 200K context window can hold a 500-page book. You can ask questions about an entire codebase or legal contract in one request.
- Long conversations — Larger windows mean the model remembers more of your chat history without losing context.
- RAG trade-off — With large enough context windows, you can sometimes skip building a RAG pipeline and just pass documents directly. But retrieval is still more cost-effective for very large knowledge bases.
- Cost implications — More input tokens = higher cost. Claude charges per input token, so stuffing the full context window on every request gets expensive.
Current Context Window Sizes (as of 2026)
| Model | Context Window | Approx. Words |
|---|---|---|
| Claude (Anthropic) | 200K tokens | ~150,000 |
| Gemini 1.5 Pro (Google) | 2M tokens | ~1,500,000 |
| GPT-4o (OpenAI) | 128K tokens | ~96,000 |
| Llama 3 (Meta) | 128K tokens | ~96,000 |
Example
from anthropic import Anthropic
client = Anthropic()
# Read an entire file and analyze it within the context window
with open("large_codebase.py", "r") as f:
code = f.read()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
messages=[{
"role": "user",
"content": f"Review this code for security vulnerabilities:\n\n{code}"
}]
)
# Check how much of the context window we used
print(f"Input tokens used: {response.usage.input_tokens:,}")
print(f"Remaining capacity: {200_000 - response.usage.input_tokens:,} tokens")
# Managing context in a multi-turn conversation
conversation = []
def chat(user_message: str) -> str:
conversation.append({"role": "user", "content": user_message})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=conversation
)
assistant_msg = response.content[0].text
conversation.append({"role": "assistant", "content": assistant_msg})
# Warn if approaching context limit
total_tokens = response.usage.input_tokens + response.usage.output_tokens
if total_tokens > 150_000:
print(f"Warning: {total_tokens:,} tokens used — consider summarizing history")
return assistant_msg
Key Takeaways
- The context window is the model’s working memory — everything (input + output) must fit inside it
- Modern models range from 128K to 2M tokens, enabling analysis of entire books or codebases
- Larger context windows enable new use cases but cost more per request
- For very large knowledge bases, RAG is still more practical than stuffing everything into context
- Context window size alone doesn’t guarantee quality — models may struggle with information buried in the middle of very long contexts (the “lost in the middle” problem)
Part of the DeepRaft Glossary — AI and ML terms explained for developers.