
Skills vs MCP Servers: The Hidden Token Cost of Claude Code Extensions
MCP servers consume up to 50x more context than skills. Here's how each loads into memory, what it costs, and when to …
Sources (verified March 2026):
LLMs have no memory between API calls. Every turn is a fresh API call with the full conversation passed in.
Turn 1:
API call → [system prompt, user msg 1]
Response → assistant msg 1
Turn 2:
API call → [system prompt, user msg 1, assistant msg 1, user msg 2]
Response → assistant msg 2
Turn 3:
API call → [system prompt, user msg 1, assistant msg 1, user msg 2, assistant msg 2, user msg 3]
Response → assistant msg 3
The model re-reads the entire conversation every turn. This is true for ChatGPT, Claude, Claude Code, and every LLM-based chat product. The “memory” you experience is an illusion — the application layer replays the full history each time.
Every model has a maximum context window (e.g., Claude Opus: 200K tokens). When the conversation approaches this limit, different products handle it differently:
| Product | Strategy |
|---|---|
| ChatGPT | Silently truncates older messages |
| Claude.ai | Warns you, starts new conversation |
| Claude Code | Auto-compacts — summarizes older messages, keeps recent ones |
Source: Claude Code auto-compaction — triggers at ~95% capacity by default, configurable via
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE.
An LLM doesn’t just “read” tokens. It performs expensive matrix math on every token through dozens of transformer layers. Each token produces a Key-Value (KV) pair at every layer — the model’s internal representation of that token in context.
Input: "The cat sat on the mat"
Token 1 "The" → N layers of matrix math → KV pair
Token 2 "cat" → N layers of matrix math → KV pair
Token 3 "sat" → N layers of matrix math → KV pair
Token 4 "on" → N layers of matrix math → KV pair
Token 5 "the" → N layers of matrix math → KV pair
Token 6 "mat" → N layers of matrix math → KV pair
These KV pairs are what the model uses to generate the next token. Computing them is the most GPU-intensive part of processing input.
Note: The exact number of layers varies by model and is not publicly disclosed for Claude. The principle is the same regardless of layer count.
Without caching, every turn recomputes KV pairs for the entire conversation — even though most tokens are identical to the previous turn.
Turn 1: [system prompt (1000 tokens) + user msg (50 tokens)]
→ Compute KV pairs for all 1,050 tokens
Turn 2: [system prompt (1000) + user msg (50) + asst msg (200) + user msg 2 (50)]
→ Compute KV pairs for all 1,300 tokens AGAIN
→ The first 1,050 tokens produce the EXACT SAME KV pairs
→ Wasted GPU time
Turn 1: [system prompt (1000) + user msg (50)]
→ Compute KV pairs for 1,050 tokens
→ SAVE KV pairs to fast storage ← the cache
Turn 2: [system prompt (1000) + user msg (50) + asst msg (200) + user msg 2 (50)]
→ First 1,050 tokens? LOAD KV pairs from storage (no GPU math)
→ Only compute KV pairs for the 250 new tokens
→ GPU did 250 tokens of work instead of 1,300
Source: Anthropic docs confirm: “Prompt caching optimizes API usage by allowing you to resume from specific prefixes in your prompts. The system stores KV cache representations and cryptographic hashes, but not raw text of prompts or responses.”
Cache matches from the start of the token sequence. One change in the middle breaks the cache for everything after it.
Cache works:
[A B C D E] ← Turn 1 (cached)
[A B C D E F G] ← Turn 2 (A-E cache hit, F-G new)
Cache BREAKS:
[A B C D E] ← Turn 1 (cached)
[A B X D E F G] ← Turn 2 (A-B cache hit, X breaks match, D-G all recomputed)
The cache follows a hierarchical order: tools → system → messages. This is why system prompts and early conversation turns almost always hit cache.
Source: Anthropic docs: “Cache keys are cumulative — each block’s hash depends on all previous blocks.”
All tokens are still sent to the API (the server needs them to verify the prefix match). But the server skips the GPU computation for cached tokens and loads pre-computed results instead.
What you send over the network: ALL tokens (same data size)
What the GPU computes: ONLY new tokens (much less work)
Analogy: Re-taking a math exam where questions 1-10 are the same as last time. Without cache, you re-solve all questions. With cache, you load your saved answers for 1-10 and only solve question 11.
| TTL Option | Duration | Write Cost | When to Use |
|---|---|---|---|
| Default | 5 minutes | 1.25x base input | Active conversations (refreshed on each use) |
| Extended | 1 hour | 2x base input | Infrequent requests, long agentic tasks |
Source: Anthropic docs: “Refreshed at no additional cost when used within 5 minutes.”
Not all content can be cached — there’s a minimum size requirement:
| Model | Minimum Cached Tokens |
|---|---|
| Claude Opus 4.6, 4.5 / Haiku 4.5 | 4,096 |
| Claude Sonnet 4.6 | 2,048 |
| Claude Sonnet 4.5, 4.1, 4, 3.7 | 1,024 |
| Claude Haiku 3.5, 3 | 2,048 |
Source: Anthropic Prompt Caching docs
Both providers use the same core concept (prefix-based KV cache), but differ in implementation:
| Anthropic (Claude) | OpenAI (GPT) | |
|---|---|---|
| Activation | Opt-in via cache_control field, or automatic mode | Fully automatic, no code changes |
| Minimum tokens | 1,024 - 4,096 (varies by model) | 1,024 |
| Cache read discount | 90% off input price (0.1x) | Up to 90% off input + up to 80% latency reduction |
| Cache write cost | 25% surcharge (1.25x) | No surcharge |
| Default TTL | 5 minutes | 5-10 minutes (in-memory) |
| Extended TTL | 1 hour (at 2x input price) | 24 hours (GPU-local storage, same price) |
| Scope | Workspace-level isolation | Organization-level |
| Breakpoints | Up to 4 explicit breakpoints | Automatic + prompt_cache_key for routing hints |
| Monitoring | cache_creation_input_tokens + cache_read_input_tokens | cached_tokens in prompt_tokens_details |
| Cache routing | Prefix-based hash | Hash of first ~256 tokens + optional prompt_cache_key |
| Rate limit | Cache hits don’t count against limits | Cache hits still count against TPM limits |
Key takeaway: Both offer up to 90% input cost reduction. Anthropic charges a 25% surcharge on cache writes but gives explicit control via breakpoints. OpenAI is simpler (automatic, no write surcharge) and offers 24-hour extended retention on newer models (GPT-5 series). OpenAI also provides a prompt_cache_key parameter to influence cache routing when many requests share prefixes.
Sources:
- Anthropic Prompt Caching
- OpenAI Prompt Caching Guide — fetched via headless browser (Cloudflare-protected)
| Model | Base Input | 5m Cache Write | 1h Cache Write | Cache Read | Output |
|---|---|---|---|---|---|
| Opus 4.6 / 4.5 | $5 | $6.25 | $10 | $0.50 | $25 |
| Opus 4.1 / 4 | $15 | $18.75 | $30 | $1.50 | $75 |
| Sonnet 4.6 / 4.5 / 4 / 3.7 | $3 | $3.75 | $6 | $0.30 | $15 |
| Haiku 4.5 | $1 | $1.25 | $2 | $0.10 | $5 |
| Haiku 3.5 | $0.80 | $1 | $1.60 | $0.08 | $4 |
Pricing multipliers (same across all models):
Source: Anthropic Prompt Caching docs — pricing table verified March 2026.
Using Claude Opus 4.6 ($5/1M input, $0.50/1M cache read):
[system prompt ~~~~~~~~] → cache READ ($0.50/1M)
[CLAUDE.md ~~~~~~~~~~~~] → cache READ ($0.50/1M)
[msg 1, response 1 ~~~~] → cache READ ($0.50/1M)
[msg 2, response 2 ~~~~] → cache READ ($0.50/1M)
...
[previous response ~~~~~] → cache WRITE ($6.25/1M) ← new since last turn
[your new message ~~~~~~] → full INPUT ($5/1M) ← brand new
[model's response ~~~~~~] → OUTPUT ($25/1M) ← always most expensive
90%+ of input hits cache. Output tokens are always full price and typically the biggest cost driver.
The API response includes cache metrics:
{
"usage": {
"cache_creation_input_tokens": 500,
"cache_read_input_tokens": 49000,
"input_tokens": 200,
"output_tokens": 800
}
}
Total input = cache_read + cache_creation + input (49,000 + 500 + 200 = 49,700)
Source: Anthropic docs: “
input_tokensrepresents only tokens after the last cache breakpoint, not all input tokens sent.”
In Claude Code, subagents run in isolated context windows with their own model choice.
Direct (main conversation):
┌────────────────────────────────────────┐
│ System prompt + history (50K tokens) │
│ + tool result (1K tokens) │
│ All processed by Opus ($$$) │
│ Result stays in context permanently │
└────────────────────────────────────────┘
Subagent:
┌──────────────────┐ ┌─────────────────────┐
│ Main (Opus) │ │ Agent (Sonnet) │
│ 50K history │───►│ 100 token prompt │
│ + 300 summary │◄───│ + 1K tool result │
│ │ │ Isolated, discarded │
└──────────────────┘ └─────────────────────┘
Source: Claude Code Subagents docs: “Each subagent runs in its own context window with a custom system prompt, specific tool access, and independent permissions.”
The main model processes the full conversation history even just to decide to delegate. There is no lightweight router that intercepts before the main model.
You: "use web-search agent to find X"
→ Opus reads 50K history to decide "delegate to agent" (~50 output tokens)
→ This delegation turn is unavoidable overhead
| Savings Source | Why |
|---|---|
| Cheaper model for work | Sonnet ($3/1M input, $15/1M output) vs Opus ($5/1M input, $25/1M output) |
| Smaller context addition | 300 token summary vs 1K full result added to main context |
| Context window preservation | Delays auto-compaction, which loses information when it summarizes |
| Output token savings | Sonnet output $15/1M vs Opus $25/1M |
Note: With caching, input token savings from agents are minimal (cached tokens are cheap anyway). The primary savings come from output tokens (always full price) and context window space.
| Scenario | Better Approach | Why |
|---|---|---|
| Single search, short conversation | Direct | Agent overhead not worth it |
| Multiple searches, long conversation | Agent | Compounding context savings |
| Heavy output generation (analysis, code) | Agent | Output tokens cheaper on Sonnet |
| Need to ask follow-ups about result | Direct | Agent discards its context |
| Verbose operations (test runs, log analysis) | Agent | Keeps noise out of main context |
Source: Anthropic docs recommend subagents for “isolating high-volume operations” — Claude Code Subagents: Common patterns.
With caching, per-turn input cost difference is small. The real savings compound from output tokens and context window space:
5 web searches in a 20-turn session:
Direct:
- 5K tokens added to context permanently
- All search work output generated by Opus ($25/1M output)
- Context fills up faster → earlier compaction → information loss
Agent:
- 1.5K tokens added to context (summaries only)
- Search work output generated by Sonnet ($15/1M output)
- 3.5K context space preserved → delays compaction
Changes cascade down the hierarchy. Modifying something invalidates that level and all subsequent levels:
| What Changes | tools | system | messages |
|---|---|---|---|
| Tool definitions | ✘ | ✘ | ✘ |
| System prompt | ✓ | ✘ | ✘ |
| Tool choice parameter | ✓ | ✓ | ✘ |
| Images added/removed | ✓ | ✓ | ✘ |
| Thinking parameters | ✓ | ✓ | ✘ |
(✓ = still cached, ✘ = invalidated)
Source: Anthropic Prompt Caching docs — “What Invalidates the Cache” section.
| Concept | Key Takeaway |
|---|---|
| Stateless | Full conversation re-sent every turn — no built-in memory |
| Prefix caching | 90% cheaper for repeated prefixes — GPU skips cached tokens |
| Cache lifetime | 5 min TTL (default), 1 hour (extended) — active chat stays cached |
| Minimum cache size | 1,024-4,096 tokens depending on model |
| Biggest cost | Output tokens (always full price, no caching possible) |
| Subagents | Save on output tokens + context space; input savings minimal due to caching |
| When to use agents | Long sessions, multiple tool calls, heavy output generation, noisy operations |
Last verified: March 2026. Pricing and features may change — always check the official Anthropic documentation for current information.
Questions or feedback? Reach out on LinkedIn