AI/ML

LLM Token Economics

Eric Lam

March 12, 2026 · 10 min read

Sources (verified March 2026):
Anthropic Prompt Caching
Anthropic Pricing
OpenAI Prompt Caching Cookbook
Claude Code Subagents
Claude Code Skills

1. LLMs Are Stateless

LLMs have no memory between API calls. Every turn is a fresh API call with the full conversation passed in.

Turn 1:
  API call → [system prompt, user msg 1]
  Response → assistant msg 1

Turn 2:
  API call → [system prompt, user msg 1, assistant msg 1, user msg 2]
  Response → assistant msg 2

Turn 3:
  API call → [system prompt, user msg 1, assistant msg 1, user msg 2, assistant msg 2, user msg 3]
  Response → assistant msg 3

The model re-reads the entire conversation every turn. This is true for ChatGPT, Claude, Claude Code, and every LLM-based chat product. The “memory” you experience is an illusion — the application layer replays the full history each time.

2. What Happens When Context Fills Up

Every model has a maximum context window (e.g., Claude Opus: 200K tokens). When the conversation approaches this limit, different products handle it differently:

Product	Strategy
ChatGPT	Silently truncates older messages
Claude.ai	Warns you, starts new conversation
Claude Code	Auto-compacts — summarizes older messages, keeps recent ones

Source: Claude Code auto-compaction — triggers at ~95% capacity by default, configurable via CLAUDE_AUTOCOMPACT_PCT_OVERRIDE.

3. How the LLM Processes Tokens Internally

An LLM doesn’t just “read” tokens. It performs expensive matrix math on every token through dozens of transformer layers. Each token produces a Key-Value (KV) pair at every layer — the model’s internal representation of that token in context.

Input: "The cat sat on the mat"

Token 1 "The"  → N layers of matrix math → KV pair
Token 2 "cat"  → N layers of matrix math → KV pair
Token 3 "sat"  → N layers of matrix math → KV pair
Token 4 "on"   → N layers of matrix math → KV pair
Token 5 "the"  → N layers of matrix math → KV pair
Token 6 "mat"  → N layers of matrix math → KV pair

These KV pairs are what the model uses to generate the next token. Computing them is the most GPU-intensive part of processing input.

Note: The exact number of layers varies by model and is not publicly disclosed for Claude. The principle is the same regardless of layer count.

4. Prompt Caching

The Problem: Redundant Computation

Without caching, every turn recomputes KV pairs for the entire conversation — even though most tokens are identical to the previous turn.

Turn 1: [system prompt (1000 tokens) + user msg (50 tokens)]
  → Compute KV pairs for all 1,050 tokens

Turn 2: [system prompt (1000) + user msg (50) + asst msg (200) + user msg 2 (50)]
  → Compute KV pairs for all 1,300 tokens AGAIN
  → The first 1,050 tokens produce the EXACT SAME KV pairs
  → Wasted GPU time

The Solution: Cache the KV Pairs

Turn 1: [system prompt (1000) + user msg (50)]
  → Compute KV pairs for 1,050 tokens
  → SAVE KV pairs to fast storage ← the cache

Turn 2: [system prompt (1000) + user msg (50) + asst msg (200) + user msg 2 (50)]
  → First 1,050 tokens? LOAD KV pairs from storage (no GPU math)
  → Only compute KV pairs for the 250 new tokens
  → GPU did 250 tokens of work instead of 1,300

Source: Anthropic docs confirm: “Prompt caching optimizes API usage by allowing you to resume from specific prefixes in your prompts. The system stores KV cache representations and cryptographic hashes, but not raw text of prompts or responses.”

Key Rule: Prefix Matching Only

Cache matches from the start of the token sequence. One change in the middle breaks the cache for everything after it.

Cache works:
  [A B C D E]       ← Turn 1 (cached)
  [A B C D E F G]   ← Turn 2 (A-E cache hit, F-G new)

Cache BREAKS:
  [A B C D E]       ← Turn 1 (cached)
  [A B X D E F G]   ← Turn 2 (A-B cache hit, X breaks match, D-G all recomputed)

The cache follows a hierarchical order: tools → system → messages. This is why system prompts and early conversation turns almost always hit cache.

Source: Anthropic docs: “Cache keys are cumulative — each block’s hash depends on all previous blocks.”

What Gets Sent vs What Gets Computed

All tokens are still sent to the API (the server needs them to verify the prefix match). But the server skips the GPU computation for cached tokens and loads pre-computed results instead.

What you send over the network:  ALL tokens (same data size)
What the GPU computes:           ONLY new tokens (much less work)

Analogy: Re-taking a math exam where questions 1-10 are the same as last time. Without cache, you re-solve all questions. With cache, you load your saved answers for 1-10 and only solve question 11.

Cache Lifetime (TTL)

TTL Option	Duration	Write Cost	When to Use
Default	5 minutes	1.25x base input	Active conversations (refreshed on each use)
Extended	1 hour	2x base input	Infrequent requests, long agentic tasks

Active conversation = nearly everything stays cached (each request refreshes the TTL)
Pause for 5+ minutes = next turn pays full cache write cost again
Cache hits don’t count against rate limits

Source: Anthropic docs: “Refreshed at no additional cost when used within 5 minutes.”

Minimum Token Requirements for Caching

Not all content can be cached — there’s a minimum size requirement:

Model	Minimum Cached Tokens
Claude Opus 4.6, 4.5 / Haiku 4.5	4,096
Claude Sonnet 4.6	2,048
Claude Sonnet 4.5, 4.1, 4, 3.7	1,024
Claude Haiku 3.5, 3	2,048

Source: Anthropic Prompt Caching docs

Anthropic vs OpenAI Caching Comparison

Both providers use the same core concept (prefix-based KV cache), but differ in implementation:

	Anthropic (Claude)	OpenAI (GPT)
Activation	Opt-in via `cache_control` field, or automatic mode	Fully automatic, no code changes
Minimum tokens	1,024 - 4,096 (varies by model)	1,024
Cache read discount	90% off input price (0.1x)	Up to 90% off input + up to 80% latency reduction
Cache write cost	25% surcharge (1.25x)	No surcharge
Default TTL	5 minutes	5-10 minutes (in-memory)
Extended TTL	1 hour (at 2x input price)	24 hours (GPU-local storage, same price)
Scope	Workspace-level isolation	Organization-level
Breakpoints	Up to 4 explicit breakpoints	Automatic + `prompt_cache_key` for routing hints
Monitoring	`cache_creation_input_tokens` + `cache_read_input_tokens`	`cached_tokens` in `prompt_tokens_details`
Cache routing	Prefix-based hash	Hash of first ~256 tokens + optional `prompt_cache_key`
Rate limit	Cache hits don’t count against limits	Cache hits still count against TPM limits

Key takeaway: Both offer up to 90% input cost reduction. Anthropic charges a 25% surcharge on cache writes but gives explicit control via breakpoints. OpenAI is simpler (automatic, no write surcharge) and offers 24-hour extended retention on newer models (GPT-5 series). OpenAI also provides a prompt_cache_key parameter to influence cache routing when many requests share prefixes.

Sources:
Anthropic Prompt Caching
OpenAI Prompt Caching Guide — fetched via headless browser (Cloudflare-protected)

5. Token Pricing

Full Pricing Table (per 1M tokens)

Model	Base Input	5m Cache Write	1h Cache Write	Cache Read	Output
Opus 4.6 / 4.5	$5	$6.25	$10	$0.50	$25
Opus 4.1 / 4	$15	$18.75	$30	$1.50	$75
Sonnet 4.6 / 4.5 / 4 / 3.7	$3	$3.75	$6	$0.30	$15
Haiku 4.5	$1	$1.25	$2	$0.10	$5
Haiku 3.5	$0.80	$1	$1.60	$0.08	$4

Pricing multipliers (same across all models):

Cache write (5m): 1.25x base input
Cache write (1h): 2x base input
Cache read: 0.1x base input (90% discount)
Output: 5x base input (always the most expensive)

Source: Anthropic Prompt Caching docs — pricing table verified March 2026.

Cost Example: 50K Token Conversation

Using Claude Opus 4.6 ($5/1M input, $0.50/1M cache read):

Without cache: 50K × $5/1M = $0.25 per turn
With cache: 50K × $0.50/1M = $0.025 per turn (10x cheaper)

What a Typical Turn Looks Like

[system prompt ~~~~~~~~]  → cache READ   ($0.50/1M)
[CLAUDE.md ~~~~~~~~~~~~]  → cache READ   ($0.50/1M)
[msg 1, response 1 ~~~~]  → cache READ   ($0.50/1M)
[msg 2, response 2 ~~~~]  → cache READ   ($0.50/1M)
...
[previous response ~~~~~]  → cache WRITE  ($6.25/1M)   ← new since last turn
[your new message ~~~~~~]  → full INPUT   ($5/1M)      ← brand new
[model's response ~~~~~~]  → OUTPUT       ($25/1M)     ← always most expensive

90%+ of input hits cache. Output tokens are always full price and typically the biggest cost driver.

Tracking Cache Performance

The API response includes cache metrics:

{
  "usage": {
    "cache_creation_input_tokens": 500,
    "cache_read_input_tokens": 49000,
    "input_tokens": 200,
    "output_tokens": 800
  }
}

Total input = cache_read + cache_creation + input (49,000 + 500 + 200 = 49,700)

Source: Anthropic docs: “input_tokens represents only tokens after the last cache breakpoint, not all input tokens sent.”

6. Subagents: Token Optimization Strategy

The Architecture

In Claude Code, subagents run in isolated context windows with their own model choice.

Direct (main conversation):
┌────────────────────────────────────────┐
│ System prompt + history (50K tokens)   │
│ + tool result (1K tokens)              │
│ All processed by Opus ($$$)            │
│ Result stays in context permanently    │
└────────────────────────────────────────┘

Subagent:
┌──────────────────┐    ┌─────────────────────┐
│ Main (Opus)      │    │ Agent (Sonnet)       │
│ 50K history      │───►│ 100 token prompt     │
│ + 300 summary    │◄───│ + 1K tool result     │
│                  │    │ Isolated, discarded  │
└──────────────────┘    └─────────────────────┘

Source: Claude Code Subagents docs: “Each subagent runs in its own context window with a custom system prompt, specific tool access, and independent permissions.”

Limitation: Delegation Still Costs

The main model processes the full conversation history even just to decide to delegate. There is no lightweight router that intercepts before the main model.

You: "use web-search agent to find X"
  → Opus reads 50K history to decide "delegate to agent" (~50 output tokens)
  → This delegation turn is unavoidable overhead

Where Subagents Actually Save

Savings Source	Why
Cheaper model for work	Sonnet ($3/1M input, $15/1M output) vs Opus ($5/1M input, $25/1M output)
Smaller context addition	300 token summary vs 1K full result added to main context
Context window preservation	Delays auto-compaction, which loses information when it summarizes
Output token savings	Sonnet output $15/1M vs Opus $25/1M

Note: With caching, input token savings from agents are minimal (cached tokens are cheap anyway). The primary savings come from output tokens (always full price) and context window space.

When to Use Each Approach

Scenario	Better Approach	Why
Single search, short conversation	Direct	Agent overhead not worth it
Multiple searches, long conversation	Agent	Compounding context savings
Heavy output generation (analysis, code)	Agent	Output tokens cheaper on Sonnet
Need to ask follow-ups about result	Direct	Agent discards its context
Verbose operations (test runs, log analysis)	Agent	Keeps noise out of main context

Source: Anthropic docs recommend subagents for “isolating high-volume operations” — Claude Code Subagents: Common patterns.

Compounding Effect Over Multiple Searches

With caching, per-turn input cost difference is small. The real savings compound from output tokens and context window space:

5 web searches in a 20-turn session:

Direct:
  - 5K tokens added to context permanently
  - All search work output generated by Opus ($25/1M output)
  - Context fills up faster → earlier compaction → information loss

Agent:
  - 1.5K tokens added to context (summaries only)
  - Search work output generated by Sonnet ($15/1M output)
  - 3.5K context space preserved → delays compaction

7. What Invalidates the Cache

Changes cascade down the hierarchy. Modifying something invalidates that level and all subsequent levels:

What Changes	tools	system	messages
Tool definitions	✘	✘	✘
System prompt	✓	✘	✘
Tool choice parameter	✓	✓	✘
Images added/removed	✓	✓	✘
Thinking parameters	✓	✓	✘

(✓ = still cached, ✘ = invalidated)

Source: Anthropic Prompt Caching docs — “What Invalidates the Cache” section.

8. Summary

Concept	Key Takeaway
Stateless	Full conversation re-sent every turn — no built-in memory
Prefix caching	90% cheaper for repeated prefixes — GPU skips cached tokens
Cache lifetime	5 min TTL (default), 1 hour (extended) — active chat stays cached
Minimum cache size	1,024-4,096 tokens depending on model
Biggest cost	Output tokens (always full price, no caching possible)
Subagents	Save on output tokens + context space; input savings minimal due to caching
When to use agents	Long sessions, multiple tool calls, heavy output generation, noisy operations

Last verified: March 2026. Pricing and features may change — always check the official Anthropic documentation for current information.

#LLM #Tokens #Prompt Caching #Cost Optimization