Skip to content
Personal blog. Opinions are my own. Always refer to official documentation.
Back to posts
AI/ML

LLM Token Economics

EL
Eric Lam
March 12, 2026 · 10 min read

Sources (verified March 2026):


1. LLMs Are Stateless

LLMs have no memory between API calls. Every turn is a fresh API call with the full conversation passed in.

Turn 1:
  API call → [system prompt, user msg 1]
  Response → assistant msg 1

Turn 2:
  API call → [system prompt, user msg 1, assistant msg 1, user msg 2]
  Response → assistant msg 2

Turn 3:
  API call → [system prompt, user msg 1, assistant msg 1, user msg 2, assistant msg 2, user msg 3]
  Response → assistant msg 3

The model re-reads the entire conversation every turn. This is true for ChatGPT, Claude, Claude Code, and every LLM-based chat product. The “memory” you experience is an illusion — the application layer replays the full history each time.

2. What Happens When Context Fills Up

Every model has a maximum context window (e.g., Claude Opus: 200K tokens). When the conversation approaches this limit, different products handle it differently:

ProductStrategy
ChatGPTSilently truncates older messages
Claude.aiWarns you, starts new conversation
Claude CodeAuto-compacts — summarizes older messages, keeps recent ones

Source: Claude Code auto-compaction — triggers at ~95% capacity by default, configurable via CLAUDE_AUTOCOMPACT_PCT_OVERRIDE.

3. How the LLM Processes Tokens Internally

An LLM doesn’t just “read” tokens. It performs expensive matrix math on every token through dozens of transformer layers. Each token produces a Key-Value (KV) pair at every layer — the model’s internal representation of that token in context.

Input: "The cat sat on the mat"

Token 1 "The"  → N layers of matrix math → KV pair
Token 2 "cat"  → N layers of matrix math → KV pair
Token 3 "sat"  → N layers of matrix math → KV pair
Token 4 "on"   → N layers of matrix math → KV pair
Token 5 "the"  → N layers of matrix math → KV pair
Token 6 "mat"  → N layers of matrix math → KV pair

These KV pairs are what the model uses to generate the next token. Computing them is the most GPU-intensive part of processing input.

Note: The exact number of layers varies by model and is not publicly disclosed for Claude. The principle is the same regardless of layer count.

4. Prompt Caching

The Problem: Redundant Computation

Without caching, every turn recomputes KV pairs for the entire conversation — even though most tokens are identical to the previous turn.

Turn 1: [system prompt (1000 tokens) + user msg (50 tokens)]
  → Compute KV pairs for all 1,050 tokens

Turn 2: [system prompt (1000) + user msg (50) + asst msg (200) + user msg 2 (50)]
  → Compute KV pairs for all 1,300 tokens AGAIN
  → The first 1,050 tokens produce the EXACT SAME KV pairs
  → Wasted GPU time

The Solution: Cache the KV Pairs

Turn 1: [system prompt (1000) + user msg (50)]
  → Compute KV pairs for 1,050 tokens
  → SAVE KV pairs to fast storage ← the cache

Turn 2: [system prompt (1000) + user msg (50) + asst msg (200) + user msg 2 (50)]
  → First 1,050 tokens? LOAD KV pairs from storage (no GPU math)
  → Only compute KV pairs for the 250 new tokens
  → GPU did 250 tokens of work instead of 1,300

Source: Anthropic docs confirm: “Prompt caching optimizes API usage by allowing you to resume from specific prefixes in your prompts. The system stores KV cache representations and cryptographic hashes, but not raw text of prompts or responses.”

Key Rule: Prefix Matching Only

Cache matches from the start of the token sequence. One change in the middle breaks the cache for everything after it.

Cache works:
  [A B C D E]       ← Turn 1 (cached)
  [A B C D E F G]   ← Turn 2 (A-E cache hit, F-G new)

Cache BREAKS:
  [A B C D E]       ← Turn 1 (cached)
  [A B X D E F G]   ← Turn 2 (A-B cache hit, X breaks match, D-G all recomputed)

The cache follows a hierarchical order: toolssystemmessages. This is why system prompts and early conversation turns almost always hit cache.

Source: Anthropic docs: “Cache keys are cumulative — each block’s hash depends on all previous blocks.”

What Gets Sent vs What Gets Computed

All tokens are still sent to the API (the server needs them to verify the prefix match). But the server skips the GPU computation for cached tokens and loads pre-computed results instead.

What you send over the network:  ALL tokens (same data size)
What the GPU computes:           ONLY new tokens (much less work)

Analogy: Re-taking a math exam where questions 1-10 are the same as last time. Without cache, you re-solve all questions. With cache, you load your saved answers for 1-10 and only solve question 11.

Cache Lifetime (TTL)

TTL OptionDurationWrite CostWhen to Use
Default5 minutes1.25x base inputActive conversations (refreshed on each use)
Extended1 hour2x base inputInfrequent requests, long agentic tasks

Source: Anthropic docs: “Refreshed at no additional cost when used within 5 minutes.”

Minimum Token Requirements for Caching

Not all content can be cached — there’s a minimum size requirement:

ModelMinimum Cached Tokens
Claude Opus 4.6, 4.5 / Haiku 4.54,096
Claude Sonnet 4.62,048
Claude Sonnet 4.5, 4.1, 4, 3.71,024
Claude Haiku 3.5, 32,048

Source: Anthropic Prompt Caching docs

Anthropic vs OpenAI Caching Comparison

Both providers use the same core concept (prefix-based KV cache), but differ in implementation:

Anthropic (Claude)OpenAI (GPT)
ActivationOpt-in via cache_control field, or automatic modeFully automatic, no code changes
Minimum tokens1,024 - 4,096 (varies by model)1,024
Cache read discount90% off input price (0.1x)Up to 90% off input + up to 80% latency reduction
Cache write cost25% surcharge (1.25x)No surcharge
Default TTL5 minutes5-10 minutes (in-memory)
Extended TTL1 hour (at 2x input price)24 hours (GPU-local storage, same price)
ScopeWorkspace-level isolationOrganization-level
BreakpointsUp to 4 explicit breakpointsAutomatic + prompt_cache_key for routing hints
Monitoringcache_creation_input_tokens + cache_read_input_tokenscached_tokens in prompt_tokens_details
Cache routingPrefix-based hashHash of first ~256 tokens + optional prompt_cache_key
Rate limitCache hits don’t count against limitsCache hits still count against TPM limits

Key takeaway: Both offer up to 90% input cost reduction. Anthropic charges a 25% surcharge on cache writes but gives explicit control via breakpoints. OpenAI is simpler (automatic, no write surcharge) and offers 24-hour extended retention on newer models (GPT-5 series). OpenAI also provides a prompt_cache_key parameter to influence cache routing when many requests share prefixes.

Sources:

5. Token Pricing

Full Pricing Table (per 1M tokens)

ModelBase Input5m Cache Write1h Cache WriteCache ReadOutput
Opus 4.6 / 4.5$5$6.25$10$0.50$25
Opus 4.1 / 4$15$18.75$30$1.50$75
Sonnet 4.6 / 4.5 / 4 / 3.7$3$3.75$6$0.30$15
Haiku 4.5$1$1.25$2$0.10$5
Haiku 3.5$0.80$1$1.60$0.08$4

Pricing multipliers (same across all models):

Source: Anthropic Prompt Caching docs — pricing table verified March 2026.

Cost Example: 50K Token Conversation

Using Claude Opus 4.6 ($5/1M input, $0.50/1M cache read):

What a Typical Turn Looks Like

[system prompt ~~~~~~~~]  → cache READ   ($0.50/1M)
[CLAUDE.md ~~~~~~~~~~~~]  → cache READ   ($0.50/1M)
[msg 1, response 1 ~~~~]  → cache READ   ($0.50/1M)
[msg 2, response 2 ~~~~]  → cache READ   ($0.50/1M)
...
[previous response ~~~~~]  → cache WRITE  ($6.25/1M)   ← new since last turn
[your new message ~~~~~~]  → full INPUT   ($5/1M)      ← brand new
[model's response ~~~~~~]  → OUTPUT       ($25/1M)     ← always most expensive

90%+ of input hits cache. Output tokens are always full price and typically the biggest cost driver.

Tracking Cache Performance

The API response includes cache metrics:

{
  "usage": {
    "cache_creation_input_tokens": 500,
    "cache_read_input_tokens": 49000,
    "input_tokens": 200,
    "output_tokens": 800
  }
}

Total input = cache_read + cache_creation + input (49,000 + 500 + 200 = 49,700)

Source: Anthropic docs: “input_tokens represents only tokens after the last cache breakpoint, not all input tokens sent.”

6. Subagents: Token Optimization Strategy

The Architecture

In Claude Code, subagents run in isolated context windows with their own model choice.

Direct (main conversation):
┌────────────────────────────────────────┐
│ System prompt + history (50K tokens)   │
│ + tool result (1K tokens)              │
│ All processed by Opus ($$$)            │
│ Result stays in context permanently    │
└────────────────────────────────────────┘

Subagent:
┌──────────────────┐    ┌─────────────────────┐
│ Main (Opus)      │    │ Agent (Sonnet)       │
│ 50K history      │───►│ 100 token prompt     │
│ + 300 summary    │◄───│ + 1K tool result     │
│                  │    │ Isolated, discarded  │
└──────────────────┘    └─────────────────────┘

Source: Claude Code Subagents docs: “Each subagent runs in its own context window with a custom system prompt, specific tool access, and independent permissions.”

Limitation: Delegation Still Costs

The main model processes the full conversation history even just to decide to delegate. There is no lightweight router that intercepts before the main model.

You: "use web-search agent to find X"
  → Opus reads 50K history to decide "delegate to agent" (~50 output tokens)
  → This delegation turn is unavoidable overhead

Where Subagents Actually Save

Savings SourceWhy
Cheaper model for workSonnet ($3/1M input, $15/1M output) vs Opus ($5/1M input, $25/1M output)
Smaller context addition300 token summary vs 1K full result added to main context
Context window preservationDelays auto-compaction, which loses information when it summarizes
Output token savingsSonnet output $15/1M vs Opus $25/1M

Note: With caching, input token savings from agents are minimal (cached tokens are cheap anyway). The primary savings come from output tokens (always full price) and context window space.

When to Use Each Approach

ScenarioBetter ApproachWhy
Single search, short conversationDirectAgent overhead not worth it
Multiple searches, long conversationAgentCompounding context savings
Heavy output generation (analysis, code)AgentOutput tokens cheaper on Sonnet
Need to ask follow-ups about resultDirectAgent discards its context
Verbose operations (test runs, log analysis)AgentKeeps noise out of main context

Source: Anthropic docs recommend subagents for “isolating high-volume operations” — Claude Code Subagents: Common patterns.

Compounding Effect Over Multiple Searches

With caching, per-turn input cost difference is small. The real savings compound from output tokens and context window space:

5 web searches in a 20-turn session:

Direct:
  - 5K tokens added to context permanently
  - All search work output generated by Opus ($25/1M output)
  - Context fills up faster → earlier compaction → information loss

Agent:
  - 1.5K tokens added to context (summaries only)
  - Search work output generated by Sonnet ($15/1M output)
  - 3.5K context space preserved → delays compaction

7. What Invalidates the Cache

Changes cascade down the hierarchy. Modifying something invalidates that level and all subsequent levels:

What Changestoolssystemmessages
Tool definitions
System prompt
Tool choice parameter
Images added/removed
Thinking parameters

(✓ = still cached, ✘ = invalidated)

Source: Anthropic Prompt Caching docs — “What Invalidates the Cache” section.

8. Summary

ConceptKey Takeaway
StatelessFull conversation re-sent every turn — no built-in memory
Prefix caching90% cheaper for repeated prefixes — GPU skips cached tokens
Cache lifetime5 min TTL (default), 1 hour (extended) — active chat stays cached
Minimum cache size1,024-4,096 tokens depending on model
Biggest costOutput tokens (always full price, no caching possible)
SubagentsSave on output tokens + context space; input savings minimal due to caching
When to use agentsLong sessions, multiple tool calls, heavy output generation, noisy operations

Last verified: March 2026. Pricing and features may change — always check the official Anthropic documentation for current information.

Related Posts

Questions or feedback? Reach out on LinkedIn