Skip to content
Personal blog. Opinions are my own. Always refer to official documentation.
Back to posts
AI/ML

Skills vs MCP Servers: The Hidden Token Cost of Claude Code Extensions

EL
Eric Lam
March 11, 2026 · 9 min read

Sources (verified March 2026):


1. The Hidden Cost of Extensions

Claude Code’s context window is finite — 200K tokens for Opus. Every turn, the entire conversation is re-sent to the model (see LLM Token Economics for background). That means anything loaded into context costs tokens on every single request, not just the first one.

Skills and MCP servers are the two main ways to extend Claude Code. Both add capabilities, but they load into context very differently — and that difference has a real impact on your token bill and how quickly you hit the context limit.

2. How MCP Servers Load Into Context

MCP (Model Context Protocol) servers connect Claude Code to external tools — databases, GitHub, Slack, monitoring dashboards, filesystem utilities. Each MCP server exposes tool definitions: JSON schemas describing what each tool does, its parameters, and return types.

Default Behavior: Loaded at Session Start

When you start a Claude Code session, all tool definitions from all connected MCP servers are loaded into context. They stay there for every request in the session, whether you use them or not.

Session with 3 MCP servers (GitHub, Slack, Filesystem):

Request 1: [system prompt] + [MCP tools: ~3,000 tokens] + [your message]
Request 2: [system prompt] + [MCP tools: ~3,000 tokens] + [conversation history] + [your message]
Request 3: [system prompt] + [MCP tools: ~3,000 tokens] + [conversation history] + [your message]
...
Request 50: [system prompt] + [MCP tools: ~3,000 tokens] + [conversation history] + [your message]

Those ~3,000 tokens are present in EVERY request.
Over 50 requests, that's ~150,000 tokens of overhead — just for tool definitions.

You can see the actual cost with the /mcp command, which shows token overhead per connected server.

Think of it like carrying every tool in the toolbox to every room — even if you only need a screwdriver.

Claude Code has a built-in optimization called Tool Search. As of March 2026, Tool Search is enabled by default — tool definitions are deferred and discovered on-demand rather than loaded upfront. Claude uses a search mechanism to find and load only the tools it needs for each request.

Without Tool Search:
  All tool definitions loaded → every request pays the cost

With Tool Search (default):
  Tool definitions deferred → Claude searches for tools when needed
  Only discovered tools loaded into context

You can configure this behavior:

# Auto mode: only defer when tools exceed 5% of context
ENABLE_TOOL_SEARCH=auto:5 claude

# Force tool search on — always defer (current default)
ENABLE_TOOL_SEARCH=true claude

# Force tool search off — always load all upfront
ENABLE_TOOL_SEARCH=false claude

Source: Claude Code MCP docs: “Claude Code automatically enables Tool Search when your MCP tool descriptions would consume more than 10% of the context window.” Note: the default is now true (always enabled), with auto:N available for threshold-based triggering.

3. How Skills Load Into Context

Skills are markdown-based extensions that teach Claude domain knowledge, provide reference material, or define reusable workflows. They live in .claude/skills/<skill-name>/SKILL.md files.

Default Behavior: Lazy Loading

Skills use a fundamentally different loading strategy. Only skill descriptions (a few sentences each) load at session start. The full content loads only when the skill is actually used.

Session with 3 skills (deploy, review, conventions):

Request 1: [system prompt] + [skill descriptions: ~300 tokens] + [your message]
   → Claude reads descriptions, decides none are needed
Request 2: [system prompt] + [skill descriptions: ~300 tokens] + [conversation] + [your message]
   → Claude decides to use "conventions" skill
   → Full skill content (~1,500 tokens) loads for THIS request only
Request 3: [system prompt] + [skill descriptions: ~300 tokens] + [conversation] + [your message]
   → Back to just descriptions

Idle cost: ~300 tokens per request (descriptions only)
Active cost: ~1,800 tokens (descriptions + one full skill)

Think of it like a table of contents — you see the chapter titles on every page, but you only open the chapter you need.

Two Modes: Model-Invocable vs Manual-Only

Skills have a disable-model-invocation setting that controls context behavior:

SettingClaude can invoke?You can invoke?Context behavior
false (default)YesYes (/<name>)Description in context every request; full content loads when used
trueNoYes (/<name>)Nothing in context until you invoke manually

Setting disable-model-invocation: true is powerful for workflows you only trigger yourself (like /deploy or /ai-pulse). The idle context cost drops to zero.

# .claude/skills/deploy/SKILL.md
---
name: deploy
description: Deploy the application to production
disable-model-invocation: true
---
# Deploy steps...

Source: Claude Code Skills docs: “Set to true to prevent Claude from automatically loading this skill. Use for workflows you only want to trigger manually.”

Description Budget

Skill descriptions share a character budget that scales at 2% of the context window (fallback: 16,000 characters). If you have many skills, some descriptions may be excluded. Check with /context and override the limit if needed:

SLASH_COMMAND_TOOL_CHAR_BUDGET=32000 claude

Subagent Behavior

Skills behave differently in subagents. Instead of lazy loading, skills passed to a subagent are fully preloaded into its context at launch. They aren’t inherited from the parent session — you must list them explicitly.

Source: Claude Code Features Overview: “In subagents: Skills work differently in subagents. Instead of on-demand loading, skills passed to a subagent are fully preloaded into its context at launch.”

4. Side-by-Side Comparison

The official Claude Code features overview provides this comparison:

FeatureWhen it loadsWhat loadsContext cost
CLAUDE.mdSession startFull contentEvery request
SkillsSession start + when usedDescriptions at start, full content when usedLow (descriptions every request)
MCP serversSession startAll tool definitions and schemasEvery request
SubagentsWhen spawnedFresh context with specified skillsIsolated from main session
HooksOn triggerNothing (runs externally)Zero

Source: Claude Code Features Overview — context loading comparison table.

Scenario: 5 Extensions Over a 50-Turn Session

Let’s compare the approaches with 5 extensions, assuming each MCP server adds ~600 tokens of tool definitions and each skill has a ~100 token description and ~1,500 tokens of full content:

5 MCP servers (below Tool Search threshold):
  Idle overhead:  5 × 600 = 3,000 tokens per request
  Over 50 turns:  3,000 × 50 = 150,000 tokens of tool definitions
  Used or not:    Same cost either way

5 Skills (default, model-invocable):
  Idle overhead:  5 × 100 = 500 tokens per request (descriptions)
  Over 50 turns:  500 × 50 = 25,000 tokens of descriptions
  If 2 skills used once each: + 2 × 1,500 = 3,000 tokens (only on those turns)
  Total: ~28,000 tokens

5 Skills (manual-only, disable-model-invocation: true):
  Idle overhead:  0 tokens
  Over 50 turns:  0 tokens until invoked
  If 2 skills invoked once each: 2 × 1,500 = 3,000 tokens (only on those turns)
  Total: ~3,000 tokens
Context window usage over a session:

MCP servers:   [████████████████████░░░░░░░░░░] 150K tokens (tool defs alone)
Skills (auto): [██░░░░░░░░░░░░░░░░░░░░░░░░░░░░]  28K tokens
Skills (manual):[░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]   3K tokens

█ = token overhead from extensions
░ = available for conversation

The difference is dramatic. MCP servers consume 5-50x more context than equivalent skills, depending on usage patterns.

5. When to Use Each

ScenarioUseWhy
External API integration (GitHub, Jira, databases)MCP serverSkills can’t make external API calls; MCP servers provide real tool execution
Coding conventions / style guideSkillReference material, only needed when writing code in that style
Deployment workflow (/deploy)Skill (manual-only)Triggered explicitly, zero idle cost
File system operations beyond CWDMCP serverBuilt-in tools are scoped to CWD; MCP can extend reach
Code review checklistSkillOn-demand reference, not needed every turn
Database queries during debuggingMCP serverNeeds live connection; frequent use justifies the overhead
Large prompt template / boilerplateSkill (manual-only)Only loaded when you need it

Key insight: If the same capability exists as both a CLI tool and an MCP server, prefer the CLI tool. Claude can run gh, aws, gcloud, and sentry-cli directly via Bash without any persistent context overhead.

Source: Claude Code Costs docs: “Prefer CLI tools when available: Tools like gh, aws, gcloud, and sentry-cli are more context-efficient than MCP servers because they don’t add persistent tool definitions.”

6. Best Practices for Token-Efficient Extensions

Audit Your MCP Servers

Run /mcp in any session to see token costs per server. Disconnect servers you aren’t actively using. Each idle server silently consumes tokens on every request.

Prefer CLI Tools Over MCP

If a CLI exists for the service, use it. gh pr list costs zero idle tokens. A GitHub MCP server’s tool definitions cost tokens on every request, whether you create a PR or not.

Use Manual-Only Skills for Workflows

Set disable-model-invocation: true on skills you only trigger yourself. This eliminates all idle context cost — the skill loads only when you type /<name>.

Move CLAUDE.md Overflow to Skills

CLAUDE.md loads in full on every request. The official guidance recommends keeping it under ~500 lines (per the costs and features-overview docs). Files over 200 lines may start to reduce adherence to instructions. If yours is larger, move specialized sections (coding standards for a specific language, deployment procedures, review checklists) into skills. They’ll load only when relevant.

Source: Claude Code Costs docs: “Aim to keep CLAUDE.md under ~500 lines by including only essentials.” Memory docs: “Files over 200 lines consume more context and may reduce adherence.”

Tool Search is enabled by default. If you want threshold-based triggering instead (only defer when tools are large), use the auto mode:

ENABLE_TOOL_SEARCH=auto:5 claude

This triggers deferred loading only when tool definitions exceed 5% of the context window. The default (true) always defers regardless of size.

Monitor Context Usage

Use both commands during sessions:

Manage MCP Output Size

Large MCP tool outputs can flood your context. Claude Code warns at 10,000 tokens per output and has a hard limit of 25,000 tokens (configurable):

# Increase if your MCP tools legitimately return large payloads
MAX_MCP_OUTPUT_TOKENS=50000 claude

7. Summary

ConceptKey Takeaway
MCP serversTool definitions load at session start, present every request — persistent overhead
Tool SearchEnabled by default — defers MCP tools and discovers on-demand (configurable via ENABLE_TOOL_SEARCH)
Skills (default)Descriptions load at start (~low cost), full content only when used
Skills (manual-only)Zero context cost until you invoke with /<name>
Subagent skillsFully preloaded at launch, not lazy-loaded — explicit opt-in only
CLI vs MCPCLI tools have zero idle overhead — prefer when available
CLAUDE.mdLoads fully every request — keep under ~500 lines, overflow to skills
Monitoring/mcp for server costs, /context for usage grid, /cost for token stats

The core principle: load what you need, when you need it. Skills do this by default. MCP servers don’t — but Tool Search helps at scale. Choose the right extension type for the job, audit your context regularly, and your token budget (and context window) will go much further.


Last verified: March 2026. Claude Code features evolve rapidly — always check the official documentation for current behavior.

Related Posts

Questions or feedback? Reach out on LinkedIn