Why Caveman Prompts Don't Actually Save Tokens in Claude Code

A prompting trick has been making the rounds lately: tell Claude Code to "speak like a caveman," strip out greetings and hedges, and watch your token bill shrink. The screenshots look convincing. Short replies, fewer words, supposedly multiples less cost.

I want to push back on this one. Not because the idea is ridiculous (it's a clever observation about model verbosity), but because the savings story depends on a measurement most people never check. Once you look at where Claude Code tokens actually go, the caveman prompt is optimizing the smallest slice of the pie while introducing risks to the part of the workflow you care about most: output quality.

Here's the case against it, and what to do instead.

Where Claude Code tokens actually go

The implicit claim behind caveman prompts is that assistant output dominates your bill. It doesn't, not even close.

A user on the Claude Code repo posted a 30-day breakdown of their own session logs. The numbers are striking:

Source	Tokens	Share
I/O tokens (raw input + output)	3.9M	0.07%
Cache creation	176.5M	3.4%
Cache reads	5.09B	99.93%

For every I/O token, roughly 1,310 tokens are read from cache. Your assistant's replies are a sliver of that I/O line, which itself is a sliver of the total.

Why so lopsided? Claude Code resends the system prompt, tool definitions, CLAUDE.md, and conversation history on every turn. Most of that hits the cache, but the cache is not free: cached reads bill at 10% of input rate, and they multiply with every message. A complete extraction of the Claude Code system prompt clocks in at roughly 2,300 to 3,600 tokens for the prompt itself, plus 14,000 to 17,000 for tool definitions. A well-loved CLAUDE.md can add another 5,000 to 50,000 on top. A single Read call adds about 70% overhead just from the line-number prefix format.

Do the arithmetic on a realistic session. Call it 100K input tokens per turn against 5K to 15K of output. Output is 5% to 15% of the total, and that's before you count cache reads. Compressing the 15% slice by 80% yields a 12% total reduction at best. The "dramatic" savings people report in screenshots are usually output-only ratios presented as if they were total-cost ratios.

Persona prompts aren't free

The caveman framing is a persona, and personas have a track record in the literature that people promoting the trick rarely mention.

Zheng et al. (EMNLP Findings 2024) ran 162 roles across 9 models and 2,410 MMLU questions. Adding a persona to the system prompt did not improve accuracy over the no-persona baseline, and picking a "good" persona was statistically indistinguishable from random. Gupta et al. (ICLR 2024) found statistically significant degradation in 80% of personas across 24 reasoning datasets, with some tasks dropping over 70 points of accuracy. Even GPT-4 Turbo showed significant drops under 42% of persona assignments. Kim et al. (2024) documented a "Jekyll and Hyde" effect where role-play reliably confuses models on reasoning problems; their proposed fix (ensemble with and without the persona) is itself a tell that personas aren't safe to use alone.

More recent work on agentic tasks (the mode Claude Code actually runs in) reports up to 26% performance degradation from persona steering on multi-step planning, which is exactly what you're doing when you ask the model to debug across a repo.

The caveman prompt isn't just "be concise." It's a character. And that character comes with the baggage the research describes.

"Answer briefly" makes hallucinations worse

The Phare benchmark (Giskard, Google DeepMind, EU partners) found that instructions emphasizing brevity reduced factual reliability across most tested models, with up to a 20% drop in hallucination resistance in the worst cases.

The mechanism is intuitive once you see it. Correcting a wrong premise requires a longer answer than accepting it. If the system prompt forbids long answers, the model picks the path that fits the constraint: it accepts the premise and produces something short and plausible. If you ask about a "CORS error" that's actually a SameSite cookie issue, a brevity-first model is more likely to tell you to add Access-Control-Allow-Origin and call it done.

Caveman prompts typically pair brevity with explicit "no enumeration, no alternatives" rules. That's structurally the same pattern the Phare team flagged.

Reasoning length tracks accuracy

Anthropic's own Visible Extended Thinking post puts it plainly: accuracy on math problems improves logarithmically with the number of thinking tokens the model is allowed to sample. Claude Opus 4.7 makes adaptive thinking the default precisely because the model is better at choosing how long to deliberate than a human is at guessing.

Several other results point the same direction. Yeo et al. (2025) on long chain-of-thought SFT, "Long Is More Important Than Difficult" (2025), the ICLR 2024 theoretical result that T-step chain-of-thought lets a transformer solve any problem a size-T circuit can solve, and DeepMind's test-time compute scaling work showing an inference-time budget can outperform a 14x larger model. Yes, there's a counter-literature (When More is Less, Don't Overthink It) showing pathological over-reasoning. But those papers are about finding the right length, not about shorter being better.

Style rules like "no enumeration" and "no hedging" don't land in the thinking block directly, but they bias token selection during generation, which in turn constrains how much internal deliberation the model bothers with. You end up capping adaptive thinking through the back door.

Claude Code is already concise

This is the part that keeps getting skipped. Claude Code's default system prompt already contains:

You should be concise, direct, and to the point. You MUST answer concisely with fewer than 4 lines (not including tool use or code generation), unless user asks for detail. IMPORTANT: You should minimize output tokens as much as possible while maintaining helpfulness, quality, and accuracy.

And:

You MUST avoid text before/after your response, such as "The answer is .", "Here is the content of the file..."

The four-line cap is built in. Output token minimization is explicit. Opus 4.7 additionally calibrates response length to task complexity automatically.

If your Claude Code sessions feel verbose, the first question isn't "how do I muzzle the model." It's "what in my CLAUDE.md or custom instructions is overriding the default." Most verbose output I see in the wild traces back to a user-added rule asking for explanations, summaries, or progress narration.

What to actually do

If the goal is a smaller bill, target the categories that dominate the breakdown. The official cost docs and best practices guide cover these and they're all more leveraged than output compression.

Push tool output into subagents. The single biggest lever. A subagent absorbs the noisy Read, Grep, and Bash results in its own context and returns a summary. Your main thread never sees the 50K tokens of file contents.
Trim CLAUDE.md aggressively. Every line rides along on every turn via cache reads. Go through yours line by line and ask whether each one is actually earning its keep.
Use .claudeignore and progressive disclosure in skills. Stop the model from wandering into node_modules or generated directories.
Run /compact and /clear more often. Conversation history accumulates and keeps billing you through cache reads.
Switch models by task. Haiku runs at roughly one-fifth the Opus rate. /model haiku for mechanical edits, reserve Opus for design and debugging.
Tune effort. effort: low or medium cuts thinking cost on tasks that genuinely don't need deep reasoning. Don't apply it globally; apply it where you know the answer is shallow.
Lean on prompt caching. Static context wrapped in cache_control reads at 10% of input rate. Structure long-lived context so it sits in the cached prefix.

Every one of these reduces the tokens Claude Code actually processes, rather than trying to shave the smallest line item on the invoice.

The bottom line

Caveman prompts aren't harmless brevity hacks. They're persona assignments plus concision rules plus anti-enumeration rules, stacked on top of a system prompt that already mandates four-line answers. The research says each ingredient has a cost. The token accounting says the ceiling on savings is in the low double digits at best. The cache is where the money goes.

If you're paying real money for Claude Code and want to pay less, stop trying to shrink the model's replies. Shrink what you're sending it, and let subagents absorb the rest.