Lessons from Building Claude Code: Prompt Caching Is Everything

It is often said in engineering that "Cache Rules Everything Around Me", and the same rule holds for agents. Long running agentic products like Claude Code are made feasible by prompt caching which

A clear, practical thread on X outlines one big truth for agent builders: *prompt caching is everything*. The post shared on X by @trq212 (link below) walks through how Claude Code was built around caching, and why that design choice makes long-running agent products feasible, faster, and cheaper.

At its core, prompt caching is a simple idea, with a tricky implementation detail. The cache works by *prefix matching*, so the order of things in your request matters a lot. The thread explains a reliable ordering that maximizes hits: **static system prompt and tools first, project-level context next, session context after that, and finally conversation messages**. Put dynamic bits up front and you’ll break the cache, which costs time and money.

The author then walks through concrete lessons from real ops. For example, send updates as system messages rather than editing the system prompt, don’t swap models mid-session unless you handle a handoff, and never add or remove tools during a conversation. There’s also a clever pattern for loading many tools without breaking the prefix: send lightweight stubs with defer_loading true, then fetch full schemas only when the model asks. Small aside, that felt like magic the first time I read it.

Compaction (summarizing old context when the window fills) gets special attention. If you run compaction with different parameters, you lose the parent cache, and costs spike. The fix is “cache-safe forking,” where the compaction call reuses the exact same prefix so only the summary tokens are new.

Read the original thread here: https://x.com/trq212/status/2024574133011673516

If you’re building agents, design around the cache from day one, monitor cache hit rate like uptime, and treat small cache wins as big operational wins. The future looks promising, since these patterns let teams scale agents without exploding cost or latency, and APIs are already evolving to make this easier.

Kommentar abschicken