← Interactive Wireframes
How prompt caching works
Each turn of the interactive panel re-sends the same large prefix — the megaprompt instructions plus all three characters' profiles, journals, and briefs. Caching lets you pay for that once and read it cheap on every turn after. Here's the mechanic, then the two ways to handle the deliberation.
A cache hit runs from the very first token until the first byte that differs — then it stops. It's a prefix, from the front. Always. Not scattered identical blocks.
written once (~1.25× input)
cached read (~0.1× — the win)
new, full price this turn
generated output (full price, once)
cache miss — re-read
1 · The model reads one long sequence
Everything you send is one stream, read front to back. Caching looks for the longest run that is identical to a previous request, starting at the very front.
system · megaprompt + Li / Doronin / Shane profiles · journals · briefs
conversation so far
new user line
The system block is the bulk of the tokens and never changes. That's the prize: re-reading it for free every turn.
2 · TTL — how long the cache lives
5 min
TTL = "time to live." The cached prefix survives 5 minutes, and the clock resets on every hit — so an active conversation stays warm. Five minutes of silence and it expires; the next call re-writes it. (A 1-hour TTL exists at a higher write price, for when the user is likely to step away.)
3 · It's a prefix, not scattered blocks
Byte-identical isn't enough. A block also has to sit in the same place with everything before it unchanged. Change something in the middle and the cache stops there — even if later blocks are identical, they come after the change, so they're re-read.
call A
systemblock Bblock C
call B — B changed; C is byte-identical to before
system
block B′ (changed)
✕block C — re-read
C is a miss even though it didn't change, because the hit ended at B. Order and position matter as much as content.
4 · So: stable first, volatile last
Put the things that never change at the front, and the things that change every turn at the end. Anything volatile placed early pushes the "first differing byte" forward and throws away the cache for everything after it.
system + character materials — never changes → stays cached
conversation history — append-only → stays cached (grows)
new user line — changes every turn → the only full-price input
5 · Keep the confer, or drop it — both still cache
After the panel deliberates backstage, you choose what to store in the history for next turn: the whole confer, or only the public responses. Watch the same blocks turn gold (new) once, then green (cached) forever after. The two options differ only in how big the cached block is — never in hit-vs-miss.
KEEP confer — the back-and-forth stays in history
turn 1
systemu₁
→gen: confer₁ + resp₁
turn 2
systemu₁ · [confer₁ + resp₁]u₂
→gen₂
turn 3
systemu₁·[confer₁+resp₁]·u₂·[confer₂+resp₂]u₃
→gen₃
Bigger cached block — the panel "remembers" how it argued. All cheap reads.
DROP confer — confer goes to the drawer, only responses stored
turn 1
systemu₁
→gen: confer₁ + resp₁
confer₁ → side panel, not stored in history
turn 2
systemu₁ · [resp₁ only]u₂
→gen₂
turn 3
systemu₁·[resp₁]·u₂·[resp₂]u₃
→gen₃
Smaller cached block — leaner, less memory of the argument. Still all cheap reads.
Both columns: system + every prior turn is a cache hit (green); only the new user line and the new generation are full price. They commit once to what each turn looks like and only ever append — so the prefix stays identical going forward. The choice is memory vs. leanness, not cheap vs. expensive.
6 · Why it pays off most with a user in the loop
Turn 1 pays a one-time write for the materials (~1.25× input). Every turn after reads them at ~0.1×, and only the new user line + new generation are full price. So per-turn cost stays roughly flat as the chat grows — and the longer the conversation, the more the one-time write is amortized. The batch one-shot scheme (a single call) gets almost none of this; the interactive multi-turn scheme gets the most.
7 · Four things that quietly break it
- 1Volatile content placed early. A timestamp, session id, or turn counter at the top of the system block moves the "first differing byte" to the front and discards everything after it.
- 2@-address / routing baked into the system block. Keep "answer as Shane only" in the appended user message at the end — never edit the system block mid-session.
- 3Changing the cast or deliberation depth mid-session. Naïvely, adding a character or deepening the confer rewrites the system block from that point on. The cast part has a cache-aware way to do it — see §8 below. (Deepening the confer mid-session has no shortcut: cheap between sessions, not free within one.)
- 4Idle longer than the TTL. If the user reads, walks away, and comes back after 5 minutes, the next call re-writes. Use the 1-hour TTL if long gaps are expected.
8 · Adding / removing a voice without throwing the cache away
Changing the cast changes the system block — which is bust #3. But where and how you change it decides whether you re-write a little or everything. Both moves come straight out of the prefix rule (§3) and the stable-first ordering (§4).
Add → append, never insert. Put the new character's materials at the end of the materials block. Everything before stays cached; only their tail is a write.
before
system · Li · Doronin · Shane
add Brian — appended at the end ✓
system · Li · Doronin · Shane
+ Brian
new user line
insert / reorder in the middle ✗
system · Li
+ Brian (inserted)
✕Doronin · Shane — re-read
Inserting or reordering pushes the first-differing byte backward and re-reads everything after it — the §3 miss, self-inflicted. Appending keeps it to one small tail.
Remove → retire in place, don't delete. Deleting a character from the middle shifts the prefix and busts the cache from the gap onward. Instead, leave their materials where they are — dormant, still cheap cached reads — and append a "stop voicing X" instruction.
delete Doronin from the middle ✗
system · Li
✕Shane — re-read (the gap shifted everything after)
retire Doronin in place ✓
system · Li · Doronin (dormant) · Shane
“stop voicing Doronin”
Their tokens stay cached but inert — you keep paying the cheap read for them, and you avoid a re-write entirely. A retire is essentially free.
The optimization — compact only when a write is already happening. Dormant voices pile up over a long session: cheap, but not zero. Clearing them means rebuilding the context — which is itself a full re-write, so doing it on its own costs more than it saves. But an add already pays for a tail write — so drop the dormant voices in that same rebuild and the garbage-collect rides along for free. Never compact for its own sake; fold it into a write you were going to pay anyway.