Dialogue · Prompt Caching
← Interactive Wireframes

How prompt caching works

Each turn of the interactive panel re-sends the same large prefix — the megaprompt instructions plus all three characters' profiles, journals, and briefs. Caching lets you pay for that once and read it cheap on every turn after. Here's the mechanic, then the two ways to handle the deliberation.

A cache hit runs from the very first token until the first byte that differs — then it stops. It's a prefix, from the front. Always. Not scattered identical blocks.
written once (~1.25× input) cached read (~0.1× — the win) new, full price this turn generated output (full price, once) cache miss — re-read

1 · The model reads one long sequence

Everything you send is one stream, read front to back. Caching looks for the longest run that is identical to a previous request, starting at the very front.

system · megaprompt + Li / Doronin / Shane profiles · journals · briefs conversation so far new user line

The system block is the bulk of the tokens and never changes. That's the prize: re-reading it for free every turn.

2 · TTL — how long the cache lives

5 min
TTL = "time to live." The cached prefix survives 5 minutes, and the clock resets on every hit — so an active conversation stays warm. Five minutes of silence and it expires; the next call re-writes it. (A 1-hour TTL exists at a higher write price, for when the user is likely to step away.)

3 · It's a prefix, not scattered blocks

Byte-identical isn't enough. A block also has to sit in the same place with everything before it unchanged. Change something in the middle and the cache stops there — even if later blocks are identical, they come after the change, so they're re-read.

call A
systemblock Bblock C
call B — B changed; C is byte-identical to before
system block B′ (changed) block C — re-read

C is a miss even though it didn't change, because the hit ended at B. Order and position matter as much as content.

4 · So: stable first, volatile last

Put the things that never change at the front, and the things that change every turn at the end. Anything volatile placed early pushes the "first differing byte" forward and throws away the cache for everything after it.

system + character materials  — never changes  → stays cached conversation history  — append-only  → stays cached (grows) new user line  — changes every turn  → the only full-price input

5 · Keep the confer, or drop it — both still cache

After the panel deliberates backstage, you choose what to store in the history for next turn: the whole confer, or only the public responses. Watch the same blocks turn gold (new) once, then green (cached) forever after. The two options differ only in how big the cached block is — never in hit-vs-miss.

KEEP confer — the back-and-forth stays in history

turn 1
systemu₁ gen: confer₁ + resp₁
turn 2
systemu₁ · [confer₁ + resp₁]u₂ gen₂
turn 3
systemu₁·[confer₁+resp₁]·u₂·[confer₂+resp₂]u₃ gen₃

Bigger cached block — the panel "remembers" how it argued. All cheap reads.

DROP confer — confer goes to the drawer, only responses stored

turn 1
systemu₁ gen: confer₁ + resp₁

confer₁ → side panel, not stored in history

turn 2
systemu₁ · [resp₁ only]u₂ gen₂
turn 3
systemu₁·[resp₁]·u₂·[resp₂]u₃ gen₃

Smaller cached block — leaner, less memory of the argument. Still all cheap reads.

Both columns: system + every prior turn is a cache hit (green); only the new user line and the new generation are full price. They commit once to what each turn looks like and only ever append — so the prefix stays identical going forward. The choice is memory vs. leanness, not cheap vs. expensive.

6 · Why it pays off most with a user in the loop

Turn 1 pays a one-time write for the materials (~1.25× input). Every turn after reads them at ~0.1×, and only the new user line + new generation are full price. So per-turn cost stays roughly flat as the chat grows — and the longer the conversation, the more the one-time write is amortized. The batch one-shot scheme (a single call) gets almost none of this; the interactive multi-turn scheme gets the most.

7 · Four things that quietly break it

8 · Adding / removing a voice without throwing the cache away

Changing the cast changes the system block — which is bust #3. But where and how you change it decides whether you re-write a little or everything. Both moves come straight out of the prefix rule (§3) and the stable-first ordering (§4).

Add → append, never insert. Put the new character's materials at the end of the materials block. Everything before stays cached; only their tail is a write.

before
system · Li · Doronin · Shane
add Brian — appended at the end ✓
system · Li · Doronin · Shane + Brian new user line
insert / reorder in the middle ✗
system · Li + Brian (inserted) Doronin · Shane — re-read

Inserting or reordering pushes the first-differing byte backward and re-reads everything after it — the §3 miss, self-inflicted. Appending keeps it to one small tail.

Remove → retire in place, don't delete. Deleting a character from the middle shifts the prefix and busts the cache from the gap onward. Instead, leave their materials where they are — dormant, still cheap cached reads — and append a "stop voicing X" instruction.

delete Doronin from the middle ✗
system · Li Shane — re-read (the gap shifted everything after)
retire Doronin in place ✓
system · Li · Doronin (dormant) · Shane “stop voicing Doronin”

Their tokens stay cached but inert — you keep paying the cheap read for them, and you avoid a re-write entirely. A retire is essentially free.

The optimization — compact only when a write is already happening. Dormant voices pile up over a long session: cheap, but not zero. Clearing them means rebuilding the context — which is itself a full re-write, so doing it on its own costs more than it saves. But an add already pays for a tail write — so drop the dormant voices in that same rebuild and the garbage-collect rides along for free. Never compact for its own sake; fold it into a write you were going to pay anyway.