What each scheme actually cost to run.
Four runs of the same scene with the same characters. Two used the lean Python coordinator path against the API directly (006a on Anthropic Sonnet 4.6, 007 on OpenRouter / DeepSeek V4 Pro). Two were orchestrated inside a Claude Code session (006b = subagent scheme, 006c = monolithic) — both ran on Opus 4.7, not the intended Sonnet 4.6, because the CC session's default model was Opus at the time. Two confounders to disentangle: the model billed and the architecture doing the orchestration.
01What we actually paid
Sources of truth: Anthropic console for 006a and the two CC runs; OpenRouter console for 007. The CC-run token counts are reconstructed from the local CC session JSONL telemetry (parent session + all spawned Task sub-agents) and priced at Anthropic Opus list rates.
| Exp | Scheme | Model billed | Turns / char | Tokens (total) | Cost |
|---|---|---|---|---|---|
| 006a | python | claude-sonnet-4-6 | 15 / 14 | ~2.50 M | $0.80 |
| 006b | subagent | claude-opus-4-7 * | 7 / 7 | 26.45 M | $134.81 |
| 006c | monolithic | claude-opus-4-7 * | 23 / 23 | 4.43 M | $22.97 |
| 007 | python | deepseek/deepseek-v4-pro | 9 / 8 | ~0.97 M | $0.79 |
* 006b and 006c were intended to run on claude-sonnet-4-6 but the CC session was configured for Opus when the runs executed. Each notes.md records the discrepancy.
02Normalised to Opus 4.7 — strips the model confounder, keeps architecture
Assume all four runs had been billed at Anthropic Opus 4.7 list rates. 006a scales up by 5.00× (Opus is uniformly 5.00× Sonnet across input, cache, and output buckets). 007 is re-costed against its OpenRouter token mix, assuming the same cache_control discipline lib/run_dialogue.py uses.
| Exp | Scheme | Actual bill | Cost at Opus 4.7 | × vs 006a |
|---|---|---|---|---|
| 006a | python | $0.80 | $4.00 | 1.00× |
| 006b | subagent | $134.81 | $134.81 | 33.70× |
| 006c | monolithic | $22.97 | $22.97 | 5.74× |
| 007 | python | $0.79 | ~$5.65 | ~1.41× |
At a constant model, the two python-coordinator runs sit at a $4.00–$5.65 floor; the two CC-orchestrated runs sit 5.74× to 33.70× higher. The model isn't the source of the gap. The architecture is.
03Normalised to Sonnet 4.6 — what the study was supposed to cost
Same exercise at Sonnet rates instead of Opus. This is the figure that would have appeared on the bill if the CC sessions hadn't been mis-configured to Opus. Sonnet list: $3.00 input / $3.75 cache-write-5m / $6.00 cache-write-1h / $0.30 cache-read / $15.00 output, per M tokens.
| Exp | Scheme | Actual bill | Cost at Sonnet 4.6 | × vs 006a |
|---|---|---|---|---|
| 006a | python | $0.80 | $0.80 | 1.00× |
| 006b | subagent | $134.81 | $26.96 | 33.70× |
| 006c | monolithic | $22.97 | $4.59 | 5.74× |
| 007 | python | $0.79 | ~$1.13 | ~1.41× |
The architecture ratios are unchanged from the Opus-normalised view (they must be — Opus is a uniform 5.00× multiplier across every bucket). What's new is the absolute floor: at Sonnet rates, even the worst-case architecture (subagent via CC) is $26.96 per scene-run. Expensive relative to the python path but not catastrophic if used selectively.
04The picture, at Sonnet rates
Log scale, because the four runs span from $0.80 to $26.96 — a 33.70× range.
The two architectural families are visually distinct on the log axis. The python coordinator runs cluster around $1.00; the CC-orchestrated runs cluster an order of magnitude above. Model choice (Sonnet vs DeepSeek vs Opus) moves cost by 1.41× to 5.00×. Architecture moves cost by 5.74× (monolithic) to 33.70× (subagent).
05Where the money went (CC runs)
Per-bucket decomposition at the actually-billed Opus rates. Both CC runs show the same shape: a small fraction is the actor work; most of the cost is the parent CC session doing tool-loop bookkeeping (reading docs, composing per-turn briefs, narrating routing decisions).
| Bucket | 006b tokens | 006b cost | 006c tokens | 006c cost | Rate ($/M) |
|---|---|---|---|---|---|
| Input (fresh) | <0.01 M | $0.00 | <0.01 M | $0.00 | $15.00 |
| Cache write — 5-min TTL | 0.53 M | $9.86 | 0.29 M | $5.35 | $18.75 |
| Cache write — 1-hour TTL | 1.27 M | $38.17 | 0.23 M | $6.78 | $30.00 |
| Cache read | 23.97 M | $35.96 | 3.85 M | $5.77 | $1.50 |
| Output | 0.68 M | $50.82 | 0.07 M | $5.07 | $75.00 |
| Total | 26.45 M | $134.81 | 4.43 M | $22.97 |
06Practical takeaways
- Scheme=python is the production architecture. $0.80–$1.13 per scene-run scales. Ten scenes is on the order of $10.00; a hundred is on the order of $100.00. The Python coordinator's cost is mostly the actor work — there's no orchestration layer to bill.
- Scheme=subagent via CC is a development architecture. $26.96 at Sonnet rates per scene-run is fine for occasional runs while tuning actor system prompts or scene briefs — when the adaptive CC tool-loop is genuinely useful. Don't use it for batch experiments.
- Scheme=monolithic via CC is in the middle. $4.59 at Sonnet per scene-run is plausible for ablation experiments where the cheap-and-quick "what if there were no isolation" comparison is what you want.
- The model is a smaller lever than the architecture. Opus → Sonnet saves 5.00×. Python → CC orchestration costs 5.74× (monolithic) or 33.70× (subagent). If cost matters, fix the architecture first.
-
The cost difference is not a quality signal. A more expensive run is not automatically a better run. The audit verdicts (cross-character leakage, locked-fact compliance, voice fidelity) live separately in each experiment's
audit.md.
006a and 007: token counts and costs are pulled from the API providers' own console logs (Anthropic console for 006a, OpenRouter console for 007). Bills reported by the user.
006b and 006c: the Anthropic console has no record of these runs — they were executed by a Claude Code session, which records every API call's usage object (input_tokens, cache_creation, cache_read, output_tokens) in the session's local JSONL transcript at ~/.claude/projects/<project>/<session>.jsonl for parent calls and ~/.claude/projects/<project>/<session>/subagents/agent-*.jsonl for spawned Task sub-agents. Totals are summed across the parent + all sub-agents for each experiment session. Costs apply Anthropic Opus 4.7 list rates.
Normalisation: Opus normalisation multiplies a Sonnet bill by 5.00× (Anthropic prices Opus at uniformly 5.00× Sonnet across input, cache, and output buckets). Sonnet normalisation divides an Opus bill by 5.00×. 007's normalisation re-costs its OpenRouter token mix (~0.93 M input, ~0.04 M output) against Anthropic list rates assuming lib/run_dialogue.py-style cache_control on ~88% of input. The mid-point estimates ($5.65 at Opus, $1.13 at Sonnet) sit inside a range that reflects how much of the input would actually have been cache-eligible.