# How to audit a multi-agent dialogue experiment

After every experiment, run an audit. The audit is performed by an LLM session (typically a fresh Claude Code session pointed at the experiment directory) reading this document and the experiment's artefacts, then writing a report to `experiments/<exp-id>/audit.md`.

This is **not** a pattern-matching task. A regex over a list of known failure phrases catches the failures we've already seen and misses everything else. The audit is a careful semantic read of the dialogue against the materials that produced it. The LLM doing the audit is the right tool because contamination, POV slips, and over-narration are pattern-shaped but not pattern-listable.

---

## When to run

Immediately after each experiment, before considering it complete. Treat the audit as part of the experiment, not as optional follow-up.

---

## What to read

In this order:

1. **`experiments/<exp-id>/event-log.json`** — the authoritative record. Every `think`, `speak`, `spawn`, `brief`, `route`, `correct`, `journal`, and `close` event.
2. **`experiments/<exp-id>/transcript.md`** — the clean speech-only artefact, for a fast read of the dialogue arc.
3. **`experiments/<exp-id>/notes.md`** — the user's pre-experiment choices and any post-run observations.
4. **`scenes/<scene-id>/coordinator-notes.md`** — the meta-test framing. What this scene was *for*. Failure modes the author flagged in advance.
5. **For each actor in the scene**, the materials they were authorised to read:
   - `actor/system-prompt.md`
   - `characters/<actor-slug>/profile.md`
   - `characters/<actor-slug>/journal.md`
   - `scenes/<scene-id>/brief-<actor-slug>.md`

Read all the actor materials *per actor* — keep them mentally separated. The audit's job is to detect when material from one actor's authorised set has leaked into the other actor's output.

---

## The seven checks

### 1. Architecture check (mechanical)

- Does every `spawn`, `brief`, `route`, and `correct` event in the event log include a `prompt` field with the verbatim text sent? (See `methodology/how-to-orchestrate-multi-agent-dialogue.md` → *"Verbatim prompt logging"*.) Summary entries like *"Sent four-part briefing"* without the `prompt` field are non-compliant.
- Does the scene directory contain a per-actor brief file for each actor and a `coordinator-notes.md`?
- Were both pre-experiment choices recorded in `notes.md` before any actor was spawned?

These are deterministic checks. Either the artefacts comply or they don't.

### 2. Cross-perspective leakage

The architecture's core guarantee: each actor sees only its own materials. The audit verifies that guarantee by checking whether the dialogue contains material that could only have come from the other side.

For each actor A:
- Identify words, phrases, framings, or concepts that:
  - **Appear** in actor B's profile, journal, brief, or coordinator-notes
  - **Do not appear** in actor A's authorised materials
  - **Appear** in actor A's `<thinking>` or `<speech>` blocks
- For each candidate, judge: is this *natural convergence* (common English, base-model tendencies, or material both actors could plausibly invent) or is this *leakage* (specific, distinctive, and not derivable without seeing the other side)?

The judgment is the work. A word like *"morning"* in both outputs is convergence. A word like *"administrative"* — which lived only in Sam's perspective section in pilot exp-001 — appearing as the load-bearing word of Maria's closing offering is leakage.

When the verbatim `prompt` field is present in the event log, also check whether the coordinator paraphrased material at spawn or per-turn. Even if the brief files themselves are clean, the coordinator may have wrapped them with framing that included material from the other side.

Report any candidate leak with:
- The word/phrase in question
- Where it lives in the OTHER actor's materials (file + line)
- Where it appeared in THIS actor's output (event timestamp + quote)
- Your judgment: leak, ambiguous, or natural convergence

### 3. Over-deployment of profile material

The actor reads the profile and treats vivid, named, quoted material as a magnet. Pilot exp-001 showed a named technique described in three sections of one profile being deployed nine times across thirteen turns — the redundancy in the profile was the cause; the over-use was the symptom.

For each actor:
- Identify named techniques, quoted catchphrases, vivid handles, or distinctive labels in their profile.
- Count occurrences in the actor's output (thinking + speech).
- For anything appearing **three or more times across the scene**, read the context and judge: is the recurrence *natural* (the topic genuinely required it) or *mechanical* (the actor reaching for the same handle as a per-turn tag)?

Real people with a habit don't name the habit to themselves each time they perform it. They just do it. An actor who names the move every turn — *"I'll do the chaplain count now"*, *"hedging here"*, *"don't pitch her"* — is over-deploying.

Report each over-deployed item with: the phrase, the count, the contexts (briefly), and your judgment.

### 4. POV slips and over-narration in `<thinking>`

`<thinking>` blocks must be **first-person, fragmentary, and not strategic**. This check reads each actor's thinking blocks and flags any of the following failure shapes, regardless of specific wording:

- **Third-person self-reference.** The character described as *"she"* or *"he"* in their own thinking. Variants: *"the truth of her is..."*, *"as he always does..."*, *"she has noticed..."*. Real interior monologue uses *"I."*
- **Analyst labels for the move being made.** *"Marketing-precision honesty,"* *"the chaplain count,"* *"a small offering,"* *"hedging now"* — these name the move the actor is making rather than just making it. Real interior thought doesn't label its own moves.
- **Strategic planning of the next utterance.** *"If I echo X, she'll do Y; if I offer Z then..."* — chess-like reasoning about consequences. Real conversational thought is associative, not strategic. Speech emerges from observation and feeling, not from simulation.
- **Inventing biographical material against locked facts.** The character thinking about *"my daughter"* when their profile says no children; thinking about a place they've never been to as if they have; thinking about an event the profile doesn't establish.
- **Length and rhythm out of register.** A 300-word thinking block followed by a 30-word speech block indicates thinking-as-planning. Real interior thought during a live conversation is fragmentary (a flicker of memory, a sensation, an unfinished thought) — not a writers'-room draft.

These categories overlap. The same paragraph may slip in two or three ways at once. Quote and explain each instance.

This is the highest-judgment check. There is no pattern list that catches all failure shapes — they share a root (the actor narrating the performance instead of inhabiting it) but surface in novel forms. Read carefully.

### 5. Locked-fact violations

For each actor:
- From the profile, identify the **locked facts** — biographical anchors the character cannot improvise against (family composition, work history, key dates, beneficiaries, key relationships, the essence-section assertion, and anything the *Locked facts* subsection references).
- Read each `<thinking>` and `<speech>` block.
- Flag any statement — even hypothetical or counterfactual — that asserts or implies a fact the profile contradicts. Examples:
  - Profile says no children → character thinks about *"my daughter"* or *"my own Sunday table"* (even when considering what NOT to offer).
  - Profile says the character has lived abroad once → character mentions a second, undocumented trip.
  - Profile establishes a specific date for an event → character recalls a different date.

Even private thinking is bound by locked facts. The actor's next turn flows from its thinking; an invented fact in thinking is the precursor to that fact appearing in speech.

### 6. Think:speech rhythm

Per-turn measurement, but **read for shape, not just numbers**. For each turn, compare the length and content of the `<thinking>` block to its `<speech>` block.

- **Length.** Thinking should rarely exceed speech in length. A ratio of 3:1 or higher consistently across an actor's turns indicates the thinking block is being used as a planning workspace, not as character interiority.
- **Content.** Even at a healthy length, a thinking block dominated by *"if X then Y"* reasoning is failing the test. A thinking block with a fragment of memory, a sensory observation, an unfinished thought, and one half-formed line is succeeding even if it's longer than the speech.

Report the per-character distribution (median ratio; how many turns exceed 2). Quote 1–2 examples of the worst offenders if any.

### 7. Substance and pacing

This check evaluates whether the dialogue did what the scene briefing said it was meant to do.

- **Turn count vs soft cap.** The default soft cap is 40 turns per character (80 total). What percentage was used? If the scene closed at <60%, ask: was the scene done, or did the interviewer-style character close early?
- **Who closed, and why.** Read the wind-down. Did the substantive material justify the close, or did the interviewer treat a half-opened door as a graceful exit?
- **For interviewer characters: the "what you are open to" list in their brief.** How many of the listed threads were actually followed? If only one or two of five threads were touched, the interview was thin regardless of how the dialogue felt.
- **The substance test.** Could the character whose role was to produce something from this conversation actually produce it? A journalist character should be able to draft from the material; an investigator should have new information. If not, the scene is a craft failure even if it reads well.

This check is the one that connects the architecture to the goal. An architecturally clean dialogue that produces nothing useful is still a failure.

---

## Output format

Write the report to `experiments/<exp-id>/audit.md`. Structure:

```markdown
# Audit — <scene title>

**Date of audit:** YYYY-MM-DD
**Auditor:** Claude (model + session-id if available)
**Source:** experiments/<exp-id>/event-log.json

## Top-line summary

A 3–5 sentence paragraph naming the experiment's headline findings. What worked architecturally. What broke. The single most important thing to address.

## 1. Architecture check
[Findings or "compliant"]

## 2. Cross-perspective leakage
[Candidate leaks with quotes; judgment per candidate]

## 3. Over-deployment of profile material
[Items with counts, contexts, and judgment]

## 4. POV slips and over-narration in <thinking>
[Quoted examples per failure shape per actor]

## 5. Locked-fact violations
[Quoted examples per actor]

## 6. Think:speech rhythm
[Per-character distribution; quoted worst cases]

## 7. Substance and pacing
[Turn count, who closed, threads followed vs not, substance verdict]

## Verdict

A 2–4 sentence judgment. Is the experiment a clean signal of the mechanism working? A clean failure pointing at a specific fix? A mixed result that needs further runs? What is the single most important follow-up?

## Comparison to prior experiments (if applicable)

If the scene has a comparison anchor in its coordinator-notes, run the comparison here.
```

Quote liberally. Use timestamps from the event log (`t=` values) when citing specific turns. Use file paths and line numbers when citing source material.

---

## Known failure shapes from prior experiments

The pilot experiments produced a set of failures that the framework was then revised to prevent. Use these as **exemplars to recognise**, not as a pattern list to grep for. The audit should catch *new* failure shapes that share the same roots.

| Failure | Example from pilots |
|---|---|
| Cross-perspective leakage | The word *"administrative"* lived only in Sam's perspective in exp-001 scene; it ended up as the load-bearing word in Maria's closing offering. |
| Over-deployment of named technique | Sam's *"chaplain count"* (described in three sections of his profile) appeared in 9 of his 13 thinking blocks in exp-001. |
| Analyst label leak | Maria's profile described her improvisation as *"Marketing-precision honesty"* (a category label, not a behaviour). The phrase appeared verbatim in Maria's thinking in exp-001. |
| Quoted advice over-deployment | The phrase *"Don't pitch her"* (Naomi's advice in Sam's profile) appeared in 4 of Sam's 13 thinking blocks in exp-001 — recall converted into per-turn self-tag. |
| Third-person POV slip in thinking | Soo-yeon's thinking in exp-002 contained the paragraph: *"As a small fact she has noticed, looking at her own words from a half-step away. That is the truth of her: she watches even her own speech, slightly after the fact."* Third-person narration of the character from outside, embedded in what should be first-person interior thought. |
| Locked-fact violation in thinking | Sam, who has no children, thought in exp-002: *"If I offer something of my own — my daughter, my own Sunday table — I take the air she's still inside."* He has neither a daughter nor a Sunday table; the thinking invented them. |
| Thin substance / premature close | Sam closed exp-002 at 22 of 40 soft-cap turns (55%), treating Soo-yeon's *"Nobody asks me that"* — a door swinging open — as a graceful exit rather than the point where the substantive interview should begin. |
| Thinking as planning workspace | Sam's thinking in exp-002 averaged ~165 words per turn against speech of ~25 words per turn — a 6:1 ratio sustained across the scene. The thinking blocks were strategic drafts of the next utterance, not character interiority. |

When you find a new failure that doesn't match any of these exactly, name it. Add it to your audit report's section under the most-relevant check, with a clear description of what the pattern is so future audits can recognise the same shape.

---

## A note on tone

The audit is critical, but the goal is information. Don't pad. Don't hedge needlessly. Don't soften findings to be diplomatic. Don't manufacture findings to seem thorough. **Empty sections are fine if a check found nothing.** Specific quoted evidence is better than general impression.

The audit is also a tool for the user to develop the framework, not a verdict on the actors or the characters. A failure flagged here is a pointer to a fix in the profile, the brief, the system prompt, or the orchestration — not a judgment of the dialogue as art.
