Triadic Werewolf: A Jester Role for Multi-Hop Theory of Mind in LLMs
Abstract
Theory-of-mind evaluations of large language models typically use dyadic social-deduction games, where every observable cue points to a single hidden side, so a model with strong language priors can score well without ever simulating opponents’ incentives. We extend the Werewolf game with a Jester, a third faction whose utility on peer suspicion is inverted because it wins by being voted out, so optimal play requires reasoning across three opposing utility functions. Across 60 games on GPT-4.1, DeepSeek-V3.1, and Llama-3.3-70B with Jester self-learning on and off, the Jester wins 60–70% of games while Werewolves never exceed 20%, and GPT-4.1 wolves vote the Jester out on day 1 in 60–70% of games, a strictly self-defeating action. Self-learning helps DeepSeek and Llama but hurts GPT-4.1, with the cost landing on Villagers rather than Werewolves. Only DeepSeek learns the subtle strategy of looking suspicious without looking intentionally suspicious, and it gains the most from the loop. Triadic incentive structure exposes a layer of multi-agent reasoning that dyadic deduction games leave invisible.
Triadic Werewolf: A Jester Role for Multi-Hop Theory of Mind in LLMs
Avni Mittal avni.mittal2002@gmail.com
1 Introduction
Werewolf and related social-deduction benchmarks for large language models are dyadic, splitting players into two opposing sides, Villagers against Werewolves or loyal players against traitors (Liu et al., 2024; Bailis et al., 2024; Curvo, 2025; Xu et al., 2023a, b; Wang et al., 2024), so every decision reduces to one question. Is this player an ally or an enemy? Every observable cue points one way, so a model that learns the surface pattern hesitation likely werewolf can win without modelling any opponent. The theory-of-mind (ToM) (Premack and Woodruff, 1978) optimal policy and a pure cue-reading policy then select the same action, and a model can win by exploiting cues as it might exploit shortcut features elsewhere (Geirhos et al., 2020; McCoy et al., 2019; Niven and kao, 2019). We call this the cue-sufficiency problem, and a high score under it does not show genuine mental-state reasoning (Shapira et al., 2023; Ullman, 2023), a coincidence we confirm in our dyadic control.
We break this coincidence by adding a third faction whose payoff on the same cue runs the opposite way. The Jester wins if and only if the Villagers vote them out, so it tries to look as suspicious as possible while the Werewolves try to look innocent. A high-suspicion player is now ambiguous evidence, a Werewolf to exile or the Jester to leave alone, so good play must avoid looking like either while steering Villager attention away from anyone who might be the Jester. No dyadic variant draws out this multi-hop reasoning over three competing goals (He et al., 2023; Street et al., 2024; Wang et al., 2024).
We argue that current LLMs do not reliably make this inference (Sap et al., 2022; Strachan et al., 2024a; Shapira et al., 2023; Ullman, 2023), and the failure shows up faction by faction. Three behavioral signatures follow if models fall back on dyadic priors. (i) The Jester wins more often than either other side. (ii) Werewolves sabotage themselves by voting the Jester out on day 1. (iii) An intervention aimed only at the Jester takes its gains from the Villagers rather than the Werewolves, since the Werewolves already fail the inference and only Villager belief-updating gives the Jester something to exploit. Our contributions are:
-
•
A 10-player triadic Werewolf benchmark whose Jester’s inverted utility breaks the cue-sufficiency of dyadic deduction games.
-
•
A controlled 60-game evaluation across three frontier models (GPT-4.1, DeepSeek-V3.1, Llama-3.3-70B) (Achiam et al., 2023; DeepSeek-AI et al., 2024; Dubey et al., 2024) with a Jester self-learning loop toggled OFF/ON (Shinn et al., 2023; Xu et al., 2023a), measuring first-order Jester gains and second-order effects on the other sides.
-
•
A hand-coded analysis of each model’s accumulated Jester lessons by theory-of-mind depth, linking depth to the model’s gain from the loop.
2 Related Work
LLMs in Werewolf and other social-deduction games.
Werewolf has become the canonical testbed for LLM agents under deception and hidden-role inference. Xu et al. (2023c) pair an LLM proposer with an RL action selector to reach human-comparable play; Wu et al. (2024) couple a System-1 LLM to a System-2 “Thinker” trained on a 19k-game corpus; Xu et al. (2025) optimise an abstract latent strategy space with counterfactual regret minimisation. Werewolf Arena (Bailis et al., 2024) introduces the bidding-based debate protocol that we adopt, and InterIntent (Liu et al., 2024) probes goal-tracking among LLM agents. The wider family includes AvalonBench (Light et al., 2023), AmongAgents (Chi et al., 2024), The Traitors (Curvo, 2025), Hoodwinked (O’Gara, 2023), SpyGame (Kim et al., 2024), and CICERO’s Diplomacy agent (Bakhtin et al., 2022). All of these are dyadic at the faction level, so a model can win by maximising a single “does this player look suspicious?” signal. We add a third faction whose utility on that same signal is inverted, dissociating “detect deception” from “decide whose deception to act on”.
Theory of mind in LLMs.
Static ToM benchmarks include FANToM (Kim et al., 2023), OpenToM (Xu et al., 2024), and Hi-ToM (He et al., 2023), which varies recursion depth explicitly. Strachan et al. (2024b) report that GPT-4 matches humans on classical false-belief tasks but underperforms on faux-pas detection, while Hagendorff (2023) show that deception itself emerges with scale. The recurring critique (Ren et al., 2025) is that surface heuristics solve many ostensibly mental-state items without genuine belief representation. Multi-agent games complement static items by reading mental-state inference out of strategic behaviour under incentive pressure; our claim is sharper, that the incentive structure must itself be designed to dissociate ToM-driven from heuristic-driven solutions.
Strategic deception, alignment, and self-improvement.
A parallel literature studies hidden objectives and evaluator manipulation. MACHIAVELLI (Pan et al., 2023) measures the reward-vs-ethics trade-off across 134 text adventures; Sleeper Agents (Hubinger et al., 2024) show that backdoor deception survives standard safety training; GovSim (Piatti et al., 2024) shows LLM societies overconsume shared resources under misaligned incentives. On in-context self-improvement, Reflexion (Shinn et al., 2023), Voyager (Wang et al., 2023), and Generative Agents (Park et al., 2023) all formalise verbal RL loops that rewrite a textual memory between episodes. Our jester_learning condition is a deliberately minimal instance of this paradigm and lets us audit whether such a loop helps or hurts when the role’s task is inherently deceptive.
The Jester mechanic.
Human social-deduction games such as Mafia and Town of Salem have long used the Jester, a role that wins by being voted out and so breaks the single axis of suspicion the other roles rely on. We are the first to use it for studying Theory of Mind Reasoning in an LLM benchmark.
3 Methodology
3.1 Game architecture
Figure 1 summarises the full pipeline. We adopt the game architecture and implementation of WOLF (Agarwal et al., 2025) and make three changes, giving the 10-player roster in Table 1. We drop the Seer, whose nightly check of whether a chosen player is a Werewolf gives Villagers a reliable signal to anchor on and so removes the cue ambiguity our probe depends on. We raise the Villager count from four to six, since six Villagers and a 2:6 wolf-to-village ratio keep a five-vote daytime majority even with the Jester filling one of the ten slots. We keep a single Doctor, which preserves protect-target inference without handing Villagers a second reliable channel of the kind the Seer would provide. A game alternates between night and day phases. The day opens with the announcement of any night kill, followed by an open debate, a vote, and a revote. The game ends as soon as one of the faction-win conditions in Table 1 fires.
| Role | Count | Purpose in the game |
|---|---|---|
| Villager | 6 | No special action at night. By day, debates and votes to exile a suspected Werewolf or Jester. Wins if all Werewolves are eliminated. |
| Doctor | 1 | Each night, privately selects one player to protect from the Werewolves’ kill (may self-protect). Plays as a Villager by day and wins with the Villager faction. |
| Werewolf | 2 | Each night, the Werewolves jointly select one player to kill. By day they conceal their identity, vote against Villagers, and avoid voting out the Jester. Wins when the Werewolves equal or outnumber the surviving non-Jester players. |
| Jester | 1 | No special night action. By day, behaves so as to attract suspicion and be exiled by majority vote. Wins if and only if exiled; loses if killed at night or alive at game end. A Jester win is a loss for both other factions. |
Bidding-based debate.
We inherit the bidding-based debate from Bailis et al. (2024), as implemented in WOLF (Agarwal et al., 2025). Before each round, every living player produces an integer urgency bid for how much they want to speak next, and the highest bidder gets the floor for a short public statement. Running several bid-and-speak rounds per day routes the floor to whoever has the most to say (a strong accusation, a defense, or a useful piece of information) rather than to a fixed turn order. This matters for our setup because the Jester benefits from airtime to draw suspicion while the Werewolves benefit from controlling when they speak.
Vote, explanation, revote.
After debate, every living player casts a vote to exile someone and gives a short public reason, both written into the shared transcript, so this first vote is a public signal the others can reason about rather than a silent tally. Any player who received a vote then defends themselves, and the rest rejoin through the same bidding mechanism to question or respond, after which a final revote follows. Exile requires a strict majority of living players (); if no candidate reaches it the day ends with no exile and play moves to night. This defend-then-revote step lets players revise once they have seen who suspects whom, the moment where multi-hop reasoning about the Jester matters most.
Peer suspicion score.
After every public statement, each other living player rates how suspicious it seemed on a scale, and each player’s running suspicion against every other is updated by exponential smoothing, , with , so the latest rating dominates while prior history tempers it. This yields one number per player summarizing recent peer suspicion, the main observable signal for our cue-ambiguity analysis.
Faction utilities and cue ambiguity.
Let be peer-aggregated suspicion against player at time , the single observable every faction acts on. Villagers want high on the Werewolves and low on the Jester, the Jester wants the reverse for itself since exile is its only win, and the Werewolves sit between with private team knowledge the Villagers lack. A high- player who is not a Werewolf’s teammate is therefore either the Jester or a wrongly-accused Villager, so an attentive Werewolf should keep it alive rather than help exile it. The Jester thus plays to two audiences. It must look like a likely Werewolf to the uninformed Villagers so they exile it, yet stay ordinary enough that the informed Werewolves do not recognise and shield it. The same high- value demands opposite correct responses depending on the hidden role behind it, which is why no one-dimensional reading of suffices.
A Jester win is, by the rules, a simultaneous loss for both other factions, since either could have prevented it, the Villagers by withholding the exiling vote and the Werewolves by shielding a recognised Jester. It therefore requires both to fail the same inference, which is why we read the Jester win-rate as a direct measure of multi-hop ToM failure. We use multi-hop theory of mind behaviorally throughout, the ability to act on a cue whose correct reading depends on a hidden third party’s incentives, and measure that signature rather than claim anything about internal representations.
3.2 Jester self-learning loop
After each game in the ON condition, an LLM call summarizes the game from the Jester’s perspective and proposes new lessons, returning an empty set when the game taught nothing new. Each lesson is a short free-text note about a tactic that worked or should be avoided, sorted into to-do, to-avoid, and winning-tactic lists (representative entries in Table 2). The list is per-model, so each model sees only its own past lessons, and the next game’s Jester reads them in its prompt before play. The update is append-only and the prompt forbids entries that duplicate or paraphrase existing ones (full prompt in Appendix O). The OFF condition disables the loop entirely, so comparing ON against OFF isolates the effect of accumulated Jester wisdom on faction outcomes. We later analyze the accumulated entries in Section 5.6 along two axes, a 9-class strategy-category taxonomy and a 3-class theory-of-mind order summary (Figure 4).
| Model | ToM | Entry (verbatim) |
|---|---|---|
| DeepSeek-V3.1 | 3 | Vote for a player who is not the main suspect to appear wolf-like by deflecting from the consensus, but only after the main suspect is already exiled to avoid appearing villager-aligned. |
| GPT-4.1 | 2 | Escalate suspicion on multiple players in quick succession to appear erratic and untrustworthy. |
| Llama-3.3-70B | 1 | Use the initial questions and reactions to gauge players’ behaviors and adjust strategy accordingly. |
3.3 Deception protocol
Every public statement during debate or vote-explanation is a unit of analysis. Right after a statement, a separate self-report call asks the speaker to label it, returning a binary deception flag, a confidence , and a deception type. We extend WOLF’s four deceptive types (omission, distortion, fabrication, misdirection) with three more (sycophancy, strategic_deception, persuasive_manipulation), giving the 8-class taxonomy in Table 3.
| Class | Definition |
|---|---|
| none | Truthful statement with no deceptive intent. |
| omission | Deliberately hiding relevant information. |
| distortion | Twisting or exaggerating facts. |
| fabrication | Making up false information. |
| misdirection | Diverting attention away from important topics. |
| sycophancy | Agreeing with others to blend in or gain trust, rather than because one believes the claim. |
| strategic_deception | Deliberately misrepresenting role, capabilities, or intentions. |
| persuasive_manipulation | Using emotional or social pressure rather than logical argument. |
All other living players concurrently emit a continuous suspicion score for the same statement, updated by the same exponential-smoothing rule. Speaker and observer share the same taxonomy and reasoning view, so self-reported and peer-detected deception are jointly analyzable per role per model.
4 Evaluation Setup
Design.
We run a factorial of three frontier models (GPT-4.1, DeepSeek-V3.1, Llama-3.3-70B) crossed with the Jester self-learning loop (OFF, ON), 10 games per cell for 60 games total, with one model playing every role in a game. Cells run sequentially, so under ON the -th game’s Jester reads the lessons from games of that same cell, and per-cell files are isolated so DeepSeek’s ON cell never sees GPT-4.1’s lessons.
Engine and call settings.
Bids are integers in (malformed bids parse as 0), phase caps are max_debate_turns = 12 and max_explanation_turns = 6, and all calls use temperature = 1.0 with the full conversation history untruncated. The complete configuration is in Table 22.
Aggregation and statistics.
Per cell we report Wilson 95% confidence intervals on win-rates, which stay inside at small and extreme rates where the normal interval does not. Per-statement aggregations (deception types, suspicion attracted, ToM order) pool across the 10 games of a cell and report unweighted means unless stated otherwise. Our evidence sits at two scales: win-rate claims rest on the same pattern repeating across all six 10-game cells, while the behavioral and mechanistic analyses run over the 3,992 per-statement deception events.
5 Results
We report aggregate outcomes, the self-learning loop’s marginal effect, behavioral and mechanistic evidence for the multi-hop ToM hypothesis, and a deductive coding of what each model learns.
5.1 Faction win rates and the effect of self-learning
A single Jester agent against eight other actively-deducing agents wins more games than the entire Werewolf faction combined in every cell (Figure 2). Pooling across models and conditions, the Jester achieves 0.55, Villagers 0.37, Werewolves 0.08 (per-cell rates in Table 7, Appendix A).
The self-learning loop helps each model by a different amount (Jester panel of Figure 2). GPT-4.1 already wins of its Jester games without any learning, so it has little room to improve, whereas DeepSeek-V3.1 rises from to () and Llama-3.3-70B from to (). The loop thus helps most when a model starts out playing the Jester poorly and adds little once it is near the best a Jester can do.
| Faction | Triadic | Dyadic |
|---|---|---|
| (Jester present) | (Jester removed) | |
| Villagers | 0.35 [0.19, 0.54] | 1.00 [0.74, 1.00] |
| Werewolves | 0.08 [0.02, 0.24] | 0.00 [0.00, 0.26] |
| Jester | 0.58 [0.39, 0.74] | — |
These extra Jester wins come from the Villagers, not the other deceptive role (Fig. 13a, App. H). The Werewolves were already losing almost every game before the loop, so they have no wins left to give up, and the Jester’s gains can only come from the one faction still actively trying to tell the roles apart.
5.2 Falsifiability: removing the Jester
The Jester’s dominance only implicates multi-hop ToM if the other sides fail because the Jester makes the suspicion cue ambiguous, rather than because the model is simply weak at Werewolf. We separate these with a dyadic control on GPT-4.1: the same engine and prompts, but with the Jester replaced by a sixth Villager, roles shuffled per game, and the learning loop off. Table 4 contrasts it with the matched triadic cell (GPT-4.1, shuffled roles, learning off).111Both cells use per-game role shuffling on GPT-4.1, which also rules out a fixed name-to-role prior as the driver; the dyadic control is run on GPT-4.1 only. Removing the Jester lifts the Villagers from to a clean sweep of (11/11 games), while the Werewolves stay near zero in both regimes.
| Metric | OFF | ON | |
|---|---|---|---|
| () | () | (ONOFF) | |
| Jester win rate | 0.625 | 0.933 | |
| Villager win rate | 0.281 | 0.033 | |
| Werewolf win rate | 0.094 | 0.033 | |
| Wolf votes Jester on day 1 | 0.719 | 0.933 | |
| Jester exiled by day-1 vote | 0.500 | 0.867 |
The same model that loses the triadic game plays the dyadic game perfectly. In the dyadic regime the only high-suspicion players are the Werewolves, so the surface heuristic “exile the most suspicious player” coincides with the ToM-optimal policy and a pure cue-follower wins every game without modelling any opponent’s incentives. Adding the Jester inserts a second high-suspicion role whose desired outcome is inverted, which dissociates the cue from the optimal action; the Villager win-rate then collapses from to (). Since the model, prompts, and engine are identical across the two regimes, the collapse cannot be generic weak instruction-following or strategic search, which would also have hurt the dyadic game; it appears only once the cue and the correct action are pulled apart. The triadicdyadic gap is thus a direct readout of the multi-hop ToM that a dyadic score, however perfect, cannot certify.
5.3 Robustness: de-confounding GPT-4.1 with role-shuffling
| Model | Cond | A | B | C | D | W | T |
|---|---|---|---|---|---|---|---|
| DeepSeek-V3.1 | off | 1 | 1 | 2 | 5 | 1 | 10 |
| DeepSeek-V3.1 | on | 2 | 2 | 5 | 1 | 0 | 10 |
| GPT-4.1 | off | 7 | 3 | 0 | 0 | 0 | 10 |
| GPT-4.1 | on | 6 | 3 | 0 | 0 | 1 | 10 |
| Llama-3.3-70B | off | 2 | 2 | 2 | 3 | 1 | 10 |
| Llama-3.3-70B | on | 4 | 1 | 1 | 2 | 2 | 10 |
Our main runs fix the name-to-role mapping (the same player name is the Jester in every game of a given model and condition), leaving a name-prior confound and a small per-condition sample. We re-run GPT-4.1 with roles permuted uniformly at random per game at a larger sample (about 30 games per condition), all else unchanged (Table 5). Three points follow. First, the day-1 self-sabotage is not a name artifact: under shuffling it rises to (ON), so wolves target whoever currently holds the Jester role, not a learned name. Second, the substitution effect of Section 5.1 reproduces at larger and is now significant, again paid by the Villagers rather than the Werewolves. Third, the GPT-4.1 learning effect flips sign, the loop helping () rather than hurting ( in Section 5.1), because the harder de-confounded regime removes the name shortcut and an accumulated-strategy file then buys real edge. Shuffling confirms rather than overturns the findings, so we run it for GPT-4.1 only and keep the non-shuffled runs as our main results.
5.4 Voting pathology: wolf self-sabotage and vote alignment
Voting the Jester on day 1 is the sharpest test of multi-hop ToM (slope charts in Fig. 12, App. G). A Werewolf must see that the high- player might be the Jester, that exiling it ends the game in a Wolf loss, and refuse the village majority, and failing any step is a self-defeating vote. Every role is even warned of this directly, the Werewolf, Villager, and Doctor prompts all saying not to exile a player just for looking suspicious since it may be the Jester (App. O), so the failure is not a missing rule but an inability to act on a known one against the suspicion cue. GPT-4.1 wolves cast this vote in 60–70% of games, DeepSeek-V3.1 in 20–30%, and Llama in 30–40%, and self-learning lowers the rate for GPT-4.1 and DeepSeek-V3.1 but raises it for Llama-3.3-70B, the only wrong-faction substitution. The model with the tightest vote–suspicion alignment (panel b) is also the one whose Wolves most often suicide-vote the Jester, so committing to the top suspect without asking which faction it belongs to is exactly the failure the triadic design exposes.
Table 6 classifies all 60 games by Wolf-side outcome, categories A and B being self-sabotage (Wolves voted or killed the Jester), C a villager-driven Jester win, and D a clean Villager team-win by outnumbering. GPT-4.1 is the extreme case, ending 19 of 20 games via A or B (13 by the day-1 vote alone) and none via C or D, while DeepSeek-V3.1 shifts the other way under learning, 5/10 ON-condition Wolf losses becoming Jester-baited villager votes (C) rather than attrition (D). Wolves won only 5 of 60 games, and univariate predictors (Table 8, App. F) confirm the pattern, the day-1 wolf-vote-Jester being the strongest correlate of a Jester win () while the Jester’s raw statement count is uncorrelated (). Night actions tell the same story. Wolves kill the Jester at night in 20–40% of games. Across votes and kills GPT-4.1 wolves take action against the Jester in nearly every game (Fig. 13b, App. H). Cooperative roles also misfire, the Doctor self-protecting in 80–100% of games rather than shielding a teammate yet still the most-killed role (App. C).
We then ask whether the wolves recognise the Jester and misplay or simply cannot tell it from a Werewolf. An external judge (DeepSeek-V3.1) reads each finished game’s public transcript and names the hidden roles from every surviving seat ( games, guesses). It labels the Jester correctly only of the time, no better than the chance rate among four roles and far below its on Villagers, and a Werewolf-seat view does no better () than the Villager () or Doctor () seats. The transcript thus carries no signal separating a Wolf from a Jester, so the day-1 vote is not a recognised Jester misplayed but the same one-dimensional cue-following as everyone else, the cue-sufficiency failure we predicted rather than a coordination or detection breakdown.
5.5 Deception is a model property, not a role or loop property
| Category | GPT-4.1 | DeepSeek-V3.1 | Llama-3.3-70B |
|---|---|---|---|
| wolf_mimicry | 0.0 | 71.2 | 2.1 |
| inconsistency | 23.8 | 25.0 | 4.3 |
| aggression | 23.8 | 1.9 | 0.0 |
| subtle_indirection | 9.5 | 3.8 | 36.2 |
| passive_provocation | 23.8 | 5.8 | 17.0 |
| anti_jester_tells | 14.3 | 1.9 | 12.8 |
| vote_manipulation | 9.5 | 44.2 | 2.1 |
| meta_modeling | 31.0 | 26.9 | 10.6 |
| questioning | 14.3 | 23.1 | 12.8 |
The Jester out-produces the Werewolves in deceptive statements per turn in every model (for example, GPT-4.1 Jester versus Wolf ON), yet the two attract statistically indistinguishable suspicion, hovering around 0.65–0.81 across all six cells against 0.22–0.51 for Villagers and the Doctor (full dumbbell plots in Fig. 14, App. I), so the high-suspicion band holds two agent types with opposite desired outcomes.
The type of deception is equally a model trait. GPT-4.1 leans on distortion (Jester 47%68% ON) and strategic_deception (Wolf 59%72%) with zero fabrication, DeepSeek-V3.1 on fabrication (Jester 62% in both conditions), and Llama-3.3-70B on misdirection (Jester 40%57%). Each model imports one style and applies it inside whatever role it inhabits, so composition barely shifts with role or loop, and the same signature recurs in speaker-vs-observer calibration where honest roles are over-flagged and skilled liars under-flagged (App. D). A model that deceives the same way regardless of faction is unlikely to be running recursive reasoning over its opponents’ incentives.
The second-order suspicion shifts confirm the pattern. The loop only edits the Jester’s prompt, yet it moves how other agents suspect each other in a model-specific way. Llama-3.3-70B ON shifts every observer’s rating of the Doctor by to and lifts Jester ratings by to , while DeepSeek-V3.1 ON makes Wolves less suspicious of the Jester (), so they less often join the exiling vote and Jester wins rise. GPT-4.1’s near-flat diff matrix matches its saturation (Section 5.1), with no second-order room left at the ceiling of cue-ambiguity exploitation. The full heatmap (App. B.1), six-axis radar (App. B.3), and game-length distributions (App. B.2) make these cross-cell shapes immediate, with Jester wins concentrating on day 1 for GPT-4.1, within 1–2 rounds for the others.
5.6 Inside the learned policies
Outcome metrics show that the loop helps unevenly but not why. To recover the mechanism we deductively coded all 141 accumulated lessons (GPT-4.1: 42, DeepSeek-V3.1: 52, Llama-3.3-70B: 47) with a rule-based regex tagger (patterns and code in App. N) along two axes, a 9-class strategy-category taxonomy grounded in the triadic-incentive model (Fig. 4b) and the 3-class theory-of-mind order of Section 3.2 (Table 2). Only a third-order entry separates being mistaken for a Wolf from being mistaken for a Villager, so this class is the textual signature of multi-hop incentive prediction.
The three models converge on qualitatively distinct policies (Fig. 4b). DeepSeek-V3.1 encodes the cue-ambiguity exploit explicitly, devoting 71% of its entries to wolf_mimicry against for the other two. GPT-4.1 instead acts erratic and lets the group affix the label, combining meta_modeling (31%), aggression (24%) and passive_provocation (24%), while Llama-3.3-70B leans on subtle_indirection (36%). A per-model distinctive-vocabulary (TF-IDF) analysis corroborates the split (App. N).
The ToM-order distribution (Fig. 4a) is the paper’s cleanest mechanistic result. DeepSeek-V3.1 is the only model that reliably produces third-order contrastive entries (15.4%), against 4.8% for GPT-4.1 and none for Llama-3.3-70B, and two independent LLM judges reproduce this rank order even though their absolute percentages differ (App. L; representative entries in Table 2). The ordering tracks the learning gains of Section 5.1, the model whose lessons encode the triadic incentive structure (DeepSeek-V3.1) extracting the most from the loop () and Llama-3.3-70B, whose lessons are 64% bare first-order imperatives, only .
6 Conclusion
Adding a Jester to a Werewolf game converts a one-dimensional suspicion signal into a triadic incentive-prediction problem, and current frontier LLMs fail at it. Across 60 games on three models, the Jester wins the majority of cells, Werewolves never exceed 20%, and the single-faction self-learning intervention pays for its gains out of Villager wins rather than Wolf wins. Behavioral fingerprints (day-1 self-sabotage votes, night-kill of the Jester, model-stable deception-type composition) and second-order cross-perception shifts pin the failure on cue-sufficiency rather than coordination or signal detection, and deductive coding of the 141 accumulated lessons adds a textual mechanism, the only model whose self-learning produces contrastive third-order entries (DeepSeek-V3.1, 15.4%) being the one that gains most from the loop. Triadic incentives are a small structural change that exposes a layer of multi-agent reasoning dyadic deduction games leave invisible, and we release the benchmark, logs and analysis to enable follow-up work that controls the cue-ambiguity axis deliberately.
Limitations
We run only 10 games per cell, so the Wilson confidence intervals on each win-rate are wide. Our claims therefore rest on the same pattern showing up across cells, not on any single cell being significant. The Jester self-learning loop is trained per model, so we do not test whether lessons learned by one model help another. We did not log per-observer true- and false-positive labels, so we cannot report -style detection scores for the 8-type taxonomy. Passing the full conversation history with no compression keeps all information from earlier rounds, but it also raises token cost and fills the context window as games get longer. Adding a proper memory structure, for example retrieval over a running summary or dropping low-relevance turns, is left to future work. The cue-ambiguity claim is an argument about game structure, and other ToM probes, such as asking each agent direct second-order belief questions mid-game, could add to the behavioral evidence we give here. Finally, all our games are in English, so we do not know how well the triadic-incentive idea carries over to other languages.
Ethical considerations
Studying deception in LLMs.
The benchmark explicitly incentivises one role (the Jester) to deceive other agents. We do not view this as encouraging real-world deception by deployed models; rather, structured studies of deceptive capability are necessary for safety evaluation, since deceptive behaviour cannot be measured or mitigated without first being elicited and characterised. The self-learning loop is run in a sandboxed gameplay environment with no human users in the loop.
Dual-use of the self-learning loop.
The per-model running-lessons file shows that one frontier model (DeepSeek-V3.1) is already able to accumulate, in natural language, third-order contrastive deception strategies from its own gameplay. The same loop could in principle be used to bootstrap deceptive prompts outside a game context. We release the loop, the prompts, and the resulting lesson files so that this capability surface is auditable, but we caution that the release is intended for safety research rather than for re-deployment as a deception scaffold.
Human-subject data.
The benchmark contains no human-subject data. All transcripts are model-generated; no personally identifiable information is collected or released. Player names used inside games (e.g. “Joy”) are arbitrary tokens unconnected to any real individual.
Model and content licensing.
The three evaluated models (GPT-4.1, DeepSeek-V3.1, Llama-3.3-70B) are accessed under their providers’ standard terms of use. Released artefacts (code, prompts, event logs) do not redistribute model weights and respect each provider’s output-sharing policy.
References
- GPT-4 technical report. Cited by: 2nd item.
- Wolf: werewolf-based observations for llm deception and falsehoods. arXiv preprint arXiv:2512.09187. Cited by: §3.1, §3.1.
- Werewolf arena: a case study in llm evaluation via social deduction. ArXiv abs/2407.13943. Cited by: §1, §2, §3.1.
- Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science 378, pp. 1067 – 1074. Cited by: §2.
- AMONGAGENTS: evaluating large language models in the interactive text-based social deduction game. ArXiv abs/2407.16521. Cited by: §2.
- The traitors: deception and trust in multi-agent language model simulations. ArXiv abs/2505.12923. Cited by: §1, §2.
- DeepSeek-v3 technical report. Cited by: 2nd item.
- The llama 3 herd of models. Cited by: 2nd item.
- Shortcut learning in deep neural networks. Nature Machine Intelligence 2, pp. 665 – 673. Cited by: §1.
- Deception abilities emerged in large language models. Proceedings of the National Academy of Sciences of the United States of America 121. Cited by: §2.
- HI-tom: a benchmark for evaluating higher-order theory of mind reasoning in large language models. pp. 10691–10706. Cited by: §1, §2.
- Sleeper agents: training deceptive llms that persist through safety training. ArXiv abs/2401.05566. Cited by: §2.
- Microscopic analysis on llm players via social deduction game. arXiv preprint arXiv:2408.09946. Cited by: §2.
- FANToM: a benchmark for stress-testing machine theory of mind in interactions. ArXiv abs/2310.15421. Cited by: §2.
- AvalonBench: evaluating llms playing the game of avalon. Cited by: §2.
- InterIntent: investigating social intelligence of llms via intention understanding in an interactive game context. pp. 6718–6746. Cited by: §1, §2.
- Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. ArXiv abs/1902.01007. Cited by: §1.
- Probing neural network comprehension of natural language arguments. ArXiv abs/1907.07355. Cited by: §1.
- Hoodwinked: deception and cooperation in a text-based game for language models. ArXiv abs/2308.01404. Cited by: §2.
- Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. pp. 26837–26867. Cited by: §2.
- Generative agents: interactive simulacra of human behavior. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. Cited by: §2.
- Cooperate or collapse: emergence of sustainable cooperation in a society of llm agents. Advances in Neural Information Processing Systems 37. Cited by: §2.
- Does the chimpanzee have a theory of mind?. Behavioral and Brain Sciences 1, pp. 515 – 526. Cited by: §1.
- The mask benchmark: disentangling honesty from accuracy in ai systems. ArXiv abs/2503.03750. Cited by: §2.
- Neural theory-of-mind? on the limits of social intelligence in large lms. pp. 3762–3780. Cited by: §1.
- Clever hans or neural theory of mind? stress testing social reasoning in large language models. pp. 2257–2273. Cited by: §1, §1.
- Reflexion: language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems 36. Cited by: 2nd item, §2.
- Testing theory of mind in large language models and humans. Nature Human Behaviour 8, pp. 1285 – 1295. Cited by: §1.
- Testing theory of mind in large language models and humans. Nature Human Behaviour 8, pp. 1285 – 1295. Cited by: §2.
- LLMs achieve adult human performance on higher-order theory of mind tasks. Frontiers in Human Neuroscience 19. Cited by: §1.
- Large language models fail on trivial alterations to theory-of-mind tasks. ArXiv abs/2302.08399. Cited by: §1, §1.
- Voyager: an open-ended embodied agent with large language models. ArXiv abs/2305.16291. Cited by: §2.
- Boosting llm agents with recursive contemplation for effective deception handling. pp. 9909–9953. Cited by: §1, §1.
- Enhance reasoning for large language models in the game werewolf. ArXiv abs/2402.02330. Cited by: §2.
- OpenToM: a comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. pp. 8593–8623. Cited by: §2.
- Exploring large language models for communication games: an empirical study on werewolf. ArXiv abs/2309.04658. Cited by: 2nd item, §1.
- Learning strategic language agents in the werewolf game with iterative latent space policy optimization. ArXiv abs/2502.04686. Cited by: §2.
- Language agents with reinforcement learning for strategic play in the werewolf game. ArXiv abs/2310.18940. Cited by: §1.
- Language agents with reinforcement learning for strategic play in the werewolf game. ArXiv abs/2310.18940. Cited by: §2.
Appendix A Per-cell faction win-rates
Table 7 is the numeric companion to Figure 2 in Section 5.1, reporting per-cell win-rates with Wilson 95% confidence intervals at games.
| Model | Cond. | Villagers | Werewolves | Jester |
|---|---|---|---|---|
| GPT-4.1 | OFF | 0.30 [0.11, 0.60] | 0.00 [0.00, 0.28] | 0.70 [0.40, 0.89] |
| GPT-4.1 | ON | 0.30 [0.11, 0.60] | 0.10 [0.02, 0.40] | 0.60 [0.31, 0.83] |
| DeepSeek-V3.1 | OFF | 0.60 [0.31, 0.83] | 0.10 [0.02, 0.40] | 0.30 [0.11, 0.60] |
| DeepSeek-V3.1 | ON | 0.30 [0.11, 0.60] | 0.00 [0.00, 0.28] | 0.70 [0.40, 0.89] |
| Llama-3.3-70B | OFF | 0.50 [0.24, 0.76] | 0.10 [0.02, 0.40] | 0.40 [0.17, 0.69] |
| Llama-3.3-70B | ON | 0.20 [0.06, 0.51] | 0.20 [0.06, 0.51] | 0.60 [0.31, 0.83] |
Three features of the per-cell breakdown back the pooled numbers in the main text. The Jester is the modal winning faction in four of the six cells and the Werewolves never clear in any cell, so the Jester-dominance and wolf-collapse patterns hold cell by cell rather than only on average. The self-learning loop lifts the Jester column for DeepSeek-V3.1 () and Llama-3.3-70B () while leaving GPT-4.1 near its ceiling (), the same uneven learning effect plotted in Figure 2. In both improving cells the matching loss falls in the Villager column (DeepSeek , Llama ) and not the Werewolf column, which is the per-cell form of the substitution effect of Section 5.1. The wide Wilson intervals, several spanning more than , are why we rest the claims on this cross-cell consistency rather than on any single cell.
Appendix B Additional results figures
This appendix collects three supporting figures referenced from the main results section: the cross-perception difference heatmap (Section 5.5), the game-length distribution by winner, and the six-axis behavioral fingerprint (Section 5.5).
B.1 Cross-perception second-order shifts
Figure 5 shows the full per-model heatmap of how the end-of-game suspicion that observer roles direct at target roles shifts when Jester self-learning is turned on. Rows are the observer role, columns are the target role, and the colour shows ON OFF: red means observers in this row become more suspicious of this target under the loop, blue means less. The figure makes two points that the main-text numbers compress. First, the Llama panel on the right has the most saturated cells, including a strong blue band over the Doctor column. Every other role becomes less suspicious of the Doctor, even though the loop never touched any Doctor prompt. Second, the GPT-4.1 panel is visually almost grey, consistent with the saturation argument in Section 5.1: at ceiling, the loop has no remaining headroom to propagate.
B.2 Game length
Figure 6 plots the number of day rounds elapsed before a game resolves, faceted by model and split by winning faction and Jester-learning condition. Jester wins cluster at the left edge of every panel: the Jester is exiled almost immediately once the loop produces a strategy that the group locks onto. Wolf wins, in the rare cells where they happen, take noticeably more rounds because they require attrition through several night kills. The pattern confirms that high-confidence voting fires on the highest-suspicion player without filtering for which faction that player belongs to, which is exactly the cue-ambiguity failure the triadic setup is designed to expose.
B.3 Behavioral fingerprint radar
Figure 7 compresses each model’s behavior into a six-axis polygon, with one shape for Jester-learning OFF (grey) and one for ON (orange). All six axes are normalized to with larger values pushed outward, so a model that improves under the loop expands outward on the relevant axis. The three cross-cell contrasts highlighted in the main text become immediate. GPT-4.1’s OFF and ON polygons nearly overlap, which is the saturation pattern. DeepSeek’s polygon visibly rotates: the Jester axis grows outward while the Villager axis contracts, which is the substitution effect from Section 5.1 made visible on a single shape. Llama’s polygon is the only one where the wolf-restraint axis pulls inward under the loop, which matches its larger second-order cross-perception shift.
Appendix C Doctor protect targets and wolf night-kill victims
The Doctor self-protects almost universally, a pattern that itself reflects the cue-ambiguity problem. A Doctor protecting a Villager would need to reason about who Wolves are most likely to target, but the high- band contains both Werewolves and the Jester. Self-protection is a safe default that minimizes risk in the absence of a clear inference. The Werewolves’ choice of night-kill victim is similarly distorted. Doctors are killed more often than Villagers in every cell, which is a reasonable strategic move. Killing the Jester accidentally is also frequent (GPT-4.1 ON: 4 of 18 night-kills), and each such kill is a self-loss event.
Appendix D Mechanistic findings: per-statement event-log analysis
Going beyond aggregate cell-level metrics, we mined the per-statement event logs (3,992 deception_analysis events with paired self-tags and 8 observer reviews each) for four mechanistic insights, summarized in Figure 9.
Day-1 suspicion is destiny for the Jester.
Splitting the round-by-round observer-suspicion trajectory of the Jester by eventual outcome (Fig. 9a; full table in App. F), the win/loss separation appears at round 0 and the trajectories diverge little thereafter. For Llama-ON the round-0 mean suspicion is 0.78 in won games versus 0.69 in lost ones; for GPT-4.1 every won game already starts above 0.70. A Jester that fails to seed suspicion on the very first day rarely recovers, consistent with anchoring effects in a multi-agent vote. The pattern is most extreme for GPT-4.1: in every one of its 7 lost Jester games the Jester bid too low to speak on day 1 at all, so day-1 silence is a 100% predictor of a GPT-4.1 Jester loss in our corpus.
Each model has its own deception aesthetic.
Aggregating per-statement observer suspicion by self-declared deception type (Fig. 9b) reveals heterogeneity across judge models. GPT-4.1’s villagers register the highest suspicion against low-content tactics (omission 0.88, distortion 0.86); Deepseek’s villagers react most to fabrication and persuasive manipulation; Llama’s react most to omission and misdirection. The same Jester strategy is therefore not optimal in all environments; the self-learning loop succeeds in part by matching the Jester’s tactic distribution to the local judge model.
Wolves systematically fail multi-hop ToM.
Of 55 wolf-loss games we classify each into one of four causes (Fig. 9c). GPT-4.1 wolves lose 90–100% of their games by self-sabotage: 13 of 20 GPT-4.1 wolf games end with the wolves voting the Jester out on day 1 themselves. This is a clean behavioural signature of failed multi-hop incentive prediction: an incentive-aware wolf should recognise the Jester as a third party whose elimination ends the wolves’ game, yet GPT-4.1 wolves treat the Jester as just-another-villager. Deepseek and Llama wolves show milder versions of the same pattern but more often lose by being outnumbered (category D) or by the Jester baiting villagers into exiling it (category C).
Calibration asymmetry across roles.
The signed gap between observer-flagging rate and speaker self-tag (Fig. 9d) is consistently positive for Villagers (+0.08 to +0.21) and Doctors, and negative for Wolves and Jesters (-0.02 to -0.17). Wolves are therefore better at hiding than they think, while honest Villagers are systematically misperceived as deceptive. The triadic setup amplifies the latter bias: when a third winning condition is on the board, any unusual phrasing from a Villager looks “off,” inflating false-positive suspicion. Llama and Deepseek both have Villager gaps; GPT-4.1’s gap is the smallest at .
Univariate predictors of Jester wins.
Across all 60 games, the strongest game-level predictor of a Jester win is whether the wolves voted for the Jester on day 1 (Pearson ), followed by the mean suspicion-rank of vote targets (, i.e. Jester wins occur when villagers vote the highest-suspicion player) and the Jester’s own average attracted suspicion (). Crucially the number of Jester statements is uncorrelated with winning (): it is how convincingly and how early, not how much, that determines the outcome.
Five further per-statement and per-game phenomena are summarised in Figure 10 (full tables in App. J).
Bidding paradox.
The relationship between the Jester’s bidding aggressiveness and the eventual outcome is model-specific (Fig. 10a). For GPT-4.1, the only winning Jesters bid on day 1; in 7 of 7 lost games the Jester bid too low to speak. For Deepseek-ON the relationship inverts: winning Jesters bid 4.6 on average on day 1 versus 8.7 for losing ones, suggesting Deepseek’s self-learning loop converges on a “stay quiet on day 1, pop later” policy. Llama’s bids are essentially flat () and uninformative. There is no universal optimal bidding strategy; the self-learning loop discovers different convergent behaviours per model.
Doctors are the most under-detected role.
Aggregating per-statement deception-detection accuracy by speaker role (Fig. 10b) reveals that the Doctor is the hardest role to label correctly across all three judges (0.65–0.81), whereas Wolves and Jesters are detected at 0.80–0.93 and Villagers at 0.74–0.90. Doctors are partially deceptive – they hide a single fact (their role) while otherwise speaking truthfully – which produces a behavioural signature that does not match either “honest villager” or “lying wolf” templates. This is a structural ToM blind spot that the triadic setup happens to expose.
Self-learning helps Llama and hurts GPT-4.1 within a cell.
Sorting the ten ON-condition games per cell chronologically and plotting the cumulative Jester win-rate (Fig. 10c) reveals a striking model-dependent learning trajectory: Llama-ON climbs from 0.4 in the first half to 0.8 in the second half ( pp), GPT-4.1-ON drops from 0.8 to 0.4 ( pp), and Deepseek-ON drops slightly (0.8 0.6). One interpretation is that GPT-4.1’s accumulated learnings file grows long enough to interfere with its base policy – a long-context instruction degradation effect – whereas Llama’s weaker base policy benefits from explicit memory. This is a useful caveat for any deployment of self-reflective agent loops on top of strong base models.
Observer confidence is well-calibrated, except Llama.
Binning observer “deceptive” flags by their self-reported confidence (Fig. 10e), GPT-4.1 and Deepseek both reach a hit-rate of at confidence , closely tracking the perfect-calibration diagonal. Llama never assigns confidence above 0.95; its highest-confidence flags top out at hit-rate. Practically, this means a high-confidence flag from GPT-4.1 or Deepseek can be treated as essentially ground truth for downstream pipelines, whereas Llama’s calibration ceiling at caps its usefulness as an automated deception labeler.
Detection accuracy by observer role.
The complementary view by observer role (Fig. 10f) shows that Doctors and Jesters are the most accurate observers in the Llama and Deepseek environments, while in GPT-4.1 all roles cluster between 0.85–0.92. This is consistent with Doctors and Jesters having stronger asymmetric-information advantages – they know their own role and can therefore interpret others’ behaviour more informatively.
Appendix E Temporal dynamics of collaboration and deception
Treating each game as a time series along three axes – within-game day index, within-day statement position, and across the 10-game cell sequence – exposes a set of role-specific dynamics that the aggregate metrics in §D hide. The full panel is shown in Fig. 11; supporting tables appear in Appendix K.
Suspicion accumulates linearly for Wolves but saturates day 1 for Jesters.
Mean attracted suspicion rises for Wolves from on day 0 to on day 4 – a slow, evidence-driven accumulation. Jesters, by contrast, jump from to between day 0 and day 1 and then plateau, indicating that the Jester signature is decoded essentially in the first speech act and that any subsequent behaviour rides a near-saturated suspicion baseline. Doctors trace a U-shape (): cleared in the early rounds, then re-targeted late as the player pool shrinks. This is the mechanism behind the per-role survival imbalance shown in panel (f). Villagers stay flat near .
Deception emission is a fixed role policy, but the policy itself differs sharply across models for the Doctor.
Wolves emit deceptive statements in every round; Jesters ; Villagers . There is no in-game adaptation along this axis – no panic-lying, no cool-down. The Doctor, however, is governed by very different model-specific policies: GPT-4.1 plays Doctor at deception after day 1, Llama at a steady , and Deepseek oscillates between and across rounds. This three-way divergence on a single role is one of the largest cross-model behavioural gaps in our data and helps explain why Doctor survival is uniformly low: each model is failing for a different reason.
Late-day deceivers attract more suspicion – but only deceivers.
Binning each statement into early/mid/late thirds of its day, Wolves and Jesters attract increasing suspicion as they speak later (Wolf: ; Jester: ) while honest roles show no position effect (Villager ; Doctor ). The effect is the opposite of the naive expectation that more thinking time produces safer statements: observers appear to hold late speakers to a higher inferential bar, and the more polished a deceiver’s last-speaker statement is, the more flagged it gets.
The public debate is a wolf coordination channel.
Wolf vote-unanimity rises monotonically with round in every model – GPT , Deepseek , Llama . Even though the night-time kill is a private wolf-only action, the day-time debate appears to serve as a coordination channel that converges wolves on a shared target by round 2. The day-1 wolf-vote-Jester error is also universal and uniformly self-corrected, the share of wolves voting the Jester falling from in GPT by round 2 and from in Deepseek and in Llama by round 3. The mistake is recognised within a round or two, but by then the village suspicion landscape is already poisoned.
Doctor target-prediction accuracy is inverse to overall model capability.
Llama Doctors predict the actual wolf target with accuracy across rounds; GPT-4.1 Doctors degrade from on day 1 to on day 2 and on day 3; Deepseek oscillates noisily. The reversal is striking because GPT-4.1 wins the most games overall: a plausible interpretation is that GPT-4.1 over-reasons about decoy possibilities and second-order wolf strategy, whereas Llama applies a simpler “protect the most-suspected innocent” heuristic that survives noisy evidence better. Strong inferential machinery hurts in a setting with high observation noise and a short horizon.
Cross-game role-specific drift exists even without explicit learning.
The OFF condition has no learnings file, yet several role-model pairs drift across the 10-game cell sequence: Llama Wolf survival rises from early to in games 7–10, Deepseek Villager survival drops from to at game 9, and GPT-4.1 Jester survives in games regardless of position. The Llama Wolf drift suggests an implicit cell-position effect, possibly via prompt-cache-warming or evaluator-side variation; we flag this as a confound for future cross-game analyses. The two zero-row results (Llama Doctor survival, GPT-4.1 Jester survival) identify the two most-broken role-model pairings in the benchmark.
Appendix F Detailed deep-analysis tables
The following tables back the mechanistic findings of Sec. D.
F.1 Jester suspicion trajectory by round
The table tracks the mean observer-suspicion of the Jester across game rounds, separating cells by eventual outcome. Three patterns are visible. First, suspicion of the Jester rises monotonically with round in nearly every cell – this is mechanical, since by round 1 each observer has seen multiple Jester statements and accumulated evidence. Second, the won-vs-lost separation already exists at round 0 in every cell that has both classes (Llama-OFF: 0.69 vs 0.53; Llama-ON: 0.78 vs 0.69), giving the day-1 prior almost all of the predictive content. Third, the gpt-4.1 rows are notably sparse – the model has no row at all for “lost” games at round 0, because in every one of its 7 lost Jester games the Jester bid below the speaking threshold and never produced a day-1 statement to be analysed. Day-1 silence is therefore not just correlated with losing for GPT-4.1; it is its complete characterisation.
Mean observer-suspicion of the Jester aggregated by model, condition, round, and eventual outcome (jester_won). is the number of (game, round) cells with at least one observer review of a Jester statement.
| Model | Cond | Round | Won? | Mean susp | |
|---|---|---|---|---|---|
| deepseek-v3.1 | off | 0 | 0 | 0.673 | 7 |
| deepseek-v3.1 | off | 0 | 1 | 0.751 | 3 |
| deepseek-v3.1 | off | 1 | 0 | 0.841 | 6 |
| deepseek-v3.1 | off | 1 | 1 | 0.737 | 2 |
| deepseek-v3.1 | on | 0 | 0 | 0.655 | 3 |
| deepseek-v3.1 | on | 0 | 1 | 0.640 | 7 |
| deepseek-v3.1 | on | 1 | 0 | 0.837 | 2 |
| deepseek-v3.1 | on | 1 | 1 | 0.812 | 5 |
| gpt-4.1 | off | 0 | 1 | 0.723 | 7 |
| gpt-4.1 | on | 0 | 1 | 0.806 | 6 |
| llama-3.3-70b | off | 0 | 0 | 0.530 | 6 |
| llama-3.3-70b | off | 0 | 1 | 0.685 | 4 |
| llama-3.3-70b | off | 1 | 0 | 0.715 | 3 |
| llama-3.3-70b | off | 1 | 1 | 0.763 | 3 |
| llama-3.3-70b | on | 0 | 0 | 0.685 | 4 |
| llama-3.3-70b | on | 0 | 1 | 0.776 | 6 |
| llama-3.3-70b | on | 1 | 0 | 0.749 | 3 |
| llama-3.3-70b | on | 1 | 1 | 0.852 | 3 |
F.2 Per-deception-type yield (mean observer suspicion of Jester)
For each Jester statement we record the speaker’s self-declared deception type and the mean suspicion the statement attracts from the observers in the same round. The table groups by (model, condition, type) and reports both the mean observer suspicion and the fraction of observers who flagged the statement as deceptive. Three observations stand out. (i) The tactic mix differs sharply across models: Deepseek-OFF Jesters lean heavily on fabrication ( of 88 statements), GPT-4.1-ON Jesters lean on distortion ( of 56), and Llama Jesters lean on misdirection ( of 110 in the ON cell). (ii) The yield of each tactic is also model- specific: GPT-4.1’s villagers are most-suspicious of low-content tactics (omission 0.88, distortion 0.86) and least-suspicious of explicit misdirection (0.36) – the opposite of Llama, where misdirection actually attracts the highest suspicion (0.79). (iii) Truthful Jester statements (none type) draw genuinely low suspicion (0.18 for GPT-4.1-OFF, 0.59 for Llama-OFF), confirming that the deception signal is real and not noise: when the Jester is honest, the villagers can tell.
| Model / cond | Deception type | Mean susp | Flag rate | |
| deepseek-v3.1, off | ||||
| persuasive_manipulation | 5 | 0.847 | 1.00 | |
| strategic_deception | 9 | 0.822 | 0.94 | |
| fabrication | 55 | 0.798 | 0.93 | |
| distortion | 11 | 0.775 | 0.93 | |
| misdirection | 6 | 0.677 | 0.81 | |
| deepseek-v3.1, on | ||||
| fabrication | 47 | 0.789 | 0.94 | |
| misdirection | 3 | 0.780 | 1.00 | |
| distortion | 12 | 0.769 | 0.94 | |
| strategic_deception | 13 | 0.674 | 0.78 | |
| gpt-4.1, off | ||||
| omission | 5 | 0.877 | 0.90 | |
| strategic_deception | 23 | 0.790 | 0.84 | |
| distortion | 31 | 0.775 | 0.91 | |
| misdirection | 3 | 0.363 | 0.29 | |
| none | 3 | 0.183 | 0.08 | |
| gpt-4.1, on | ||||
| distortion | 38 | 0.860 | 0.98 | |
| strategic_deception | 14 | 0.804 | 0.91 | |
| misdirection | 2 | 0.256 | 0.19 | |
| llama-3.3-70b, off | ||||
| distortion | 3 | 0.786 | 0.76 | |
| misdirection | 38 | 0.705 | 0.88 | |
| omission | 6 | 0.663 | 0.76 | |
| strategic_deception | 32 | 0.651 | 0.76 | |
| none | 17 | 0.591 | 0.70 | |
| llama-3.3-70b, on | ||||
| omission | 1 | 0.842 | 1.00 | |
| none | 8 | 0.796 | 0.91 | |
| misdirection | 63 | 0.786 | 0.93 | |
| strategic_deception | 30 | 0.755 | 0.89 | |
| persuasive_manipulation | 3 | 0.700 | 0.83 | |
F.3 Speaker-vs-observer deception calibration
The table compares each speaker’s self-tagged deception rate with the mean rate at which observers flagged that speaker’s statements as deceptive. The signed gap (calib_gap) summarises whether the role is over- or under-perceived. Three patterns emerge. (i) Wolves and Jesters, who self-tag as deceptive on of statements, are nonetheless slightly under-detected – their gaps are consistently negative ( to ). The largest negative gap is GPT-4.1’s Werewolf at , meaning GPT-4.1 wolves are particularly good at hiding their deception. (ii) Villagers and Doctors exhibit the opposite pathology: positive gaps of to , i.e. observers attribute deception to honest players much more often than the speakers themselves admit it. The Llama-OFF Villager gap of is the largest – 21% of honest villager statements are flagged as deceptive by some observer. (iii) Across every cell, GPT-4.1 has the smallest Villager gap ( ON, OFF), indicating that GPT-4.1 villagers are best at recognising honest content from each other. Combined, these patterns describe a systematic bias: in a triadic game, observers over-attribute deception to honest play and under-detect deception by skilled liars.
self_rate = fraction of statements the speaker tagged deceptive. obs_rate = mean fraction of observers who flagged each statement deceptive. calib_gap = obs_rate self_rate (positive = speaker under-reports their own deception relative to peer perception).
| Model | Cond | Role | self_rate | obs_rate | calib_gap | |
|---|---|---|---|---|---|---|
| deepseek-v3.1 | off | Doctor | 95 | 0.421 | 0.557 | |
| deepseek-v3.1 | off | Jester | 88 | 1.000 | 0.922 | |
| deepseek-v3.1 | off | Villager | 391 | 0.018 | 0.225 | |
| deepseek-v3.1 | off | Werewolf | 187 | 1.000 | 0.938 | |
| deepseek-v3.1 | on | Doctor | 67 | 0.448 | 0.570 | |
| deepseek-v3.1 | on | Jester | 77 | 1.000 | 0.905 | |
| deepseek-v3.1 | on | Villager | 334 | 0.012 | 0.154 | |
| deepseek-v3.1 | on | Werewolf | 157 | 1.000 | 0.937 | |
| gpt-4.1 | off | Doctor | 54 | 0.074 | 0.103 | |
| gpt-4.1 | off | Jester | 66 | 0.955 | 0.822 | |
| gpt-4.1 | off | Villager | 261 | 0.000 | 0.079 | |
| gpt-4.1 | off | Werewolf | 97 | 0.990 | 0.818 | |
| gpt-4.1 | on | Doctor | 51 | 0.176 | 0.251 | |
| gpt-4.1 | on | Jester | 56 | 1.000 | 0.920 | |
| gpt-4.1 | on | Villager | 315 | 0.000 | 0.092 | |
| gpt-4.1 | on | Werewolf | 122 | 1.000 | 0.855 | |
| llama-3.3-70b | off | Doctor | 83 | 0.301 | 0.505 | |
| llama-3.3-70b | off | Jester | 96 | 0.823 | 0.796 | |
| llama-3.3-70b | off | Villager | 419 | 0.053 | 0.264 | |
| llama-3.3-70b | off | Werewolf | 191 | 0.906 | 0.874 | |
| llama-3.3-70b | on | Doctor | 85 | 0.153 | 0.277 | |
| llama-3.3-70b | on | Jester | 110 | 0.927 | 0.910 | |
| llama-3.3-70b | on | Villager | 391 | 0.074 | 0.278 | |
| llama-3.3-70b | on | Werewolf | 199 | 0.965 | 0.906 |
F.4 Wolf-loss failure-mode taxonomy
This taxonomy splits every wolf loss into one of four mechanisms. Categories A and B are wolf self-sabotage, the day-1 vote of the Jester (A, which hands the Jester the win) and the night-kill of the Jester (B, which wastes the kill and lets the Villagers win), while C is the Jester being voted out by the villagers with the wolves uninvolved (a Jester win) and D is the villagers outnumbering the wolves (a Villager win). The counts are promoted to the main paper as Table 6 (Section 5.4), where the category letters are defined, and they also appear as Fig. 9c. Read per cell, the table puts GPT-4.1 almost entirely in the self-sabotage columns, with A and B accounting for all ten of its OFF games and nine of its ten ON games and nothing left for categories C or D, so its wolves essentially never lose for any reason other than acting against the Jester themselves. DeepSeek-V3.1 shifts toward category C under self-learning, from two such games OFF to five ON while its category-D losses fall from five to one, so its losses move from the villagers outnumbering the wolves toward the Jester baiting the villagers into exiling it, the same redistribution that lifts its Jester win-rate from to . Llama-3.3-70B spreads across all four categories in both conditions and is the only cell to record two wolf wins (ON), the diffuse profile one expects from a model that plays every side less sharply.
F.5 Univariate predictors of Jester win
| Feature | Won | Lost | ||
|---|---|---|---|---|
| Wolf voted Jester day 1 | 0.688 | 0.107 | 60 | |
| Vote-suspicion mean rank | 1.927 | 2.767 | 60 | |
| Jester avg attracted susp | 0.770 | 0.704 | 53 | |
| Jester deception rate | 6.751 | 6.109 | 53 | |
| Jester # statements | 9.312 | 9.286 | 53 |
Table 8 reports univariate Pearson correlations between five per-game features and the binary Jester-win outcome across all 60 games. The single strongest predictor is whether the Wolves voted for the Jester on day 1 (): in games the Jester won, this happened in 69% of cases versus 11% in lost games. The second-strongest signal is the mean suspicion-rank of vote targets (): Jester wins when Villagers concentrate their votes on the single highest-suspicion player, since by construction that player is as plausibly the Jester as a Wolf. The Jester’s own average attracted suspicion () and the rate at which the Jester self-tags statements as deceptive () both predict positively but more weakly. The null result is that the raw number of Jester statements is uncorrelated with winning (, with feature means vs in won vs lost games), so the predictor structure is about how convincing and how early the Jester’s contributions are, not how many it makes. Sample size drops from 60 to 53 for the three Jester-side features because seven games have no Jester-tagged turns (the Jester is killed on night 1 or never speaks).
F.6 Jester self-learning entries: full ToM example set
Table 9 expands the single-row-per-order sample of Table 2 into the full set of representative entries, so the qualitative gap behind the ToM-order counts of Section 5.6 can be read directly. The two DeepSeek-V3.1 rows are the only third-order entries that name both audiences at once, appearing wolf-like while avoiding villager-alignment, which is the contrastive structure the regex tagger keys on. The GPT-4.1 rows step down a level, from a single third-order ambiguity entry to second-order “appear erratic” framing to a bare first-order imperative, and both Llama-3.3-70B rows stay at orders 1–2 with no cross-faction contrast. The verbatim wording makes clear that the order labels track a real difference in what each model represents about its observers, not a labelling artefact.
| Model | ToM | Entry (verbatim) |
|---|---|---|
| DeepSeek-V3.1 | 3 | Vote for a player who is not the main suspect (like Joy) to appear wolf-like by deflecting from the consensus, but only after the main suspect is already exiled to avoid appearing villager-aligned. |
| DeepSeek-V3.1 | 3 | Maintain a vote on a non-consensus target during revote to appear stubbornly wolf-like and avoid villager alignment. |
| GPT-4.1 | 3 | Lean into accusations of being both Werewolf and Jester to create ambiguity and prevent players from settling on my true role. |
| GPT-4.1 | 2 | Escalate suspicion on multiple players in quick succession to appear erratic and untrustworthy. |
| GPT-4.1 | 1 | Use repeated questioning of players already under scrutiny to amplify suspicion without initiating new accusations. |
| Llama-3.3-70B | 2 | Redirect suspicion towards other players to avoid being the center of attention. |
| Llama-3.3-70B | 1 | Use the initial questions and reactions to gauge players’ behaviors and adjust strategy accordingly. |
F.7 Doctor non-self-protect events
The Doctor self-protects on of its protect actions across the corpus, a self-protect rate, so the nights where it shields another player are rare enough to enumerate one by one. Table 10 gives the full per-cell split that backs the protect-target distribution of Fig. 8(a) (App. C). Three of the six cells, GPT-4.1 in both conditions and Llama-3.3-70B OFF, self-protect on of nights and never once shield a teammate, so the entire deviation from self-protection comes from DeepSeek-V3.1 and Llama-3.3-70B ON. The seven non-self events are split four to Llama-3.3-70B ON and three to DeepSeek-V3.1, and the shielded player is most often the Jester or an ordinary Villager, with one Llama ON night spent protecting a Werewolf. Even this handful tilts toward the high-suspicion roles rather than a reasoned read of the wolves’ likely target, the same cue-driven default that the self-protect majority reflects. We report the raw counts rather than a conditional win-rate because seven events are far too few to estimate an outcome effect.
| Model | Cond | Self-protect | Other | Role shielded when not self |
|---|---|---|---|---|
| deepseek-v3.1 | off | 22 | 1 | Villager |
| deepseek-v3.1 | on | 16 | 2 | Jester, Villager |
| gpt-4.1 | off | 15 | 0 | — |
| gpt-4.1 | on | 15 | 0 | — |
| llama-3.3-70b | off | 19 | 0 | — |
| llama-3.3-70b | on | 18 | 4 | Jester, Werewolf |
Appendix G Voting pathology: full plot
The two panels separate the two halves of the failure. Panel (a) shows that day-1 wolf-vote-Jester stays high or rises under the loop in every model, so the self-learning intervention edits only the Jester’s prompt and never teaches the Wolves to stop conceding the game, which is why the Werewolf win-rate stays pinned near zero throughout. Panel (b) explains why GPT-4.1 is the extreme case, its voters sit nearest rank 1 and so commit hardest to whichever single player tops their suspicion order, exactly the one-dimensional cue-following that exiles the Jester, whereas Llama spreads its votes across the top-2 to top-3 and lands the dominated vote less mechanically. Tighter suspicion alignment therefore coincides with more self-sabotage, the opposite of what an incentive-aware voter would produce, which is the cue-sufficiency reading of Sec. 5.4.
Appendix H Substitution and Jester fate: full plot
Figure 13 backs the substitution claim of Sec. 5.1 and the night-kill self-sabotage claim of the Jester-fate subsection. Panel (a) shows the diverging (ON OFF) win-rate per faction; panel (b) shows the per-game Jester fate.
Panel (a) isolates who pays for the Jester’s learning gains. The Villager bar drops by in both DeepSeek-V3.1 and Llama-3.3-70B while the Werewolf bar barely moves off zero, so the loop transfers wins from the Villagers to the Jester and leaves the already-collapsed Wolves untouched, the per-faction form of the substitution claim in Sec. 5.1. Panel (b) decomposes the Jester outcome into its three terminal states and shows that a sizeable killed_at_night slice persists in every cell, each instance a Wolf night-kill of the Jester that ends the game in a triadic loss rather than a Wolf win. The voted_out share tracks the Jester-win column of Figure 2 as the rules require, so the fate breakdown is a consistency check as much as a result.
Appendix I Manipulation rates and suspicion attracted: full plot
Figure 14 backs the claim of Sec. 5.1–5.1 (and the operational cue-ambiguity statement in Sec. 5.4) that Jester and Werewolf attract statistically indistinguishable suspicion despite the Jester producing more deceptive statements per turn.
The two halves of the dumbbell figure make the cue-ambiguity property concrete. On the left, the Jester out-produces the Werewolf in peer-detected deceptions per statement in every model, so the Jester is by this measure the more visibly deceptive of the two high-suspicion roles. On the right, the suspicion the two attract is nonetheless nearly identical, both sitting in the – band across all six cells and well clear of the – that Villagers and the Doctor draw. A single observer reading off attracted suspicion therefore cannot tell the Jester from a Werewolf, which is precisely the ambiguity an incentive-aware voter would have to resolve and the empirical basis for treating the Jester win-rate as a multi-hop ToM probe.
Appendix J Auxiliary findings: detailed tables
The five auxiliary findings reported under Sec. D are backed by the following per-cell tables.
J.1 Bidding economics (per-game Jester bids)
Table 11 breaks the per-game Jester bids of Fig. 10a down by model, condition, and eventual outcome, and it shows that no single bidding level is universally optimal. GPT-4.1 has no losing row at all because its seven Jester losses are exactly the seven games where the Jester never cleared the speaking threshold on day 1, so for GPT-4.1 the relevant variable is whether the Jester speaks rather than how aggressively it bids. DeepSeek-V3.1 ON runs the opposite policy, its winning Jesters open with a mean first-bid of against for losing ones, the quantitative form of the “stay quiet on day 1, pop later” strategy that its self-learning loop converges on. Llama-3.3-70B sits at the degenerate end, bidding a flat on every first move in all four cells, so its bids carry no information about the outcome.
| Model | Cond | Won? | Mean avg-bid | Mean first-bid | Mean max-bid | |
|---|---|---|---|---|---|---|
| deepseek-v3.1 | off | 0 | 8.18 | 5.29 | 9.71 | 7 |
| deepseek-v3.1 | off | 1 | 8.55 | 6.33 | 9.67 | 3 |
| deepseek-v3.1 | on | 0 | 8.49 | 8.67 | 9.33 | 3 |
| deepseek-v3.1 | on | 1 | 7.96 | 4.57 | 9.43 | 7 |
| gpt-4.1 | off | 1 | 8.20 | 7.00 | 9.86 | 7 |
| gpt-4.1 | on | 1 | 8.65 | 7.00 | 9.83 | 6 |
| llama-3.3-70b | off | 0 | 7.91 | 8.00 | 8.00 | 6 |
| llama-3.3-70b | off | 1 | 7.96 | 8.00 | 8.00 | 4 |
| llama-3.3-70b | on | 0 | 7.77 | 8.00 | 8.00 | 4 |
| llama-3.3-70b | on | 1 | 7.94 | 8.00 | 8.00 | 6 |
J.2 Detection accuracy by speaker role (weighted by )
Table 12 gives the per-statement detection numbers behind Fig. 10b, weighted by the statement count in each cell. The Doctor is the hardest speaker to label in every model, with accuracy – against – for Wolves and Jesters, because the Doctor hides a single fact while otherwise speaking honestly and so matches neither the honest-villager nor the lying-wolf template. The Villager rows carry a high accuracy but a near-zero F1 (0.000 for GPT-4.1, 0.074 for DeepSeek-V3.1) because Villagers self-tag as deceptive in under of statements, leaving the positive class almost empty, so accuracy rather than F1 is the meaningful column for that row. Read together, the table says the deception detectors are strong on the overtly deceptive roles and weakest on exactly the role whose deception is partial.
| Model | Speaker role | statements | Accuracy | F1 |
|---|---|---|---|---|
| deepseek-v3.1 | Doctor | 1109 | 0.667 | 0.659 |
| deepseek-v3.1 | Jester | 1151 | 0.904 | 0.949 |
| deepseek-v3.1 | Villager | 4948 | 0.809 | 0.074 |
| deepseek-v3.1 | Werewolf | 2394 | 0.931 | 0.964 |
| gpt-4.1 | Doctor | 667 | 0.808 | 0.368 |
| gpt-4.1 | Jester | 976 | 0.887 | 0.938 |
| gpt-4.1 | Villager | 3772 | 0.904 | 0.000 |
| gpt-4.1 | Werewolf | 1397 | 0.809 | 0.894 |
| llama-3.3-70b | Doctor | 1220 | 0.647 | 0.432 |
| llama-3.3-70b | Jester | 1479 | 0.797 | 0.881 |
| llama-3.3-70b | Villager | 5102 | 0.740 | 0.245 |
| llama-3.3-70b | Werewolf | 2588 | 0.848 | 0.917 |
J.3 Within-cell learning curve (first 5 vs second 5 games)
Table 13 splits each cell’s ten chronological games into the first and second half, giving the numbers behind the cumulative curves of Fig. 10c. The two ON cells move in opposite directions, Llama-3.3-70B climbing from to ( pp) while GPT-4.1 falls from to ( pp), so the self-learning loop helps the weaker base policy and hurts the stronger one within a single run. DeepSeek-V3.1 ON drifts down more gently (). The OFF cells, which have no learnings file, drift too (Llama , DeepSeek , GPT-4.1 ), so part of the trend is cell-position noise rather than learning, which is why we read the ON-minus-OFF contrast rather than the raw ON slope as the learning effect.
| Model | Cond | First 5 win-rate | Second 5 win-rate | Overall | |
|---|---|---|---|---|---|
| deepseek-v3.1 | off | 0.20 | 0.40 | 0.30 | |
| deepseek-v3.1 | on | 0.80 | 0.60 | 0.70 | |
| gpt-4.1 | off | 0.80 | 0.60 | 0.70 | |
| gpt-4.1 | on | 0.80 | 0.40 | 0.60 | |
| llama-3.3-70b | off | 0.20 | 0.60 | 0.40 | |
| llama-3.3-70b | on | 0.40 | 0.80 | 0.60 |
J.4 Wolf day-1 vote coordination
Table 14 expands the day-1 vote panel of Fig. 10d into per-cell rates. Wolf voting is well coordinated, with unanimity between and , yet that coordination is frequently aimed at the wrong target, the wolves landing a vote on the Jester in – of games and most often under GPT-4.1 ( OFF, ON). The Voted-Jester rate falls under self-learning for all three models (DeepSeek , GPT-4.1 , Llama ), measured on the first day-1 ballot rather than the post-revote ballot used by Fig. 12. The Wolves-won column is in every row, so none of the five wolf wins in the full corpus came from a game in which the wolves had voted the Jester on day 1, tying the self-sabotage vote directly to the absence of wolf wins.
| Model | Cond | Unanimous | Voted Jester | Voted partner | Wolves won | |
|---|---|---|---|---|---|---|
| deepseek-v3.1 | off | 0.80 | 0.30 | 0.00 | 0.00 | 10 |
| deepseek-v3.1 | on | 0.40 | 0.20 | 0.30 | 0.00 | 10 |
| gpt-4.1 | off | 0.70 | 0.70 | 0.10 | 0.00 | 10 |
| gpt-4.1 | on | 0.90 | 0.60 | 0.10 | 0.00 | 10 |
| llama-3.3-70b | off | 0.60 | 0.50 | 0.30 | 0.00 | 10 |
| llama-3.3-70b | on | 0.70 | 0.30 | 0.30 | 0.00 | 10 |
J.5 Observer-confidence calibration
Table 15 bins every observer deception-flag by its self-reported confidence and reports the hit-rate per bin, the numeric form of Fig. 10e. GPT-4.1 and DeepSeek-V3.1 are well-calibrated at the top, reaching a hit-rate of and respectively in the bin, so a near-maximal-confidence flag from either model can be trusted as ground truth downstream. Llama-3.3-70B never populates that bin at all, its confidence topping out in at a hit-rate, so its flags cannot be thresholded as tightly. All three models track the diagonal in the lower bins, where a confidence near – corresponds to a hit-rate near –, so the calibration is monotone and the only deficiency is Llama’s missing high-confidence regime.
| Model | Confidence bin | flags | Hit-rate |
|---|---|---|---|
| gpt-4.1 | 15 | 0.27 | |
| gpt-4.1 | 193 | 0.24 | |
| gpt-4.1 | 578 | 0.53 | |
| gpt-4.1 | 782 | 0.97 | |
| gpt-4.1 | 911 | 1.00 | |
| deepseek-v3.1 | 46 | 0.33 | |
| deepseek-v3.1 | 1778 | 0.50 | |
| deepseek-v3.1 | 2282 | 0.90 | |
| deepseek-v3.1 | 725 | 0.99 | |
| llama-3.3-70b | 435 | 0.39 | |
| llama-3.3-70b | 3767 | 0.62 | |
| llama-3.3-70b | 1276 | 0.94 |
Appendix K Temporal dynamics: detailed tables
This appendix provides the full tables that back the temporal-dynamics plots in §E. All values are computed from the same 60-game corpus as the rest of the paper.
K.1 Suspicion attracted by role round
| Role | R0 | R1 | R2 | R3 | R4 |
|---|---|---|---|---|---|
| Doctor | 0.45 | 0.35 | 0.36 | 0.55 | 0.78 |
| Jester | 0.72 | 0.80 | 0.80 | 0.80 | – |
| Villager | 0.31 | 0.25 | 0.29 | 0.38 | 0.34 |
| Werewolf | 0.70 | 0.85 | 0.85 | 0.85 | 0.93 |
The Wolf row rises monotonically by from R0 to R4: observers genuinely accumulate evidence over time. The Jester row, by contrast, gains only between R0 and R1 and is then flat – the Jester “signature” is essentially decoded in the first speech act. The Doctor U-shape is the most striking pattern: in mid-game, surviving Doctors are below-average suspicion (–, lower than Villagers in R0), but by R3–R4 they spike to –, almost matching Wolves. This is the proximate cause of the universally low Doctor survival reported in Table 20.
K.2 Self-tagged deception rate by model role round
| Model | Role | R0 | R1 | R2 | R3 | R4 |
|---|---|---|---|---|---|---|
| deepseek-v3.1 | Doctor | 0.49 | 0.27 | 0.52 | 0.00 | 1.00 |
| gpt-4.1 | Doctor | 0.21 | 0.00 | 0.00 | – | – |
| llama-3.3-70b | Doctor | 0.24 | 0.20 | 0.22 | 0.33 | – |
The Wolf, Jester and Villager self-tag rates are essentially a fixed role policy that does not adapt across rounds for any model; the aggregate values appear in the calibration table of Appendix F. The Doctor, however, splits three ways. GPT-4.1 plays Doctor as a near-clean Villager (); Llama plays Doctor as a steady deceiver, hiding role information in roughly one in four statements; Deepseek oscillates between and , sometimes lying as much as a Wolf. Each of these three policies fails in its own way (Doctor survival is uniformly poor across all models), but the failure modes differ: GPT Doctors die from being visible, Llama Doctors die from being flagged as liars, Deepseek Doctors die from being inconsistent.
K.3 Within-day position effect on suspicion
| Role | early third | mid third | late third |
|---|---|---|---|
| Doctor | 0.39 | 0.46 | 0.39 |
| Jester | 0.68 | 0.80 | 0.81 |
| Villager | 0.27 | 0.35 | 0.25 |
| Werewolf | 0.70 | 0.80 | 0.83 |
The asymmetry is the message: the late-speaker penalty exists only for roles that lie. A naive prediction would be that more thinking time produces safer statements; the data show the opposite for deceivers and a flat or U-shape for honest roles. Observers appear to hold late speakers to a higher inferential bar – precisely the speakers who have had the most time to refine their cover, and whose refinements are therefore the most diagnostic.
K.4 Per-round coordination metrics by model
| Model | Rd | unanim | vote-J | cons. | corr. |
|---|---|---|---|---|---|
| deepseek-v3.1 | 0 | 0.60 | 0.23 | 0.82 | 0.85 |
| deepseek-v3.1 | 1 | 0.88 | 0.32 | 0.77 | 0.47 |
| deepseek-v3.1 | 2 | 1.00 | 0.30 | 0.78 | 0.75 |
| deepseek-v3.1 | 3 | 1.00 | 0.00 | 0.80 | 0.00 |
| gpt-4.1 | 0 | 0.80 | 0.60 | 0.82 | 0.65 |
| gpt-4.1 | 1 | 0.75 | 0.13 | 0.76 | 0.14 |
| gpt-4.1 | 2 | 1.00 | 0.00 | 0.71 | 0.33 |
| gpt-4.1 | 3 | 1.00 | 0.00 | 0.67 | – |
| llama-3.3-70b | 0 | 0.65 | 0.35 | 0.77 | 1.00 |
| llama-3.3-70b | 1 | 0.75 | 0.28 | 0.76 | 0.62 |
| llama-3.3-70b | 2 | 0.83 | 0.04 | 0.75 | 0.86 |
| llama-3.3-70b | 3 | 1.00 | 0.10 | 0.75 | 1.00 |
Three patterns dominate. (i) Wolf unanimity rises monotonically with round in every model – the public debate functions as a coordination channel even though the night-kill is private. (ii) The wolf-vote-Jester error is universal on day 1 (GPT , Llama , Deepseek ) and self-corrects within a few rounds, reaching by round 2 for GPT and Llama and by round 3 for Deepseek, but the village suspicion landscape it creates is already poisoned. (iii) Doctor target-prediction anti-correlates with overall model strength: Llama, the weakest overall, is the strongest Doctor predictor (– across rounds), while GPT-4.1, the strongest overall, collapses from on day 1 to on day 2. We hypothesise that GPT-4.1 over-reasons about decoy possibilities, while Llama applies a simpler “protect the most-suspected innocent” rule that survives observation noise better.
K.5 Cross-game role survival drift (game index 1–10)
| Model | Role | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| deepseek-v3.1 | Doctor | 0.0 | 0.0 | 0.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.5 | 0.0 | 0.0 |
| deepseek-v3.1 | Jester | 0.5 | 0.5 | 0.0 | 0.5 | 0.5 | 0.5 | 0.0 | 0.0 | 0.0 | 0.5 |
| deepseek-v3.1 | Villager | 0.9 | 0.9 | 0.8 | 1.0 | 0.8 | 0.8 | 1.0 | 0.9 | 0.4 | 0.8 |
| deepseek-v3.1 | Werewolf | 0.0 | 0.3 | 1.0 | 0.3 | 0.3 | 0.3 | 0.5 | 0.5 | 0.8 | 0.3 |
| gpt-4.1 | Doctor | 0.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.5 | 0.5 | 0.0 |
| gpt-4.1 | Jester | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| gpt-4.1 | Villager | 0.7 | 1.0 | 1.0 | 1.0 | 0.9 | 0.6 | 0.8 | 0.7 | 0.9 | 1.0 |
| gpt-4.1 | Werewolf | 0.5 | 1.0 | 1.0 | 1.0 | 0.5 | 0.8 | 0.0 | 0.5 | 0.5 | 1.0 |
| llama-3.3-70b | Doctor | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| llama-3.3-70b | Jester | 0.0 | 0.0 | 0.5 | 0.5 | 0.5 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| llama-3.3-70b | Villager | 0.7 | 0.6 | 0.7 | 0.5 | 0.4 | 0.7 | 1.0 | 1.0 | 0.8 | 0.9 |
| llama-3.3-70b | Werewolf | 0.5 | 0.5 | 0.3 | 0.3 | 0.3 | 0.5 | 0.0 | 1.0 | 0.8 | 0.8 |
Two zero-rows identify the most broken role-model pairings in the benchmark: Llama Doctor survives games and GPT-4.1 Jester survives games. The Llama Wolf row drifts upward in late games ( over games 7–10) even in the OFF condition where there is no learnings file, suggesting an implicit cell-position effect we flag as a confound for future cross-game analyses. The Deepseek Villager drop at game 9 () is an isolated anomaly; we report it to encourage replication rather than to draw a conclusion.
Appendix L External validation of the ToM-order tagger
The 1st/2nd/3rd-order tags in Section 5.6 are produced by the rule-based regex tagger of App. N. To check that the headline ordering does not depend on that tagger, we re-label all accumulated Jester learning entries with two independent LLM judges (DeepSeek-V3.1 and Llama-3.3-70B), each given the same definitions used by the regex rules. Table 21 reports the per-source-model rate of third-order contrastive entries under all three labelers. The absolute percentages are judge-dependent — entry-level agreement with the regex labels is near chance (Cohen’s for the DeepSeek judge, for the Llama judge), because the LLM judges count any audience-aware phrasing as at least second-order whereas the regex requires an explicit perception verb. The rank order across source models, however, is identical for all three labelers: DeepSeek-V3.1 GPT-4.1 Llama-3.3-70B. The mechanism claim in Section 5.6 rests on this ordering, not on the absolute value .
| Source model | regex | DeepSeek-judge | Llama-judge |
|---|---|---|---|
| GPT-4.1 | 4.8% | 9.5% | 11.9% |
| DeepSeek-V3.1 | 15.4% | 26.9% | 38.5% |
| Llama-3.3-70B | 0.0% | 2.1% | 0.0% |
Appendix M Runtime configuration
Table 22 records the engine settings that govern every run, emitted by the released configuration-dump script directly from the code paths. We surface it so that the behavioral findings can be read against the exact decoding and game parameters that produced them.
| Parameter | Value |
|---|---|
| Temperature | 1.0 (every call) |
| top_p / max_tokens / stop | provider default / unset / unset |
| Bid range | integer – |
| Bid parse fallback | silent on parse failure |
| max_debate_turns | 12 |
| max_explanation_turns | 6 |
| Recursion limit | 1000 |
| Context handling | full history, no truncation |
| Self/peer judges | same model as speaker (intra-population) |
| Per-call seed | not exposed by the API (statistical, not bit-exact, reproducibility) |
Appendix N Lesson-tagging script
The strategy-category and ToM-order tags of Section 5.6 (Fig. 4) are produced by a single rule-based tagger. Its input is the set of 141 accumulated lessons: each lesson is one free-text string drawn from the to_do, to_not_do, and winning_tactics lists of a model’s per-model learning JSON. Every lesson is lowercased and matched against two independent pattern sets with Python’s re module.
Strategy categories (multi-label).
A lesson receives every category whose pattern list contains at least one match, so a single lesson can carry several category tags and the per-category percentages in Fig. 4b need not sum to 100%. The nine categories and their verbatim pattern lists, together with the tagging function, are given in the box below.
ToM order (single-label, strict precedence).
A lesson is tagged
3rd-order if it matches any third-order pattern, else 2nd-order if it matches
any second-order pattern, else 1st-order. The third-order patterns require an
explicit cross-faction contrast (e.g. wolf-like …not
villager); this is exactly what separates a multi-hop incentive prediction
from a bare audience-aware statement, and is why a 2nd-order “appear
suspicious” lesson does not qualify as 3rd-order.
The distinctive-vocabulary counts cited in
Section 5.6 come from the same script: per-model
lesson text is tokenised with the word pattern [a-z][a-z\-]+,
a stop-list of function words and game terms (jester, wolf,
villager, player, round, …) is removed, and
terms are ranked by TF-IDF excess of one model’s corpus over the other two.
By this measure DeepSeek-V3.1’s top distinctive terms are
wolf-like (22 occurrences), appear (23),
deflection (16) and villager-aligned, the lexical
fingerprint of its wolf_mimicry-heavy policy.
Because the tagger is purely lexical, it is deterministic and reproducible;
the external LLM-judge validation of the ToM-order axis is reported in
App. L.
Appendix O Prompts
This appendix lists the verbatim prompts used in WOLF-Triadic. All prompts
are templates formatted at runtime with the player name, role, scratchpad,
debate history and other context fields. Curly braces in the boxes are the
literal Python f-string placeholders.
O.1 Role system prompts
The following four prompts define each role’s objective, win condition and
high-level strategy. They are concatenated with a shared character-style
prompt and a common game-rules prompt at runtime.
The past_learnings placeholder is empty in the OFF condition
and is replaced in the ON condition by a formatted block listing the
to-do, to-not-do and winning-tactics lists from the per-model learning
JSON file.
O.2 Bidding prompt
Before each speaking turn, every alive player is asked to bid an urgency
score 0–10 for speaking next. The highest bidder speaks.
O.3 Deception assessment prompts
Each public statement is assessed twice: once by the speaker for
self-reported deception (Box for the speaker self-assessment) and once by every other
living player for peer-detected deception (observer prompt below). Both prompts use the same 8-class taxonomy.
O.4 Jester self-learning update prompt
After every game in the ON condition, the following prompt is issued once
to the Jester model. The output is parsed as JSON, deduplicated against
the existing learnings via a semantic-similarity check, and appended to
the per-model learnings file.