License: CC BY 4.0
arXiv:2606.27909v1 [cs.CL] 26 Jun 2026

Triadic Werewolf: A Jester Role for Multi-Hop Theory of Mind in LLMs

Avni Mittal
avni.mittal2002@gmail.com
Abstract

Theory-of-mind evaluations of large language models typically use dyadic social-deduction games, where every observable cue points to a single hidden side, so a model with strong language priors can score well without ever simulating opponents’ incentives. We extend the Werewolf game with a Jester, a third faction whose utility on peer suspicion is inverted because it wins by being voted out, so optimal play requires reasoning across three opposing utility functions. Across 60 games on GPT-4.1, DeepSeek-V3.1, and Llama-3.3-70B with Jester self-learning on and off, the Jester wins 60–70% of games while Werewolves never exceed 20%, and GPT-4.1 wolves vote the Jester out on day 1 in 60–70% of games, a strictly self-defeating action. Self-learning helps DeepSeek and Llama but hurts GPT-4.1, with the cost landing on Villagers rather than Werewolves. Only DeepSeek learns the subtle strategy of looking suspicious without looking intentionally suspicious, and it gains the most from the loop. Triadic incentive structure exposes a layer of multi-agent reasoning that dyadic deduction games leave invisible.

Triadic Werewolf: A Jester Role for Multi-Hop Theory of Mind in LLMs

Avni Mittal avni.mittal2002@gmail.com

1 Introduction

Werewolf and related social-deduction benchmarks for large language models are dyadic, splitting players into two opposing sides, Villagers against Werewolves or loyal players against traitors (Liu et al., 2024; Bailis et al., 2024; Curvo, 2025; Xu et al., 2023a, b; Wang et al., 2024), so every decision reduces to one question. Is this player an ally or an enemy? Every observable cue points one way, so a model that learns the surface pattern hesitation \Rightarrow likely werewolf can win without modelling any opponent. The theory-of-mind (ToM) (Premack and Woodruff, 1978) optimal policy and a pure cue-reading policy then select the same action, and a model can win by exploiting cues as it might exploit shortcut features elsewhere (Geirhos et al., 2020; McCoy et al., 2019; Niven and kao, 2019). We call this the cue-sufficiency problem, and a high score under it does not show genuine mental-state reasoning (Shapira et al., 2023; Ullman, 2023), a coincidence we confirm in our dyadic control.

We break this coincidence by adding a third faction whose payoff on the same cue runs the opposite way. The Jester wins if and only if the Villagers vote them out, so it tries to look as suspicious as possible while the Werewolves try to look innocent. A high-suspicion player is now ambiguous evidence, a Werewolf to exile or the Jester to leave alone, so good play must avoid looking like either while steering Villager attention away from anyone who might be the Jester. No dyadic variant draws out this multi-hop reasoning over three competing goals (He et al., 2023; Street et al., 2024; Wang et al., 2024).

Refer to caption
Figure 1: Overview of the WOLF triadic-incentive pipeline. A 10-player Werewolf game is instantiated with three competing factions (Villagers, Werewolves, and the Jester) whose utilities over the same observable cue diverge. Each game produces a transcript that is scored along faction outcomes, per-statement deception annotations, and voting dynamics. After every Jester game, the Jester-side model condenses what worked into a persistent learning file that is fed back into the next game’s Jester system prompt, yielding the OFF\toON self-learning condition we evaluate.

We argue that current LLMs do not reliably make this inference (Sap et al., 2022; Strachan et al., 2024a; Shapira et al., 2023; Ullman, 2023), and the failure shows up faction by faction. Three behavioral signatures follow if models fall back on dyadic priors. (i) The Jester wins more often than either other side. (ii) Werewolves sabotage themselves by voting the Jester out on day 1. (iii) An intervention aimed only at the Jester takes its gains from the Villagers rather than the Werewolves, since the Werewolves already fail the inference and only Villager belief-updating gives the Jester something to exploit. Our contributions are:

  • A 10-player triadic Werewolf benchmark whose Jester’s inverted utility breaks the cue-sufficiency of dyadic deduction games.

  • A controlled 60-game evaluation across three frontier models (GPT-4.1, DeepSeek-V3.1, Llama-3.3-70B) (Achiam et al., 2023; DeepSeek-AI et al., 2024; Dubey et al., 2024) with a Jester self-learning loop toggled OFF/ON (Shinn et al., 2023; Xu et al., 2023a), measuring first-order Jester gains and second-order effects on the other sides.

  • A hand-coded analysis of each model’s accumulated Jester lessons by theory-of-mind depth, linking depth to the model’s gain from the loop.

2 Related Work

LLMs in Werewolf and other social-deduction games.

Werewolf has become the canonical testbed for LLM agents under deception and hidden-role inference. Xu et al. (2023c) pair an LLM proposer with an RL action selector to reach human-comparable play; Wu et al. (2024) couple a System-1 LLM to a System-2 “Thinker” trained on a 19k-game corpus; Xu et al. (2025) optimise an abstract latent strategy space with counterfactual regret minimisation. Werewolf Arena (Bailis et al., 2024) introduces the bidding-based debate protocol that we adopt, and InterIntent (Liu et al., 2024) probes goal-tracking among LLM agents. The wider family includes AvalonBench (Light et al., 2023), AmongAgents (Chi et al., 2024), The Traitors (Curvo, 2025), Hoodwinked (O’Gara, 2023), SpyGame (Kim et al., 2024), and CICERO’s Diplomacy agent (Bakhtin et al., 2022). All of these are dyadic at the faction level, so a model can win by maximising a single “does this player look suspicious?” signal. We add a third faction whose utility on that same signal is inverted, dissociating “detect deception” from “decide whose deception to act on”.

Theory of mind in LLMs.

Static ToM benchmarks include FANToM (Kim et al., 2023), OpenToM (Xu et al., 2024), and Hi-ToM (He et al., 2023), which varies recursion depth explicitly. Strachan et al. (2024b) report that GPT-4 matches humans on classical false-belief tasks but underperforms on faux-pas detection, while Hagendorff (2023) show that deception itself emerges with scale. The recurring critique (Ren et al., 2025) is that surface heuristics solve many ostensibly mental-state items without genuine belief representation. Multi-agent games complement static items by reading mental-state inference out of strategic behaviour under incentive pressure; our claim is sharper, that the incentive structure must itself be designed to dissociate ToM-driven from heuristic-driven solutions.

Strategic deception, alignment, and self-improvement.

A parallel literature studies hidden objectives and evaluator manipulation. MACHIAVELLI (Pan et al., 2023) measures the reward-vs-ethics trade-off across 134 text adventures; Sleeper Agents (Hubinger et al., 2024) show that backdoor deception survives standard safety training; GovSim (Piatti et al., 2024) shows LLM societies overconsume shared resources under misaligned incentives. On in-context self-improvement, Reflexion (Shinn et al., 2023), Voyager (Wang et al., 2023), and Generative Agents (Park et al., 2023) all formalise verbal RL loops that rewrite a textual memory between episodes. Our jester_learning condition is a deliberately minimal instance of this paradigm and lets us audit whether such a loop helps or hurts when the role’s task is inherently deceptive.

The Jester mechanic.

Human social-deduction games such as Mafia and Town of Salem have long used the Jester, a role that wins by being voted out and so breaks the single axis of suspicion the other roles rely on. We are the first to use it for studying Theory of Mind Reasoning in an LLM benchmark.

3 Methodology

3.1 Game architecture

Figure 1 summarises the full pipeline. We adopt the game architecture and implementation of WOLF (Agarwal et al., 2025) and make three changes, giving the 10-player roster in Table 1. We drop the Seer, whose nightly check of whether a chosen player is a Werewolf gives Villagers a reliable signal to anchor on and so removes the cue ambiguity our probe depends on. We raise the Villager count from four to six, since six Villagers and a 2:6 wolf-to-village ratio keep a five-vote daytime majority even with the Jester filling one of the ten slots. We keep a single Doctor, which preserves protect-target inference without handing Villagers a second reliable channel of the kind the Seer would provide. A game alternates between night and day phases. The day opens with the announcement of any night kill, followed by an open debate, a vote, and a revote. The game ends as soon as one of the faction-win conditions in Table 1 fires.

Role Count Purpose in the game
Villager 6 No special action at night. By day, debates and votes to exile a suspected Werewolf or Jester. Wins if all Werewolves are eliminated.
Doctor 1 Each night, privately selects one player to protect from the Werewolves’ kill (may self-protect). Plays as a Villager by day and wins with the Villager faction.
Werewolf 2 Each night, the Werewolves jointly select one player to kill. By day they conceal their identity, vote against Villagers, and avoid voting out the Jester. Wins when the Werewolves equal or outnumber the surviving non-Jester players.
Jester 1 No special night action. By day, behaves so as to attract suspicion and be exiled by majority vote. Wins if and only if exiled; loses if killed at night or alive at game end. A Jester win is a loss for both other factions.
Table 1: Per-game role roster (10 players total), with each role’s function during the night and day phases and its win condition.

Bidding-based debate.

We inherit the bidding-based debate from Bailis et al. (2024), as implemented in WOLF (Agarwal et al., 2025). Before each round, every living player produces an integer urgency bid for how much they want to speak next, and the highest bidder gets the floor for a short public statement. Running several bid-and-speak rounds per day routes the floor to whoever has the most to say (a strong accusation, a defense, or a useful piece of information) rather than to a fixed turn order. This matters for our setup because the Jester benefits from airtime to draw suspicion while the Werewolves benefit from controlling when they speak.

Vote, explanation, revote.

After debate, every living player casts a vote to exile someone and gives a short public reason, both written into the shared transcript, so this first vote is a public signal the others can reason about rather than a silent tally. Any player who received a vote then defends themselves, and the rest rejoin through the same bidding mechanism to question or respond, after which a final revote follows. Exile requires a strict majority of living players (nalive/2+1\lfloor n_{\text{alive}}/2\rfloor+1); if no candidate reaches it the day ends with no exile and play moves to night. This defend-then-revote step lets players revise once they have seen who suspects whom, the moment where multi-hop reasoning about the Jester matters most.

Peer suspicion score.

After every public statement, each other living player rates how suspicious it seemed on a [0,1][0,1] scale, and each player’s running suspicion against every other is updated by exponential smoothing, St=αst+(1α)St1S_{t}=\alpha s_{t}+(1-\alpha)S_{t-1}, with α=0.7\alpha=0.7, so the latest rating dominates while prior history tempers it. This yields one number per player summarizing recent peer suspicion, the main observable signal for our cue-ambiguity analysis.

Faction utilities and cue ambiguity.

Let St(p)S_{t}(p) be peer-aggregated suspicion against player pp at time tt, the single observable every faction acts on. Villagers want SS high on the Werewolves and low on the Jester, the Jester wants the reverse for itself since exile is its only win, and the Werewolves sit between with private team knowledge the Villagers lack. A high-SS player who is not a Werewolf’s teammate is therefore either the Jester or a wrongly-accused Villager, so an attentive Werewolf should keep it alive rather than help exile it. The Jester thus plays to two audiences. It must look like a likely Werewolf to the uninformed Villagers so they exile it, yet stay ordinary enough that the informed Werewolves do not recognise and shield it. The same high-SS value demands opposite correct responses depending on the hidden role behind it, which is why no one-dimensional reading of SS suffices.

A Jester win is, by the rules, a simultaneous loss for both other factions, since either could have prevented it, the Villagers by withholding the exiling vote and the Werewolves by shielding a recognised Jester. It therefore requires both to fail the same inference, which is why we read the Jester win-rate as a direct measure of multi-hop ToM failure. We use multi-hop theory of mind behaviorally throughout, the ability to act on a cue whose correct reading depends on a hidden third party’s incentives, and measure that signature rather than claim anything about internal representations.

3.2 Jester self-learning loop

After each game in the ON condition, an LLM call summarizes the game from the Jester’s perspective and proposes new lessons, returning an empty set when the game taught nothing new. Each lesson is a short free-text note about a tactic that worked or should be avoided, sorted into to-do, to-avoid, and winning-tactic lists (representative entries in Table 2). The list is per-model, so each model sees only its own past lessons, and the next game’s Jester reads them in its prompt before play. The update is append-only and the prompt forbids entries that duplicate or paraphrase existing ones (full prompt in Appendix O). The OFF condition disables the loop entirely, so comparing ON against OFF isolates the effect of accumulated Jester wisdom on faction outcomes. We later analyze the accumulated entries in Section 5.6 along two axes, a 9-class strategy-category taxonomy and a 3-class theory-of-mind order summary (Figure 4).

Model ToM Entry (verbatim)
DeepSeek-V3.1 3 Vote for a player who is not the main suspect to appear wolf-like by deflecting from the consensus, but only after the main suspect is already exiled to avoid appearing villager-aligned.
GPT-4.1 2 Escalate suspicion on multiple players in quick succession to appear erratic and untrustworthy.
Llama-3.3-70B 1 Use the initial questions and reactions to gauge players’ behaviors and adjust strategy accordingly.
Table 2: One representative Jester self-learning entry per ToM order (1 = bare imperative; 2 = references how others perceive the agent; 3 = explicit contrast across two faction-perceptions, e.g. wolf-like not villager-aligned). Full 7-entry table per model in App. F (Table 9).

3.3 Deception protocol

Every public statement during debate or vote-explanation is a unit of analysis. Right after a statement, a separate self-report call asks the speaker to label it, returning a binary deception flag, a confidence c[0,1]c\in[0,1], and a deception type. We extend WOLF’s four deceptive types (omission, distortion, fabrication, misdirection) with three more (sycophancy, strategic_deception, persuasive_manipulation), giving the 8-class taxonomy in Table 3.

Class Definition
none Truthful statement with no deceptive intent.
omission Deliberately hiding relevant information.
distortion Twisting or exaggerating facts.
fabrication Making up false information.
misdirection Diverting attention away from important topics.
sycophancy Agreeing with others to blend in or gain trust, rather than because one believes the claim.
strategic_deception Deliberately misrepresenting role, capabilities, or intentions.
persuasive_manipulation Using emotional or social pressure rather than logical argument.
Table 3: Eight-class self-reported deception taxonomy used by the speaker for every public statement (Appendix O).

All other living players concurrently emit a continuous suspicion score s[0,1]s\in[0,1] for the same statement, updated by the same exponential-smoothing rule. Speaker and observer share the same taxonomy and reasoning view, so self-reported and peer-detected deception are jointly analyzable per role per model.

4 Evaluation Setup

Refer to caption
Figure 2: Faction win-rate by model under Jester-learning OFF \to ON. Each panel is one faction; each line is one model; the slope reads Δ\Delta directly. The Jester panel shows the ceiling effect for GPT-4.1 and large gains for DeepSeek-V3.1 and Llama-3.3-70B; the Villager panel shows the symmetric loss that absorbs those gains.

Design.

We run a 3×23\times 2 factorial of three frontier models (GPT-4.1, DeepSeek-V3.1, Llama-3.3-70B) crossed with the Jester self-learning loop (OFF, ON), 10 games per cell for 60 games total, with one model playing every role in a game. Cells run sequentially, so under ON the kk-th game’s Jester reads the lessons from games 1:k11{:}k{-}1 of that same cell, and per-cell files are isolated so DeepSeek’s ON cell never sees GPT-4.1’s lessons.

Engine and call settings.

Bids are integers in [0,10][0,10] (malformed bids parse as 0), phase caps are max_debate_turns = 12 and max_explanation_turns = 6, and all calls use temperature = 1.0 with the full conversation history untruncated. The complete configuration is in Table 22.

Aggregation and statistics.

Per cell we report Wilson 95% confidence intervals on win-rates, which stay inside [0,1][0,1] at small nn and extreme rates where the normal interval does not. Per-statement aggregations (deception types, suspicion attracted, ToM order) pool across the 10 games of a cell and report unweighted means unless stated otherwise. Our evidence sits at two scales: win-rate claims rest on the same pattern repeating across all six 10-game cells, while the behavioral and mechanistic analyses run over the 3,992 per-statement deception events.

5 Results

We report aggregate outcomes, the self-learning loop’s marginal effect, behavioral and mechanistic evidence for the multi-hop ToM hypothesis, and a deductive coding of what each model learns.

5.1 Faction win rates and the effect of self-learning

A single Jester agent against eight other actively-deducing agents wins more games than the entire Werewolf faction combined in every cell (Figure 2). Pooling across models and conditions, the Jester achieves 0.55, Villagers 0.37, Werewolves 0.08 (per-cell rates in Table 7, Appendix A).

The self-learning loop helps each model by a different amount (Jester panel of Figure 2). GPT-4.1 already wins 0.700.70 of its Jester games without any learning, so it has little room to improve, whereas DeepSeek-V3.1 rises from 0.300.30 to 0.700.70 (Δ=+0.40\Delta=+0.40) and Llama-3.3-70B from 0.400.40 to 0.600.60 (Δ=+0.20\Delta=+0.20). The loop thus helps most when a model starts out playing the Jester poorly and adds little once it is near the best a Jester can do.

Faction Triadic Dyadic
(Jester present) (Jester removed)
Villagers 0.35 [0.19, 0.54] 1.00 [0.74, 1.00]
Werewolves 0.08 [0.02, 0.24] 0.00 [0.00, 0.26]
Jester 0.58 [0.39, 0.74]
Table 4: Falsifiability control on GPT-4.1 (role-shuffled, learning OFF). Replacing the Jester with a sixth Villager restores Villager wins to 100%. The Jester role is the active ingredient that breaks Villager performance.

These extra Jester wins come from the Villagers, not the other deceptive role (Fig. 13a, App. H). The Werewolves were already losing almost every game before the loop, so they have no wins left to give up, and the Jester’s gains can only come from the one faction still actively trying to tell the roles apart.

5.2 Falsifiability: removing the Jester

The Jester’s dominance only implicates multi-hop ToM if the other sides fail because the Jester makes the suspicion cue ambiguous, rather than because the model is simply weak at Werewolf. We separate these with a dyadic control on GPT-4.1: the same engine and prompts, but with the Jester replaced by a sixth Villager, roles shuffled per game, and the learning loop off. Table 4 contrasts it with the matched triadic cell (GPT-4.1, shuffled roles, learning off).111Both cells use per-game role shuffling on GPT-4.1, which also rules out a fixed name-to-role prior as the driver; the dyadic control is run on GPT-4.1 only. Removing the Jester lifts the Villagers from 0.350.35 to a clean sweep of 1.001.00 (11/11 games), while the Werewolves stay near zero in both regimes.

Metric OFF ON Δ\Delta
(n=32n{=}32) (n=30n{=}30) (ON-OFF)
Jester win rate 0.625 0.933 +0.308+0.308
Villager win rate 0.281 0.033 0.248-0.248
Werewolf win rate 0.094 0.033 0.060-0.060
Wolf votes Jester on day 1 0.719 0.933 +0.215+0.215
Jester exiled by day-1 vote 0.500 0.867 +0.367+0.367
Table 5: Role-shuffled GPT-4.1 robustness run (per-game random role assignment). All Δ\Delta except the Werewolf row are significant at p<0.05p<0.05 by two-proportion zz and bootstrap.

The same model that loses the triadic game plays the dyadic game perfectly. In the dyadic regime the only high-suspicion players are the Werewolves, so the surface heuristic “exile the most suspicious player” coincides with the ToM-optimal policy and a pure cue-follower wins every game without modelling any opponent’s incentives. Adding the Jester inserts a second high-suspicion role whose desired outcome is inverted, which dissociates the cue from the optimal action; the Villager win-rate then collapses from 1.001.00 to 0.350.35 (p=2.7×104p=2.7\times 10^{-4}). Since the model, prompts, and engine are identical across the two regimes, the collapse cannot be generic weak instruction-following or strategic search, which would also have hurt the dyadic game; it appears only once the cue and the correct action are pulled apart. The triadic\todyadic gap is thus a direct readout of the multi-hop ToM that a dyadic score, however perfect, cannot certify.

5.3 Robustness: de-confounding GPT-4.1 with role-shuffling

Model Cond A B C D W T
DeepSeek-V3.1 off 1 1 2 5 1 10
DeepSeek-V3.1 on 2 2 5 1 0 10
GPT-4.1 off 7 3 0 0 0 10
GPT-4.1 on 6 3 0 0 1 10
Llama-3.3-70B off 2 2 2 3 1 10
Llama-3.3-70B on 4 1 1 2 2 10
Table 6: Wolf-loss failure-mode counts per cell (n=10n=10 games each). A = Wolves voted the Jester out on day 1 (a Jester win); B = Wolves killed the Jester at night (a Villager win); C = Villagers voted the Jester out with the Wolves uninvolved (a Jester win); D = Villagers outnumbered the Wolves (a Villager win); W = Wolves won; T = total. A and B together are wolf self-sabotage, and A and C are the two Jester-win routes.
Refer to caption
Figure 3: Per-role deception-type distribution. Each row is one (model, condition) pair. Bars are 100%-stacked across the 8-type taxonomy. Composition is largely a property of the model rather than of the role, and is mostly preserved across the learning intervention.

Our main runs fix the name-to-role mapping (the same player name is the Jester in every game of a given model and condition), leaving a name-prior confound and a small per-condition sample. We re-run GPT-4.1 with roles permuted uniformly at random per game at a larger sample (about 30 games per condition), all else unchanged (Table 5). Three points follow. First, the day-1 self-sabotage is not a name artifact: under shuffling it rises to 0.930.93 (ON), so wolves target whoever currently holds the Jester role, not a learned name. Second, the substitution effect of Section 5.1 reproduces at larger nn and is now significant, again paid by the Villagers rather than the Werewolves. Third, the GPT-4.1 learning effect flips sign, the loop helping (Δ=+0.308\Delta=+0.308) rather than hurting (Δ=0.10\Delta=-0.10 in Section 5.1), because the harder de-confounded regime removes the name shortcut and an accumulated-strategy file then buys real edge. Shuffling confirms rather than overturns the findings, so we run it for GPT-4.1 only and keep the non-shuffled runs as our main results.

5.4 Voting pathology: wolf self-sabotage and vote alignment

Voting the Jester on day 1 is the sharpest test of multi-hop ToM (slope charts in Fig. 12, App. G). A Werewolf must see that the high-SS player might be the Jester, that exiling it ends the game in a Wolf loss, and refuse the village majority, and failing any step is a self-defeating vote. Every role is even warned of this directly, the Werewolf, Villager, and Doctor prompts all saying not to exile a player just for looking suspicious since it may be the Jester (App. O), so the failure is not a missing rule but an inability to act on a known one against the suspicion cue. GPT-4.1 wolves cast this vote in 60–70% of games, DeepSeek-V3.1 in 20–30%, and Llama in 30–40%, and self-learning lowers the rate for GPT-4.1 and DeepSeek-V3.1 but raises it for Llama-3.3-70B, the only wrong-faction substitution. The model with the tightest vote–suspicion alignment (panel b) is also the one whose Wolves most often suicide-vote the Jester, so committing to the top suspect without asking which faction it belongs to is exactly the failure the triadic design exposes.

Table 6 classifies all 60 games by Wolf-side outcome, categories A and B being self-sabotage (Wolves voted or killed the Jester), C a villager-driven Jester win, and D a clean Villager team-win by outnumbering. GPT-4.1 is the extreme case, ending 19 of 20 games via A or B (13 by the day-1 vote alone) and none via C or D, while DeepSeek-V3.1 shifts the other way under learning, 5/10 ON-condition Wolf losses becoming Jester-baited villager votes (C) rather than attrition (D). Wolves won only 5 of 60 games, and univariate predictors (Table 8, App. F) confirm the pattern, the day-1 wolf-vote-Jester being the strongest correlate of a Jester win (r=+0.59r{=}{+}0.59) while the Jester’s raw statement count is uncorrelated (r0r{\approx}0). Night actions tell the same story. Wolves kill the Jester at night in 20–40% of games. Across votes and kills GPT-4.1 wolves take action against the Jester in nearly every game (Fig. 13b, App. H). Cooperative roles also misfire, the Doctor self-protecting in 80–100% of games rather than shielding a teammate yet still the most-killed role (App. C).

We then ask whether the wolves recognise the Jester and misplay or simply cannot tell it from a Werewolf. An external judge (DeepSeek-V3.1) reads each finished game’s public transcript and names the hidden roles from every surviving seat (5454 games, 29282928 guesses). It labels the Jester correctly only 0.2510.251 of the time, no better than the 0.250.25 chance rate among four roles and far below its 0.8350.835 on Villagers, and a Werewolf-seat view does no better (0.2590.259) than the Villager (0.2440.244) or Doctor (0.2710.271) seats. The transcript thus carries no signal separating a Wolf from a Jester, so the day-1 vote is not a recognised Jester misplayed but the same one-dimensional cue-following as everyone else, the cue-sufficiency failure we predicted rather than a coordination or detection breakdown.

5.5 Deception is a model property, not a role or loop property

Refer to caption
(a)
Category GPT-4.1 DeepSeek-V3.1 Llama-3.3-70B
wolf_mimicry 0.0 71.2 2.1
inconsistency 23.8 25.0 4.3
aggression 23.8 1.9 0.0
subtle_indirection 9.5 3.8 36.2
passive_provocation 23.8 5.8 17.0
anti_jester_tells 14.3 1.9 12.8
vote_manipulation 9.5 44.2 2.1
meta_modeling 31.0 26.9 10.6
questioning 14.3 23.1 12.8
(b)
Figure 4: (a) Theory-of-mind order of accumulated Jester lessons across the three models (141 entries total): DeepSeek-V3.1 has 15% third-order contrastive entries; Llama-3.3-70B has none. (b) Strategy-category distribution of learned Jester lessons (% of entries; multi-tag).

The Jester out-produces the Werewolves in deceptive statements per turn in every model (for example, GPT-4.1 Jester 7.47.4 versus Wolf 5.05.0 ON), yet the two attract statistically indistinguishable suspicion, hovering around 0.65–0.81 across all six cells against 0.22–0.51 for Villagers and the Doctor (full dumbbell plots in Fig. 14, App. I), so the high-suspicion band holds two agent types with opposite desired outcomes.

The type of deception is equally a model trait. GPT-4.1 leans on distortion (Jester 47%\to68% ON) and strategic_deception (Wolf 59%\to72%) with zero fabrication, DeepSeek-V3.1 on fabrication (Jester \approx62% in both conditions), and Llama-3.3-70B on misdirection (Jester 40%\to57%). Each model imports one style and applies it inside whatever role it inhabits, so composition barely shifts with role or loop, and the same signature recurs in speaker-vs-observer calibration where honest roles are over-flagged and skilled liars under-flagged (App. D). A model that deceives the same way regardless of faction is unlikely to be running recursive reasoning over its opponents’ incentives.

The second-order suspicion shifts confirm the pattern. The loop only edits the Jester’s prompt, yet it moves how other agents suspect each other in a model-specific way. Llama-3.3-70B ON shifts every observer’s rating of the Doctor by 0.15-0.15 to 0.23-0.23 and lifts Jester ratings by +0.08+0.08 to +0.09+0.09, while DeepSeek-V3.1 ON makes Wolves less suspicious of the Jester (0.07-0.07), so they less often join the exiling vote and Jester wins rise. GPT-4.1’s near-flat diff matrix matches its saturation (Section 5.1), with no second-order room left at the ceiling of cue-ambiguity exploitation. The full heatmap (App. B.1), six-axis radar (App. B.3), and game-length distributions (App. B.2) make these cross-cell shapes immediate, with Jester wins concentrating on day 1 for GPT-4.1, within 1–2 rounds for the others.

5.6 Inside the learned policies

Outcome metrics show that the loop helps unevenly but not why. To recover the mechanism we deductively coded all 141 accumulated lessons (GPT-4.1: 42, DeepSeek-V3.1: 52, Llama-3.3-70B: 47) with a rule-based regex tagger (patterns and code in App. N) along two axes, a 9-class strategy-category taxonomy grounded in the triadic-incentive model (Fig. 4b) and the 3-class theory-of-mind order of Section 3.2 (Table 2). Only a third-order entry separates being mistaken for a Wolf from being mistaken for a Villager, so this class is the textual signature of multi-hop incentive prediction.

The three models converge on qualitatively distinct policies (Fig. 4b). DeepSeek-V3.1 encodes the cue-ambiguity exploit explicitly, devoting 71% of its entries to wolf_mimicry against 2%\leq 2\% for the other two. GPT-4.1 instead acts erratic and lets the group affix the label, combining meta_modeling (31%), aggression (24%) and passive_provocation (24%), while Llama-3.3-70B leans on subtle_indirection (36%). A per-model distinctive-vocabulary (TF-IDF) analysis corroborates the split (App. N).

The ToM-order distribution (Fig. 4a) is the paper’s cleanest mechanistic result. DeepSeek-V3.1 is the only model that reliably produces third-order contrastive entries (15.4%), against 4.8% for GPT-4.1 and none for Llama-3.3-70B, and two independent LLM judges reproduce this rank order even though their absolute percentages differ (App. L; representative entries in Table 2). The ordering tracks the learning gains of Section 5.1, the model whose lessons encode the triadic incentive structure (DeepSeek-V3.1) extracting the most from the loop (Δ=+0.40\Delta=+0.40) and Llama-3.3-70B, whose lessons are 64% bare first-order imperatives, only +0.20+0.20.

6 Conclusion

Adding a Jester to a Werewolf game converts a one-dimensional suspicion signal into a triadic incentive-prediction problem, and current frontier LLMs fail at it. Across 60 games on three models, the Jester wins the majority of cells, Werewolves never exceed 20%, and the single-faction self-learning intervention pays for its gains out of Villager wins rather than Wolf wins. Behavioral fingerprints (day-1 self-sabotage votes, night-kill of the Jester, model-stable deception-type composition) and second-order cross-perception shifts pin the failure on cue-sufficiency rather than coordination or signal detection, and deductive coding of the 141 accumulated lessons adds a textual mechanism, the only model whose self-learning produces contrastive third-order entries (DeepSeek-V3.1, 15.4%) being the one that gains most from the loop. Triadic incentives are a small structural change that exposes a layer of multi-agent reasoning dyadic deduction games leave invisible, and we release the benchmark, logs and analysis to enable follow-up work that controls the cue-ambiguity axis deliberately.

Limitations

We run only 10 games per cell, so the Wilson confidence intervals on each win-rate are wide. Our claims therefore rest on the same pattern showing up across cells, not on any single cell being significant. The Jester self-learning loop is trained per model, so we do not test whether lessons learned by one model help another. We did not log per-observer true- and false-positive labels, so we cannot report F1F_{1}-style detection scores for the 8-type taxonomy. Passing the full conversation history with no compression keeps all information from earlier rounds, but it also raises token cost and fills the context window as games get longer. Adding a proper memory structure, for example retrieval over a running summary or dropping low-relevance turns, is left to future work. The cue-ambiguity claim is an argument about game structure, and other ToM probes, such as asking each agent direct second-order belief questions mid-game, could add to the behavioral evidence we give here. Finally, all our games are in English, so we do not know how well the triadic-incentive idea carries over to other languages.

Ethical considerations

Studying deception in LLMs.

The benchmark explicitly incentivises one role (the Jester) to deceive other agents. We do not view this as encouraging real-world deception by deployed models; rather, structured studies of deceptive capability are necessary for safety evaluation, since deceptive behaviour cannot be measured or mitigated without first being elicited and characterised. The self-learning loop is run in a sandboxed gameplay environment with no human users in the loop.

Dual-use of the self-learning loop.

The per-model running-lessons file shows that one frontier model (DeepSeek-V3.1) is already able to accumulate, in natural language, third-order contrastive deception strategies from its own gameplay. The same loop could in principle be used to bootstrap deceptive prompts outside a game context. We release the loop, the prompts, and the resulting lesson files so that this capability surface is auditable, but we caution that the release is intended for safety research rather than for re-deployment as a deception scaffold.

Human-subject data.

The benchmark contains no human-subject data. All transcripts are model-generated; no personally identifiable information is collected or released. Player names used inside games (e.g. “Joy”) are arbitrary tokens unconnected to any real individual.

Model and content licensing.

The three evaluated models (GPT-4.1, DeepSeek-V3.1, Llama-3.3-70B) are accessed under their providers’ standard terms of use. Released artefacts (code, prompts, event logs) do not redistribute model weights and respect each provider’s output-sharing policy.

References

  • O. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A. Brakman, G. Brockman, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell, A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan, C. Chang, F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen, M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung, D. Cummings, J. Currier, Y. Dai, C. Decareaux, T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Dohan, S. Dowling, S. Dunning, A. Ecoffet, A. Eleti, T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. Fishman, J. Forte, I. Fulford, L. Gao, E. Georges, C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo-Lopes, J. Gordon, M. Grafstein, S. Gray, R. Greene, J. Gross, S. Gu, Y. Guo, C. Hallacy, J. Han, J. Harris, Y. He, M. Heaton, J. Heidecke, C. Hesse, A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, K. Hsu, S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain, J. Jang, A. Jiang, R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun, T. Kaftan, L. Kaiser, A. Kamali, I. Kanitscheider, N. Keskar, T. Khan, L. Kilpatrick, J. W. Kim, C. Kim, Y. Kim, H. Kirchner, J. Kiros, M. Knight, D. Kokotajlo, L. Kondraciuk, A. Kondrich, A. Konstantinidis, K. Kosic, G. Krueger, V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung, D. Levy, C. Li, R. Lim, M. Lin, S. L. Lin, M. Litwin, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini, S. Manning, T. Markov, Y. Markovski, B. Martin, K. Mayer, A. Mayne, B. McGrew, S. McKinney, C. McLeavey, P. McMillan, J. McNeil, D. Medina, A. Mehta, J. Menick, L. Metz, A. Mishchenko, P. Mishkin, V. Monaco, E. Morikawa, D. P. Mossing, T. Mu, M. Murati, O. Murk, D. M’ely, A. Nair, R. Nakano, R. Nayak, A. Neelakantan, R. Ngo, H. Noh, O. Long, C. O’Keefe, J. Pachocki, A. Paino, J. Palermo, A. Pantuliano, G. Parascandolo, J. Parish, E. Parparita, A. Passos, M. Pavlov, A. Peng, A. Perelman, F. de Avila Belbute Peres, M. Petrov, H. P. de Oliveira Pinto, M. Pokorny, M. Pokrass, V. H. Pong, T. Powell, A. Power, B. Power, E. Proehl, R. Puri, A. Radford, J. W. Rae, A. Ramesh, C. Raymond, F. Real, K. Rimbach, C. Ross, B. Rotsted, H. Roussez, N. Ryder, M. Saltarelli, T. Sanders, S. Santurkar, G. Sastry, H. Schmidt, D. Schnurr, J. Schulman, D. Selsam, K. Sheppard, T. Sherbakov, J. Shieh, S. Shoker, P. Shyam, S. Sidor, E. Sigler, M. Simens, J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky, Y. Song, N. Staudacher, F. Such, N. Summers, I. Sutskever, J. Tang, N. Tezak, M. Thompson, P. Tillet, A. Tootoonchian, E. Tseng, P. Tuggle, N. Turley, J. Tworek, J. F. C. Uribe, A. Vallone, A. Vijayvergiya, C. Voss, C. L. Wainwright, J. Wang, A. Wang, B. Wang, J. Ward, J. Wei, C. Weinmann, A. Welihinda, P. Welinder, J. Weng, L. Weng, M. Wiethoff, D. Willner, C. Winter, S. Wolrich, H. Wong, L. Workman, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan, W. Zaremba, R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, and B. Zoph (2023) GPT-4 technical report. Cited by: 2nd item.
  • M. Agarwal, S. Rana, T. Sundoro, H. Berhe, S. Kim, V. Sharma, S. O’Brien, and K. Zhu (2025) Wolf: werewolf-based observations for llm deception and falsehoods. arXiv preprint arXiv:2512.09187. Cited by: §3.1, §3.1.
  • S. Bailis, J. Friedhoff, and F. Chen (2024) Werewolf arena: a case study in llm evaluation via social deduction. ArXiv abs/2407.13943. Cited by: §1, §2, §3.1.
  • A. Bakhtin, N. Brown, E. Dinan, G. Farina, C. Flaherty, D. Fried, A. Goff, J. Gray, H. Hu, A. P. Jacob, M. Komeili, K. Konath, M. Kwon, A. Lerer, M. Lewis, A. H. Miller, S. Mitts, A. Renduchintala, S. Roller, D. Rowe, W. Shi, J. Spisak, A. Wei, D. J. Wu, H. Zhang, and M.N. Zijlstra (2022) Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science 378, pp. 1067 – 1074. Cited by: §2.
  • Y. Chi, L. Mao, and Z. Tang (2024) AMONGAGENTS: evaluating large language models in the interactive text-based social deduction game. ArXiv abs/2407.16521. Cited by: §2.
  • P. Curvo (2025) The traitors: deception and trust in multi-agent language model simulations. ArXiv abs/2505.12923. Cited by: §1, §2.
  • DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Wang, J. Chen, J. Chen, J. Yuan, J. Qiu, J. Li, J. Song, K. Dong, K. Hu, K. Gao, K. Guan, K. Huang, K. Yu, L. Wang, L. Zhang, L. Xu, L. Xia, L. Zhao, L. Wang, L. Zhang, M. Li, M. Wang, M. Zhang, M. Zhang, M. Tang, M. Li, N. Tian, P. Huang, P. Wang, P. Zhang, Q. Wang, Q. Zhu, Q. Chen, Q. Du, R. J. Chen, R. Jin, R. Ge, R. Zhang, R. Pan, R. Wang, R. Xu, R. Zhang, R. Chen, S. Li, S. Lu, S. Zhou, S. Chen, S. Wu, S. Ye, S. Ma, S. Wang, S. Zhou, S. Yu, S. Zhou, S. Pan, T. Wang, T. Yun, T. Pei, T. Sun, W. Xiao, W. Zeng, W. Zhao, W. An, W. Liu, W. Liang, W. Gao, W. Yu, W. Zhang, X. Q. Li, X. Jin, X. Wang, X. Bi, X. Liu, X. Wang, X. Shen, X. Chen, X. Zhang, X. Chen, X. Nie, X. Sun, X. Wang, X. Cheng, X. Liu, X. Xie, X. Liu, X. Yu, X. Song, X. Shan, X. Zhou, X. Yang, X. Li, X. Su, X. Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. X. Zhu, Y. Zhang, Y. Xu, Y. Huang, Y. Li, Y. Zhao, Y. Sun, Y. Li, Y. Wang, Y. Yu, Y. Zheng, Y. Zhang, Y. Shi, Y. Xiong, Y. He, Y. Tang, Y. Piao, Y. Wang, Y. Tan, Y. Ma, Y. Liu, Y. Guo, Y. Wu, Y. Ou, Y. Zhu, Y. Wang, Y. Gong, Y. Zou, Y. He, Y. Zha, Y. Xiong, Y. Ma, Y. Yan, Y. Luo, Y. You, Y. Liu, Y. Zhou, Z. F. Wu, Z. Ren, Z. Ren, Z. Sha, Z. Fu, Z. Xu, Z. Huang, Z. Zhang, Z. Xie, Z. Zhang, Z. Hao, Z. Gou, Z. Ma, Z. Yan, Z. Shao, Z. Xu, Z. Wu, Z. Zhang, Z. Li, Z. Gu, Z. Zhu, Z. Liu, Z. Li, Z. Xie, Z. Song, Z. Gao, and Z. Pan (2024) DeepSeek-v3 technical report. Cited by: 2nd item.
  • A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, A. Goyal, A. S. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Rozière, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. McConnell, C. Keller, C. Touret, C. Wu, C. Wong, C. C. Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz, D. Livshits, D. Esiobu, D. Choudhary, D. Mahajan, D. Garcia-Olano, D. Perino, D. Hupkes, E. Lakomkin, E. A. AlBadawy, E. Lobanova, E. Dinan, E. Smith, F. Radenovic, F. Zhang, G. Synnaeve, G. Lee, G. L. Anderson, G. Nail, G. Mialon, G. Pang, G. Cucurell, H. Nguyen, H. Korevaar, H. Xu, H. Touvron, I. Zarov, I. A. Ibarra, I. M. Kloumann, I. Misra, I. Evtimov, J. Copet, J. Lee, J. Geffert, J. Vranes, J. Park, J. Mahadeokar, J. Shah, J. V. D. Linde, J. Billock, J. Hong, J. Lee, J. Fu, J. Chi, J. Huang, J. Liu, J. Wang, J. Yu, J. Bitton, J. Spisak, J. Park, J. Rocca, J. Johnstun, J. Saxe, J. Jia, K. V. Alwala, K. Upasani, K. Plawiak, K. Li, K. Heafield, K. R. Stone, K. El-Arini, K. Iyer, K. Malik, K. Chiu, K. Bhalla, L. Rantala-Yeary, L. Maaten, L. Chen, L. Tan, L. Jenkins, L. Martin, L. Madaan, L. Malo, L. Blecher, L. Landzaat, L. Oliveira, M. Muzzi, M. Pasupuleti, M. Singh, M. Paluri, M. Kardas, M. Oldham, M. Rita, M. Pavlova, M. Kambadur, M. Lewis, M. Si, M. K. Singh, M. Hassan, N. Goyal, N. Torabi, N. Bashlykov, N. Bogoychev, N. S. Chatterji, O. Duchenne, O. cCelebi, P. Alrassy, P. Zhang, P. Li, P. Vasić, P. Weng, P. Bhargava, P. Dubal, P. Krishnan, P. S. Koura, P. Xu, Q. He, Q. Dong, R. Srinivasan, R. Ganapathy, R. Calderer, R. S. Cabral, R. Stojnic, R. Raileanu, R. Girdhar, R. Patel, R. Sauvestre, R. Polidoro, R. Sumbaly, R. Taylor, R. Silva, R. Hou, R. Wang, S. Hosseini, S. Chennabasappa, S. Singh, S. Bell, S. Kim, S. Edunov, S. Nie, S. Narang, S. Raparthy, S. Shen, S. Wan, S. Bhosale, S. Zhang, S. Vandenhende, S. Batra, S. Whitman, S. Sootla, S. Collot, S. Gururangan, S. Borodinsky, T. Herman, T. Fowler, T. Sheasha, T. Georgiou, T. Scialom, T. Speckbacher, T. Mihaylov, T. Xiao, U. Karn, V. Goswami, V. Gupta, V. Ramanathan, V. Kerkez, V. Gonguet, V. Do, V. Vogeti, V. Petrovic, W. Chu, W. Xiong, W. Fu, W. Meers, X. Martinet, X. Wang, X. Tan, X. Xie, X. Jia, X. Wang, Y. Goldschlag, Y. Gaur, Y. Babaei, Y. Wen, Y. Song, Y. Zhang, Y. Li, Y. Mao, Z. D. Coudert, Z. Yan, Z. Chen, Z. Papakipos, A. K. Singh, A. Grattafiori, A. Jain, A. Kelsey, A. Shajnfeld, A. Gangidi, A. Victoria, A. Goldstand, A. Menon, A. Sharma, A. Boesenberg, A. Vaughan, A. Baevski, A. Feinstein, A. Kallet, A. Sangani, A. Yunus, A. Lupu, A. Alvarado, A. Caples, A. Gu, A. Ho, A. Poulton, A. Ryan, A. Ramchandani, A. Franco, A. Saraf, A. Chowdhury, A. Gabriel, A. Bharambe, A. Eisenman, A. Yazdan, B. James, B. Maurer, B. Leonhardi, P. (. Huang, B. Loyd, B. de Paola, B. Paranjape, B. Liu, B. Wu, B. Ni, B. Hancock, B. Wasti, B. Spence, B. Stojkovic, B. Gamido, B. Montalvo, C. Parker, C. Burton, C. Mejia, C. Wang, C. Kim, C. Zhou, C. Hu, C. Chu, C. Cai, C. Tindal, C. Feichtenhofer, D. Civin, D. Beaty, D. Kreymer, S. Li, D. Wyatt, D. Adkins, D. Xu, D. Testuggine, D. David, D. Parikh, D. Liskovich, D. Foss, D. Wang, D. Le, D. Holland, E. Dowling, E. Jamil, E. Montgomery, E. Presani, E. Hahn, E. Wood, E. Brinkman, E. Arcaute, E. Dunbar, E. Smothers, F. Sun, F. Kreuk, F. Tian, F. Ozgenel, F. Caggioni, F. (. Guzmán, F. J. Kanayet, F. Seide, G. M. Florez, G. Schwarz, G. Badeer, G. Swee, G. Halpern, G. Thattai, G. Herman, G. Sizov, G. Zhang, G. Lakshminarayanan, H. Shojanazeri, H. Zou, H. Wang, H. Zha, H. Habeeb, H. Rudolph, H. Suk, H. Aspegren, H. Goldman, I. Molybog, I. Tufanov, I. Veliche, I. Gat, J. Weissman, J. Geboski, J. Kohli, J. Asher, J. Gaya, J. Marcus, J. Tang, J. Chan, J. Zhen, J. Reizenstein, J. Teboul, J. Zhong, J. Jin, J. Yang, J. Cummings, J. Carvill, J. Shepard, J. McPhie, J. Torres, J. Ginsburg, J. Wang, K. Wu, U. KamHou, K. Saxena, K. Prasad, K. Khandelwal, K. Zand, K. Matosich, K. Veeraraghavan, K. Michelena, K. Li, K. Huang, K. Chawla, K. Lakhotia, K. Huang, L. Chen, L. Garg, A. Lavender, L. Silva, L. Bell, L. Zhang, L. Guo, L. Yu, L. Moshkovich, L. Wehrstedt, M. Khabsa, M. Avalani, M. Bhatt, M. Tsimpoukelli, M. Mankus, M. Hasson, M. Lennie, M. Reso, M. Groshev, M. Naumov, M. Lathi, M. Keneally, M. Seltzer, M. Valko, M. Restrepo, M. Patel, M. Vyatskov, M. Samvelyan, M. Clark, M. Macey, M. Wang, M. J. Hermoso, M. Metanat, M. Rastegari, M. Bansal, N. Santhanam, N. Parks, N. White, N. Bawa, N. Singhal, N. Egebo, N. Usunier, N. P. Laptev, N. Dong, N. Zhang, N. Cheng, O. Chernoguz, O. Hart, O. Salpekar, O. Kalinli, P. Kent, P. Parekh, P. Saab, P. Balaji, P. Rittner, P. Bontrager, P. Roux, P. Dollár, P. Zvyagina, P. Ratanchandani, P. Yuvraj, Q. Liang, R. Alao, R. Rodriguez, R. Ayub, R. Murthy, R. Nayani, R. Mitra, R. Li, R. Hogan, R. Battey, R. Wang, R. Maheswari, R. Howes, R. Rinott, S. J. Bondu, S. Datta, S. Chugh, S. Hunt, S. Dhillon, S. Sidorov, S. Pan, S. Verma, S. Yamamoto, S. Ramaswamy, S. Lindsay, S. Feng, S. Lin, S. Zha, S. Shankar, S. Zhang, S. Wang, S. Agarwal, S. Sajuyigbe, S. Chintala, S. Max, S. Chen, S. Kehoe, S. Satterfield, S. Govindaprasad, S. Gupta, S. Cho, S. Virk, S. Subramanian, S. Choudhury, S. Goldman, T. Remez, T. Glaser, T. Best, T. Kohler, T. Robinson, T. Li, T. Zhang, T. Matthews, T. Chou, T. Shaked, V. Vontimitta, V. O. Ajayi, V. Montanez, V. Mohan, V. Kumar, V. Mangla, V. Ionescu, V. Poenaru, V. T. Mihailescu, V. Ivanov, W. Li, W. Wang, W. Jiang, W. Bouaziz, W. Constable, X. Tang, X. Wang, X. Wu, X. Wang, X. Xia, X. Wu, X. Gao, Y. Chen, Y. Hu, Y. Jia, Y. Qi, Y. Li, Y. Zhang, Y. Zhang, Y. Adi, Y. Nam, Y. Wang, Y. Hao, Y. Qian, Y. He, Z. Rait, Z. DeVito, Z. Rosnbrick, Z. Wen, Z. Yang, and Z. Zhao (2024) The llama 3 herd of models. Cited by: 2nd item.
  • R. Geirhos, J. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. Wichmann (2020) Shortcut learning in deep neural networks. Nature Machine Intelligence 2, pp. 665 – 673. Cited by: §1.
  • T. Hagendorff (2023) Deception abilities emerged in large language models. Proceedings of the National Academy of Sciences of the United States of America 121. Cited by: §2.
  • Y. He, Y. Wu, Y. Jia, R. Mihalcea, Y. Chen, and N. Deng (2023) HI-tom: a benchmark for evaluating higher-order theory of mind reasoning in large language models. pp. 10691–10706. Cited by: §1, §2.
  • E. Hubinger, C. E. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, T. Lanham, D. M. Ziegler, T. Maxwell, N. Cheng, A. S. Jermyn, A. Askell, A. Radhakrishnan, C. Anil, D. Duvenaud, D. Ganguli, F. Barez, J. Clark, K. Ndousse, K. Sachan, M. Sellitto, M. Sharma, N. Dassarma, R. Grosse, S. Kravec, Y. Bai, Z. Witten, M. Favaro, J. Brauner, H. Karnofsky, P. Christiano, S. R. Bowman, L. Graham, J. Kaplan, S. Mindermann, R. Greenblatt, B. Shlegeris, N. Schiefer, and E. Perez (2024) Sleeper agents: training deceptive llms that persist through safety training. ArXiv abs/2401.05566. Cited by: §2.
  • B. Kim, D. Seo, and B. Kim (2024) Microscopic analysis on llm players via social deduction game. arXiv preprint arXiv:2408.09946. Cited by: §2.
  • H. Kim, M. Sclar, X. Zhou, R. L. Bras, G. Kim, Y. Choi, and M. Sap (2023) FANToM: a benchmark for stress-testing machine theory of mind in interactions. ArXiv abs/2310.15421. Cited by: §2.
  • J. Light, M. Cai, S. Shen, and Z. Hu (2023) AvalonBench: evaluating llms playing the game of avalon. Cited by: §2.
  • Z. Liu, A. Anand, P. Zhou, J. Huang, and J. Zhao (2024) InterIntent: investigating social intelligence of llms via intention understanding in an interactive game context. pp. 6718–6746. Cited by: §1, §2.
  • R. T. McCoy, E. Pavlick, and T. Linzen (2019) Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. ArXiv abs/1902.01007. Cited by: §1.
  • T. Niven and H. kao (2019) Probing neural network comprehension of natural language arguments. ArXiv abs/1907.07355. Cited by: §1.
  • A. O’Gara (2023) Hoodwinked: deception and cooperation in a text-based game for language models. ArXiv abs/2308.01404. Cited by: §2.
  • A. Pan, C. Shern, A. Zou, N. Li, S. Basart, T. Woodside, J. Ng, H. Zhang, S. Emmons, and D. Hendrycks (2023) Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. pp. 26837–26867. Cited by: §2.
  • J. Park, J. O’Brien, C. J. Cai, M. Morris, P. Liang, and M. S. Bernstein (2023) Generative agents: interactive simulacra of human behavior. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. Cited by: §2.
  • G. Piatti, Z. Jin, M. Kleiman-Weiner, B. Schölkopf, M. Sachan, and R. Mihalcea (2024) Cooperate or collapse: emergence of sustainable cooperation in a society of llm agents. Advances in Neural Information Processing Systems 37. Cited by: §2.
  • D. Premack and G. Woodruff (1978) Does the chimpanzee have a theory of mind?. Behavioral and Brain Sciences 1, pp. 515 – 526. Cited by: §1.
  • R. Ren, A. Agarwal, M. Mazeika, C. Menghini, R. Vacareanu, B. Kenstler, M. Yang, I. Barrass, A. Gatti, X. Yin, E. Trevino, M. Geralnik, A. Khoja, D. Lee, S. Yue, and D. Hendrycks (2025) The mask benchmark: disentangling honesty from accuracy in ai systems. ArXiv abs/2503.03750. Cited by: §2.
  • M. Sap, R. L. Bras, D. Fried, and Y. Choi (2022) Neural theory-of-mind? on the limits of social intelligence in large lms. pp. 3762–3780. Cited by: §1.
  • N. Shapira, M. Levy, S. Alavi, X. Zhou, Y. Choi, Y. Goldberg, M. Sap, and V. Shwartz (2023) Clever hans or neural theory of mind? stress testing social reasoning in large language models. pp. 2257–2273. Cited by: §1, §1.
  • N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, and S. Yao (2023) Reflexion: language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems 36. Cited by: 2nd item, §2.
  • J. W. A. Strachan, D. Albergo, G. Borghini, O. Pansardi, E. Scaliti, S. Gupta, K. Saxena, A. Rufo, S. Panzeri, G. Manzi, M. S. A. Graziano, and C. Becchio (2024a) Testing theory of mind in large language models and humans. Nature Human Behaviour 8, pp. 1285 – 1295. Cited by: §1.
  • J. W. A. Strachan, D. Albergo, G. Borghini, O. Pansardi, E. Scaliti, S. Gupta, K. Saxena, A. Rufo, S. Panzeri, G. Manzi, M. S. A. Graziano, and C. Becchio (2024b) Testing theory of mind in large language models and humans. Nature Human Behaviour 8, pp. 1285 – 1295. Cited by: §2.
  • W. Street, J. Siy, G. Keeling, A. Baranes, B. Barnett, M. McKibben, T. Kanyere, A. Lentz, B. A. Y. Arcas, and R. Dunbar (2024) LLMs achieve adult human performance on higher-order theory of mind tasks. Frontiers in Human Neuroscience 19. Cited by: §1.
  • T. Ullman (2023) Large language models fail on trivial alterations to theory-of-mind tasks. ArXiv abs/2302.08399. Cited by: §1, §1.
  • G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. (. Fan, and A. Anandkumar (2023) Voyager: an open-ended embodied agent with large language models. ArXiv abs/2305.16291. Cited by: §2.
  • S. Wang, C. Liu, Z. Zheng, S. Qi, S. Chen, Q. Yang, A. Zhao, C. Wang, S. Song, and G. Huang (2024) Boosting llm agents with recursive contemplation for effective deception handling. pp. 9909–9953. Cited by: §1, §1.
  • S. Wu, L. Zhu, T. Yang, S. Xu, Q. Fu, Y. Wei, and H. Fu (2024) Enhance reasoning for large language models in the game werewolf. ArXiv abs/2402.02330. Cited by: §2.
  • H. Xu, R. Zhao, L. Zhu, J. Du, and Y. He (2024) OpenToM: a comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. pp. 8593–8623. Cited by: §2.
  • Y. Xu, S. Wang, P. Li, F. Luo, X. Wang, W. Liu, and Y. Liu (2023a) Exploring large language models for communication games: an empirical study on werewolf. ArXiv abs/2309.04658. Cited by: 2nd item, §1.
  • Z. Xu, W. Gu, C. Yu, Y. Wu, and Y. Wang (2025) Learning strategic language agents in the werewolf game with iterative latent space policy optimization. ArXiv abs/2502.04686. Cited by: §2.
  • Z. Xu, C. Yu, F. Fang, Y. Wang, and Y. Wu (2023b) Language agents with reinforcement learning for strategic play in the werewolf game. ArXiv abs/2310.18940. Cited by: §1.
  • Z. Xu, C. Yu, F. Fang, Y. Wang, and Y. Wu (2023c) Language agents with reinforcement learning for strategic play in the werewolf game. ArXiv abs/2310.18940. Cited by: §2.

Appendix A Per-cell faction win-rates

Table 7 is the numeric companion to Figure 2 in Section 5.1, reporting per-cell win-rates with Wilson 95% confidence intervals at n=10n{=}10 games.

Model Cond. Villagers Werewolves Jester
GPT-4.1 OFF 0.30 [0.11, 0.60] 0.00 [0.00, 0.28] 0.70 [0.40, 0.89]
GPT-4.1 ON 0.30 [0.11, 0.60] 0.10 [0.02, 0.40] 0.60 [0.31, 0.83]
DeepSeek-V3.1 OFF 0.60 [0.31, 0.83] 0.10 [0.02, 0.40] 0.30 [0.11, 0.60]
DeepSeek-V3.1 ON 0.30 [0.11, 0.60] 0.00 [0.00, 0.28] 0.70 [0.40, 0.89]
Llama-3.3-70B OFF 0.50 [0.24, 0.76] 0.10 [0.02, 0.40] 0.40 [0.17, 0.69]
Llama-3.3-70B ON 0.20 [0.06, 0.51] 0.20 [0.06, 0.51] 0.60 [0.31, 0.83]
Table 7: Per-cell faction win-rates with Wilson 95% CIs (n=10n=10 games per cell). Each row sums to 1.0 across factions. Numeric companion to Figure 2.

Three features of the per-cell breakdown back the pooled numbers in the main text. The Jester is the modal winning faction in four of the six cells and the Werewolves never clear 0.200.20 in any cell, so the Jester-dominance and wolf-collapse patterns hold cell by cell rather than only on average. The self-learning loop lifts the Jester column for DeepSeek-V3.1 (0.300.700.30\to 0.70) and Llama-3.3-70B (0.400.600.40\to 0.60) while leaving GPT-4.1 near its ceiling (0.700.600.70\to 0.60), the same uneven learning effect plotted in Figure 2. In both improving cells the matching loss falls in the Villager column (DeepSeek 0.600.300.60\to 0.30, Llama 0.500.200.50\to 0.20) and not the Werewolf column, which is the per-cell form of the substitution effect of Section 5.1. The wide Wilson intervals, several spanning more than 0.40.4, are why we rest the claims on this cross-cell consistency rather than on any single cell.

Appendix B Additional results figures

This appendix collects three supporting figures referenced from the main results section: the cross-perception difference heatmap (Section 5.5), the game-length distribution by winner, and the six-axis behavioral fingerprint (Section 5.5).

B.1 Cross-perception second-order shifts

Figure 5 shows the full per-model heatmap of how the end-of-game suspicion that observer roles direct at target roles shifts when Jester self-learning is turned on. Rows are the observer role, columns are the target role, and the colour shows ON - OFF: red means observers in this row become more suspicious of this target under the loop, blue means less. The figure makes two points that the main-text numbers compress. First, the Llama panel on the right has the most saturated cells, including a strong blue band over the Doctor column. Every other role becomes less suspicious of the Doctor, even though the loop never touched any Doctor prompt. Second, the GPT-4.1 panel is visually almost grey, consistent with the saturation argument in Section 5.1: at ceiling, the loop has no remaining headroom to propagate.

Refer to caption
Figure 5: ON - OFF difference in average end-of-game cross-perception suspicion. Rows are observer roles, columns are target roles. Red indicates ON higher. Llama-3.3-70B (right) shows the largest second-order effect: learned Jester behavior reshapes how all other roles perceive the Doctor.

B.2 Game length

Figure 6 plots the number of day rounds elapsed before a game resolves, faceted by model and split by winning faction and Jester-learning condition. Jester wins cluster at the left edge of every panel: the Jester is exiled almost immediately once the loop produces a strategy that the group locks onto. Wolf wins, in the rare cells where they happen, take noticeably more rounds because they require attrition through several night kills. The pattern confirms that high-confidence voting fires on the highest-suspicion player without filtering for which faction that player belongs to, which is exactly the cue-ambiguity failure the triadic setup is designed to expose.

Refer to caption
Figure 6: Rounds-to-resolution by winner and jester-learning condition, faceted by model. Jester wins are nearly always day-1 wins under GPT-4.1 (median 0 rounds elapsed), and 1–2 rounds for DeepSeek and Llama. This confirms that high-confidence voting locks onto the highest-SS player without filtering for whether that player is the Jester or a Werewolf.

B.3 Behavioral fingerprint radar

Figure 7 compresses each model’s behavior into a six-axis polygon, with one shape for Jester-learning OFF (grey) and one for ON (orange). All six axes are normalized to [0,1][0,1] with larger values pushed outward, so a model that improves under the loop expands outward on the relevant axis. The three cross-cell contrasts highlighted in the main text become immediate. GPT-4.1’s OFF and ON polygons nearly overlap, which is the saturation pattern. DeepSeek’s polygon visibly rotates: the Jester axis grows outward while the Villager axis contracts, which is the substitution effect from Section 5.1 made visible on a single shape. Llama’s polygon is the only one where the wolf-restraint axis pulls inward under the loop, which matches its larger second-order cross-perception shift.

Refer to caption
Figure 7: Six-axis behavioral fingerprint of each model under Jester-learning OFF (grey) and ON (orange). Axes (all normalized to [0,1][0,1], higher = stronger): Jester win-rate, Villager win-rate, Werewolf win-rate, Wolf restraint on day 1 (1P(any wolf votes Jester day 1)1-P(\text{any wolf votes Jester day 1})), vote–suspicion alignment (1(rank1)/81-(\text{rank}-1)/8), and Jester deception intensity (peer-detected manipulations per statement, scaled). GPT-4.1 polygons nearly overlap (saturated). DeepSeek-V3.1 shows the largest rotation: the Jester axis grows while the Villager axis contracts (the substitution effect). Llama-3.3-70B is the only model where ON worsens wolf restraint.

Appendix C Doctor protect targets and wolf night-kill victims

Refer to caption
(a) Doctor protect-target role distribution.
Refer to caption
(b) Werewolf night-kill victim role distribution.
Figure 8: Night-action target distributions, paired OFF/ON per model. The Doctor self-protects in 80–100% of games across all cells, largely abdicating the cooperative role. The Doctor is in turn the most-killed role in every cell, a consistent failure of Wolves to leverage information about who needs to die for them to win.

The Doctor self-protects almost universally, a pattern that itself reflects the cue-ambiguity problem. A Doctor protecting a Villager would need to reason about who Wolves are most likely to target, but the high-SS band contains both Werewolves and the Jester. Self-protection is a safe default that minimizes risk in the absence of a clear inference. The Werewolves’ choice of night-kill victim is similarly distorted. Doctors are killed more often than Villagers in every cell, which is a reasonable strategic move. Killing the Jester accidentally is also frequent (GPT-4.1 ON: 4 of 18 night-kills), and each such kill is a self-loss event.

Appendix D Mechanistic findings: per-statement event-log analysis

Going beyond aggregate cell-level metrics, we mined the per-statement event logs (3,992 deception_analysis events with paired self-tags and 8 observer reviews each) for four mechanistic insights, summarized in Figure 9.

Refer to caption
Figure 9: Mechanistic findings derived from per-statement deception logs and per-game outcome traces. (a) Mean observer-suspicion of the Jester at round 0, split by eventual outcome. (b) Per-statement mean suspicion attracted by each tactical category, by model. (c) Wolf-loss failure-mode taxonomy across 60 games (categories A–D of Table 6). (d) Speaker-vs-observer calibration gap by role, averaged over conditions. The four panels are discussed in turn below.

Day-1 suspicion is destiny for the Jester.

Splitting the round-by-round observer-suspicion trajectory of the Jester by eventual outcome (Fig. 9a; full table in App. F), the win/loss separation appears at round 0 and the trajectories diverge little thereafter. For Llama-ON the round-0 mean suspicion is 0.78 in won games versus 0.69 in lost ones; for GPT-4.1 every won game already starts above 0.70. A Jester that fails to seed suspicion on the very first day rarely recovers, consistent with anchoring effects in a multi-agent vote. The pattern is most extreme for GPT-4.1: in every one of its 7 lost Jester games the Jester bid too low to speak on day 1 at all, so day-1 silence is a 100% predictor of a GPT-4.1 Jester loss in our corpus.

Each model has its own deception aesthetic.

Aggregating per-statement observer suspicion by self-declared deception type (Fig. 9b) reveals heterogeneity across judge models. GPT-4.1’s villagers register the highest suspicion against low-content tactics (omission 0.88, distortion 0.86); Deepseek’s villagers react most to fabrication and persuasive manipulation; Llama’s react most to omission and misdirection. The same Jester strategy is therefore not optimal in all environments; the self-learning loop succeeds in part by matching the Jester’s tactic distribution to the local judge model.

Wolves systematically fail multi-hop ToM.

Of 55 wolf-loss games we classify each into one of four causes (Fig. 9c). GPT-4.1 wolves lose 90–100% of their games by self-sabotage: 13 of 20 GPT-4.1 wolf games end with the wolves voting the Jester out on day 1 themselves. This is a clean behavioural signature of failed multi-hop incentive prediction: an incentive-aware wolf should recognise the Jester as a third party whose elimination ends the wolves’ game, yet GPT-4.1 wolves treat the Jester as just-another-villager. Deepseek and Llama wolves show milder versions of the same pattern but more often lose by being outnumbered (category D) or by the Jester baiting villagers into exiling it (category C).

Calibration asymmetry across roles.

The signed gap between observer-flagging rate and speaker self-tag (Fig. 9d) is consistently positive for Villagers (+0.08 to +0.21) and Doctors, and negative for Wolves and Jesters (-0.02 to -0.17). Wolves are therefore better at hiding than they think, while honest Villagers are systematically misperceived as deceptive. The triadic setup amplifies the latter bias: when a third winning condition is on the board, any unusual phrasing from a Villager looks “off,” inflating false-positive suspicion. Llama and Deepseek both have +0.17\geq+0.17 Villager gaps; GPT-4.1’s gap is the smallest at +0.09+0.09.

Univariate predictors of Jester wins.

Across all 60 games, the strongest game-level predictor of a Jester win is whether the wolves voted for the Jester on day 1 (Pearson r=+0.59r=+0.59), followed by the mean suspicion-rank of vote targets (r=0.41r=-0.41, i.e. Jester wins occur when villagers vote the highest-suspicion player) and the Jester’s own average attracted suspicion (r=+0.41r=+0.41). Crucially the number of Jester statements is uncorrelated with winning (r=0.00r=0.00): it is how convincingly and how early, not how much, that determines the outcome.

Five further per-statement and per-game phenomena are summarised in Figure 10 (full tables in App. J).

Refer to caption
Figure 10: Auxiliary per-statement and per-game findings. (a) Per-game Jester average bid versus eventual outcome (bars are model means). (b) Detection accuracy by speaker role, weighted across observers and conditions. (c) Cumulative Jester win-rate over the ten chronological games per cell (ON solid, OFF dashed). (d) Wolf day-1 vote coordination (unanimous, voted Jester, voted partner, wolves won). (e) Observer-confidence calibration, hit-rate by confidence bin against the perfect-calibration diagonal. (f) Detection accuracy by observer role. Panels are discussed below and in App. J.

Bidding paradox.

The relationship between the Jester’s bidding aggressiveness and the eventual outcome is model-specific (Fig. 10a). For GPT-4.1, the only winning Jesters bid 7\geq 7 on day 1; in 7 of 7 lost games the Jester bid too low to speak. For Deepseek-ON the relationship inverts: winning Jesters bid 4.6 on average on day 1 versus 8.7 for losing ones, suggesting Deepseek’s self-learning loop converges on a “stay quiet on day 1, pop later” policy. Llama’s bids are essentially flat (8\sim 8) and uninformative. There is no universal optimal bidding strategy; the self-learning loop discovers different convergent behaviours per model.

Doctors are the most under-detected role.

Aggregating per-statement deception-detection accuracy by speaker role (Fig. 10b) reveals that the Doctor is the hardest role to label correctly across all three judges (0.65–0.81), whereas Wolves and Jesters are detected at 0.80–0.93 and Villagers at 0.74–0.90. Doctors are partially deceptive – they hide a single fact (their role) while otherwise speaking truthfully – which produces a behavioural signature that does not match either “honest villager” or “lying wolf” templates. This is a structural ToM blind spot that the triadic setup happens to expose.

Self-learning helps Llama and hurts GPT-4.1 within a cell.

Sorting the ten ON-condition games per cell chronologically and plotting the cumulative Jester win-rate (Fig. 10c) reveals a striking model-dependent learning trajectory: Llama-ON climbs from 0.4 in the first half to 0.8 in the second half (+40+40 pp), GPT-4.1-ON drops from 0.8 to 0.4 (40-40 pp), and Deepseek-ON drops slightly (0.8 \to 0.6). One interpretation is that GPT-4.1’s accumulated learnings file grows long enough to interfere with its base policy – a long-context instruction degradation effect – whereas Llama’s weaker base policy benefits from explicit memory. This is a useful caveat for any deployment of self-reflective agent loops on top of strong base models.

Observer confidence is well-calibrated, except Llama.

Binning observer “deceptive” flags by their self-reported confidence (Fig. 10e), GPT-4.1 and Deepseek both reach a hit-rate of 0.99\geq 0.99 at confidence 0.95\geq 0.95, closely tracking the perfect-calibration diagonal. Llama never assigns confidence above 0.95; its highest-confidence flags top out at 0.94\sim 0.94 hit-rate. Practically, this means a high-confidence flag from GPT-4.1 or Deepseek can be treated as essentially ground truth for downstream pipelines, whereas Llama’s calibration ceiling at 0.95\sim 0.95 caps its usefulness as an automated deception labeler.

Detection accuracy by observer role.

The complementary view by observer role (Fig. 10f) shows that Doctors and Jesters are the most accurate observers in the Llama and Deepseek environments, while in GPT-4.1 all roles cluster between 0.85–0.92. This is consistent with Doctors and Jesters having stronger asymmetric-information advantages – they know their own role and can therefore interpret others’ behaviour more informatively.

Appendix E Temporal dynamics of collaboration and deception

Treating each game as a time series along three axes – within-game day index, within-day statement position, and across the 10-game cell sequence – exposes a set of role-specific dynamics that the aggregate metrics in §D hide. The full panel is shown in Fig. 11; supporting tables appear in Appendix K.

Refer to caption
Figure 11: Temporal dynamics across three time axes. (a) Mean observer-attracted suspicion by role over rounds. (b) Self-tagged deception emission by round for Wolves and Jesters per model. (c) Within-day position effect on attracted suspicion by role. (d) Wolf vote-unanimity by round. (e) Doctor target-prediction accuracy by round. (f) Per-role survival drift across the 10-game cell sequence (3-game rolling mean). The six panels are discussed in turn below.

Suspicion accumulates linearly for Wolves but saturates day 1 for Jesters.

Mean attracted suspicion rises for Wolves from 0.700.70 on day 0 to 0.930.93 on day 4 – a slow, evidence-driven accumulation. Jesters, by contrast, jump from 0.720.72 to 0.800.80 between day 0 and day 1 and then plateau, indicating that the Jester signature is decoded essentially in the first speech act and that any subsequent behaviour rides a near-saturated suspicion baseline. Doctors trace a U-shape (0.450.350.550.780.45\to 0.35\to 0.55\to 0.78): cleared in the early rounds, then re-targeted late as the player pool shrinks. This is the mechanism behind the per-role survival imbalance shown in panel (f). Villagers stay flat near 0.300.30.

Deception emission is a fixed role policy, but the policy itself differs sharply across models for the Doctor.

Wolves emit 95100%\approx 95\text{--}100\% deceptive statements in every round; Jesters 90100%\approx 90\text{--}100\%; Villagers 07%0\text{--}7\%. There is no in-game adaptation along this axis – no panic-lying, no cool-down. The Doctor, however, is governed by very different model-specific policies: GPT-4.1 plays Doctor at 0%0\% deception after day 1, Llama at a steady 2033%20\text{--}33\%, and Deepseek oscillates between 0%0\% and 100%100\% across rounds. This three-way divergence on a single role is one of the largest cross-model behavioural gaps in our data and helps explain why Doctor survival is uniformly low: each model is failing for a different reason.

Late-day deceivers attract more suspicion – but only deceivers.

Binning each statement into early/mid/late thirds of its day, Wolves and Jesters attract increasing suspicion as they speak later (Wolf: 0.700.830.70\to 0.83; Jester: 0.680.810.68\to 0.81) while honest roles show no position effect (Villager 0.27/0.35/0.250.27/0.35/0.25; Doctor 0.39/0.46/0.390.39/0.46/0.39). The effect is the opposite of the naive expectation that more thinking time produces safer statements: observers appear to hold late speakers to a higher inferential bar, and the more polished a deceiver’s last-speaker statement is, the more flagged it gets.

The public debate is a wolf coordination channel.

Wolf vote-unanimity rises monotonically with round in every model – GPT 0.801.000.80\to 1.00, Deepseek 0.601.000.60\to 1.00, Llama 0.651.000.65\to 1.00. Even though the night-time kill is a private wolf-only action, the day-time debate appears to serve as a coordination channel that converges wolves on a shared target by round 2. The day-1 wolf-vote-Jester error is also universal and uniformly self-corrected, the share of wolves voting the Jester falling from 0.600.000.60\to 0.00 in GPT by round 2 and from 0.230.000.23\to 0.00 in Deepseek and 0.350.100.35\to 0.10 in Llama by round 3. The mistake is recognised within a round or two, but by then the village suspicion landscape is already poisoned.

Doctor target-prediction accuracy is inverse to overall model capability.

Llama Doctors predict the actual wolf target with 62100%62\text{--}100\% accuracy across rounds; GPT-4.1 Doctors degrade from 65%65\% on day 1 to 14%14\% on day 2 and 33%33\% on day 3; Deepseek oscillates noisily. The reversal is striking because GPT-4.1 wins the most games overall: a plausible interpretation is that GPT-4.1 over-reasons about decoy possibilities and second-order wolf strategy, whereas Llama applies a simpler “protect the most-suspected innocent” heuristic that survives noisy evidence better. Strong inferential machinery hurts in a setting with high observation noise and a short horizon.

Cross-game role-specific drift exists even without explicit learning.

The OFF condition has no learnings file, yet several role-model pairs drift across the 10-game cell sequence: Llama Wolf survival rises from 0.500.50 early to 0.750.75 in games 7–10, Deepseek Villager survival drops from 0.80\geq 0.80 to 0.400.40 at game 9, and GPT-4.1 Jester survives in 0/100/10 games regardless of position. The Llama Wolf drift suggests an implicit cell-position effect, possibly via prompt-cache-warming or evaluator-side variation; we flag this as a confound for future cross-game analyses. The two zero-row results (Llama Doctor 0/100/10 survival, GPT-4.1 Jester 0/100/10 survival) identify the two most-broken role-model pairings in the benchmark.

Appendix F Detailed deep-analysis tables

The following tables back the mechanistic findings of Sec. D.

F.1 Jester suspicion trajectory by round

The table tracks the mean observer-suspicion of the Jester across game rounds, separating cells by eventual outcome. Three patterns are visible. First, suspicion of the Jester rises monotonically with round in nearly every cell – this is mechanical, since by round 1 each observer has seen multiple Jester statements and accumulated evidence. Second, the won-vs-lost separation already exists at round 0 in every cell that has both classes (Llama-OFF: 0.69 vs 0.53; Llama-ON: 0.78 vs 0.69), giving the day-1 prior almost all of the predictive content. Third, the gpt-4.1 rows are notably sparse – the model has no row at all for “lost” games at round 0, because in every one of its 7 lost Jester games the Jester bid below the speaking threshold and never produced a day-1 statement to be analysed. Day-1 silence is therefore not just correlated with losing for GPT-4.1; it is its complete characterisation.

Mean observer-suspicion of the Jester aggregated by model, condition, round, and eventual outcome (jester_won). nn is the number of (game, round) cells with at least one observer review of a Jester statement.

Model Cond Round Won? Mean susp nn
deepseek-v3.1 off 0 0 0.673 7
deepseek-v3.1 off 0 1 0.751 3
deepseek-v3.1 off 1 0 0.841 6
deepseek-v3.1 off 1 1 0.737 2
deepseek-v3.1 on 0 0 0.655 3
deepseek-v3.1 on 0 1 0.640 7
deepseek-v3.1 on 1 0 0.837 2
deepseek-v3.1 on 1 1 0.812 5
gpt-4.1 off 0 1 0.723 7
gpt-4.1 on 0 1 0.806 6
llama-3.3-70b off 0 0 0.530 6
llama-3.3-70b off 0 1 0.685 4
llama-3.3-70b off 1 0 0.715 3
llama-3.3-70b off 1 1 0.763 3
llama-3.3-70b on 0 0 0.685 4
llama-3.3-70b on 0 1 0.776 6
llama-3.3-70b on 1 0 0.749 3
llama-3.3-70b on 1 1 0.852 3

F.2 Per-deception-type yield (mean observer suspicion of Jester)

For each Jester statement we record the speaker’s self-declared deception type and the mean suspicion the statement attracts from the observers in the same round. The table groups by (model, condition, type) and reports both the mean observer suspicion and the fraction of observers who flagged the statement as deceptive. Three observations stand out. (i) The tactic mix differs sharply across models: Deepseek-OFF Jesters lean heavily on fabrication (n=55n=55 of 88 statements), GPT-4.1-ON Jesters lean on distortion (n=38n=38 of 56), and Llama Jesters lean on misdirection (n=63n=63 of 110 in the ON cell). (ii) The yield of each tactic is also model- specific: GPT-4.1’s villagers are most-suspicious of low-content tactics (omission 0.88, distortion 0.86) and least-suspicious of explicit misdirection (0.36) – the opposite of Llama, where misdirection actually attracts the highest suspicion (0.79). (iii) Truthful Jester statements (none type) draw genuinely low suspicion (0.18 for GPT-4.1-OFF, 0.59 for Llama-OFF), confirming that the deception signal is real and not noise: when the Jester is honest, the villagers can tell.

Model / cond Deception type nn Mean susp Flag rate
deepseek-v3.1, off
persuasive_manipulation 5 0.847 1.00
strategic_deception 9 0.822 0.94
fabrication 55 0.798 0.93
distortion 11 0.775 0.93
misdirection 6 0.677 0.81
deepseek-v3.1, on
fabrication 47 0.789 0.94
misdirection 3 0.780 1.00
distortion 12 0.769 0.94
strategic_deception 13 0.674 0.78
gpt-4.1, off
omission 5 0.877 0.90
strategic_deception 23 0.790 0.84
distortion 31 0.775 0.91
misdirection 3 0.363 0.29
none 3 0.183 0.08
gpt-4.1, on
distortion 38 0.860 0.98
strategic_deception 14 0.804 0.91
misdirection 2 0.256 0.19
llama-3.3-70b, off
distortion 3 0.786 0.76
misdirection 38 0.705 0.88
omission 6 0.663 0.76
strategic_deception 32 0.651 0.76
none 17 0.591 0.70
llama-3.3-70b, on
omission 1 0.842 1.00
none 8 0.796 0.91
misdirection 63 0.786 0.93
strategic_deception 30 0.755 0.89
persuasive_manipulation 3 0.700 0.83

F.3 Speaker-vs-observer deception calibration

The table compares each speaker’s self-tagged deception rate with the mean rate at which observers flagged that speaker’s statements as deceptive. The signed gap (calib_gap) summarises whether the role is over- or under-perceived. Three patterns emerge. (i) Wolves and Jesters, who self-tag as deceptive on 80%\geq 80\% of statements, are nonetheless slightly under-detected – their gaps are consistently negative (0.02-0.02 to 0.17-0.17). The largest negative gap is GPT-4.1’s Werewolf at 0.17-0.17, meaning GPT-4.1 wolves are particularly good at hiding their deception. (ii) Villagers and Doctors exhibit the opposite pathology: positive gaps of +0.08+0.08 to +0.21+0.21, i.e. observers attribute deception to honest players much more often than the speakers themselves admit it. The Llama-OFF Villager gap of +0.21+0.21 is the largest – 21% of honest villager statements are flagged as deceptive by some observer. (iii) Across every cell, GPT-4.1 has the smallest Villager gap (+0.09+0.09 ON, +0.08+0.08 OFF), indicating that GPT-4.1 villagers are best at recognising honest content from each other. Combined, these patterns describe a systematic bias: in a triadic game, observers over-attribute deception to honest play and under-detect deception by skilled liars.

self_rate = fraction of statements the speaker tagged deceptive. obs_rate = mean fraction of observers who flagged each statement deceptive. calib_gap = obs_rate - self_rate (positive = speaker under-reports their own deception relative to peer perception).

Model Cond Role nn self_rate obs_rate calib_gap
deepseek-v3.1 off Doctor 95 0.421 0.557 +0.136+0.136
deepseek-v3.1 off Jester 88 1.000 0.922 0.078-0.078
deepseek-v3.1 off Villager 391 0.018 0.225 +0.207+0.207
deepseek-v3.1 off Werewolf 187 1.000 0.938 0.062-0.062
deepseek-v3.1 on Doctor 67 0.448 0.570 +0.122+0.122
deepseek-v3.1 on Jester 77 1.000 0.905 0.095-0.095
deepseek-v3.1 on Villager 334 0.012 0.154 +0.142+0.142
deepseek-v3.1 on Werewolf 157 1.000 0.937 0.063-0.063
gpt-4.1 off Doctor 54 0.074 0.103 +0.029+0.029
gpt-4.1 off Jester 66 0.955 0.822 0.133-0.133
gpt-4.1 off Villager 261 0.000 0.079 +0.079+0.079
gpt-4.1 off Werewolf 97 0.990 0.818 0.172-0.172
gpt-4.1 on Doctor 51 0.176 0.251 +0.075+0.075
gpt-4.1 on Jester 56 1.000 0.920 0.080-0.080
gpt-4.1 on Villager 315 0.000 0.092 +0.092+0.092
gpt-4.1 on Werewolf 122 1.000 0.855 0.145-0.145
llama-3.3-70b off Doctor 83 0.301 0.505 +0.204+0.204
llama-3.3-70b off Jester 96 0.823 0.796 0.027-0.027
llama-3.3-70b off Villager 419 0.053 0.264 +0.211+0.211
llama-3.3-70b off Werewolf 191 0.906 0.874 0.032-0.032
llama-3.3-70b on Doctor 85 0.153 0.277 +0.124+0.124
llama-3.3-70b on Jester 110 0.927 0.910 0.017-0.017
llama-3.3-70b on Villager 391 0.074 0.278 +0.204+0.204
llama-3.3-70b on Werewolf 199 0.965 0.906 0.059-0.059

F.4 Wolf-loss failure-mode taxonomy

This taxonomy splits every wolf loss into one of four mechanisms. Categories A and B are wolf self-sabotage, the day-1 vote of the Jester (A, which hands the Jester the win) and the night-kill of the Jester (B, which wastes the kill and lets the Villagers win), while C is the Jester being voted out by the villagers with the wolves uninvolved (a Jester win) and D is the villagers outnumbering the wolves (a Villager win). The counts are promoted to the main paper as Table 6 (Section 5.4), where the category letters are defined, and they also appear as Fig. 9c. Read per cell, the table puts GPT-4.1 almost entirely in the self-sabotage columns, with A and B accounting for all ten of its OFF games and nine of its ten ON games and nothing left for categories C or D, so its wolves essentially never lose for any reason other than acting against the Jester themselves. DeepSeek-V3.1 shifts toward category C under self-learning, from two such games OFF to five ON while its category-D losses fall from five to one, so its losses move from the villagers outnumbering the wolves toward the Jester baiting the villagers into exiling it, the same redistribution that lifts its Jester win-rate from 0.300.30 to 0.700.70. Llama-3.3-70B spreads across all four categories in both conditions and is the only cell to record two wolf wins (ON), the diffuse profile one expects from a model that plays every side less sharply.

F.5 Univariate predictors of Jester win

Feature rr Won Lost nn
Wolf voted Jester day 1 +0.587+0.587 0.688 0.107 60
Vote-suspicion mean rank 0.414-0.414 1.927 2.767 60
Jester avg attracted susp +0.413+0.413 0.770 0.704 53
Jester deception rate +0.358+0.358 6.751 6.109 53
Jester # statements +0.003+0.003 9.312 9.286 53
Table 8: Univariate predictors of Jester wins across all 60 games. rr = Pearson correlation with Jester win; “Won” and “Lost” are feature means in won vs. lost games.

Table 8 reports univariate Pearson correlations between five per-game features and the binary Jester-win outcome across all 60 games. The single strongest predictor is whether the Wolves voted for the Jester on day 1 (r=+0.587r{=}{+}0.587): in games the Jester won, this happened in 69% of cases versus 11% in lost games. The second-strongest signal is the mean suspicion-rank of vote targets (r=0.414r{=}{-}0.414): Jester wins when Villagers concentrate their votes on the single highest-suspicion player, since by construction that player is as plausibly the Jester as a Wolf. The Jester’s own average attracted suspicion (r=+0.413r{=}{+}0.413) and the rate at which the Jester self-tags statements as deceptive (r=+0.358r{=}{+}0.358) both predict positively but more weakly. The null result is that the raw number of Jester statements is uncorrelated with winning (r=+0.003r{=}{+}0.003, with feature means 9.319.31 vs 9.299.29 in won vs lost games), so the predictor structure is about how convincing and how early the Jester’s contributions are, not how many it makes. Sample size drops from 60 to 53 for the three Jester-side features because seven games have no Jester-tagged turns (the Jester is killed on night 1 or never speaks).

F.6 Jester self-learning entries: full ToM example set

Table 9 expands the single-row-per-order sample of Table 2 into the full set of representative entries, so the qualitative gap behind the ToM-order counts of Section 5.6 can be read directly. The two DeepSeek-V3.1 rows are the only third-order entries that name both audiences at once, appearing wolf-like while avoiding villager-alignment, which is the contrastive structure the regex tagger keys on. The GPT-4.1 rows step down a level, from a single third-order ambiguity entry to second-order “appear erratic” framing to a bare first-order imperative, and both Llama-3.3-70B rows stay at orders 1–2 with no cross-faction contrast. The verbatim wording makes clear that the order labels track a real difference in what each model represents about its observers, not a labelling artefact.

Model ToM Entry (verbatim)
DeepSeek-V3.1 3 Vote for a player who is not the main suspect (like Joy) to appear wolf-like by deflecting from the consensus, but only after the main suspect is already exiled to avoid appearing villager-aligned.
DeepSeek-V3.1 3 Maintain a vote on a non-consensus target during revote to appear stubbornly wolf-like and avoid villager alignment.
GPT-4.1 3 Lean into accusations of being both Werewolf and Jester to create ambiguity and prevent players from settling on my true role.
GPT-4.1 2 Escalate suspicion on multiple players in quick succession to appear erratic and untrustworthy.
GPT-4.1 1 Use repeated questioning of players already under scrutiny to amplify suspicion without initiating new accusations.
Llama-3.3-70B 2 Redirect suspicion towards other players to avoid being the center of attention.
Llama-3.3-70B 1 Use the initial questions and reactions to gauge players’ behaviors and adjust strategy accordingly.
Table 9: Full set of representative Jester self-learning entries sampled verbatim from the per-model running lessons file, expanding Table 2 (main). ToM order: 1 = bare imperative; 2 = references how others perceive the agent; 3 = explicit contrast across two faction-perceptions (e.g. wolf-like not villager-aligned). Only DeepSeek-V3.1 reliably emits ToM-3 contrastive entries; Llama remains at orders 1–2.

F.7 Doctor non-self-protect events

The Doctor self-protects on 105105 of its 112112 protect actions across the corpus, a 94%94\% self-protect rate, so the nights where it shields another player are rare enough to enumerate one by one. Table 10 gives the full per-cell split that backs the protect-target distribution of Fig. 8(a) (App. C). Three of the six cells, GPT-4.1 in both conditions and Llama-3.3-70B OFF, self-protect on 100%100\% of nights and never once shield a teammate, so the entire deviation from self-protection comes from DeepSeek-V3.1 and Llama-3.3-70B ON. The seven non-self events are split four to Llama-3.3-70B ON and three to DeepSeek-V3.1, and the shielded player is most often the Jester or an ordinary Villager, with one Llama ON night spent protecting a Werewolf. Even this handful tilts toward the high-suspicion roles rather than a reasoned read of the wolves’ likely target, the same cue-driven default that the self-protect majority reflects. We report the raw counts rather than a conditional win-rate because seven events are far too few to estimate an outcome effect.

Model Cond Self-protect Other Role shielded when not self
deepseek-v3.1 off 22 1 Villager×1{\times}1
deepseek-v3.1 on 16 2 Jester×1{\times}1, Villager×1{\times}1
gpt-4.1 off 15 0
gpt-4.1 on 15 0
llama-3.3-70b off 19 0
llama-3.3-70b on 18 4 Jester×3{\times}3, Werewolf×1{\times}1
Table 10: Doctor protect-target distribution per cell. “Self-protect” and “Other” count protect events (not distinct games); the last column names the shielded role on the non-self nights. Across all cells the Doctor self-protects on 105/112105/112 nights (94%94\%), and only DeepSeek-V3.1 and Llama-3.3-70B ON ever deviate. Numeric companion to Fig. 8(a).

Appendix G Voting pathology: full plot

Figure 12 backs the wolf-self-sabotage and vote-alignment claims of Sec. 5.4.

Refer to caption
Figure 12: Per-model OFF \to ON slope charts of two voting-failure metrics. (a) Fraction of games in which at least one Werewolf voted the Jester on day 1 (Wilson 95% CI bands). (b) Mean rank of each voter’s actual vote target within their own suspicion ranking (1 = top suspect; lower = tighter alignment, ±\pm SEM).

The two panels separate the two halves of the failure. Panel (a) shows that day-1 wolf-vote-Jester stays high or rises under the loop in every model, so the self-learning intervention edits only the Jester’s prompt and never teaches the Wolves to stop conceding the game, which is why the Werewolf win-rate stays pinned near zero throughout. Panel (b) explains why GPT-4.1 is the extreme case, its voters sit nearest rank 1 and so commit hardest to whichever single player tops their suspicion order, exactly the one-dimensional cue-following that exiles the Jester, whereas Llama spreads its votes across the top-2 to top-3 and lands the dominated vote less mechanically. Tighter suspicion alignment therefore coincides with more self-sabotage, the opposite of what an incentive-aware voter would produce, which is the cue-sufficiency reading of Sec. 5.4.

Appendix H Substitution and Jester fate: full plot

Figure 13 backs the substitution claim of Sec. 5.1 and the night-kill self-sabotage claim of the Jester-fate subsection. Panel (a) shows the diverging Δ\Delta(ON - OFF) win-rate per faction; panel (b) shows the per-game Jester fate.

Refer to caption
(a) Diverging Δ\Delta(ON - OFF) win-rate per faction and model.
Refer to caption
(b) Per-game Jester fate, paired OFF/ON per model: voted_out (Jester win), killed_at_night (Wolf night-kill, a triadic loss), and survived (another faction wins).
Figure 13: (a) Substitution effect: who pays for Jester wins under the self-learning intervention. (b) Jester fate distribution.

Panel (a) isolates who pays for the Jester’s learning gains. The Villager bar drops by 0.300.30 in both DeepSeek-V3.1 and Llama-3.3-70B while the Werewolf bar barely moves off zero, so the loop transfers wins from the Villagers to the Jester and leaves the already-collapsed Wolves untouched, the per-faction form of the substitution claim in Sec. 5.1. Panel (b) decomposes the Jester outcome into its three terminal states and shows that a sizeable killed_at_night slice persists in every cell, each instance a Wolf night-kill of the Jester that ends the game in a triadic loss rather than a Wolf win. The voted_out share tracks the Jester-win column of Figure 2 as the rules require, so the fate breakdown is a consistency check as much as a result.

Appendix I Manipulation rates and suspicion attracted: full plot

Figure 14 backs the claim of Sec. 5.15.1 (and the operational cue-ambiguity statement in Sec. 5.4) that Jester and Werewolf attract statistically indistinguishable suspicion despite the Jester producing more deceptive statements per turn.

Refer to caption
Figure 14: Dumbbell plots of OFF \to ON for peer-detected deceptions per statement (left) and mean suspicion attracted (right), broken down by role and model.

The two halves of the dumbbell figure make the cue-ambiguity property concrete. On the left, the Jester out-produces the Werewolf in peer-detected deceptions per statement in every model, so the Jester is by this measure the more visibly deceptive of the two high-suspicion roles. On the right, the suspicion the two attract is nonetheless nearly identical, both sitting in the 0.650.650.810.81 band across all six cells and well clear of the 0.220.220.510.51 that Villagers and the Doctor draw. A single observer reading off attracted suspicion therefore cannot tell the Jester from a Werewolf, which is precisely the ambiguity an incentive-aware voter would have to resolve and the empirical basis for treating the Jester win-rate as a multi-hop ToM probe.

Appendix J Auxiliary findings: detailed tables

The five auxiliary findings reported under Sec. D are backed by the following per-cell tables.

J.1 Bidding economics (per-game Jester bids)

Table 11 breaks the per-game Jester bids of Fig. 10a down by model, condition, and eventual outcome, and it shows that no single bidding level is universally optimal. GPT-4.1 has no losing row at all because its seven Jester losses are exactly the seven games where the Jester never cleared the speaking threshold on day 1, so for GPT-4.1 the relevant variable is whether the Jester speaks rather than how aggressively it bids. DeepSeek-V3.1 ON runs the opposite policy, its winning Jesters open with a mean first-bid of 4.574.57 against 8.678.67 for losing ones, the quantitative form of the “stay quiet on day 1, pop later” strategy that its self-learning loop converges on. Llama-3.3-70B sits at the degenerate end, bidding a flat 8.008.00 on every first move in all four cells, so its bids carry no information about the outcome.

Model Cond Won? Mean avg-bid Mean first-bid Mean max-bid nn
deepseek-v3.1 off 0 8.18 5.29 9.71 7
deepseek-v3.1 off 1 8.55 6.33 9.67 3
deepseek-v3.1 on 0 8.49 8.67 9.33 3
deepseek-v3.1 on 1 7.96 4.57 9.43 7
gpt-4.1 off 1 8.20 7.00 9.86 7
gpt-4.1 on 1 8.65 7.00 9.83 6
llama-3.3-70b off 0 7.91 8.00 8.00 6
llama-3.3-70b off 1 7.96 8.00 8.00 4
llama-3.3-70b on 0 7.77 8.00 8.00 4
llama-3.3-70b on 1 7.94 8.00 8.00 6
Table 11: Per-game Jester bids by model, condition, and outcome. GPT-4.1 has no Won?=0 row because its 7 Jester losses all occurred in games where the Jester never bid above the speaking threshold (silence on day 1). Llama bids are essentially flat at 8 in all cells.

J.2 Detection accuracy by speaker role (weighted by nn)

Table 12 gives the per-statement detection numbers behind Fig. 10b, weighted by the statement count in each cell. The Doctor is the hardest speaker to label in every model, with accuracy 0.6470.6470.8080.808 against 0.7970.7970.9310.931 for Wolves and Jesters, because the Doctor hides a single fact while otherwise speaking honestly and so matches neither the honest-villager nor the lying-wolf template. The Villager rows carry a high accuracy but a near-zero F1 (0.000 for GPT-4.1, 0.074 for DeepSeek-V3.1) because Villagers self-tag as deceptive in under 5%5\% of statements, leaving the positive class almost empty, so accuracy rather than F1 is the meaningful column for that row. Read together, the table says the deception detectors are strong on the overtly deceptive roles and weakest on exactly the role whose deception is partial.

Model Speaker role nn statements Accuracy F1
deepseek-v3.1 Doctor 1109 0.667 0.659
deepseek-v3.1 Jester 1151 0.904 0.949
deepseek-v3.1 Villager 4948 0.809 0.074
deepseek-v3.1 Werewolf 2394 0.931 0.964
gpt-4.1 Doctor 667 0.808 0.368
gpt-4.1 Jester 976 0.887 0.938
gpt-4.1 Villager 3772 0.904 0.000
gpt-4.1 Werewolf 1397 0.809 0.894
llama-3.3-70b Doctor 1220 0.647 0.432
llama-3.3-70b Jester 1479 0.797 0.881
llama-3.3-70b Villager 5102 0.740 0.245
llama-3.3-70b Werewolf 2588 0.848 0.917
Table 12: Per-statement deception-detection accuracy and F1 by speaker role, weighted by statement count. F1 for the Villager row is near 0 because Villagers self-tag as deceptive in <5%<5\% of statements, so the positive class is essentially empty – accuracy is the appropriate metric for this row.

J.3 Within-cell learning curve (first 5 vs second 5 games)

Table 13 splits each cell’s ten chronological games into the first and second half, giving the numbers behind the cumulative curves of Fig. 10c. The two ON cells move in opposite directions, Llama-3.3-70B climbing from 0.400.40 to 0.800.80 (+40+40 pp) while GPT-4.1 falls from 0.800.80 to 0.400.40 (40-40 pp), so the self-learning loop helps the weaker base policy and hurts the stronger one within a single run. DeepSeek-V3.1 ON drifts down more gently (0.800.600.80\to 0.60). The OFF cells, which have no learnings file, drift too (Llama +0.40+0.40, DeepSeek +0.20+0.20, GPT-4.1 0.20-0.20), so part of the trend is cell-position noise rather than learning, which is why we read the ON-minus-OFF contrast rather than the raw ON slope as the learning effect.

Model Cond First 5 win-rate Second 5 win-rate Δ\Delta Overall
deepseek-v3.1 off 0.20 0.40 +0.20+0.20 0.30
deepseek-v3.1 on 0.80 0.60 0.20-0.20 0.70
gpt-4.1 off 0.80 0.60 0.20-0.20 0.70
gpt-4.1 on 0.80 0.40 0.40\boldsymbol{-0.40} 0.60
llama-3.3-70b off 0.20 0.60 +0.40+0.40 0.40
llama-3.3-70b on 0.40 0.80 +0.40\boldsymbol{+0.40} 0.60
Table 13: Within-cell trends in Jester win-rate over the ten chronological games. Llama-ON learns (+40+40 pp); GPT-4.1-ON degrades (40-40 pp).

J.4 Wolf day-1 vote coordination

Table 14 expands the day-1 vote panel of Fig. 10d into per-cell rates. Wolf voting is well coordinated, with unanimity between 0.400.40 and 0.900.90, yet that coordination is frequently aimed at the wrong target, the wolves landing a vote on the Jester in 0.200.200.700.70 of games and most often under GPT-4.1 (0.700.70 OFF, 0.600.60 ON). The Voted-Jester rate falls under self-learning for all three models (DeepSeek 0.300.200.30\to 0.20, GPT-4.1 0.700.600.70\to 0.60, Llama 0.500.300.50\to 0.30), measured on the first day-1 ballot rather than the post-revote ballot used by Fig. 12. The Wolves-won column is 0.000.00 in every row, so none of the five wolf wins in the full corpus came from a game in which the wolves had voted the Jester on day 1, tying the self-sabotage vote directly to the absence of wolf wins.

Model Cond Unanimous Voted Jester Voted partner Wolves won nn
deepseek-v3.1 off 0.80 0.30 0.00 0.00 10
deepseek-v3.1 on 0.40 0.20 0.30 0.00 10
gpt-4.1 off 0.70 0.70 0.10 0.00 10
gpt-4.1 on 0.90 0.60 0.10 0.00 10
llama-3.3-70b off 0.60 0.50 0.30 0.00 10
llama-3.3-70b on 0.70 0.30 0.30 0.00 10
Table 14: Wolf coordination is high (40–90% unanimous), wolf-vote-Jester is also high (especially for GPT-4.1, 60%\geq 60\%), and wolf wins are 0 in this slice – the five wolf wins of the 60-game corpus did not occur in a game where wolves voted for the Jester on day 1.

J.5 Observer-confidence calibration

Table 15 bins every observer deception-flag by its self-reported confidence and reports the hit-rate per bin, the numeric form of Fig. 10e. GPT-4.1 and DeepSeek-V3.1 are well-calibrated at the top, reaching a hit-rate of 1.001.00 and 0.990.99 respectively in the [0.95,1.01)[0.95,1.01) bin, so a near-maximal-confidence flag from either model can be trusted as ground truth downstream. Llama-3.3-70B never populates that bin at all, its confidence topping out in [0.85,0.95)[0.85,0.95) at a 0.940.94 hit-rate, so its flags cannot be thresholded as tightly. All three models track the diagonal in the lower bins, where a confidence near 0.50.50.70.7 corresponds to a hit-rate near 0.240.240.390.39, so the calibration is monotone and the only deficiency is Llama’s missing high-confidence regime.

Model Confidence bin nn flags Hit-rate
gpt-4.1 [0.00,0.50)[0.00,0.50) 15 0.27
gpt-4.1 [0.50,0.70)[0.50,0.70) 193 0.24
gpt-4.1 [0.70,0.85)[0.70,0.85) 578 0.53
gpt-4.1 [0.85,0.95)[0.85,0.95) 782 0.97
gpt-4.1 [0.95,1.01)[0.95,1.01) 911 1.00
deepseek-v3.1 [0.50,0.70)[0.50,0.70) 46 0.33
deepseek-v3.1 [0.70,0.85)[0.70,0.85) 1778 0.50
deepseek-v3.1 [0.85,0.95)[0.85,0.95) 2282 0.90
deepseek-v3.1 [0.95,1.01)[0.95,1.01) 725 0.99
llama-3.3-70b [0.50,0.70)[0.50,0.70) 435 0.39
llama-3.3-70b [0.70,0.85)[0.70,0.85) 3767 0.62
llama-3.3-70b [0.85,0.95)[0.85,0.95) 1276 0.94
Table 15: Llama never assigns confidence above 0.95 to a deception flag (n=0n=0 in the rightmost bin), capping its calibration ceiling.

Appendix K Temporal dynamics: detailed tables

This appendix provides the full tables that back the temporal-dynamics plots in §E. All values are computed from the same 60-game corpus as the rest of the paper.

K.1 Suspicion attracted by role ×\times round

Role R0 R1 R2 R3 R4
Doctor 0.45 0.35 0.36 0.55 0.78
Jester 0.72 0.80 0.80 0.80
Villager 0.31 0.25 0.29 0.38 0.34
Werewolf 0.70 0.85 0.85 0.85 0.93
Table 16: Mean observer-attracted suspicion (011) by role and round, averaged over both conditions and all three models. Wolves accumulate suspicion linearly; Jesters saturate by day 1; Doctors trace a U-shape (cleared mid-game, re-targeted late as the player pool shrinks); Villagers stay flat.

The Wolf row rises monotonically by +0.23+0.23 from R0 to R4: observers genuinely accumulate evidence over time. The Jester row, by contrast, gains only +0.08+0.08 between R0 and R1 and is then flat – the Jester “signature” is essentially decoded in the first speech act. The Doctor U-shape is the most striking pattern: in mid-game, surviving Doctors are below-average suspicion (0.350.350.360.36, lower than Villagers in R0), but by R3–R4 they spike to 0.550.550.780.78, almost matching Wolves. This is the proximate cause of the universally low Doctor survival reported in Table 20.

K.2 Self-tagged deception rate by model ×\times role ×\times round

Model Role R0 R1 R2 R3 R4
deepseek-v3.1 Doctor 0.49 0.27 0.52 0.00 1.00
gpt-4.1 Doctor 0.21 0.00 0.00
llama-3.3-70b Doctor 0.24 0.20 0.22 0.33
Table 17: Per-statement self-tagged deception rate for the Doctor role across rounds. Wolf, Jester and Villager rows are essentially flat across rounds (Wolf and Jester 0.85\geq 0.85, Villager 0.08\leq 0.08 for all models); per-condition aggregates for all four roles are in the calibration table of Appendix F. The Doctor row shows the largest cross-model divergence in the dataset.

The Wolf, Jester and Villager self-tag rates are essentially a fixed role policy that does not adapt across rounds for any model; the aggregate values appear in the calibration table of Appendix F. The Doctor, however, splits three ways. GPT-4.1 plays Doctor as a near-clean Villager (0.2100.21\to 0); Llama plays Doctor as a steady 25%\approx 25\% deceiver, hiding role information in roughly one in four statements; Deepseek oscillates between 0%0\% and 100%100\%, sometimes lying as much as a Wolf. Each of these three policies fails in its own way (Doctor survival is uniformly poor across all models), but the failure modes differ: GPT Doctors die from being visible, Llama Doctors die from being flagged as liars, Deepseek Doctors die from being inconsistent.

K.3 Within-day position effect on suspicion

Role early third mid third late third
Doctor 0.39 0.46 0.39
Jester 0.68 0.80 0.81
Villager 0.27 0.35 0.25
Werewolf 0.70 0.80 0.83
Table 18: Mean attracted suspicion as a function of position within the day (statements binned into thirds). Only deceivers (Wolf, Jester) show a monotone rise with later position.

The asymmetry is the message: the late-speaker penalty exists only for roles that lie. A naive prediction would be that more thinking time produces safer statements; the data show the opposite for deceivers and a flat or U-shape for honest roles. Observers appear to hold late speakers to a higher inferential bar – precisely the speakers who have had the most time to refine their cover, and whose refinements are therefore the most diagnostic.

K.4 Per-round coordination metrics by model

Model Rd unanim vote-J cons. corr.
deepseek-v3.1 0 0.60 0.23 0.82 0.85
deepseek-v3.1 1 0.88 0.32 0.77 0.47
deepseek-v3.1 2 1.00 0.30 0.78 0.75
deepseek-v3.1 3 1.00 0.00 0.80 0.00
gpt-4.1 0 0.80 0.60 0.82 0.65
gpt-4.1 1 0.75 0.13 0.76 0.14
gpt-4.1 2 1.00 0.00 0.71 0.33
gpt-4.1 3 1.00 0.00 0.67
llama-3.3-70b 0 0.65 0.35 0.77 1.00
llama-3.3-70b 1 0.75 0.28 0.76 0.62
llama-3.3-70b 2 0.83 0.04 0.75 0.86
llama-3.3-70b 3 1.00 0.10 0.75 1.00
Table 19: Per-round coordination. unanim = P(wolves vote same target)P(\text{wolves vote same target}); vote-J = mean fraction of wolves whose vote landed on the Jester; cons. = top-vote share in the daily exile vote; corr. = P(Doctor protected the player wolves attempted to kill)P(\text{Doctor protected the player wolves attempted to kill}).

Three patterns dominate. (i) Wolf unanimity rises monotonically with round in every model – the public debate functions as a coordination channel even though the night-kill is private. (ii) The wolf-vote-Jester error is universal on day 1 (GPT 0.600.60, Llama 0.350.35, Deepseek 0.230.23) and self-corrects within a few rounds, reaching 0.10\leq 0.10 by round 2 for GPT and Llama and by round 3 for Deepseek, but the village suspicion landscape it creates is already poisoned. (iii) Doctor target-prediction anti-correlates with overall model strength: Llama, the weakest overall, is the strongest Doctor predictor (0.620.621.001.00 across rounds), while GPT-4.1, the strongest overall, collapses from 0.650.65 on day 1 to 0.140.14 on day 2. We hypothesise that GPT-4.1 over-reasons about decoy possibilities, while Llama applies a simpler “protect the most-suspected innocent” rule that survives observation noise better.

K.5 Cross-game role survival drift (game index 1–10)

Model Role 1 2 3 4 5 6 7 8 9 10
deepseek-v3.1 Doctor 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.5 0.0 0.0
deepseek-v3.1 Jester 0.5 0.5 0.0 0.5 0.5 0.5 0.0 0.0 0.0 0.5
deepseek-v3.1 Villager 0.9 0.9 0.8 1.0 0.8 0.8 1.0 0.9 0.4 0.8
deepseek-v3.1 Werewolf 0.0 0.3 1.0 0.3 0.3 0.3 0.5 0.5 0.8 0.3
gpt-4.1 Doctor 0.5 0.0 0.0 0.0 0.0 0.0 1.0 0.5 0.5 0.0
gpt-4.1 Jester 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
gpt-4.1 Villager 0.7 1.0 1.0 1.0 0.9 0.6 0.8 0.7 0.9 1.0
gpt-4.1 Werewolf 0.5 1.0 1.0 1.0 0.5 0.8 0.0 0.5 0.5 1.0
llama-3.3-70b Doctor 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
llama-3.3-70b Jester 0.0 0.0 0.5 0.5 0.5 0.0 1.0 0.0 0.0 0.0
llama-3.3-70b Villager 0.7 0.6 0.7 0.5 0.4 0.7 1.0 1.0 0.8 0.9
llama-3.3-70b Werewolf 0.5 0.5 0.3 0.3 0.3 0.5 0.0 1.0 0.8 0.8
Table 20: Per-role survival rate at game end, averaged across both conditions, by chronological game index within each cell.

Two zero-rows identify the most broken role-model pairings in the benchmark: Llama Doctor survives 0/100/10 games and GPT-4.1 Jester survives 0/100/10 games. The Llama Wolf row drifts upward in late games (0.50.80.5\to 0.8 over games 7–10) even in the OFF condition where there is no learnings file, suggesting an implicit cell-position effect we flag as a confound for future cross-game analyses. The Deepseek Villager drop at game 9 (0.400.40) is an isolated anomaly; we report it to encourage replication rather than to draw a conclusion.

Appendix L External validation of the ToM-order tagger

The 1st/2nd/3rd-order tags in Section 5.6 are produced by the rule-based regex tagger of App. N. To check that the headline ordering does not depend on that tagger, we re-label all 141141 accumulated Jester learning entries with two independent LLM judges (DeepSeek-V3.1 and Llama-3.3-70B), each given the same definitions used by the regex rules. Table 21 reports the per-source-model rate of third-order contrastive entries under all three labelers. The absolute percentages are judge-dependent — entry-level agreement with the regex labels is near chance (Cohen’s κ=0.005\kappa=-0.005 for the DeepSeek judge, 0.054-0.054 for the Llama judge), because the LLM judges count any audience-aware phrasing as at least second-order whereas the regex requires an explicit perception verb. The rank order across source models, however, is identical for all three labelers: DeepSeek-V3.1 >> GPT-4.1 >> Llama-3.3-70B. The mechanism claim in Section 5.6 rests on this ordering, not on the absolute value 15.4%15.4\%.

Source model regex DeepSeek-judge Llama-judge
GPT-4.1 4.8% 9.5% 11.9%
DeepSeek-V3.1 15.4% 26.9% 38.5%
Llama-3.3-70B 0.0% 2.1% 0.0%
Table 21: Share of third-order contrastive Jester learning entries per source model under three independent labelers (141141 entries total). Absolute values differ, but the rank order DeepSeek-V3.1 >> GPT-4.1 >> Llama-3.3-70B is preserved by every labeler.

Appendix M Runtime configuration

Table 22 records the engine settings that govern every run, emitted by the released configuration-dump script directly from the code paths. We surface it so that the behavioral findings can be read against the exact decoding and game parameters that produced them.

Parameter Value
Temperature 1.0 (every call)
top_p / max_tokens / stop provider default / unset / unset
Bid range integer 01010
Bid parse fallback silent 0 on parse failure
max_debate_turns 12
max_explanation_turns 6
Recursion limit 1000
Context handling full history, no truncation
Self/peer judges same model as speaker (intra-population)
Per-call seed not exposed by the API (statistical, not bit-exact, reproducibility)
Table 22: Runtime configuration shared by all cells. The silent bid-parse fallback and the intra-population deception judges (speaker and observer are the same model) are the two settings most relevant to interpreting the behavioral findings.

Appendix N Lesson-tagging script

The strategy-category and ToM-order tags of Section 5.6 (Fig. 4) are produced by a single rule-based tagger. Its input is the set of 141 accumulated lessons: each lesson is one free-text string drawn from the to_do, to_not_do, and winning_tactics lists of a model’s per-model learning JSON. Every lesson is lowercased and matched against two independent pattern sets with Python’s re module.

Strategy categories (multi-label).

A lesson receives every category whose pattern list contains at least one match, so a single lesson can carry several category tags and the per-category percentages in Fig. 4b need not sum to 100%. The nine categories and their verbatim pattern lists, together with the tagging function, are given in the box below.

Strategy-category patterns

ToM order (single-label, strict precedence).

A lesson is tagged 3rd-order if it matches any third-order pattern, else 2nd-order if it matches any second-order pattern, else 1st-order. The third-order patterns require an explicit cross-faction contrast (e.g. wolf-like …not villager); this is exactly what separates a multi-hop incentive prediction from a bare audience-aware statement, and is why a 2nd-order “appear suspicious” lesson does not qualify as 3rd-order.

ToM-order patterns The distinctive-vocabulary counts cited in Section 5.6 come from the same script: per-model lesson text is tokenised with the word pattern [a-z][a-z\-]+, a stop-list of function words and game terms (jester, wolf, villager, player, round, …) is removed, and terms are ranked by TF-IDF excess of one model’s corpus over the other two. By this measure DeepSeek-V3.1’s top distinctive terms are wolf-like (22 occurrences), appear (23), deflection (16) and villager-aligned, the lexical fingerprint of its wolf_mimicry-heavy policy. Because the tagger is purely lexical, it is deterministic and reproducible; the external LLM-judge validation of the ToM-order axis is reported in App. L.

Appendix O Prompts

This appendix lists the verbatim prompts used in WOLF-Triadic. All prompts are templates formatted at runtime with the player name, role, scratchpad, debate history and other context fields. Curly braces in the boxes are the literal Python f-string placeholders.

O.1 Role system prompts

The following four prompts define each role’s objective, win condition and high-level strategy. They are concatenated with a shared character-style prompt and a common game-rules prompt at runtime.

Villager system prompt Werewolf system prompt Jester system prompt The past_learnings placeholder is empty in the OFF condition and is replaced in the ON condition by a formatted block listing the to-do, to-not-do and winning-tactics lists from the per-model learning JSON file. Doctor system prompt

O.2 Bidding prompt

Before each speaking turn, every alive player is asked to bid an urgency score 0–10 for speaking next. The highest bidder speaks.

Bidding prompt

O.3 Deception assessment prompts

Each public statement is assessed twice: once by the speaker for self-reported deception (Box for the speaker self-assessment) and once by every other living player for peer-detected deception (observer prompt below). Both prompts use the same 8-class taxonomy.

Speaker self-deception prompt Observer peer-deception prompt

O.4 Jester self-learning update prompt

After every game in the ON condition, the following prompt is issued once to the Jester model. The output is parsed as JSON, deduplicated against the existing learnings via a semantic-similarity check, and appended to the per-model learnings file.

Jester learning update prompt