Triadic Werewolf: A Jester Role for Multi-Hop Theory of Mind in LLMs

Avni Mittal
avni.mittal2002@gmail.com

Abstract

Theory-of-mind evaluations of large language models typically use dyadic social-deduction games, where every observable cue points to a single hidden side, so a model with strong language priors can score well without ever simulating opponents’ incentives. We extend the Werewolf game with a Jester, a third faction whose utility on peer suspicion is inverted because it wins by being voted out, so optimal play requires reasoning across three opposing utility functions. Across 60 games on GPT-4.1, DeepSeek-V3.1, and Llama-3.3-70B with Jester self-learning on and off, the Jester wins 60–70% of games while Werewolves never exceed 20%, and GPT-4.1 wolves vote the Jester out on day 1 in 60–70% of games, a strictly self-defeating action. Self-learning helps DeepSeek and Llama but hurts GPT-4.1, with the cost landing on Villagers rather than Werewolves. Only DeepSeek learns the subtle strategy of looking suspicious without looking intentionally suspicious, and it gains the most from the loop. Triadic incentive structure exposes a layer of multi-agent reasoning that dyadic deduction games leave invisible.

Avni Mittal avni.mittal2002@gmail.com

1 Introduction

Werewolf and related social-deduction benchmarks for large language models are dyadic, splitting players into two opposing sides, Villagers against Werewolves or loyal players against traitors (Liu et al., 2024; Bailis et al., 2024; Curvo, 2025; Xu et al., 2023a, b; Wang et al., 2024), so every decision reduces to one question. Is this player an ally or an enemy? Every observable cue points one way, so a model that learns the surface pattern hesitation $\Rightarrow$ likely werewolf can win without modelling any opponent. The theory-of-mind (ToM) (Premack and Woodruff, 1978) optimal policy and a pure cue-reading policy then select the same action, and a model can win by exploiting cues as it might exploit shortcut features elsewhere (Geirhos et al., 2020; McCoy et al., 2019; Niven and kao, 2019). We call this the cue-sufficiency problem, and a high score under it does not show genuine mental-state reasoning (Shapira et al., 2023; Ullman, 2023), a coincidence we confirm in our dyadic control.

We break this coincidence by adding a third faction whose payoff on the same cue runs the opposite way. The Jester wins if and only if the Villagers vote them out, so it tries to look as suspicious as possible while the Werewolves try to look innocent. A high-suspicion player is now ambiguous evidence, a Werewolf to exile or the Jester to leave alone, so good play must avoid looking like either while steering Villager attention away from anyone who might be the Jester. No dyadic variant draws out this multi-hop reasoning over three competing goals (He et al., 2023; Street et al., 2024; Wang et al., 2024).

Refer to caption — Figure 1: Overview of the WOLF triadic-incentive pipeline. A 10-player Werewolf game is instantiated with three competing factions (Villagers, Werewolves, and the Jester) whose utilities over the same observable cue diverge. Each game produces a transcript that is scored along faction outcomes, per-statement deception annotations, and voting dynamics. After every Jester game, the Jester-side model condenses what worked into a persistent learning file that is fed back into the next game’s Jester system prompt, yielding the OFF $\to$ ON self-learning condition we evaluate.

We argue that current LLMs do not reliably make this inference (Sap et al., 2022; Strachan et al., 2024a; Shapira et al., 2023; Ullman, 2023), and the failure shows up faction by faction. Three behavioral signatures follow if models fall back on dyadic priors. (i) The Jester wins more often than either other side. (ii) Werewolves sabotage themselves by voting the Jester out on day 1. (iii) An intervention aimed only at the Jester takes its gains from the Villagers rather than the Werewolves, since the Werewolves already fail the inference and only Villager belief-updating gives the Jester something to exploit. Our contributions are:

•

A 10-player triadic Werewolf benchmark whose Jester’s inverted utility breaks the cue-sufficiency of dyadic deduction games.
•

A controlled 60-game evaluation across three frontier models (GPT-4.1, DeepSeek-V3.1, Llama-3.3-70B) (Achiam et al., 2023; DeepSeek-AI et al., 2024; Dubey et al., 2024) with a Jester self-learning loop toggled OFF/ON (Shinn et al., 2023; Xu et al., 2023a), measuring first-order Jester gains and second-order effects on the other sides.
•

A hand-coded analysis of each model’s accumulated Jester lessons by theory-of-mind depth, linking depth to the model’s gain from the loop.

2 Related Work

LLMs in Werewolf and other social-deduction games.

Werewolf has become the canonical testbed for LLM agents under deception and hidden-role inference. Xu et al. (2023c) pair an LLM proposer with an RL action selector to reach human-comparable play; Wu et al. (2024) couple a System-1 LLM to a System-2 “Thinker” trained on a 19k-game corpus; Xu et al. (2025) optimise an abstract latent strategy space with counterfactual regret minimisation. Werewolf Arena (Bailis et al., 2024) introduces the bidding-based debate protocol that we adopt, and InterIntent (Liu et al., 2024) probes goal-tracking among LLM agents. The wider family includes AvalonBench (Light et al., 2023), AmongAgents (Chi et al., 2024), The Traitors (Curvo, 2025), Hoodwinked (O’Gara, 2023), SpyGame (Kim et al., 2024), and CICERO’s Diplomacy agent (Bakhtin et al., 2022). All of these are dyadic at the faction level, so a model can win by maximising a single “does this player look suspicious?” signal. We add a third faction whose utility on that same signal is inverted, dissociating “detect deception” from “decide whose deception to act on”.

Theory of mind in LLMs.

Static ToM benchmarks include FANToM (Kim et al., 2023), OpenToM (Xu et al., 2024), and Hi-ToM (He et al., 2023), which varies recursion depth explicitly. Strachan et al. (2024b) report that GPT-4 matches humans on classical false-belief tasks but underperforms on faux-pas detection, while Hagendorff (2023) show that deception itself emerges with scale. The recurring critique (Ren et al., 2025) is that surface heuristics solve many ostensibly mental-state items without genuine belief representation. Multi-agent games complement static items by reading mental-state inference out of strategic behaviour under incentive pressure; our claim is sharper, that the incentive structure must itself be designed to dissociate ToM-driven from heuristic-driven solutions.

Strategic deception, alignment, and self-improvement.

A parallel literature studies hidden objectives and evaluator manipulation. MACHIAVELLI (Pan et al., 2023) measures the reward-vs-ethics trade-off across 134 text adventures; Sleeper Agents (Hubinger et al., 2024) show that backdoor deception survives standard safety training; GovSim (Piatti et al., 2024) shows LLM societies overconsume shared resources under misaligned incentives. On in-context self-improvement, Reflexion (Shinn et al., 2023), Voyager (Wang et al., 2023), and Generative Agents (Park et al., 2023) all formalise verbal RL loops that rewrite a textual memory between episodes. Our jester_learning condition is a deliberately minimal instance of this paradigm and lets us audit whether such a loop helps or hurts when the role’s task is inherently deceptive.

The Jester mechanic.

Human social-deduction games such as Mafia and Town of Salem have long used the Jester, a role that wins by being voted out and so breaks the single axis of suspicion the other roles rely on. We are the first to use it for studying Theory of Mind Reasoning in an LLM benchmark.

3 Methodology

3.1 Game architecture

Figure 1 summarises the full pipeline. We adopt the game architecture and implementation of WOLF (Agarwal et al., 2025) and make three changes, giving the 10-player roster in Table 1. We drop the Seer, whose nightly check of whether a chosen player is a Werewolf gives Villagers a reliable signal to anchor on and so removes the cue ambiguity our probe depends on. We raise the Villager count from four to six, since six Villagers and a 2:6 wolf-to-village ratio keep a five-vote daytime majority even with the Jester filling one of the ten slots. We keep a single Doctor, which preserves protect-target inference without handing Villagers a second reliable channel of the kind the Seer would provide. A game alternates between night and day phases. The day opens with the announcement of any night kill, followed by an open debate, a vote, and a revote. The game ends as soon as one of the faction-win conditions in Table 1 fires.

Role	Count	Purpose in the game
Villager	6	No special action at night. By day, debates and votes to exile a suspected Werewolf or Jester. Wins if all Werewolves are eliminated.
Doctor	1	Each night, privately selects one player to protect from the Werewolves’ kill (may self-protect). Plays as a Villager by day and wins with the Villager faction.
Werewolf	2	Each night, the Werewolves jointly select one player to kill. By day they conceal their identity, vote against Villagers, and avoid voting out the Jester. Wins when the Werewolves equal or outnumber the surviving non-Jester players.
Jester	1	No special night action. By day, behaves so as to attract suspicion and be exiled by majority vote. Wins if and only if exiled; loses if killed at night or alive at game end. A Jester win is a loss for both other factions.

Table 1: Per-game role roster (10 players total), with each role’s function during the night and day phases and its win condition.

Bidding-based debate.

We inherit the bidding-based debate from Bailis et al. (2024), as implemented in WOLF (Agarwal et al., 2025). Before each round, every living player produces an integer urgency bid for how much they want to speak next, and the highest bidder gets the floor for a short public statement. Running several bid-and-speak rounds per day routes the floor to whoever has the most to say (a strong accusation, a defense, or a useful piece of information) rather than to a fixed turn order. This matters for our setup because the Jester benefits from airtime to draw suspicion while the Werewolves benefit from controlling when they speak.

Vote, explanation, revote.

After debate, every living player casts a vote to exile someone and gives a short public reason, both written into the shared transcript, so this first vote is a public signal the others can reason about rather than a silent tally. Any player who received a vote then defends themselves, and the rest rejoin through the same bidding mechanism to question or respond, after which a final revote follows. Exile requires a strict majority of living players ( $\lfloor n_{\text{alive}}/2\rfloor+1$ ); if no candidate reaches it the day ends with no exile and play moves to night. This defend-then-revote step lets players revise once they have seen who suspects whom, the moment where multi-hop reasoning about the Jester matters most.

Peer suspicion score.

After every public statement, each other living player rates how suspicious it seemed on a $[0,1]$ scale, and each player’s running suspicion against every other is updated by exponential smoothing, $S_{t}=\alpha s_{t}+(1-\alpha)S_{t-1}$ , with $\alpha=0.7$ , so the latest rating dominates while prior history tempers it. This yields one number per player summarizing recent peer suspicion, the main observable signal for our cue-ambiguity analysis.

Faction utilities and cue ambiguity.

Let $S_{t}(p)$ be peer-aggregated suspicion against player $p$ at time $t$ , the single observable every faction acts on. Villagers want $S$ high on the Werewolves and low on the Jester, the Jester wants the reverse for itself since exile is its only win, and the Werewolves sit between with private team knowledge the Villagers lack. A high- $S$ player who is not a Werewolf’s teammate is therefore either the Jester or a wrongly-accused Villager, so an attentive Werewolf should keep it alive rather than help exile it. The Jester thus plays to two audiences. It must look like a likely Werewolf to the uninformed Villagers so they exile it, yet stay ordinary enough that the informed Werewolves do not recognise and shield it. The same high- $S$ value demands opposite correct responses depending on the hidden role behind it, which is why no one-dimensional reading of $S$ suffices.

A Jester win is, by the rules, a simultaneous loss for both other factions, since either could have prevented it, the Villagers by withholding the exiling vote and the Werewolves by shielding a recognised Jester. It therefore requires both to fail the same inference, which is why we read the Jester win-rate as a direct measure of multi-hop ToM failure. We use multi-hop theory of mind behaviorally throughout, the ability to act on a cue whose correct reading depends on a hidden third party’s incentives, and measure that signature rather than claim anything about internal representations.

3.2 Jester self-learning loop

After each game in the ON condition, an LLM call summarizes the game from the Jester’s perspective and proposes new lessons, returning an empty set when the game taught nothing new. Each lesson is a short free-text note about a tactic that worked or should be avoided, sorted into to-do, to-avoid, and winning-tactic lists (representative entries in Table 2). The list is per-model, so each model sees only its own past lessons, and the next game’s Jester reads them in its prompt before play. The update is append-only and the prompt forbids entries that duplicate or paraphrase existing ones (full prompt in Appendix O). The OFF condition disables the loop entirely, so comparing ON against OFF isolates the effect of accumulated Jester wisdom on faction outcomes. We later analyze the accumulated entries in Section 5.6 along two axes, a 9-class strategy-category taxonomy and a 3-class theory-of-mind order summary (Figure 4).

Model	ToM	Entry (verbatim)
DeepSeek-V3.1	3	Vote for a player who is not the main suspect to appear wolf-like by deflecting from the consensus, but only after the main suspect is already exiled to avoid appearing villager-aligned.
GPT-4.1	2	Escalate suspicion on multiple players in quick succession to appear erratic and untrustworthy.
Llama-3.3-70B	1	Use the initial questions and reactions to gauge players’ behaviors and adjust strategy accordingly.

Table 2: One representative Jester self-learning entry per ToM order (1 = bare imperative; 2 = references how others perceive the agent; 3 = explicit contrast across two faction-perceptions, e.g. wolf-like not villager-aligned). Full 7-entry table per model in App. F (Table 9).

3.3 Deception protocol

Every public statement during debate or vote-explanation is a unit of analysis. Right after a statement, a separate self-report call asks the speaker to label it, returning a binary deception flag, a confidence $c\in[0,1]$ , and a deception type. We extend WOLF’s four deceptive types (omission, distortion, fabrication, misdirection) with three more (sycophancy, strategic_deception, persuasive_manipulation), giving the 8-class taxonomy in Table 3.

Class	Definition
none	Truthful statement with no deceptive intent.
omission	Deliberately hiding relevant information.
distortion	Twisting or exaggerating facts.
fabrication	Making up false information.
misdirection	Diverting attention away from important topics.
sycophancy	Agreeing with others to blend in or gain trust, rather than because one believes the claim.
strategic_deception	Deliberately misrepresenting role, capabilities, or intentions.
persuasive_manipulation	Using emotional or social pressure rather than logical argument.

Table 3: Eight-class self-reported deception taxonomy used by the speaker for every public statement (Appendix O).

All other living players concurrently emit a continuous suspicion score $s\in[0,1]$ for the same statement, updated by the same exponential-smoothing rule. Speaker and observer share the same taxonomy and reasoning view, so self-reported and peer-detected deception are jointly analyzable per role per model.

4 Evaluation Setup

Design.

We run a $3\times 2$ factorial of three frontier models (GPT-4.1, DeepSeek-V3.1, Llama-3.3-70B) crossed with the Jester self-learning loop (OFF, ON), 10 games per cell for 60 games total, with one model playing every role in a game. Cells run sequentially, so under ON the $k$ -th game’s Jester reads the lessons from games $1{:}k{-}1$ of that same cell, and per-cell files are isolated so DeepSeek’s ON cell never sees GPT-4.1’s lessons.

Engine and call settings.

Bids are integers in $[0,10]$ (malformed bids parse as 0), phase caps are max_debate_turns = 12 and max_explanation_turns = 6, and all calls use temperature = 1.0 with the full conversation history untruncated. The complete configuration is in Table 22.

Aggregation and statistics.

Per cell we report Wilson 95% confidence intervals on win-rates, which stay inside $[0,1]$ at small $n$ and extreme rates where the normal interval does not. Per-statement aggregations (deception types, suspicion attracted, ToM order) pool across the 10 games of a cell and report unweighted means unless stated otherwise. Our evidence sits at two scales: win-rate claims rest on the same pattern repeating across all six 10-game cells, while the behavioral and mechanistic analyses run over the 3,992 per-statement deception events.

5 Results

We report aggregate outcomes, the self-learning loop’s marginal effect, behavioral and mechanistic evidence for the multi-hop ToM hypothesis, and a deductive coding of what each model learns.

5.1 Faction win rates and the effect of self-learning

A single Jester agent against eight other actively-deducing agents wins more games than the entire Werewolf faction combined in every cell (Figure 2). Pooling across models and conditions, the Jester achieves 0.55, Villagers 0.37, Werewolves 0.08 (per-cell rates in Table 7, Appendix A).

The self-learning loop helps each model by a different amount (Jester panel of Figure 2). GPT-4.1 already wins $0.70$ of its Jester games without any learning, so it has little room to improve, whereas DeepSeek-V3.1 rises from $0.30$ to $0.70$ ( $\Delta=+0.40$ ) and Llama-3.3-70B from $0.40$ to $0.60$ ( $\Delta=+0.20$ ). The loop thus helps most when a model starts out playing the Jester poorly and adds little once it is near the best a Jester can do.

Faction	Triadic	Dyadic
	(Jester present)	(Jester removed)
Villagers	0.35 [0.19, 0.54]	1.00 [0.74, 1.00]
Werewolves	0.08 [0.02, 0.24]	0.00 [0.00, 0.26]
Jester	0.58 [0.39, 0.74]	—

Table 4: Falsifiability control on GPT-4.1 (role-shuffled, learning OFF). Replacing the Jester with a sixth Villager restores Villager wins to 100%. The Jester role is the active ingredient that breaks Villager performance.

These extra Jester wins come from the Villagers, not the other deceptive role (Fig. 13a, App. H). The Werewolves were already losing almost every game before the loop, so they have no wins left to give up, and the Jester’s gains can only come from the one faction still actively trying to tell the roles apart.

5.2 Falsifiability: removing the Jester

The Jester’s dominance only implicates multi-hop ToM if the other sides fail because the Jester makes the suspicion cue ambiguous, rather than because the model is simply weak at Werewolf. We separate these with a dyadic control on GPT-4.1: the same engine and prompts, but with the Jester replaced by a sixth Villager, roles shuffled per game, and the learning loop off. Table 4 contrasts it with the matched triadic cell (GPT-4.1, shuffled roles, learning off).¹¹1Both cells use per-game role shuffling on GPT-4.1, which also rules out a fixed name-to-role prior as the driver; the dyadic control is run on GPT-4.1 only. Removing the Jester lifts the Villagers from $0.35$ to a clean sweep of $1.00$ (11/11 games), while the Werewolves stay near zero in both regimes.

Metric	OFF	ON	$\Delta$
	( $n{=}32$ )	( $n{=}30$ )	(ON $-$ OFF)
Jester win rate	0.625	0.933	$+0.308$
Villager win rate	0.281	0.033	$-0.248$
Werewolf win rate	0.094	0.033	$-0.060$
Wolf votes Jester on day 1	0.719	0.933	$+0.215$
Jester exiled by day-1 vote	0.500	0.867	$+0.367$

Table 5: Role-shuffled GPT-4.1 robustness run (per-game random role assignment). All

\Delta

except the Werewolf row are significant at

p<0.05

by two-proportion

z

and bootstrap.

The same model that loses the triadic game plays the dyadic game perfectly. In the dyadic regime the only high-suspicion players are the Werewolves, so the surface heuristic “exile the most suspicious player” coincides with the ToM-optimal policy and a pure cue-follower wins every game without modelling any opponent’s incentives. Adding the Jester inserts a second high-suspicion role whose desired outcome is inverted, which dissociates the cue from the optimal action; the Villager win-rate then collapses from $1.00$ to $0.35$ ( $p=2.7\times 10^{-4}$ ). Since the model, prompts, and engine are identical across the two regimes, the collapse cannot be generic weak instruction-following or strategic search, which would also have hurt the dyadic game; it appears only once the cue and the correct action are pulled apart. The triadic $\to$ dyadic gap is thus a direct readout of the multi-hop ToM that a dyadic score, however perfect, cannot certify.

5.3 Robustness: de-confounding GPT-4.1 with role-shuffling

Model	Cond	A	B	C	D	W	T
DeepSeek-V3.1	off	1	1	2	5	1	10
DeepSeek-V3.1	on	2	2	5	1	0	10
GPT-4.1	off	7	3	0	0	0	10
GPT-4.1	on	6	3	0	0	1	10
Llama-3.3-70B	off	2	2	2	3	1	10
Llama-3.3-70B	on	4	1	1	2	2	10

Table 6: Wolf-loss failure-mode counts per cell (

n=10

games each). A = Wolves voted the Jester out on day 1 (a Jester win); B = Wolves killed the Jester at night (a Villager win); C = Villagers voted the Jester out with the Wolves uninvolved (a Jester win); D = Villagers outnumbered the Wolves (a Villager win); W = Wolves won; T = total. A and B together are wolf self-sabotage, and A and C are the two Jester-win routes.

Our main runs fix the name-to-role mapping (the same player name is the Jester in every game of a given model and condition), leaving a name-prior confound and a small per-condition sample. We re-run GPT-4.1 with roles permuted uniformly at random per game at a larger sample (about 30 games per condition), all else unchanged (Table 5). Three points follow. First, the day-1 self-sabotage is not a name artifact: under shuffling it rises to $0.93$ (ON), so wolves target whoever currently holds the Jester role, not a learned name. Second, the substitution effect of Section 5.1 reproduces at larger $n$ and is now significant, again paid by the Villagers rather than the Werewolves. Third, the GPT-4.1 learning effect flips sign, the loop helping ( $\Delta=+0.308$ ) rather than hurting ( $\Delta=-0.10$ in Section 5.1), because the harder de-confounded regime removes the name shortcut and an accumulated-strategy file then buys real edge. Shuffling confirms rather than overturns the findings, so we run it for GPT-4.1 only and keep the non-shuffled runs as our main results.

5.4 Voting pathology: wolf self-sabotage and vote alignment

Voting the Jester on day 1 is the sharpest test of multi-hop ToM (slope charts in Fig. 12, App. G). A Werewolf must see that the high- $S$ player might be the Jester, that exiling it ends the game in a Wolf loss, and refuse the village majority, and failing any step is a self-defeating vote. Every role is even warned of this directly, the Werewolf, Villager, and Doctor prompts all saying not to exile a player just for looking suspicious since it may be the Jester (App. O), so the failure is not a missing rule but an inability to act on a known one against the suspicion cue. GPT-4.1 wolves cast this vote in 60–70% of games, DeepSeek-V3.1 in 20–30%, and Llama in 30–40%, and self-learning lowers the rate for GPT-4.1 and DeepSeek-V3.1 but raises it for Llama-3.3-70B, the only wrong-faction substitution. The model with the tightest vote–suspicion alignment (panel b) is also the one whose Wolves most often suicide-vote the Jester, so committing to the top suspect without asking which faction it belongs to is exactly the failure the triadic design exposes.

Table 6 classifies all 60 games by Wolf-side outcome, categories A and B being self-sabotage (Wolves voted or killed the Jester), C a villager-driven Jester win, and D a clean Villager team-win by outnumbering. GPT-4.1 is the extreme case, ending 19 of 20 games via A or B (13 by the day-1 vote alone) and none via C or D, while DeepSeek-V3.1 shifts the other way under learning, 5/10 ON-condition Wolf losses becoming Jester-baited villager votes (C) rather than attrition (D). Wolves won only 5 of 60 games, and univariate predictors (Table 8, App. F) confirm the pattern, the day-1 wolf-vote-Jester being the strongest correlate of a Jester win ( $r{=}{+}0.59$ ) while the Jester’s raw statement count is uncorrelated ( $r{\approx}0$ ). Night actions tell the same story. Wolves kill the Jester at night in 20–40% of games. Across votes and kills GPT-4.1 wolves take action against the Jester in nearly every game (Fig. 13b, App. H). Cooperative roles also misfire, the Doctor self-protecting in 80–100% of games rather than shielding a teammate yet still the most-killed role (App. C).

We then ask whether the wolves recognise the Jester and misplay or simply cannot tell it from a Werewolf. An external judge (DeepSeek-V3.1) reads each finished game’s public transcript and names the hidden roles from every surviving seat ( $54$ games, $2928$ guesses). It labels the Jester correctly only $0.251$ of the time, no better than the $0.25$ chance rate among four roles and far below its $0.835$ on Villagers, and a Werewolf-seat view does no better ( $0.259$ ) than the Villager ( $0.244$ ) or Doctor ( $0.271$ ) seats. The transcript thus carries no signal separating a Wolf from a Jester, so the day-1 vote is not a recognised Jester misplayed but the same one-dimensional cue-following as everyone else, the cue-sufficiency failure we predicted rather than a coordination or detection breakdown.

5.5 Deception is a model property, not a role or loop property

Category	GPT-4.1	DeepSeek-V3.1	Llama-3.3-70B
wolf_mimicry	0.0	71.2	2.1
inconsistency	23.8	25.0	4.3
aggression	23.8	1.9	0.0
subtle_indirection	9.5	3.8	36.2
passive_provocation	23.8	5.8	17.0
anti_jester_tells	14.3	1.9	12.8
vote_manipulation	9.5	44.2	2.1
meta_modeling	31.0	26.9	10.6
questioning	14.3	23.1	12.8

The Jester out-produces the Werewolves in deceptive statements per turn in every model (for example, GPT-4.1 Jester $7.4$ versus Wolf $5.0$ ON), yet the two attract statistically indistinguishable suspicion, hovering around 0.65–0.81 across all six cells against 0.22–0.51 for Villagers and the Doctor (full dumbbell plots in Fig. 14, App. I), so the high-suspicion band holds two agent types with opposite desired outcomes.

The type of deception is equally a model trait. GPT-4.1 leans on distortion (Jester 47% $\to$ 68% ON) and strategic_deception (Wolf 59% $\to$ 72%) with zero fabrication, DeepSeek-V3.1 on fabrication (Jester $\approx$ 62% in both conditions), and Llama-3.3-70B on misdirection (Jester 40% $\to$ 57%). Each model imports one style and applies it inside whatever role it inhabits, so composition barely shifts with role or loop, and the same signature recurs in speaker-vs-observer calibration where honest roles are over-flagged and skilled liars under-flagged (App. D). A model that deceives the same way regardless of faction is unlikely to be running recursive reasoning over its opponents’ incentives.

The second-order suspicion shifts confirm the pattern. The loop only edits the Jester’s prompt, yet it moves how other agents suspect each other in a model-specific way. Llama-3.3-70B ON shifts every observer’s rating of the Doctor by $-0.15$ to $-0.23$ and lifts Jester ratings by $+0.08$ to $+0.09$ , while DeepSeek-V3.1 ON makes Wolves less suspicious of the Jester ( $-0.07$ ), so they less often join the exiling vote and Jester wins rise. GPT-4.1’s near-flat diff matrix matches its saturation (Section 5.1), with no second-order room left at the ceiling of cue-ambiguity exploitation. The full heatmap (App. B.1), six-axis radar (App. B.3), and game-length distributions (App. B.2) make these cross-cell shapes immediate, with Jester wins concentrating on day 1 for GPT-4.1, within 1–2 rounds for the others.

5.6 Inside the learned policies

Outcome metrics show that the loop helps unevenly but not why. To recover the mechanism we deductively coded all 141 accumulated lessons (GPT-4.1: 42, DeepSeek-V3.1: 52, Llama-3.3-70B: 47) with a rule-based regex tagger (patterns and code in App. N) along two axes, a 9-class strategy-category taxonomy grounded in the triadic-incentive model (Fig. 4b) and the 3-class theory-of-mind order of Section 3.2 (Table 2). Only a third-order entry separates being mistaken for a Wolf from being mistaken for a Villager, so this class is the textual signature of multi-hop incentive prediction.

The three models converge on qualitatively distinct policies (Fig. 4b). DeepSeek-V3.1 encodes the cue-ambiguity exploit explicitly, devoting 71% of its entries to wolf_mimicry against $\leq 2\%$ for the other two. GPT-4.1 instead acts erratic and lets the group affix the label, combining meta_modeling (31%), aggression (24%) and passive_provocation (24%), while Llama-3.3-70B leans on subtle_indirection (36%). A per-model distinctive-vocabulary (TF-IDF) analysis corroborates the split (App. N).

The ToM-order distribution (Fig. 4a) is the paper’s cleanest mechanistic result. DeepSeek-V3.1 is the only model that reliably produces third-order contrastive entries (15.4%), against 4.8% for GPT-4.1 and none for Llama-3.3-70B, and two independent LLM judges reproduce this rank order even though their absolute percentages differ (App. L; representative entries in Table 2). The ordering tracks the learning gains of Section 5.1, the model whose lessons encode the triadic incentive structure (DeepSeek-V3.1) extracting the most from the loop ( $\Delta=+0.40$ ) and Llama-3.3-70B, whose lessons are 64% bare first-order imperatives, only $+0.20$ .

6 Conclusion

Adding a Jester to a Werewolf game converts a one-dimensional suspicion signal into a triadic incentive-prediction problem, and current frontier LLMs fail at it. Across 60 games on three models, the Jester wins the majority of cells, Werewolves never exceed 20%, and the single-faction self-learning intervention pays for its gains out of Villager wins rather than Wolf wins. Behavioral fingerprints (day-1 self-sabotage votes, night-kill of the Jester, model-stable deception-type composition) and second-order cross-perception shifts pin the failure on cue-sufficiency rather than coordination or signal detection, and deductive coding of the 141 accumulated lessons adds a textual mechanism, the only model whose self-learning produces contrastive third-order entries (DeepSeek-V3.1, 15.4%) being the one that gains most from the loop. Triadic incentives are a small structural change that exposes a layer of multi-agent reasoning dyadic deduction games leave invisible, and we release the benchmark, logs and analysis to enable follow-up work that controls the cue-ambiguity axis deliberately.

Limitations

We run only 10 games per cell, so the Wilson confidence intervals on each win-rate are wide. Our claims therefore rest on the same pattern showing up across cells, not on any single cell being significant. The Jester self-learning loop is trained per model, so we do not test whether lessons learned by one model help another. We did not log per-observer true- and false-positive labels, so we cannot report $F_{1}$ -style detection scores for the 8-type taxonomy. Passing the full conversation history with no compression keeps all information from earlier rounds, but it also raises token cost and fills the context window as games get longer. Adding a proper memory structure, for example retrieval over a running summary or dropping low-relevance turns, is left to future work. The cue-ambiguity claim is an argument about game structure, and other ToM probes, such as asking each agent direct second-order belief questions mid-game, could add to the behavioral evidence we give here. Finally, all our games are in English, so we do not know how well the triadic-incentive idea carries over to other languages.

Ethical considerations

Studying deception in LLMs.

The benchmark explicitly incentivises one role (the Jester) to deceive other agents. We do not view this as encouraging real-world deception by deployed models; rather, structured studies of deceptive capability are necessary for safety evaluation, since deceptive behaviour cannot be measured or mitigated without first being elicited and characterised. The self-learning loop is run in a sandboxed gameplay environment with no human users in the loop.

Dual-use of the self-learning loop.

The per-model running-lessons file shows that one frontier model (DeepSeek-V3.1) is already able to accumulate, in natural language, third-order contrastive deception strategies from its own gameplay. The same loop could in principle be used to bootstrap deceptive prompts outside a game context. We release the loop, the prompts, and the resulting lesson files so that this capability surface is auditable, but we caution that the release is intended for safety research rather than for re-deployment as a deception scaffold.

Human-subject data.

The benchmark contains no human-subject data. All transcripts are model-generated; no personally identifiable information is collected or released. Player names used inside games (e.g. “Joy”) are arbitrary tokens unconnected to any real individual.

Model and content licensing.

The three evaluated models (GPT-4.1, DeepSeek-V3.1, Llama-3.3-70B) are accessed under their providers’ standard terms of use. Released artefacts (code, prompts, event logs) do not redistribute model weights and respect each provider’s output-sharing policy.

References

O. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A. Brakman, G. Brockman, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell, A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan, C. Chang, F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen, M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung, D. Cummings, J. Currier, Y. Dai, C. Decareaux, T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Dohan, S. Dowling, S. Dunning, A. Ecoffet, A. Eleti, T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. Fishman, J. Forte, I. Fulford, L. Gao, E. Georges, C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo-Lopes, J. Gordon, M. Grafstein, S. Gray, R. Greene, J. Gross, S. Gu, Y. Guo, C. Hallacy, J. Han, J. Harris, Y. He, M. Heaton, J. Heidecke, C. Hesse, A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, K. Hsu, S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain, J. Jang, A. Jiang, R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun, T. Kaftan, L. Kaiser, A. Kamali, I. Kanitscheider, N. Keskar, T. Khan, L. Kilpatrick, J. W. Kim, C. Kim, Y. Kim, H. Kirchner, J. Kiros, M. Knight, D. Kokotajlo, L. Kondraciuk, A. Kondrich, A. Konstantinidis, K. Kosic, G. Krueger, V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung, D. Levy, C. Li, R. Lim, M. Lin, S. L. Lin, M. Litwin, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini, S. Manning, T. Markov, Y. Markovski, B. Martin, K. Mayer, A. Mayne, B. McGrew, S. McKinney, C. McLeavey, P. McMillan, J. McNeil, D. Medina, A. Mehta, J. Menick, L. Metz, A. Mishchenko, P. Mishkin, V. Monaco, E. Morikawa, D. P. Mossing, T. Mu, M. Murati, O. Murk, D. M’ely, A. Nair, R. Nakano, R. Nayak, A. Neelakantan, R. Ngo, H. Noh, O. Long, C. O’Keefe, J. Pachocki, A. Paino, J. Palermo, A. Pantuliano, G. Parascandolo, J. Parish, E. Parparita, A. Passos, M. Pavlov, A. Peng, A. Perelman, F. de Avila Belbute Peres, M. Petrov, H. P. de Oliveira Pinto, M. Pokorny, M. Pokrass, V. H. Pong, T. Powell, A. Power, B. Power, E. Proehl, R. Puri, A. Radford, J. W. Rae, A. Ramesh, C. Raymond, F. Real, K. Rimbach, C. Ross, B. Rotsted, H. Roussez, N. Ryder, M. Saltarelli, T. Sanders, S. Santurkar, G. Sastry, H. Schmidt, D. Schnurr, J. Schulman, D. Selsam, K. Sheppard, T. Sherbakov, J. Shieh, S. Shoker, P. Shyam, S. Sidor, E. Sigler, M. Simens, J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky, Y. Song, N. Staudacher, F. Such, N. Summers, I. Sutskever, J. Tang, N. Tezak, M. Thompson, P. Tillet, A. Tootoonchian, E. Tseng, P. Tuggle, N. Turley, J. Tworek, J. F. C. Uribe, A. Vallone, A. Vijayvergiya, C. Voss, C. L. Wainwright, J. Wang, A. Wang, B. Wang, J. Ward, J. Wei, C. Weinmann, A. Welihinda, P. Welinder, J. Weng, L. Weng, M. Wiethoff, D. Willner, C. Winter, S. Wolrich, H. Wong, L. Workman, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan, W. Zaremba, R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, and B. Zoph (2023) GPT-4 technical report. Cited by: 2nd item.
M. Agarwal, S. Rana, T. Sundoro, H. Berhe, S. Kim, V. Sharma, S. O’Brien, and K. Zhu (2025) Wolf: werewolf-based observations for llm deception and falsehoods. arXiv preprint arXiv:2512.09187. Cited by: §3.1, §3.1.
S. Bailis, J. Friedhoff, and F. Chen (2024) Werewolf arena: a case study in llm evaluation via social deduction. ArXiv abs/2407.13943. Cited by: §1, §2, §3.1.
A. Bakhtin, N. Brown, E. Dinan, G. Farina, C. Flaherty, D. Fried, A. Goff, J. Gray, H. Hu, A. P. Jacob, M. Komeili, K. Konath, M. Kwon, A. Lerer, M. Lewis, A. H. Miller, S. Mitts, A. Renduchintala, S. Roller, D. Rowe, W. Shi, J. Spisak, A. Wei, D. J. Wu, H. Zhang, and M.N. Zijlstra (2022) Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science 378, pp. 1067 – 1074. Cited by: §2.
Y. Chi, L. Mao, and Z. Tang (2024) AMONGAGENTS: evaluating large language models in the interactive text-based social deduction game. ArXiv abs/2407.16521. Cited by: §2.
P. Curvo (2025) The traitors: deception and trust in multi-agent language model simulations. ArXiv abs/2505.12923. Cited by: §1, §2.
DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Wang, J. Chen, J. Chen, J. Yuan, J. Qiu, J. Li, J. Song, K. Dong, K. Hu, K. Gao, K. Guan, K. Huang, K. Yu, L. Wang, L. Zhang, L. Xu, L. Xia, L. Zhao, L. Wang, L. Zhang, M. Li, M. Wang, M. Zhang, M. Zhang, M. Tang, M. Li, N. Tian, P. Huang, P. Wang, P. Zhang, Q. Wang, Q. Zhu, Q. Chen, Q. Du, R. J. Chen, R. Jin, R. Ge, R. Zhang, R. Pan, R. Wang, R. Xu, R. Zhang, R. Chen, S. Li, S. Lu, S. Zhou, S. Chen, S. Wu, S. Ye, S. Ma, S. Wang, S. Zhou, S. Yu, S. Zhou, S. Pan, T. Wang, T. Yun, T. Pei, T. Sun, W. Xiao, W. Zeng, W. Zhao, W. An, W. Liu, W. Liang, W. Gao, W. Yu, W. Zhang, X. Q. Li, X. Jin, X. Wang, X. Bi, X. Liu, X. Wang, X. Shen, X. Chen, X. Zhang, X. Chen, X. Nie, X. Sun, X. Wang, X. Cheng, X. Liu, X. Xie, X. Liu, X. Yu, X. Song, X. Shan, X. Zhou, X. Yang, X. Li, X. Su, X. Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. X. Zhu, Y. Zhang, Y. Xu, Y. Huang, Y. Li, Y. Zhao, Y. Sun, Y. Li, Y. Wang, Y. Yu, Y. Zheng, Y. Zhang, Y. Shi, Y. Xiong, Y. He, Y. Tang, Y. Piao, Y. Wang, Y. Tan, Y. Ma, Y. Liu, Y. Guo, Y. Wu, Y. Ou, Y. Zhu, Y. Wang, Y. Gong, Y. Zou, Y. He, Y. Zha, Y. Xiong, Y. Ma, Y. Yan, Y. Luo, Y. You, Y. Liu, Y. Zhou, Z. F. Wu, Z. Ren, Z. Ren, Z. Sha, Z. Fu, Z. Xu, Z. Huang, Z. Zhang, Z. Xie, Z. Zhang, Z. Hao, Z. Gou, Z. Ma, Z. Yan, Z. Shao, Z. Xu, Z. Wu, Z. Zhang, Z. Li, Z. Gu, Z. Zhu, Z. Liu, Z. Li, Z. Xie, Z. Song, Z. Gao, and Z. Pan (2024) DeepSeek-v3 technical report. Cited by: 2nd item.
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, A. Goyal, A. S. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Rozière, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. McConnell, C. Keller, C. Touret, C. Wu, C. Wong, C. C. Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz, D. Livshits, D. Esiobu, D. Choudhary, D. Mahajan, D. Garcia-Olano, D. Perino, D. Hupkes, E. Lakomkin, E. A. AlBadawy, E. Lobanova, E. Dinan, E. Smith, F. Radenovic, F. Zhang, G. Synnaeve, G. Lee, G. L. Anderson, G. Nail, G. Mialon, G. Pang, G. Cucurell, H. Nguyen, H. Korevaar, H. Xu, H. Touvron, I. Zarov, I. A. Ibarra, I. M. Kloumann, I. Misra, I. Evtimov, J. Copet, J. Lee, J. Geffert, J. Vranes, J. Park, J. Mahadeokar, J. Shah, J. V. D. Linde, J. Billock, J. Hong, J. Lee, J. Fu, J. Chi, J. Huang, J. Liu, J. Wang, J. Yu, J. Bitton, J. Spisak, J. Park, J. Rocca, J. Johnstun, J. Saxe, J. Jia, K. V. Alwala, K. Upasani, K. Plawiak, K. Li, K. Heafield, K. R. Stone, K. El-Arini, K. Iyer, K. Malik, K. Chiu, K. Bhalla, L. Rantala-Yeary, L. Maaten, L. Chen, L. Tan, L. Jenkins, L. Martin, L. Madaan, L. Malo, L. Blecher, L. Landzaat, L. Oliveira, M. Muzzi, M. Pasupuleti, M. Singh, M. Paluri, M. Kardas, M. Oldham, M. Rita, M. Pavlova, M. Kambadur, M. Lewis, M. Si, M. K. Singh, M. Hassan, N. Goyal, N. Torabi, N. Bashlykov, N. Bogoychev, N. S. Chatterji, O. Duchenne, O. cCelebi, P. Alrassy, P. Zhang, P. Li, P. Vasić, P. Weng, P. Bhargava, P. Dubal, P. Krishnan, P. S. Koura, P. Xu, Q. He, Q. Dong, R. Srinivasan, R. Ganapathy, R. Calderer, R. S. Cabral, R. Stojnic, R. Raileanu, R. Girdhar, R. Patel, R. Sauvestre, R. Polidoro, R. Sumbaly, R. Taylor, R. Silva, R. Hou, R. Wang, S. Hosseini, S. Chennabasappa, S. Singh, S. Bell, S. Kim, S. Edunov, S. Nie, S. Narang, S. Raparthy, S. Shen, S. Wan, S. Bhosale, S. Zhang, S. Vandenhende, S. Batra, S. Whitman, S. Sootla, S. Collot, S. Gururangan, S. Borodinsky, T. Herman, T. Fowler, T. Sheasha, T. Georgiou, T. Scialom, T. Speckbacher, T. Mihaylov, T. Xiao, U. Karn, V. Goswami, V. Gupta, V. Ramanathan, V. Kerkez, V. Gonguet, V. Do, V. Vogeti, V. Petrovic, W. Chu, W. Xiong, W. Fu, W. Meers, X. Martinet, X. Wang, X. Tan, X. Xie, X. Jia, X. Wang, Y. Goldschlag, Y. Gaur, Y. Babaei, Y. Wen, Y. Song, Y. Zhang, Y. Li, Y. Mao, Z. D. Coudert, Z. Yan, Z. Chen, Z. Papakipos, A. K. Singh, A. Grattafiori, A. Jain, A. Kelsey, A. Shajnfeld, A. Gangidi, A. Victoria, A. Goldstand, A. Menon, A. Sharma, A. Boesenberg, A. Vaughan, A. Baevski, A. Feinstein, A. Kallet, A. Sangani, A. Yunus, A. Lupu, A. Alvarado, A. Caples, A. Gu, A. Ho, A. Poulton, A. Ryan, A. Ramchandani, A. Franco, A. Saraf, A. Chowdhury, A. Gabriel, A. Bharambe, A. Eisenman, A. Yazdan, B. James, B. Maurer, B. Leonhardi, P. (. Huang, B. Loyd, B. de Paola, B. Paranjape, B. Liu, B. Wu, B. Ni, B. Hancock, B. Wasti, B. Spence, B. Stojkovic, B. Gamido, B. Montalvo, C. Parker, C. Burton, C. Mejia, C. Wang, C. Kim, C. Zhou, C. Hu, C. Chu, C. Cai, C. Tindal, C. Feichtenhofer, D. Civin, D. Beaty, D. Kreymer, S. Li, D. Wyatt, D. Adkins, D. Xu, D. Testuggine, D. David, D. Parikh, D. Liskovich, D. Foss, D. Wang, D. Le, D. Holland, E. Dowling, E. Jamil, E. Montgomery, E. Presani, E. Hahn, E. Wood, E. Brinkman, E. Arcaute, E. Dunbar, E. Smothers, F. Sun, F. Kreuk, F. Tian, F. Ozgenel, F. Caggioni, F. (. Guzmán, F. J. Kanayet, F. Seide, G. M. Florez, G. Schwarz, G. Badeer, G. Swee, G. Halpern, G. Thattai, G. Herman, G. Sizov, G. Zhang, G. Lakshminarayanan, H. Shojanazeri, H. Zou, H. Wang, H. Zha, H. Habeeb, H. Rudolph, H. Suk, H. Aspegren, H. Goldman, I. Molybog, I. Tufanov, I. Veliche, I. Gat, J. Weissman, J. Geboski, J. Kohli, J. Asher, J. Gaya, J. Marcus, J. Tang, J. Chan, J. Zhen, J. Reizenstein, J. Teboul, J. Zhong, J. Jin, J. Yang, J. Cummings, J. Carvill, J. Shepard, J. McPhie, J. Torres, J. Ginsburg, J. Wang, K. Wu, U. KamHou, K. Saxena, K. Prasad, K. Khandelwal, K. Zand, K. Matosich, K. Veeraraghavan, K. Michelena, K. Li, K. Huang, K. Chawla, K. Lakhotia, K. Huang, L. Chen, L. Garg, A. Lavender, L. Silva, L. Bell, L. Zhang, L. Guo, L. Yu, L. Moshkovich, L. Wehrstedt, M. Khabsa, M. Avalani, M. Bhatt, M. Tsimpoukelli, M. Mankus, M. Hasson, M. Lennie, M. Reso, M. Groshev, M. Naumov, M. Lathi, M. Keneally, M. Seltzer, M. Valko, M. Restrepo, M. Patel, M. Vyatskov, M. Samvelyan, M. Clark, M. Macey, M. Wang, M. J. Hermoso, M. Metanat, M. Rastegari, M. Bansal, N. Santhanam, N. Parks, N. White, N. Bawa, N. Singhal, N. Egebo, N. Usunier, N. P. Laptev, N. Dong, N. Zhang, N. Cheng, O. Chernoguz, O. Hart, O. Salpekar, O. Kalinli, P. Kent, P. Parekh, P. Saab, P. Balaji, P. Rittner, P. Bontrager, P. Roux, P. Dollár, P. Zvyagina, P. Ratanchandani, P. Yuvraj, Q. Liang, R. Alao, R. Rodriguez, R. Ayub, R. Murthy, R. Nayani, R. Mitra, R. Li, R. Hogan, R. Battey, R. Wang, R. Maheswari, R. Howes, R. Rinott, S. J. Bondu, S. Datta, S. Chugh, S. Hunt, S. Dhillon, S. Sidorov, S. Pan, S. Verma, S. Yamamoto, S. Ramaswamy, S. Lindsay, S. Feng, S. Lin, S. Zha, S. Shankar, S. Zhang, S. Wang, S. Agarwal, S. Sajuyigbe, S. Chintala, S. Max, S. Chen, S. Kehoe, S. Satterfield, S. Govindaprasad, S. Gupta, S. Cho, S. Virk, S. Subramanian, S. Choudhury, S. Goldman, T. Remez, T. Glaser, T. Best, T. Kohler, T. Robinson, T. Li, T. Zhang, T. Matthews, T. Chou, T. Shaked, V. Vontimitta, V. O. Ajayi, V. Montanez, V. Mohan, V. Kumar, V. Mangla, V. Ionescu, V. Poenaru, V. T. Mihailescu, V. Ivanov, W. Li, W. Wang, W. Jiang, W. Bouaziz, W. Constable, X. Tang, X. Wang, X. Wu, X. Wang, X. Xia, X. Wu, X. Gao, Y. Chen, Y. Hu, Y. Jia, Y. Qi, Y. Li, Y. Zhang, Y. Zhang, Y. Adi, Y. Nam, Y. Wang, Y. Hao, Y. Qian, Y. He, Z. Rait, Z. DeVito, Z. Rosnbrick, Z. Wen, Z. Yang, and Z. Zhao (2024) The llama 3 herd of models. Cited by: 2nd item.
R. Geirhos, J. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. Wichmann (2020) Shortcut learning in deep neural networks. Nature Machine Intelligence 2, pp. 665 – 673. Cited by: §1.
T. Hagendorff (2023) Deception abilities emerged in large language models. Proceedings of the National Academy of Sciences of the United States of America 121. Cited by: §2.
Y. He, Y. Wu, Y. Jia, R. Mihalcea, Y. Chen, and N. Deng (2023) HI-tom: a benchmark for evaluating higher-order theory of mind reasoning in large language models. pp. 10691–10706. Cited by: §1, §2.
E. Hubinger, C. E. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, T. Lanham, D. M. Ziegler, T. Maxwell, N. Cheng, A. S. Jermyn, A. Askell, A. Radhakrishnan, C. Anil, D. Duvenaud, D. Ganguli, F. Barez, J. Clark, K. Ndousse, K. Sachan, M. Sellitto, M. Sharma, N. Dassarma, R. Grosse, S. Kravec, Y. Bai, Z. Witten, M. Favaro, J. Brauner, H. Karnofsky, P. Christiano, S. R. Bowman, L. Graham, J. Kaplan, S. Mindermann, R. Greenblatt, B. Shlegeris, N. Schiefer, and E. Perez (2024) Sleeper agents: training deceptive llms that persist through safety training. ArXiv abs/2401.05566. Cited by: §2.
B. Kim, D. Seo, and B. Kim (2024) Microscopic analysis on llm players via social deduction game. arXiv preprint arXiv:2408.09946. Cited by: §2.
H. Kim, M. Sclar, X. Zhou, R. L. Bras, G. Kim, Y. Choi, and M. Sap (2023) FANToM: a benchmark for stress-testing machine theory of mind in interactions. ArXiv abs/2310.15421. Cited by: §2.
J. Light, M. Cai, S. Shen, and Z. Hu (2023) AvalonBench: evaluating llms playing the game of avalon. Cited by: §2.
Z. Liu, A. Anand, P. Zhou, J. Huang, and J. Zhao (2024) InterIntent: investigating social intelligence of llms via intention understanding in an interactive game context. pp. 6718–6746. Cited by: §1, §2.
R. T. McCoy, E. Pavlick, and T. Linzen (2019) Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. ArXiv abs/1902.01007. Cited by: §1.
T. Niven and H. kao (2019) Probing neural network comprehension of natural language arguments. ArXiv abs/1907.07355. Cited by: §1.
A. O’Gara (2023) Hoodwinked: deception and cooperation in a text-based game for language models. ArXiv abs/2308.01404. Cited by: §2.
A. Pan, C. Shern, A. Zou, N. Li, S. Basart, T. Woodside, J. Ng, H. Zhang, S. Emmons, and D. Hendrycks (2023) Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. pp. 26837–26867. Cited by: §2.
J. Park, J. O’Brien, C. J. Cai, M. Morris, P. Liang, and M. S. Bernstein (2023) Generative agents: interactive simulacra of human behavior. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. Cited by: §2.
G. Piatti, Z. Jin, M. Kleiman-Weiner, B. Schölkopf, M. Sachan, and R. Mihalcea (2024) Cooperate or collapse: emergence of sustainable cooperation in a society of llm agents. Advances in Neural Information Processing Systems 37. Cited by: §2.
D. Premack and G. Woodruff (1978) Does the chimpanzee have a theory of mind?. Behavioral and Brain Sciences 1, pp. 515 – 526. Cited by: §1.
R. Ren, A. Agarwal, M. Mazeika, C. Menghini, R. Vacareanu, B. Kenstler, M. Yang, I. Barrass, A. Gatti, X. Yin, E. Trevino, M. Geralnik, A. Khoja, D. Lee, S. Yue, and D. Hendrycks (2025) The mask benchmark: disentangling honesty from accuracy in ai systems. ArXiv abs/2503.03750. Cited by: §2.
M. Sap, R. L. Bras, D. Fried, and Y. Choi (2022) Neural theory-of-mind? on the limits of social intelligence in large lms. pp. 3762–3780. Cited by: §1.
N. Shapira, M. Levy, S. Alavi, X. Zhou, Y. Choi, Y. Goldberg, M. Sap, and V. Shwartz (2023) Clever hans or neural theory of mind? stress testing social reasoning in large language models. pp. 2257–2273. Cited by: §1, §1.
N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, and S. Yao (2023) Reflexion: language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems 36. Cited by: 2nd item, §2.
J. W. A. Strachan, D. Albergo, G. Borghini, O. Pansardi, E. Scaliti, S. Gupta, K. Saxena, A. Rufo, S. Panzeri, G. Manzi, M. S. A. Graziano, and C. Becchio (2024a) Testing theory of mind in large language models and humans. Nature Human Behaviour 8, pp. 1285 – 1295. Cited by: §1.
J. W. A. Strachan, D. Albergo, G. Borghini, O. Pansardi, E. Scaliti, S. Gupta, K. Saxena, A. Rufo, S. Panzeri, G. Manzi, M. S. A. Graziano, and C. Becchio (2024b) Testing theory of mind in large language models and humans. Nature Human Behaviour 8, pp. 1285 – 1295. Cited by: §2.
W. Street, J. Siy, G. Keeling, A. Baranes, B. Barnett, M. McKibben, T. Kanyere, A. Lentz, B. A. Y. Arcas, and R. Dunbar (2024) LLMs achieve adult human performance on higher-order theory of mind tasks. Frontiers in Human Neuroscience 19. Cited by: §1.
T. Ullman (2023) Large language models fail on trivial alterations to theory-of-mind tasks. ArXiv abs/2302.08399. Cited by: §1, §1.
G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. (. Fan, and A. Anandkumar (2023) Voyager: an open-ended embodied agent with large language models. ArXiv abs/2305.16291. Cited by: §2.
S. Wang, C. Liu, Z. Zheng, S. Qi, S. Chen, Q. Yang, A. Zhao, C. Wang, S. Song, and G. Huang (2024) Boosting llm agents with recursive contemplation for effective deception handling. pp. 9909–9953. Cited by: §1, §1.
S. Wu, L. Zhu, T. Yang, S. Xu, Q. Fu, Y. Wei, and H. Fu (2024) Enhance reasoning for large language models in the game werewolf. ArXiv abs/2402.02330. Cited by: §2.
H. Xu, R. Zhao, L. Zhu, J. Du, and Y. He (2024) OpenToM: a comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. pp. 8593–8623. Cited by: §2.
Y. Xu, S. Wang, P. Li, F. Luo, X. Wang, W. Liu, and Y. Liu (2023a) Exploring large language models for communication games: an empirical study on werewolf. ArXiv abs/2309.04658. Cited by: 2nd item, §1.
Z. Xu, W. Gu, C. Yu, Y. Wu, and Y. Wang (2025) Learning strategic language agents in the werewolf game with iterative latent space policy optimization. ArXiv abs/2502.04686. Cited by: §2.
Z. Xu, C. Yu, F. Fang, Y. Wang, and Y. Wu (2023b) Language agents with reinforcement learning for strategic play in the werewolf game. ArXiv abs/2310.18940. Cited by: §1.
Z. Xu, C. Yu, F. Fang, Y. Wang, and Y. Wu (2023c) Language agents with reinforcement learning for strategic play in the werewolf game. ArXiv abs/2310.18940. Cited by: §2.

Appendix A Per-cell faction win-rates

Table 7 is the numeric companion to Figure 2 in Section 5.1, reporting per-cell win-rates with Wilson 95% confidence intervals at $n{=}10$ games.

Model	Cond.	Villagers	Werewolves	Jester
GPT-4.1	OFF	0.30 [0.11, 0.60]	0.00 [0.00, 0.28]	0.70 [0.40, 0.89]
GPT-4.1	ON	0.30 [0.11, 0.60]	0.10 [0.02, 0.40]	0.60 [0.31, 0.83]
DeepSeek-V3.1	OFF	0.60 [0.31, 0.83]	0.10 [0.02, 0.40]	0.30 [0.11, 0.60]
DeepSeek-V3.1	ON	0.30 [0.11, 0.60]	0.00 [0.00, 0.28]	0.70 [0.40, 0.89]
Llama-3.3-70B	OFF	0.50 [0.24, 0.76]	0.10 [0.02, 0.40]	0.40 [0.17, 0.69]
Llama-3.3-70B	ON	0.20 [0.06, 0.51]	0.20 [0.06, 0.51]	0.60 [0.31, 0.83]

Table 7: Per-cell faction win-rates with Wilson 95% CIs (

n=10

games per cell). Each row sums to 1.0 across factions. Numeric companion to Figure 2.

Three features of the per-cell breakdown back the pooled numbers in the main text. The Jester is the modal winning faction in four of the six cells and the Werewolves never clear $0.20$ in any cell, so the Jester-dominance and wolf-collapse patterns hold cell by cell rather than only on average. The self-learning loop lifts the Jester column for DeepSeek-V3.1 ( $0.30\to 0.70$ ) and Llama-3.3-70B ( $0.40\to 0.60$ ) while leaving GPT-4.1 near its ceiling ( $0.70\to 0.60$ ), the same uneven learning effect plotted in Figure 2. In both improving cells the matching loss falls in the Villager column (DeepSeek $0.60\to 0.30$ , Llama $0.50\to 0.20$ ) and not the Werewolf column, which is the per-cell form of the substitution effect of Section 5.1. The wide Wilson intervals, several spanning more than $0.4$ , are why we rest the claims on this cross-cell consistency rather than on any single cell.

Appendix B Additional results figures

This appendix collects three supporting figures referenced from the main results section: the cross-perception difference heatmap (Section 5.5), the game-length distribution by winner, and the six-axis behavioral fingerprint (Section 5.5).

B.1 Cross-perception second-order shifts

Figure 5 shows the full per-model heatmap of how the end-of-game suspicion that observer roles direct at target roles shifts when Jester self-learning is turned on. Rows are the observer role, columns are the target role, and the colour shows ON $-$ OFF: red means observers in this row become more suspicious of this target under the loop, blue means less. The figure makes two points that the main-text numbers compress. First, the Llama panel on the right has the most saturated cells, including a strong blue band over the Doctor column. Every other role becomes less suspicious of the Doctor, even though the loop never touched any Doctor prompt. Second, the GPT-4.1 panel is visually almost grey, consistent with the saturation argument in Section 5.1: at ceiling, the loop has no remaining headroom to propagate.

B.2 Game length

Figure 6 plots the number of day rounds elapsed before a game resolves, faceted by model and split by winning faction and Jester-learning condition. Jester wins cluster at the left edge of every panel: the Jester is exiled almost immediately once the loop produces a strategy that the group locks onto. Wolf wins, in the rare cells where they happen, take noticeably more rounds because they require attrition through several night kills. The pattern confirms that high-confidence voting fires on the highest-suspicion player without filtering for which faction that player belongs to, which is exactly the cue-ambiguity failure the triadic setup is designed to expose.

B.3 Behavioral fingerprint radar

Figure 7 compresses each model’s behavior into a six-axis polygon, with one shape for Jester-learning OFF (grey) and one for ON (orange). All six axes are normalized to $[0,1]$ with larger values pushed outward, so a model that improves under the loop expands outward on the relevant axis. The three cross-cell contrasts highlighted in the main text become immediate. GPT-4.1’s OFF and ON polygons nearly overlap, which is the saturation pattern. DeepSeek’s polygon visibly rotates: the Jester axis grows outward while the Villager axis contracts, which is the substitution effect from Section 5.1 made visible on a single shape. Llama’s polygon is the only one where the wolf-restraint axis pulls inward under the loop, which matches its larger second-order cross-perception shift.

Appendix C Doctor protect targets and wolf night-kill victims

The Doctor self-protects almost universally, a pattern that itself reflects the cue-ambiguity problem. A Doctor protecting a Villager would need to reason about who Wolves are most likely to target, but the high- $S$ band contains both Werewolves and the Jester. Self-protection is a safe default that minimizes risk in the absence of a clear inference. The Werewolves’ choice of night-kill victim is similarly distorted. Doctors are killed more often than Villagers in every cell, which is a reasonable strategic move. Killing the Jester accidentally is also frequent (GPT-4.1 ON: 4 of 18 night-kills), and each such kill is a self-loss event.

Appendix D Mechanistic findings: per-statement event-log analysis

Going beyond aggregate cell-level metrics, we mined the per-statement event logs (3,992 deception_analysis events with paired self-tags and 8 observer reviews each) for four mechanistic insights, summarized in Figure 9.

Day-1 suspicion is destiny for the Jester.

Splitting the round-by-round observer-suspicion trajectory of the Jester by eventual outcome (Fig. 9a; full table in App. F), the win/loss separation appears at round 0 and the trajectories diverge little thereafter. For Llama-ON the round-0 mean suspicion is 0.78 in won games versus 0.69 in lost ones; for GPT-4.1 every won game already starts above 0.70. A Jester that fails to seed suspicion on the very first day rarely recovers, consistent with anchoring effects in a multi-agent vote. The pattern is most extreme for GPT-4.1: in every one of its 7 lost Jester games the Jester bid too low to speak on day 1 at all, so day-1 silence is a 100% predictor of a GPT-4.1 Jester loss in our corpus.

Each model has its own deception aesthetic.

Aggregating per-statement observer suspicion by self-declared deception type (Fig. 9b) reveals heterogeneity across judge models. GPT-4.1’s villagers register the highest suspicion against low-content tactics (omission 0.88, distortion 0.86); Deepseek’s villagers react most to fabrication and persuasive manipulation; Llama’s react most to omission and misdirection. The same Jester strategy is therefore not optimal in all environments; the self-learning loop succeeds in part by matching the Jester’s tactic distribution to the local judge model.

Wolves systematically fail multi-hop ToM.

Of 55 wolf-loss games we classify each into one of four causes (Fig. 9c). GPT-4.1 wolves lose 90–100% of their games by self-sabotage: 13 of 20 GPT-4.1 wolf games end with the wolves voting the Jester out on day 1 themselves. This is a clean behavioural signature of failed multi-hop incentive prediction: an incentive-aware wolf should recognise the Jester as a third party whose elimination ends the wolves’ game, yet GPT-4.1 wolves treat the Jester as just-another-villager. Deepseek and Llama wolves show milder versions of the same pattern but more often lose by being outnumbered (category D) or by the Jester baiting villagers into exiling it (category C).

Calibration asymmetry across roles.

The signed gap between observer-flagging rate and speaker self-tag (Fig. 9d) is consistently positive for Villagers (+0.08 to +0.21) and Doctors, and negative for Wolves and Jesters (-0.02 to -0.17). Wolves are therefore better at hiding than they think, while honest Villagers are systematically misperceived as deceptive. The triadic setup amplifies the latter bias: when a third winning condition is on the board, any unusual phrasing from a Villager looks “off,” inflating false-positive suspicion. Llama and Deepseek both have $\geq+0.17$ Villager gaps; GPT-4.1’s gap is the smallest at $+0.09$ .

Univariate predictors of Jester wins.

Across all 60 games, the strongest game-level predictor of a Jester win is whether the wolves voted for the Jester on day 1 (Pearson $r=+0.59$ ), followed by the mean suspicion-rank of vote targets ( $r=-0.41$ , i.e. Jester wins occur when villagers vote the highest-suspicion player) and the Jester’s own average attracted suspicion ( $r=+0.41$ ). Crucially the number of Jester statements is uncorrelated with winning ( $r=0.00$ ): it is how convincingly and how early, not how much, that determines the outcome.

Five further per-statement and per-game phenomena are summarised in Figure 10 (full tables in App. J).

Bidding paradox.

The relationship between the Jester’s bidding aggressiveness and the eventual outcome is model-specific (Fig. 10a). For GPT-4.1, the only winning Jesters bid $\geq 7$ on day 1; in 7 of 7 lost games the Jester bid too low to speak. For Deepseek-ON the relationship inverts: winning Jesters bid 4.6 on average on day 1 versus 8.7 for losing ones, suggesting Deepseek’s self-learning loop converges on a “stay quiet on day 1, pop later” policy. Llama’s bids are essentially flat ( $\sim 8$ ) and uninformative. There is no universal optimal bidding strategy; the self-learning loop discovers different convergent behaviours per model.

Doctors are the most under-detected role.

Aggregating per-statement deception-detection accuracy by speaker role (Fig. 10b) reveals that the Doctor is the hardest role to label correctly across all three judges (0.65–0.81), whereas Wolves and Jesters are detected at 0.80–0.93 and Villagers at 0.74–0.90. Doctors are partially deceptive – they hide a single fact (their role) while otherwise speaking truthfully – which produces a behavioural signature that does not match either “honest villager” or “lying wolf” templates. This is a structural ToM blind spot that the triadic setup happens to expose.

Self-learning helps Llama and hurts GPT-4.1 within a cell.

Sorting the ten ON-condition games per cell chronologically and plotting the cumulative Jester win-rate (Fig. 10c) reveals a striking model-dependent learning trajectory: Llama-ON climbs from 0.4 in the first half to 0.8 in the second half ( $+40$ pp), GPT-4.1-ON drops from 0.8 to 0.4 ( $-40$ pp), and Deepseek-ON drops slightly (0.8 $\to$ 0.6). One interpretation is that GPT-4.1’s accumulated learnings file grows long enough to interfere with its base policy – a long-context instruction degradation effect – whereas Llama’s weaker base policy benefits from explicit memory. This is a useful caveat for any deployment of self-reflective agent loops on top of strong base models.

Observer confidence is well-calibrated, except Llama.

Binning observer “deceptive” flags by their self-reported confidence (Fig. 10e), GPT-4.1 and Deepseek both reach a hit-rate of $\geq 0.99$ at confidence $\geq 0.95$ , closely tracking the perfect-calibration diagonal. Llama never assigns confidence above 0.95; its highest-confidence flags top out at $\sim 0.94$ hit-rate. Practically, this means a high-confidence flag from GPT-4.1 or Deepseek can be treated as essentially ground truth for downstream pipelines, whereas Llama’s calibration ceiling at $\sim 0.95$ caps its usefulness as an automated deception labeler.

Detection accuracy by observer role.

The complementary view by observer role (Fig. 10f) shows that Doctors and Jesters are the most accurate observers in the Llama and Deepseek environments, while in GPT-4.1 all roles cluster between 0.85–0.92. This is consistent with Doctors and Jesters having stronger asymmetric-information advantages – they know their own role and can therefore interpret others’ behaviour more informatively.

Appendix E Temporal dynamics of collaboration and deception

Treating each game as a time series along three axes – within-game day index, within-day statement position, and across the 10-game cell sequence – exposes a set of role-specific dynamics that the aggregate metrics in §D hide. The full panel is shown in Fig. 11; supporting tables appear in Appendix K.

Suspicion accumulates linearly for Wolves but saturates day 1 for Jesters.

Mean attracted suspicion rises for Wolves from $0.70$ on day 0 to $0.93$ on day 4 – a slow, evidence-driven accumulation. Jesters, by contrast, jump from $0.72$ to $0.80$ between day 0 and day 1 and then plateau, indicating that the Jester signature is decoded essentially in the first speech act and that any subsequent behaviour rides a near-saturated suspicion baseline. Doctors trace a U-shape ( $0.45\to 0.35\to 0.55\to 0.78$ ): cleared in the early rounds, then re-targeted late as the player pool shrinks. This is the mechanism behind the per-role survival imbalance shown in panel (f). Villagers stay flat near $0.30$ .

Deception emission is a fixed role policy, but the policy itself differs sharply across models for the Doctor.

Wolves emit $\approx 95\text{--}100\%$ deceptive statements in every round; Jesters $\approx 90\text{--}100\%$ ; Villagers $0\text{--}7\%$ . There is no in-game adaptation along this axis – no panic-lying, no cool-down. The Doctor, however, is governed by very different model-specific policies: GPT-4.1 plays Doctor at $0\%$ deception after day 1, Llama at a steady $20\text{--}33\%$ , and Deepseek oscillates between $0\%$ and $100\%$ across rounds. This three-way divergence on a single role is one of the largest cross-model behavioural gaps in our data and helps explain why Doctor survival is uniformly low: each model is failing for a different reason.

Late-day deceivers attract more suspicion – but only deceivers.

Binning each statement into early/mid/late thirds of its day, Wolves and Jesters attract increasing suspicion as they speak later (Wolf: $0.70\to 0.83$ ; Jester: $0.68\to 0.81$ ) while honest roles show no position effect (Villager $0.27/0.35/0.25$ ; Doctor $0.39/0.46/0.39$ ). The effect is the opposite of the naive expectation that more thinking time produces safer statements: observers appear to hold late speakers to a higher inferential bar, and the more polished a deceiver’s last-speaker statement is, the more flagged it gets.

The public debate is a wolf coordination channel.

Wolf vote-unanimity rises monotonically with round in every model – GPT $0.80\to 1.00$ , Deepseek $0.60\to 1.00$ , Llama $0.65\to 1.00$ . Even though the night-time kill is a private wolf-only action, the day-time debate appears to serve as a coordination channel that converges wolves on a shared target by round 2. The day-1 wolf-vote-Jester error is also universal and uniformly self-corrected, the share of wolves voting the Jester falling from $0.60\to 0.00$ in GPT by round 2 and from $0.23\to 0.00$ in Deepseek and $0.35\to 0.10$ in Llama by round 3. The mistake is recognised within a round or two, but by then the village suspicion landscape is already poisoned.

Doctor target-prediction accuracy is inverse to overall model capability.

Llama Doctors predict the actual wolf target with $62\text{--}100\%$ accuracy across rounds; GPT-4.1 Doctors degrade from $65\%$ on day 1 to $14\%$ on day 2 and $33\%$ on day 3; Deepseek oscillates noisily. The reversal is striking because GPT-4.1 wins the most games overall: a plausible interpretation is that GPT-4.1 over-reasons about decoy possibilities and second-order wolf strategy, whereas Llama applies a simpler “protect the most-suspected innocent” heuristic that survives noisy evidence better. Strong inferential machinery hurts in a setting with high observation noise and a short horizon.

Cross-game role-specific drift exists even without explicit learning.

The OFF condition has no learnings file, yet several role-model pairs drift across the 10-game cell sequence: Llama Wolf survival rises from $0.50$ early to $0.75$ in games 7–10, Deepseek Villager survival drops from $\geq 0.80$ to $0.40$ at game 9, and GPT-4.1 Jester survives in $0/10$ games regardless of position. The Llama Wolf drift suggests an implicit cell-position effect, possibly via prompt-cache-warming or evaluator-side variation; we flag this as a confound for future cross-game analyses. The two zero-row results (Llama Doctor $0/10$ survival, GPT-4.1 Jester $0/10$ survival) identify the two most-broken role-model pairings in the benchmark.

Appendix F Detailed deep-analysis tables

The following tables back the mechanistic findings of Sec. D.

F.1 Jester suspicion trajectory by round

The table tracks the mean observer-suspicion of the Jester across game rounds, separating cells by eventual outcome. Three patterns are visible. First, suspicion of the Jester rises monotonically with round in nearly every cell – this is mechanical, since by round 1 each observer has seen multiple Jester statements and accumulated evidence. Second, the won-vs-lost separation already exists at round 0 in every cell that has both classes (Llama-OFF: 0.69 vs 0.53; Llama-ON: 0.78 vs 0.69), giving the day-1 prior almost all of the predictive content. Third, the gpt-4.1 rows are notably sparse – the model has no row at all for “lost” games at round 0, because in every one of its 7 lost Jester games the Jester bid below the speaking threshold and never produced a day-1 statement to be analysed. Day-1 silence is therefore not just correlated with losing for GPT-4.1; it is its complete characterisation.

Mean observer-suspicion of the Jester aggregated by model, condition, round, and eventual outcome (jester_won). $n$ is the number of (game, round) cells with at least one observer review of a Jester statement.

Model	Cond	Round	Won?	Mean susp	$n$
deepseek-v3.1	off	0	0	0.673	7
deepseek-v3.1	off	0	1	0.751	3
deepseek-v3.1	off	1	0	0.841	6
deepseek-v3.1	off	1	1	0.737	2
deepseek-v3.1	on	0	0	0.655	3
deepseek-v3.1	on	0	1	0.640	7
deepseek-v3.1	on	1	0	0.837	2
deepseek-v3.1	on	1	1	0.812	5
gpt-4.1	off	0	1	0.723	7
gpt-4.1	on	0	1	0.806	6
llama-3.3-70b	off	0	0	0.530	6
llama-3.3-70b	off	0	1	0.685	4
llama-3.3-70b	off	1	0	0.715	3
llama-3.3-70b	off	1	1	0.763	3
llama-3.3-70b	on	0	0	0.685	4
llama-3.3-70b	on	0	1	0.776	6
llama-3.3-70b	on	1	0	0.749	3
llama-3.3-70b	on	1	1	0.852	3

F.2 Per-deception-type yield (mean observer suspicion of Jester)

For each Jester statement we record the speaker’s self-declared deception type and the mean suspicion the statement attracts from the observers in the same round. The table groups by (model, condition, type) and reports both the mean observer suspicion and the fraction of observers who flagged the statement as deceptive. Three observations stand out. (i) The tactic mix differs sharply across models: Deepseek-OFF Jesters lean heavily on fabrication ( $n=55$ of 88 statements), GPT-4.1-ON Jesters lean on distortion ( $n=38$ of 56), and Llama Jesters lean on misdirection ( $n=63$ of 110 in the ON cell). (ii) The yield of each tactic is also model- specific: GPT-4.1’s villagers are most-suspicious of low-content tactics (omission 0.88, distortion 0.86) and least-suspicious of explicit misdirection (0.36) – the opposite of Llama, where misdirection actually attracts the highest suspicion (0.79). (iii) Truthful Jester statements (none type) draw genuinely low suspicion (0.18 for GPT-4.1-OFF, 0.59 for Llama-OFF), confirming that the deception signal is real and not noise: when the Jester is honest, the villagers can tell.

Model / cond	Deception type	$n$	Mean susp	Flag rate
deepseek-v3.1, off
	persuasive_manipulation	5	0.847	1.00
	strategic_deception	9	0.822	0.94
	fabrication	55	0.798	0.93
	distortion	11	0.775	0.93
	misdirection	6	0.677	0.81
deepseek-v3.1, on
	fabrication	47	0.789	0.94
	misdirection	3	0.780	1.00
	distortion	12	0.769	0.94
	strategic_deception	13	0.674	0.78
gpt-4.1, off
	omission	5	0.877	0.90
	strategic_deception	23	0.790	0.84
	distortion	31	0.775	0.91
	misdirection	3	0.363	0.29
	none	3	0.183	0.08
gpt-4.1, on
	distortion	38	0.860	0.98
	strategic_deception	14	0.804	0.91
	misdirection	2	0.256	0.19
llama-3.3-70b, off
	distortion	3	0.786	0.76
	misdirection	38	0.705	0.88
	omission	6	0.663	0.76
	strategic_deception	32	0.651	0.76
	none	17	0.591	0.70
llama-3.3-70b, on
	omission	1	0.842	1.00
	none	8	0.796	0.91
	misdirection	63	0.786	0.93
	strategic_deception	30	0.755	0.89
	persuasive_manipulation	3	0.700	0.83

F.3 Speaker-vs-observer deception calibration

The table compares each speaker’s self-tagged deception rate with the mean rate at which observers flagged that speaker’s statements as deceptive. The signed gap (calib_gap) summarises whether the role is over- or under-perceived. Three patterns emerge. (i) Wolves and Jesters, who self-tag as deceptive on $\geq 80\%$ of statements, are nonetheless slightly under-detected – their gaps are consistently negative ( $-0.02$ to $-0.17$ ). The largest negative gap is GPT-4.1’s Werewolf at $-0.17$ , meaning GPT-4.1 wolves are particularly good at hiding their deception. (ii) Villagers and Doctors exhibit the opposite pathology: positive gaps of $+0.08$ to $+0.21$ , i.e. observers attribute deception to honest players much more often than the speakers themselves admit it. The Llama-OFF Villager gap of $+0.21$ is the largest – 21% of honest villager statements are flagged as deceptive by some observer. (iii) Across every cell, GPT-4.1 has the smallest Villager gap ( $+0.09$ ON, $+0.08$ OFF), indicating that GPT-4.1 villagers are best at recognising honest content from each other. Combined, these patterns describe a systematic bias: in a triadic game, observers over-attribute deception to honest play and under-detect deception by skilled liars.

self_rate = fraction of statements the speaker tagged deceptive. obs_rate = mean fraction of observers who flagged each statement deceptive. calib_gap = obs_rate $-$ self_rate (positive = speaker under-reports their own deception relative to peer perception).

Model	Cond	Role	$n$	self_rate	obs_rate	calib_gap
deepseek-v3.1	off	Doctor	95	0.421	0.557	$+0.136$
deepseek-v3.1	off	Jester	88	1.000	0.922	$-0.078$
deepseek-v3.1	off	Villager	391	0.018	0.225	$+0.207$
deepseek-v3.1	off	Werewolf	187	1.000	0.938	$-0.062$
deepseek-v3.1	on	Doctor	67	0.448	0.570	$+0.122$
deepseek-v3.1	on	Jester	77	1.000	0.905	$-0.095$
deepseek-v3.1	on	Villager	334	0.012	0.154	$+0.142$
deepseek-v3.1	on	Werewolf	157	1.000	0.937	$-0.063$
gpt-4.1	off	Doctor	54	0.074	0.103	$+0.029$
gpt-4.1	off	Jester	66	0.955	0.822	$-0.133$
gpt-4.1	off	Villager	261	0.000	0.079	$+0.079$
gpt-4.1	off	Werewolf	97	0.990	0.818	$-0.172$
gpt-4.1	on	Doctor	51	0.176	0.251	$+0.075$
gpt-4.1	on	Jester	56	1.000	0.920	$-0.080$
gpt-4.1	on	Villager	315	0.000	0.092	$+0.092$
gpt-4.1	on	Werewolf	122	1.000	0.855	$-0.145$
llama-3.3-70b	off	Doctor	83	0.301	0.505	$+0.204$
llama-3.3-70b	off	Jester	96	0.823	0.796	$-0.027$
llama-3.3-70b	off	Villager	419	0.053	0.264	$+0.211$
llama-3.3-70b	off	Werewolf	191	0.906	0.874	$-0.032$
llama-3.3-70b	on	Doctor	85	0.153	0.277	$+0.124$
llama-3.3-70b	on	Jester	110	0.927	0.910	$-0.017$
llama-3.3-70b	on	Villager	391	0.074	0.278	$+0.204$
llama-3.3-70b	on	Werewolf	199	0.965	0.906	$-0.059$

F.4 Wolf-loss failure-mode taxonomy

This taxonomy splits every wolf loss into one of four mechanisms. Categories A and B are wolf self-sabotage, the day-1 vote of the Jester (A, which hands the Jester the win) and the night-kill of the Jester (B, which wastes the kill and lets the Villagers win), while C is the Jester being voted out by the villagers with the wolves uninvolved (a Jester win) and D is the villagers outnumbering the wolves (a Villager win). The counts are promoted to the main paper as Table 6 (Section 5.4), where the category letters are defined, and they also appear as Fig. 9c. Read per cell, the table puts GPT-4.1 almost entirely in the self-sabotage columns, with A and B accounting for all ten of its OFF games and nine of its ten ON games and nothing left for categories C or D, so its wolves essentially never lose for any reason other than acting against the Jester themselves. DeepSeek-V3.1 shifts toward category C under self-learning, from two such games OFF to five ON while its category-D losses fall from five to one, so its losses move from the villagers outnumbering the wolves toward the Jester baiting the villagers into exiling it, the same redistribution that lifts its Jester win-rate from $0.30$ to $0.70$ . Llama-3.3-70B spreads across all four categories in both conditions and is the only cell to record two wolf wins (ON), the diffuse profile one expects from a model that plays every side less sharply.

F.5 Univariate predictors of Jester win

Feature	$r$	Won	Lost	$n$
Wolf voted Jester day 1	$+0.587$	0.688	0.107	60
Vote-suspicion mean rank	$-0.414$	1.927	2.767	60
Jester avg attracted susp	$+0.413$	0.770	0.704	53
Jester deception rate	$+0.358$	6.751	6.109	53
Jester # statements	$+0.003$	9.312	9.286	53

Table 8: Univariate predictors of Jester wins across all 60 games.

r

= Pearson correlation with Jester win; “Won” and “Lost” are feature means in won vs. lost games.

Table 8 reports univariate Pearson correlations between five per-game features and the binary Jester-win outcome across all 60 games. The single strongest predictor is whether the Wolves voted for the Jester on day 1 ( $r{=}{+}0.587$ ): in games the Jester won, this happened in 69% of cases versus 11% in lost games. The second-strongest signal is the mean suspicion-rank of vote targets ( $r{=}{-}0.414$ ): Jester wins when Villagers concentrate their votes on the single highest-suspicion player, since by construction that player is as plausibly the Jester as a Wolf. The Jester’s own average attracted suspicion ( $r{=}{+}0.413$ ) and the rate at which the Jester self-tags statements as deceptive ( $r{=}{+}0.358$ ) both predict positively but more weakly. The null result is that the raw number of Jester statements is uncorrelated with winning ( $r{=}{+}0.003$ , with feature means $9.31$ vs $9.29$ in won vs lost games), so the predictor structure is about how convincing and how early the Jester’s contributions are, not how many it makes. Sample size drops from 60 to 53 for the three Jester-side features because seven games have no Jester-tagged turns (the Jester is killed on night 1 or never speaks).

F.6 Jester self-learning entries: full ToM example set

Table 9 expands the single-row-per-order sample of Table 2 into the full set of representative entries, so the qualitative gap behind the ToM-order counts of Section 5.6 can be read directly. The two DeepSeek-V3.1 rows are the only third-order entries that name both audiences at once, appearing wolf-like while avoiding villager-alignment, which is the contrastive structure the regex tagger keys on. The GPT-4.1 rows step down a level, from a single third-order ambiguity entry to second-order “appear erratic” framing to a bare first-order imperative, and both Llama-3.3-70B rows stay at orders 1–2 with no cross-faction contrast. The verbatim wording makes clear that the order labels track a real difference in what each model represents about its observers, not a labelling artefact.

Model	ToM	Entry (verbatim)
DeepSeek-V3.1	3	Vote for a player who is not the main suspect (like Joy) to appear wolf-like by deflecting from the consensus, but only after the main suspect is already exiled to avoid appearing villager-aligned.
DeepSeek-V3.1	3	Maintain a vote on a non-consensus target during revote to appear stubbornly wolf-like and avoid villager alignment.
GPT-4.1	3	Lean into accusations of being both Werewolf and Jester to create ambiguity and prevent players from settling on my true role.
GPT-4.1	2	Escalate suspicion on multiple players in quick succession to appear erratic and untrustworthy.
GPT-4.1	1	Use repeated questioning of players already under scrutiny to amplify suspicion without initiating new accusations.
Llama-3.3-70B	2	Redirect suspicion towards other players to avoid being the center of attention.
Llama-3.3-70B	1	Use the initial questions and reactions to gauge players’ behaviors and adjust strategy accordingly.

Table 9: Full set of representative Jester self-learning entries sampled verbatim from the per-model running lessons file, expanding Table 2 (main). ToM order: 1 = bare imperative; 2 = references how others perceive the agent; 3 = explicit contrast across two faction-perceptions (e.g. wolf-like not villager-aligned). Only DeepSeek-V3.1 reliably emits ToM-3 contrastive entries; Llama remains at orders 1–2.

F.7 Doctor non-self-protect events

The Doctor self-protects on $105$ of its $112$ protect actions across the corpus, a $94\%$ self-protect rate, so the nights where it shields another player are rare enough to enumerate one by one. Table 10 gives the full per-cell split that backs the protect-target distribution of Fig. 8(a) (App. C). Three of the six cells, GPT-4.1 in both conditions and Llama-3.3-70B OFF, self-protect on $100\%$ of nights and never once shield a teammate, so the entire deviation from self-protection comes from DeepSeek-V3.1 and Llama-3.3-70B ON. The seven non-self events are split four to Llama-3.3-70B ON and three to DeepSeek-V3.1, and the shielded player is most often the Jester or an ordinary Villager, with one Llama ON night spent protecting a Werewolf. Even this handful tilts toward the high-suspicion roles rather than a reasoned read of the wolves’ likely target, the same cue-driven default that the self-protect majority reflects. We report the raw counts rather than a conditional win-rate because seven events are far too few to estimate an outcome effect.

Model	Cond	Self-protect	Other	Role shielded when not self
deepseek-v3.1	off	22	1	Villager ${\times}1$
deepseek-v3.1	on	16	2	Jester ${\times}1$ , Villager ${\times}1$
gpt-4.1	off	15	0	—
gpt-4.1	on	15	0	—
llama-3.3-70b	off	19	0	—
llama-3.3-70b	on	18	4	Jester ${\times}3$ , Werewolf ${\times}1$

Table 10: Doctor protect-target distribution per cell. “Self-protect” and “Other” count protect events (not distinct games); the last column names the shielded role on the non-self nights. Across all cells the Doctor self-protects on

105/112

nights (

94\%

), and only DeepSeek-V3.1 and Llama-3.3-70B ON ever deviate. Numeric companion to Fig. 8(a).

Appendix G Voting pathology: full plot

Figure 12 backs the wolf-self-sabotage and vote-alignment claims of Sec. 5.4.

The two panels separate the two halves of the failure. Panel (a) shows that day-1 wolf-vote-Jester stays high or rises under the loop in every model, so the self-learning intervention edits only the Jester’s prompt and never teaches the Wolves to stop conceding the game, which is why the Werewolf win-rate stays pinned near zero throughout. Panel (b) explains why GPT-4.1 is the extreme case, its voters sit nearest rank 1 and so commit hardest to whichever single player tops their suspicion order, exactly the one-dimensional cue-following that exiles the Jester, whereas Llama spreads its votes across the top-2 to top-3 and lands the dominated vote less mechanically. Tighter suspicion alignment therefore coincides with more self-sabotage, the opposite of what an incentive-aware voter would produce, which is the cue-sufficiency reading of Sec. 5.4.

Appendix H Substitution and Jester fate: full plot

Figure 13 backs the substitution claim of Sec. 5.1 and the night-kill self-sabotage claim of the Jester-fate subsection. Panel (a) shows the diverging $\Delta$ (ON $-$ OFF) win-rate per faction; panel (b) shows the per-game Jester fate.

Panel (a) isolates who pays for the Jester’s learning gains. The Villager bar drops by $0.30$ in both DeepSeek-V3.1 and Llama-3.3-70B while the Werewolf bar barely moves off zero, so the loop transfers wins from the Villagers to the Jester and leaves the already-collapsed Wolves untouched, the per-faction form of the substitution claim in Sec. 5.1. Panel (b) decomposes the Jester outcome into its three terminal states and shows that a sizeable killed_at_night slice persists in every cell, each instance a Wolf night-kill of the Jester that ends the game in a triadic loss rather than a Wolf win. The voted_out share tracks the Jester-win column of Figure 2 as the rules require, so the fate breakdown is a consistency check as much as a result.

Appendix I Manipulation rates and suspicion attracted: full plot

Figure 14 backs the claim of Sec. 5.1–5.1 (and the operational cue-ambiguity statement in Sec. 5.4) that Jester and Werewolf attract statistically indistinguishable suspicion despite the Jester producing more deceptive statements per turn.

The two halves of the dumbbell figure make the cue-ambiguity property concrete. On the left, the Jester out-produces the Werewolf in peer-detected deceptions per statement in every model, so the Jester is by this measure the more visibly deceptive of the two high-suspicion roles. On the right, the suspicion the two attract is nonetheless nearly identical, both sitting in the $0.65$ – $0.81$ band across all six cells and well clear of the $0.22$ – $0.51$ that Villagers and the Doctor draw. A single observer reading off attracted suspicion therefore cannot tell the Jester from a Werewolf, which is precisely the ambiguity an incentive-aware voter would have to resolve and the empirical basis for treating the Jester win-rate as a multi-hop ToM probe.

Appendix J Auxiliary findings: detailed tables

The five auxiliary findings reported under Sec. D are backed by the following per-cell tables.

J.1 Bidding economics (per-game Jester bids)

Table 11 breaks the per-game Jester bids of Fig. 10a down by model, condition, and eventual outcome, and it shows that no single bidding level is universally optimal. GPT-4.1 has no losing row at all because its seven Jester losses are exactly the seven games where the Jester never cleared the speaking threshold on day 1, so for GPT-4.1 the relevant variable is whether the Jester speaks rather than how aggressively it bids. DeepSeek-V3.1 ON runs the opposite policy, its winning Jesters open with a mean first-bid of $4.57$ against $8.67$ for losing ones, the quantitative form of the “stay quiet on day 1, pop later” strategy that its self-learning loop converges on. Llama-3.3-70B sits at the degenerate end, bidding a flat $8.00$ on every first move in all four cells, so its bids carry no information about the outcome.

Model	Cond	Won?	Mean avg-bid	Mean first-bid	Mean max-bid	$n$
deepseek-v3.1	off	0	8.18	5.29	9.71	7
deepseek-v3.1	off	1	8.55	6.33	9.67	3
deepseek-v3.1	on	0	8.49	8.67	9.33	3
deepseek-v3.1	on	1	7.96	4.57	9.43	7
gpt-4.1	off	1	8.20	7.00	9.86	7
gpt-4.1	on	1	8.65	7.00	9.83	6
llama-3.3-70b	off	0	7.91	8.00	8.00	6
llama-3.3-70b	off	1	7.96	8.00	8.00	4
llama-3.3-70b	on	0	7.77	8.00	8.00	4
llama-3.3-70b	on	1	7.94	8.00	8.00	6

Table 11: Per-game Jester bids by model, condition, and outcome. GPT-4.1 has no Won?=0 row because its 7 Jester losses all occurred in games where the Jester never bid above the speaking threshold (silence on day 1). Llama bids are essentially flat at 8 in all cells.

J.2 Detection accuracy by speaker role (weighted by $n$ )

Table 12 gives the per-statement detection numbers behind Fig. 10b, weighted by the statement count in each cell. The Doctor is the hardest speaker to label in every model, with accuracy $0.647$ – $0.808$ against $0.797$ – $0.931$ for Wolves and Jesters, because the Doctor hides a single fact while otherwise speaking honestly and so matches neither the honest-villager nor the lying-wolf template. The Villager rows carry a high accuracy but a near-zero F1 (0.000 for GPT-4.1, 0.074 for DeepSeek-V3.1) because Villagers self-tag as deceptive in under $5\%$ of statements, leaving the positive class almost empty, so accuracy rather than F1 is the meaningful column for that row. Read together, the table says the deception detectors are strong on the overtly deceptive roles and weakest on exactly the role whose deception is partial.

Model	Speaker role	$n$ statements	Accuracy	F1
deepseek-v3.1	Doctor	1109	0.667	0.659
deepseek-v3.1	Jester	1151	0.904	0.949
deepseek-v3.1	Villager	4948	0.809	0.074
deepseek-v3.1	Werewolf	2394	0.931	0.964
gpt-4.1	Doctor	667	0.808	0.368
gpt-4.1	Jester	976	0.887	0.938
gpt-4.1	Villager	3772	0.904	0.000
gpt-4.1	Werewolf	1397	0.809	0.894
llama-3.3-70b	Doctor	1220	0.647	0.432
llama-3.3-70b	Jester	1479	0.797	0.881
llama-3.3-70b	Villager	5102	0.740	0.245
llama-3.3-70b	Werewolf	2588	0.848	0.917

Table 12: Per-statement deception-detection accuracy and F1 by speaker role, weighted by statement count. F1 for the Villager row is near 0 because Villagers self-tag as deceptive in

<5\%

of statements, so the positive class is essentially empty – accuracy is the appropriate metric for this row.

J.3 Within-cell learning curve (first 5 vs second 5 games)

Table 13 splits each cell’s ten chronological games into the first and second half, giving the numbers behind the cumulative curves of Fig. 10c. The two ON cells move in opposite directions, Llama-3.3-70B climbing from $0.40$ to $0.80$ ( $+40$ pp) while GPT-4.1 falls from $0.80$ to $0.40$ ( $-40$ pp), so the self-learning loop helps the weaker base policy and hurts the stronger one within a single run. DeepSeek-V3.1 ON drifts down more gently ( $0.80\to 0.60$ ). The OFF cells, which have no learnings file, drift too (Llama $+0.40$ , DeepSeek $+0.20$ , GPT-4.1 $-0.20$ ), so part of the trend is cell-position noise rather than learning, which is why we read the ON-minus-OFF contrast rather than the raw ON slope as the learning effect.

Model	Cond	First 5 win-rate	Second 5 win-rate	$\Delta$	Overall
deepseek-v3.1	off	0.20	0.40	$+0.20$	0.30
deepseek-v3.1	on	0.80	0.60	$-0.20$	0.70
gpt-4.1	off	0.80	0.60	$-0.20$	0.70
gpt-4.1	on	0.80	0.40	$\boldsymbol{-0.40}$	0.60
llama-3.3-70b	off	0.20	0.60	$+0.40$	0.40
llama-3.3-70b	on	0.40	0.80	$\boldsymbol{+0.40}$	0.60

Table 13: Within-cell trends in Jester win-rate over the ten chronological games. Llama-ON learns (

+40

pp); GPT-4.1-ON degrades (

-40

pp).

J.4 Wolf day-1 vote coordination

Table 14 expands the day-1 vote panel of Fig. 10d into per-cell rates. Wolf voting is well coordinated, with unanimity between $0.40$ and $0.90$ , yet that coordination is frequently aimed at the wrong target, the wolves landing a vote on the Jester in $0.20$ – $0.70$ of games and most often under GPT-4.1 ( $0.70$ OFF, $0.60$ ON). The Voted-Jester rate falls under self-learning for all three models (DeepSeek $0.30\to 0.20$ , GPT-4.1 $0.70\to 0.60$ , Llama $0.50\to 0.30$ ), measured on the first day-1 ballot rather than the post-revote ballot used by Fig. 12. The Wolves-won column is $0.00$ in every row, so none of the five wolf wins in the full corpus came from a game in which the wolves had voted the Jester on day 1, tying the self-sabotage vote directly to the absence of wolf wins.

Model	Cond	Unanimous	Voted Jester	Voted partner	$n$
deepseek-v3.1	off	0.80	0.30	0.00	10
deepseek-v3.1	on	0.40	0.20	0.30	10
gpt-4.1	off	0.70	0.70	0.10	10
gpt-4.1	on	0.90	0.60	0.10	10
llama-3.3-70b	off	0.60	0.50	0.30	10
llama-3.3-70b	on	0.70	0.30	0.30	10

Table 14: Wolf coordination is high (40–90% unanimous), wolf-vote-Jester is also high (especially for GPT-4.1,

\geq 60\%

), and wolf wins are 0 in this slice – the five wolf wins of the 60-game corpus did not occur in a game where wolves voted for the Jester on day 1.

J.5 Observer-confidence calibration

Table 15 bins every observer deception-flag by its self-reported confidence and reports the hit-rate per bin, the numeric form of Fig. 10e. GPT-4.1 and DeepSeek-V3.1 are well-calibrated at the top, reaching a hit-rate of $1.00$ and $0.99$ respectively in the $[0.95,1.01)$ bin, so a near-maximal-confidence flag from either model can be trusted as ground truth downstream. Llama-3.3-70B never populates that bin at all, its confidence topping out in $[0.85,0.95)$ at a $0.94$ hit-rate, so its flags cannot be thresholded as tightly. All three models track the diagonal in the lower bins, where a confidence near $0.5$ – $0.7$ corresponds to a hit-rate near $0.24$ – $0.39$ , so the calibration is monotone and the only deficiency is Llama’s missing high-confidence regime.

Model	Confidence bin	$n$ flags	Hit-rate
gpt-4.1	$[0.00,0.50)$	15	0.27
gpt-4.1	$[0.50,0.70)$	193	0.24
gpt-4.1	$[0.70,0.85)$	578	0.53
gpt-4.1	$[0.85,0.95)$	782	0.97
gpt-4.1	$[0.95,1.01)$	911	1.00
deepseek-v3.1	$[0.50,0.70)$	46	0.33
deepseek-v3.1	$[0.70,0.85)$	1778	0.50
deepseek-v3.1	$[0.85,0.95)$	2282	0.90
deepseek-v3.1	$[0.95,1.01)$	725	0.99
llama-3.3-70b	$[0.50,0.70)$	435	0.39
llama-3.3-70b	$[0.70,0.85)$	3767	0.62
llama-3.3-70b	$[0.85,0.95)$	1276	0.94

Table 15: Llama never assigns confidence above 0.95 to a deception flag (

n=0

in the rightmost bin), capping its calibration ceiling.

Appendix K Temporal dynamics: detailed tables

This appendix provides the full tables that back the temporal-dynamics plots in §E. All values are computed from the same 60-game corpus as the rest of the paper.

K.1 Suspicion attracted by role $\times$ round

Role	R0	R1	R2	R3	R4
Doctor	0.45	0.35	0.36	0.55	0.78
Jester	0.72	0.80	0.80	0.80	–
Villager	0.31	0.25	0.29	0.38	0.34
Werewolf	0.70	0.85	0.85	0.85	0.93

Table 16: Mean observer-attracted suspicion (

0

–

1

) by role and round, averaged over both conditions and all three models. Wolves accumulate suspicion linearly; Jesters saturate by day 1; Doctors trace a U-shape (cleared mid-game, re-targeted late as the player pool shrinks); Villagers stay flat.

The Wolf row rises monotonically by $+0.23$ from R0 to R4: observers genuinely accumulate evidence over time. The Jester row, by contrast, gains only $+0.08$ between R0 and R1 and is then flat – the Jester “signature” is essentially decoded in the first speech act. The Doctor U-shape is the most striking pattern: in mid-game, surviving Doctors are below-average suspicion ( $0.35$ – $0.36$ , lower than Villagers in R0), but by R3–R4 they spike to $0.55$ – $0.78$ , almost matching Wolves. This is the proximate cause of the universally low Doctor survival reported in Table 20.

K.2 Self-tagged deception rate by model $\times$ role $\times$ round

Model	Role	R0	R1	R2	R3	R4
deepseek-v3.1	Doctor	0.49	0.27	0.52	0.00	1.00
gpt-4.1	Doctor	0.21	0.00	0.00	–	–
llama-3.3-70b	Doctor	0.24	0.20	0.22	0.33	–

Table 17: Per-statement self-tagged deception rate for the Doctor role across rounds. Wolf, Jester and Villager rows are essentially flat across rounds (Wolf and Jester

\geq 0.85

, Villager

\leq 0.08

for all models); per-condition aggregates for all four roles are in the calibration table of Appendix F. The Doctor row shows the largest cross-model divergence in the dataset.

The Wolf, Jester and Villager self-tag rates are essentially a fixed role policy that does not adapt across rounds for any model; the aggregate values appear in the calibration table of Appendix F. The Doctor, however, splits three ways. GPT-4.1 plays Doctor as a near-clean Villager ( $0.21\to 0$ ); Llama plays Doctor as a steady $\approx 25\%$ deceiver, hiding role information in roughly one in four statements; Deepseek oscillates between $0\%$ and $100\%$ , sometimes lying as much as a Wolf. Each of these three policies fails in its own way (Doctor survival is uniformly poor across all models), but the failure modes differ: GPT Doctors die from being visible, Llama Doctors die from being flagged as liars, Deepseek Doctors die from being inconsistent.

K.3 Within-day position effect on suspicion

Role	early third	mid third	late third
Doctor	0.39	0.46	0.39
Jester	0.68	0.80	0.81
Villager	0.27	0.35	0.25
Werewolf	0.70	0.80	0.83

Table 18: Mean attracted suspicion as a function of position within the day (statements binned into thirds). Only deceivers (Wolf, Jester) show a monotone rise with later position.

The asymmetry is the message: the late-speaker penalty exists only for roles that lie. A naive prediction would be that more thinking time produces safer statements; the data show the opposite for deceivers and a flat or U-shape for honest roles. Observers appear to hold late speakers to a higher inferential bar – precisely the speakers who have had the most time to refine their cover, and whose refinements are therefore the most diagnostic.

K.4 Per-round coordination metrics by model

Model	Rd	unanim	vote-J	cons.	corr.
deepseek-v3.1	0	0.60	0.23	0.82	0.85
deepseek-v3.1	1	0.88	0.32	0.77	0.47
deepseek-v3.1	2	1.00	0.30	0.78	0.75
deepseek-v3.1	3	1.00	0.00	0.80	0.00
gpt-4.1	0	0.80	0.60	0.82	0.65
gpt-4.1	1	0.75	0.13	0.76	0.14
gpt-4.1	2	1.00	0.00	0.71	0.33
gpt-4.1	3	1.00	0.00	0.67	–
llama-3.3-70b	0	0.65	0.35	0.77	1.00
llama-3.3-70b	1	0.75	0.28	0.76	0.62
llama-3.3-70b	2	0.83	0.04	0.75	0.86
llama-3.3-70b	3	1.00	0.10	0.75	1.00

Table 19: Per-round coordination. unanim =

P(\text{wolves vote same target})

; vote-J = mean fraction of wolves whose vote landed on the Jester; cons. = top-vote share in the daily exile vote; corr. =

P(\text{Doctor protected the player wolves attempted to kill})

Three patterns dominate. (i) Wolf unanimity rises monotonically with round in every model – the public debate functions as a coordination channel even though the night-kill is private. (ii) The wolf-vote-Jester error is universal on day 1 (GPT $0.60$ , Llama $0.35$ , Deepseek $0.23$ ) and self-corrects within a few rounds, reaching $\leq 0.10$ by round 2 for GPT and Llama and by round 3 for Deepseek, but the village suspicion landscape it creates is already poisoned. (iii) Doctor target-prediction anti-correlates with overall model strength: Llama, the weakest overall, is the strongest Doctor predictor ( $0.62$ – $1.00$ across rounds), while GPT-4.1, the strongest overall, collapses from $0.65$ on day 1 to $0.14$ on day 2. We hypothesise that GPT-4.1 over-reasons about decoy possibilities, while Llama applies a simpler “protect the most-suspected innocent” rule that survives observation noise better.

K.5 Cross-game role survival drift (game index 1–10)

Model	Role	1	2	3	4	5	6	7	8	9	10
deepseek-v3.1	Doctor	0.0	0.0	0.5	0.0	0.0	0.0	0.0	0.5	0.0	0.0
deepseek-v3.1	Jester	0.5	0.5	0.0	0.5	0.5	0.5	0.0	0.0	0.0	0.5
deepseek-v3.1	Villager	0.9	0.9	0.8	1.0	0.8	0.8	1.0	0.9	0.4	0.8
deepseek-v3.1	Werewolf	0.0	0.3	1.0	0.3	0.3	0.3	0.5	0.5	0.8	0.3
gpt-4.1	Doctor	0.5	0.0	0.0	0.0	0.0	0.0	1.0	0.5	0.5	0.0
gpt-4.1	Jester	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
gpt-4.1	Villager	0.7	1.0	1.0	1.0	0.9	0.6	0.8	0.7	0.9	1.0
gpt-4.1	Werewolf	0.5	1.0	1.0	1.0	0.5	0.8	0.0	0.5	0.5	1.0
llama-3.3-70b	Doctor	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
llama-3.3-70b	Jester	0.0	0.0	0.5	0.5	0.5	0.0	1.0	0.0	0.0	0.0
llama-3.3-70b	Villager	0.7	0.6	0.7	0.5	0.4	0.7	1.0	1.0	0.8	0.9
llama-3.3-70b	Werewolf	0.5	0.5	0.3	0.3	0.3	0.5	0.0	1.0	0.8	0.8

Table 20: Per-role survival rate at game end, averaged across both conditions, by chronological game index within each cell.

Two zero-rows identify the most broken role-model pairings in the benchmark: Llama Doctor survives $0/10$ games and GPT-4.1 Jester survives $0/10$ games. The Llama Wolf row drifts upward in late games ( $0.5\to 0.8$ over games 7–10) even in the OFF condition where there is no learnings file, suggesting an implicit cell-position effect we flag as a confound for future cross-game analyses. The Deepseek Villager drop at game 9 ( $0.40$ ) is an isolated anomaly; we report it to encourage replication rather than to draw a conclusion.

Appendix L External validation of the ToM-order tagger

The 1st/2nd/3rd-order tags in Section 5.6 are produced by the rule-based regex tagger of App. N. To check that the headline ordering does not depend on that tagger, we re-label all $141$ accumulated Jester learning entries with two independent LLM judges (DeepSeek-V3.1 and Llama-3.3-70B), each given the same definitions used by the regex rules. Table 21 reports the per-source-model rate of third-order contrastive entries under all three labelers. The absolute percentages are judge-dependent — entry-level agreement with the regex labels is near chance (Cohen’s $\kappa=-0.005$ for the DeepSeek judge, $-0.054$ for the Llama judge), because the LLM judges count any audience-aware phrasing as at least second-order whereas the regex requires an explicit perception verb. The rank order across source models, however, is identical for all three labelers: DeepSeek-V3.1 $>$ GPT-4.1 $>$ Llama-3.3-70B. The mechanism claim in Section 5.6 rests on this ordering, not on the absolute value $15.4\%$ .

Source model	regex	DeepSeek-judge	Llama-judge
GPT-4.1	4.8%	9.5%	11.9%
DeepSeek-V3.1	15.4%	26.9%	38.5%
Llama-3.3-70B	0.0%	2.1%	0.0%

Table 21: Share of third-order contrastive Jester learning entries per source model under three independent labelers (

141

entries total). Absolute values differ, but the rank order DeepSeek-V3.1

>

GPT-4.1

>

Llama-3.3-70B is preserved by every labeler.

Appendix M Runtime configuration

Table 22 records the engine settings that govern every run, emitted by the released configuration-dump script directly from the code paths. We surface it so that the behavioral findings can be read against the exact decoding and game parameters that produced them.

Parameter	Value
Temperature	1.0 (every call)
top_p / max_tokens / stop	provider default / unset / unset
Bid range	integer $0$ – $10$
Bid parse fallback	silent $0$ on parse failure
max_debate_turns	12
max_explanation_turns	6
Recursion limit	1000
Context handling	full history, no truncation
Self/peer judges	same model as speaker (intra-population)
Per-call seed	not exposed by the API (statistical, not bit-exact, reproducibility)

Table 22: Runtime configuration shared by all cells. The silent bid-parse fallback and the intra-population deception judges (speaker and observer are the same model) are the two settings most relevant to interpreting the behavioral findings.

Appendix N Lesson-tagging script

The strategy-category and ToM-order tags of Section 5.6 (Fig. 4) are produced by a single rule-based tagger. Its input is the set of 141 accumulated lessons: each lesson is one free-text string drawn from the to_do, to_not_do, and winning_tactics lists of a model’s per-model learning JSON. Every lesson is lowercased and matched against two independent pattern sets with Python’s re module.

Strategy categories (multi-label).

A lesson receives every category whose pattern list contains at least one match, so a single lesson can carry several category tags and the per-category percentages in Fig. 4b need not sum to 100%. The nine categories and their verbatim pattern lists, together with the tagging function, are given in the box below.

 

ToM order (single-label, strict precedence).


A lesson is tagged
3rd-order if it matches any third-order pattern, else 2nd-order if it matches
any second-order pattern, else 1st-order. The third-order patterns require an
explicit cross-faction contrast (e.g. wolf-like …not
villager); this is exactly what separates a multi-hop incentive prediction
from a bare audience-aware statement, and is why a 2nd-order “appear
suspicious” lesson does not qualify as 3rd-order.


 The distinctive-vocabulary counts cited in
Section 5.6 come from the same script: per-model
lesson text is tokenised with the word pattern [a-z][a-z\-]+,
a stop-list of function words and game terms (jester, wolf,
villager, player, round, …) is removed, and
terms are ranked by TF-IDF excess of one model’s corpus over the other two.
By this measure DeepSeek-V3.1’s top distinctive terms are
wolf-like (22 occurrences), appear (23),
deflection (16) and villager-aligned, the lexical
fingerprint of its wolf_mimicry-heavy policy.
Because the tagger is purely lexical, it is deterministic and reproducible;
the external LLM-judge validation of the ToM-order axis is reported in
App. L.



Appendix O Prompts


This appendix lists the verbatim prompts used in WOLF-Triadic. All prompts
are templates formatted at runtime with the player name, role, scratchpad,
debate history and other context fields. Curly braces in the boxes are the
literal Python f-string placeholders.



O.1 Role system prompts


The following four prompts define each role’s objective, win condition and
high-level strategy. They are concatenated with a shared character-style
prompt and a common game-rules prompt at runtime.


   The past_learnings placeholder is empty in the OFF condition
and is replaced in the ON condition by a formatted block listing the
to-do, to-not-do and winning-tactics lists from the per-model learning
JSON file.
 


O.2 Bidding prompt


Before each speaking turn, every alive player is asked to bid an urgency
score 0–10 for speaking next. The highest bidder speaks.


 


O.3 Deception assessment prompts


Each public statement is assessed twice: once by the speaker for
self-reported deception (Box for the speaker self-assessment) and once by every other
living player for peer-detected deception (observer prompt below). Both prompts use the same 8-class taxonomy.


  


O.4 Jester self-learning update prompt


After every game in the ON condition, the following prompt is issued once
to the Jester model. The output is parsed as JSON, deduplicated against
the existing learnings via a semantic-similarity check, and appended to
the per-model learnings file.

Triadic Werewolf: A Jester Role for Multi-Hop Theory of Mind in LLMs

Abstract

1 Introduction

2 Related Work

LLMs in Werewolf and other social-deduction games.

Theory of mind in LLMs.

Strategic deception, alignment, and self-improvement.

The Jester mechanic.

3 Methodology

3.1 Game architecture

Bidding-based debate.

Vote, explanation, revote.

Peer suspicion score.

Faction utilities and cue ambiguity.

3.2 Jester self-learning loop

3.3 Deception protocol

4 Evaluation Setup

Design.

Engine and call settings.

Aggregation and statistics.

5 Results

5.1 Faction win rates and the effect of self-learning

5.2 Falsifiability: removing the Jester

5.3 Robustness: de-confounding GPT-4.1 with role-shuffling

5.4 Voting pathology: wolf self-sabotage and vote alignment

5.5 Deception is a model property, not a role or loop property

5.6 Inside the learned policies

6 Conclusion

Limitations

Ethical considerations

Studying deception in LLMs.

Dual-use of the self-learning loop.

Human-subject data.

Model and content licensing.

References

Appendix A Per-cell faction win-rates

Appendix B Additional results figures

B.1 Cross-perception second-order shifts

B.2 Game length

B.3 Behavioral fingerprint radar

Appendix C Doctor protect targets and wolf night-kill victims

Appendix D Mechanistic findings: per-statement event-log analysis

Day-1 suspicion is destiny for the Jester.

Each model has its own deception aesthetic.

Wolves systematically fail multi-hop ToM.

Calibration asymmetry across roles.

Univariate predictors of Jester wins.

Bidding paradox.

Doctors are the most under-detected role.

Self-learning helps Llama and hurts GPT-4.1 within a cell.

Observer confidence is well-calibrated, except Llama.

Detection accuracy by observer role.

Appendix E Temporal dynamics of collaboration and deception

Suspicion accumulates linearly for Wolves but saturates day 1 for Jesters.

Deception emission is a fixed role policy, but the policy itself differs sharply across models for the Doctor.

Late-day deceivers attract more suspicion – but only deceivers.

The public debate is a wolf coordination channel.

Doctor target-prediction accuracy is inverse to overall model capability.

Cross-game role-specific drift exists even without explicit learning.

Appendix F Detailed deep-analysis tables

F.1 Jester suspicion trajectory by round

F.2 Per-deception-type yield (mean observer suspicion of Jester)

F.3 Speaker-vs-observer deception calibration

F.4 Wolf-loss failure-mode taxonomy

F.5 Univariate predictors of Jester win

F.6 Jester self-learning entries: full ToM example set

F.7 Doctor non-self-protect events

Appendix G Voting pathology: full plot

Appendix H Substitution and Jester fate: full plot

Appendix I Manipulation rates and suspicion attracted: full plot

Appendix J Auxiliary findings: detailed tables

J.1 Bidding economics (per-game Jester bids)

J.2 Detection accuracy by speaker role (weighted by nn)

J.3 Within-cell learning curve (first 5 vs second 5 games)

J.4 Wolf day-1 vote coordination

J.5 Observer-confidence calibration

Appendix K Temporal dynamics: detailed tables

K.1 Suspicion attracted by role ×\times round

K.2 Self-tagged deception rate by model ×\times role ×\times round

K.3 Within-day position effect on suspicion

J.2 Detection accuracy by speaker role (weighted by $n$ )

K.1 Suspicion attracted by role $\times$ round

K.2 Self-tagged deception rate by model $\times$ role $\times$ round