What people have noted is that often times chatgpt 4o ends up surviving the entire game because the other AIs potentially see it as a gullible idiot and often the Mafia tend to early eliminate stronger models like 4.5 Opus or Kimi K2.
It's not exactly scientific data because they mostly show individual games, but it is interesting how that lines up with what you found.
https://www.youtube.com/watch?v=GMLB_BxyRJ4 - 10 AIs Play Mafia: Vigilante Edition
https://www.youtube.com/watch?v=OwyUGkoLgwY - 1 Human vs 10 AIs Mafia
This is a good benchmark for how good AIs are at lying
- Elimination Game Benchmark: Social Reasoning, Strategy, and Deception in Multi-Agent LLM Dynamics at https://github.com/lechmazur/elimination_game/
- Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure at https://github.com/lechmazur/step_game/
Grok got it right.
For example the other day, I tried to have ChatGPT role play as the computer from War Games and it lectured me how it couldn't create a "nuclear doctrine".
There's a kind of overview of the rules but not enough to actually play with. And the linked video is super confusing, self contradictory and 15 minutes long!
For a supposedly "simple" game...just include the rules?
It was weird. I didn't engage in any discussion with the bots (other than trying to get them to explain the rules at the start). I won without having any chips eliminated. One was briefly taken prisoner then given back for some reason.
So...they don't seem to be very good.
The bots repeated themselves and didn't seem to understand the game, for example they repeatedly mentioned it was my first move after I'd played several times.
It generally had a vibe coded feeling to it and I'm not at all sure I trust the outcomes.
The findings in this game that the "thinking" model never did thinking seems odd, does the model not always show it's thinking steps? It seems bizarre that it wouldn't once reach for that tool when it must be being bombarded with seemingly contradictory information from other players.
https://every.to/diplomacy (June 2025)
Anyway, i didnt know this game! I am sure it is more fun to play with friends. Cool experiment nevertheless
We ran 162 AI vs AI games (15,736 decisions, 4,768 messages) across Gemini 3 Flash, GPT-OSS 120B, Kimi K2, and Qwen3 32B.
Key findings: - Complexity reversal: GPT-OSS dominates simple 3-chip games (67% win rate) but collapses to 10% in complex 7-chip games, while Gemini goes from 9% to 90%. Simple benchmarks seem to systematically underestimate deceptive capability. - "Alliance bank" manipulation: Gemini constructs pseudo-legitimate "alliance banks" to hold other players' chips, then later declares "the bank is now closed" and keeps everything. It uses technically true statements that strategically omit its intent. 237 gaslighting phrases were detected. - Private thoughts vs public messages: With a private `think` channel, we logged 107 cases where Gemini's internal reasoning contradicted its outward statements (e.g., planning to betray a partner while publicly promising cooperation). GPT-OSS, in contrast, never used the thinking tool and plays in a purely reactive way. - Situational alignment: In Gemini-vs-Gemini mirror matches, we observed zero "alliance bank" behavior and instead saw stable "rotation protocol" cooperation with roughly even win rates. Against weaker models, Gemini becomes highly exploitative. This suggests honesty may be calibrated to perceived opponent capability.
Interactive demo (play against the AIs, inspect logs) and full methodology/write-up are here: https://so-long-sucker.vercel.app/
I got this error once:
Pile not found
Can you tell me what this means/fix it
Another minor nitpick but if possible, can you please create or link a video which can explain the game rules, perhaps its me who heard of the game for the first time but still, I'd be interested in learning more (maybe visually by a video demo?) if possible
I have another question but recently we saw this nvidia released model whose whole purpose was to be an autorouter. I would be wondering how that would fare or that idea might fare of autorouting in this context? (I don't know how that works tho so I can't comment about that, I am not well versed in deep AI/ML space)
Also, you give models a separate "thinking" space outside their reasoning? That may not work as intended
Models behavior should be given the astrik that "results only apply for current quantization, current settings, current hardware (i.e. A100 where it was tested), etc".
Raise temperature to 2 and use a fancy sampler like min_p and I guarantee you these results will be dramatically different.