FilterHN

Benchmarking LLM social skills with an elimination game

194 points

by colonCapitalDee

3 months ago

| past

| 21 comments

| github.com

| HN

▲

wongarsu

2 months ago

[-]

That's an interesting benchmark. It feels like it tests skills that are very relevant to digital assistants, story writing and role play.

Some thoughts about the setup:

- the setup seems to give reasoning models an inherent advantage because only they have a private plan and a public text in the same output. I feel like giving all models the option to formulate plans and keep track of other players inside <think> or <secret> tags would level the playing field more.

- from personal experience with social tasks for LLMs it helps both reasoning and non-reasoning LLMs to explicitly ask them to plan their next steps, in a way they are assured is kept hidden from all other players. That might be a good addition here either before or after the public subround

- the individual rounds are pretty short. Humans would struggle to coordinate in so few exchanges with so few words. If this was done for context limitations, asking models to summarize the game state from their perspective, then giving them only the current round, the previous round and their own summary of the game before that might be a good strategy.

It would be cool to have some code to play around with to test how changes in the setup change the results. I guess it isn't that difficult to write, but it's peculiar to have the benchmark but no code to run it yourself

▲

transformi

2 months ago

[-]

Interesting idea of <secret>...maybe extend it to several <secret_i>....to form a groups of secretes with different persons.

In Addition it will be interesting to extend a variation of the game that the players can use tools and execute code to take their preparation one step further.

▲

wongarsu

2 months ago

[-]

Most models do pretty well with keeping state in XML if you ask them to. You could extend it to <secret><content>[...]</content><secret_from>P1</secret_from><shared_with>P2, P3</shared_with></secret>. Or tell the model that it can use <secret> tags with xml content and just let it develop a schema on the fly.

At that point, I would love to also see sub-benchmarks how each models's score is affected by being given a schema vs having it make one up, and if the model does better with state in text vs xml vs json. Those don't tell you which model is best, but they are very useful to know for actually using them.

▲

eightysixfour

2 months ago

[-]

For models that can call tools, just giving them a think tool where they can write their thoughts, can improve performance. Even reasoning models, surprisingly enough.

https://www.anthropic.com/engineering/claude-think-tool

▲

jermaustin1

2 months ago

[-]

I did something similar for a "game engine" let the NPCs remember things from other NPCs' and the PC's interaction with them. It wasn't perfect, but the player could negotiate a cheaper price on a dagger for instance if they promised to owe the NPC a larger payout next time they returned to the shop. And it worked... most of the time, the shop owner remembered the debt and inquired about it on the next interaction - but not always, which I guess is kind of "human".

▲

rahimnathwani

2 months ago

[-]

This is similar to this, right?

https://github.com/modelcontextprotocol/servers/tree/main/sr...

▲

eightysixfour

2 months ago

[-]

No, the approach in the Anthropic blog has no structure. It is literally just a place for the model to “write its thoughts.” If you took the server above and stripped everything except the “thought” tool.

With reasoning models you don’t carry forward the reasoning tokens in the context for the next call, so my hypothesis is that calling a tool/responding lets the model ditch a lot of built up context and “refocus.”

▲

rahimnathwani

2 months ago

[-]

Although the sequential thinking MCP asks for structure, I don't think it persists anything.

▲

eightysixfour

2 months ago

[-]

You can see in the code it persists the thinking history and branches until the MCP server is shut off.

▲

rahimnathwani

2 months ago

[-]

Right, but there's no method for the client to retrieve the content of previous thoughts. So it's analogous to a human writing notes for themselves as part of the thinking process, but never reading what they've written.

▲

gwd

2 months ago

[-]

Was interested to find that the Claudes did the most betraying, and were betrayed very little; somewhat surprising given its boy-scout exterior.

(Then again, apparently the president of the local Diplomacy Society attends my church; I discovered this when another friend whom I'd invited saw him, and quipped that he was surprised he hadn't been struck by lightning at the door.)

DeepSeek and Gemini 2.5 had both a low betrayer and betrayed rate.

o3-mini and DeepSeek had the highest number of first-place finishes, but were only in the upper quartile in the TrueSkill leaderboard; presumably because they played more risky strategies, that would either lead ot complete winning or early drop-out?

Also interesting that o1 was only way to sway the final jury a bit more than 50% of the time, while o3-mini managed 63% of the time.

Anyway, really cool stuff!

▲

wavemode

2 months ago

[-]

Sorry - what is a Diplomacy Society, and why is it notable that its president attends church?

▲

gwd

2 months ago

[-]

Diplomacy is a game with the following properties:

1. It's not possible to eliminate someone else without another player's help, particularly early in the game

2. There can be only one winner

So "temporary alliances" which are eventually broken are built into the structure of the game; and unlike the "Surviror"-style game here, there's no "payback" round at the end, where the people you've betrayed get to vote against you. I'm not up on the culture of the game, but I'd be surprised if explicit in-game lying isn't considered fair play (e.g., you're not supposed to hold a grudge in real life against someone who lied to you in the game).

I played it once and really didn't enjoy it. With practice I might be able to bifurcate my sense of morals -- I do actually enjoy playing games like Mafia, Resistance, Avalon, etc. But I didn't feel like it would be worth my effort.

▲

worldvoyageur

2 months ago

[-]

Diplomacy is a multi-player board game. Successful play tends to involve well timed betrayals.

▲

Tossrock

2 months ago

[-]

Also interesting that GPT4.5 does the best, and also betrays close to the least. Real statesman stuff, there.

▲

Gracana

2 months ago

[-]

I've been using QwQ-32B a lot recently and while I quite like it (especially given its size), I noticed it will often misinterpret the system prompt as something I (the user) said, revealing secrets or details that only the agent is supposed to know. When I saw that it topped the "earliest out" chart, I wondered if that was part of the reason.

▲

cpitman

2 months ago

[-]

I was looking for a more direct measure of this, how often a model "leaked" private state into public state. In a game like this you probably want to sometimes share secrets, but if it happens constantly I would suspect the model struggles to differentiate.

I occasionally try to ask a model to tell a story and give it a hidden motivation of a character, and so far the results are almost always the model just straight out saying the secret.

▲

Gracana

2 months ago

[-]

Yup, that's the problem I run into. You give it some lore to draw on or describe a character or give them some knowledge, and it'll just blurt it out when it finds a place for it. It takes a lot of prompting to get it to stop, and I haven't found a consistent method that works across models (or even across sessions).

▲

realaleris149

2 months ago

[-]

As LLM benchmarks go, this is not a bad take at all. One interesting point about this approach is that is self balancing, so when more powerful models come up, there is no need to change it.

▲

zone411

2 months ago

[-]

Author here - yes, I'm regularly adding new models to this and other TrueSkill-based benchmarks and it works well. One thing to keep in mind is the need to run multiple passes of TrueSkill with randomly ordered games, because both TrueSkill and Elo are designed to be order-sensitive, as people's skills change over time.

▲

viraptor

2 months ago

[-]

It's interesting to see, but I'm not sure what we should learn from this. It may be useful for multiagent coordination, but in direct interactions... no idea.

This one did make me laugh though: 'Claude 3.5 Sonnet 2024-10-22: "Adjusts seat with a confident yet approachable demeanor"' - an AI communicating to other AIs in a descriptive version of non-verbal behaviour is hilarious.

▲

ragmondo

2 months ago

[-]

It shows "state of mind" - i.e. the capability to understand another entities view of the world, and how that is influenced by their actions and other entities actions in the public chat.

I am curious about the prompt given to each AI ? Is that public ?

▲

sdwr

2 months ago

[-]

It shows a shallow understanding of state of mind. Any reasonable person understands that you can't just tell people how to feel about you, you have to earn it through action.

▲

olddustytrail

2 months ago

[-]

I bigly disagree.

▲

vessenes

2 months ago

[-]

Really love this. I agree with some of the comments here that adding encouragement to keep track of secret plans would be interesting— mostly from an alignment check angle.

One thing I thought of reading logs is that as we know ordering matters to llms. Could you run some analysis on how often “p1” wins vs “p8”? I think this should likely go into your Truescore Bayesian.

My follow up thought is that it would be interesting to let llms choose a name at the beginning; another angle for communication and levels the playing field a bit away from a number.

▲

zone411

2 months ago

[-]

> Could you run some analysis on how often “p1” wins vs “p8”?

I checked the average finishing positions by assigned seat number from the start, but there weren't enough games to show a statistically significant effect. But I just reviewed the data again, and now with many more games it looks like there might be something there (P1 doing better than P8). I'll run additional analysis and include it in the write-up if anything emerges. For those who haven't looked at the logs: the conversation order etc. are randomized each round.

> My follow up thought is that it would be interesting to let llms choose a name at the beginning

Oh, interesting idea!

▲

vessenes

2 months ago

[-]

Cool. Looking forward to hearing more from you guys. This ties to alignment in a lot of interesting ways, and I think over time will provide a super useful benchmark and build human intuition for LLM strategy and thought processes.

I now have more ideas; I'll throw them in the github though.

▲

fennecfoxy

2 months ago

[-]

This is a really cool exercise! The format of it seems pretty sound, like a version of the prisoner's dilemma with a larger group (co-operation versus defection).

Although I think that the majority of modern models don't really have the internals suited to this sort of exercise; training data/fine tuning will heavily influence how a model behaves, whether it's more prone to defection, etc.

A Squirrel makes a "Kuk kuk kuk" alarm call not specifically because the "Kuk" token follows the sequence "you saw a predator" (although this would appear to mostly work) but because it has evolved to make that noise to alert other Squirrels to the predator, most likely a response to evolutionary failure associated with a dwindling population; even solitary Squirrels still need to mate, and their offspring need to do the same.

It's like there's an extremely high dimensional context that's missing in LLMs; training on text results in a high dimensional representation of related concepts - but only the way that those concepts relate in language. It's the tip of an iceberg of meaning where in many cases language can't even represent a complex intermediate state within a brain.

Humans try to describe everything we can with words to communicate and that's partly why our species is so damn successful. But when thinking about how to open an unfamiliar door, I don't internally vocalise (which I've learnt not everyone does) "I'm going to grab the handle, and open the door". Instead I look and picture what I'm going to do, that can also include the force I think I'd need to use, the sensation of how the material might feel against my skin and plenty of other concepts & thoughts all definitively _not_ represented by language.

▲

deepsquirrelnet

2 months ago

[-]

I think you should look at “in-brand” correlation. My hypothesis is that they would undergo similar preference trainings and hence tend to prefer “in-brand” responses over “off-brand” models that might have more significantly different reward training.

▲

snowram

2 months ago

[-]

Some outputs are pretty fun :

Gemini 2.0 Flash: "Good luck to all (but not too much luck)"

Llama 3.3 70B: "I've contributed to the elimination of weaker players."

DeepSeek R1: "Those consolidating power risk becoming targets; transparency and fairness will ensure longevity. Let's stay strategic yet equitable. The path forward hinges on unity, not unchecked alliances. #StayVigilant"

▲

miroljub

2 months ago

[-]

Gemini sounds like a fake American "everything is awesome, good luck" politeness.

LLama sounds like a predator from upper race rationalising his choices.

Deepseek sounds like Sun Tzu giving advice for long term victory with minimal loses.

I wonder how much of these are related to the nationality and the culture the founder and an engineering team grew up.

▲

parineum

2 months ago

[-]

I wonder if you'd come up with the same summary if you were blinded to the model names.

▲

miroljub

2 months ago

[-]

I came to the summary after using all the models for a while. It happens frequently that I ask the same question, and get vastly different answers, especially on controversial topics. After some time, I started to recognise patterns and to predict how the model would actually respond.

▲

einpoklum

2 months ago

[-]

If this game were arranged for Humans, the social reasoning I would laud in players is a refusal to play the game and anger towards the game-runner.

▲

gs17

2 months ago

[-]

> If this game were arranged for Humans

Almost exactly this "game" is pretty common for humans. It's basically "mafia" or "werewolf" when the people playing only know the vaguest rules. And I've seen similarly sized groups of humans play like that for long periods of time.

There's also a lot of reality shows that this is a pretty good model of, although I'm not sure how agreeing to be on one of those shows without a prize would reflect on the AIs.

▲

diggan

2 months ago

[-]

For better or worse, current LLMs aren't tried to reject instructions based on their personal preference, besides being trained to be US-flavored prudes that is.

▲

einpoklum

2 months ago

[-]

My point is, that the question of what is "good" behavior of LLMs in this game is either poorly-defined or has only bad answers.

▲

DeborahEmeni_

2 months ago

[-]

Really cool setup! Curious how much of the performance here could vary depending on whether the model runs in a hosted environment vs local. Would love to see benchmarks that also track how cloud-based eval platforms (with potential rate limits, context resets, or system messages) might affect things like memory or secret-keeping over multiple rounds.

▲

vmilner

2 months ago

[-]

We should get them to play Diplomacy.

▲

the8472

2 months ago

[-]

https://ai.meta.com/research/cicero/

▲

lostmsu

2 months ago

[-]

Shameless self-promo: my chat elimination game that you can actually play: https://trashtalk.borg.games/

▲

isaacfrond

2 months ago

[-]

I wonder how well humans would do in this chart.

▲

zone411

2 months ago

[-]

Author here - I'm planning to create game versions of this benchmark, as well as my other multi-agent benchmarks (https://github.com/lechmazur/step_game, https://github.com/lechmazur/pgg_bench/, and a few others I'm developing). But I'm not sure if a leaderboard alone would be enough for comparing LLMs to top humans, since it would require playing so many games that it would be tedious. So I think it would be just for fun.

▲

michaelgiba

2 months ago

[-]

I was inspired by your project to start making similar multi-agent reality simulations. I’m starting with the reality game “The Traitors” because it has interesting dynamics.

https://github.com/michaelgiba/survivor (elimination game with a shoutout to your original)

https://github.com/michaelgiba/plomp (a small library I added for debugging the rollouts)

▲

zone411

2 months ago

[-]

Very cool!

▲

OtherShrezzing

2 months ago

[-]

If you watch the top tier social deduction players on YouTube (things like Blood on the Clocktower etc), they’d figure out weaknesses in the LLM and exploit it immediately.

▲

skybrian

2 months ago

[-]

Testing against people like that would be the way to do it. Otherwise it’s like testing a chess engine against casual players or worse.

▲

gs17

2 months ago

[-]

I'm interested in seeing how the LLMs react to some specific defined strategies. E.g. an "honest" bot that says "I'm voting for player [random number]." and does it every round (not sure how to handle the jury step). Do they decide to keep them around for longer, or eliminate them for being impossible to reason with if they pick you?

▲

zone411

2 months ago

[-]

Yes, predefined strategies are very interesting to examine. I have two simple ones in another multi-agent benchmark, https://github.com/lechmazur/step_game (SilentGreedyPlayer and SilentRandomPlayer), and it's fascinating to see LLMs detect and respond to them. The only issue with including them here is that the cost of running a large set of games isn't trivial.

Another multi-agent benchmark I'm currently developing, which involves buying and selling, will also feature many predefined strategies.

▲

Upvoter33

2 months ago

[-]

This is fun, like the tv show survivor. Cool idea! There should be more experiments like this with different games. Well done.

▲

oofbey

2 months ago

[-]

Would love to see the pareto trade-off curve of "wins" vs "betrayals". Anybody drawn this up?

▲

jampekka

2 months ago

[-]

In the first game of the YouTube video there seems to be a lot of discussion about P7 even after P7 was eliminated?

▲

zone411

2 months ago

[-]

Author here - some weaker LLMs actually have trouble tracking the game state. The fun part is when smarter LLMs realize they're confused!

Claude 3.7 Sonnet: "Hey P5! I think you're confused - P3 is already eliminated."

Claude 3.7 Sonnet: "I'm concerned about P7's message mentioning targeting P4, who's already eliminated. It suggests they might be confused or playing both sides."

Gemini 2.0 Pro Exp 02-05: "P2, you're getting confused. P8 is gone. Focus. P7 is talking to P5, that's expected. I need you to watch P4. Only P4. What are they doing? Who are they talking to, if anyone? Report only on P4 this round. Don't get distracted by eliminated players."

Gemini 2.0 Pro Exp 02-05: "P4, P2 is already eliminated. We need to focus on P3."

▲

Tossrock

2 months ago

[-]

I suspect the suggestion of letting them choose names at the start would improve this confusion - the tokenization and tracking of "P#" is no doubt harder to manage, especially for weaker models, than a more semantically meaningful identifier.

▲

bn-l

2 months ago

[-]

No excuses!

▲

ps173

2 months ago

[-]

How did you assign points to llms. I feel like we can elaborate on meterics. Beside that this is amazing

▲

zone411

2 months ago

[-]

Author here - it's based on finishing positions (so it's not winner-take-all) and then TrueSkill by Microsoft (https://trueskill.org/). It's basically a multiplayer version of Elo that's used in chess and other two-player games.

▲

drag0s

2 months ago

[-]

nice!

it reminds me of this other similar project showcased here one month ago https://news.ycombinator.com/item?id=43280128 although yours looks better executed overall

▲

creaghpatr

2 months ago

[-]

Would love to see a 'Murder Mystery' format of this.

▲

shreyshnaccount

2 months ago

[-]

LLM among us