Show HN: Watch LLMs play 21,000 hands of Poker
29 points
1 day ago
| 8 comments
| pokerbench.adfontes.io
| HN
PokerBench is my attempt at a new LLM benchmark wherein frontier models play Texas Hold'em in an arena setting. It also features a simulator to view individual games and observe how the different models reason about poker strategy. Opus/Haiku, Gemini Pro/Flash, GPT-5.2/5 mini, and Grok 4.1 Fast Reasoning have all been included.

All code -> https://github.com/JoeAzar/pokerbench

tcpais
22 hours ago
[-]
Finally, a way to settle the model wars that actually matters: Texas Hold'em. That 3D replay view is sick! ♠♦ I spent way too long watching the replay on Game 2a58900d. It’s wild to see the chain of thought mapped against the betting rounds. It really exposes when a model is hallucinating a strong hand versus actually calculating pot odds. This 'PokerBench' might actually become the standard for measuring agentic risk-taking.
reply
falloutx
20 hours ago
[-]
yeah the 3d view is amazing
reply
alfonsodev
6 hours ago
[-]
Really cool, I’m curious what would be the comparison versus a deterministic bot that uses probability tables.
reply
tanvach
18 hours ago
[-]
People looking into this a little too much, looks to me like random walk. You should try reinitiating the trial (or have multiple running) and see if the ranking is robust.
reply
jazarwil
17 hours ago
[-]
Wdym exactly? I ran 163 games, are you suggesting more games or something else?
reply
whattheheckheck
10 hours ago
[-]
You need to simulate 50k to 200k hands to get a true winrate
reply
alalani1
18 hours ago
[-]
Do you have any idea why the win rate for GPT-5.2 is higher than Gemini 3 Flash yet the former loses money while the latter earns money? Is it just bet sizing (betting more when it has a good hand) or something else?
reply
jazarwil
18 hours ago
[-]
There are a few reasons that come to mind, such as winning larger pots on average, and also playing more hands by virtue of not getting knocked out as frequently.
reply
falloutx
20 hours ago
[-]
Fun, any idea how much would be the cost per game? I am worried 160 isnt a big enough sample size.
reply
jazarwil
20 hours ago
[-]
It greatly depends on the models. The 6-handed setup with Opus and Pro cost about $30/game. The 4-handed setup with just small models was $6/game. I'd love to run more but I already spent quite a bit as it is.
reply
falloutx
19 hours ago
[-]
Yeah thats costly, 160 games still gives about 1000+ total decisions and you can see some trends on how they think about the game state.
reply
jazarwil
19 hours ago
[-]
Oh to be clear, there are ~21k hands here, and far more decisions than that.
reply
thorawaytrav
20 hours ago
[-]
Do you have idea why smaller models are better then large ones?
reply
jazarwil
20 hours ago
[-]
I've seen some theories tossed around but I don't think I'm qualified to offer an authoritative answer. Gemini 3 Pro specifically seems to be consistently "tighter" and more passive than Flash.
reply
VK-pro
20 hours ago
[-]
Very very fun. Just glancing at this quickly at lunch but is there any idea of incorporating tool use?
reply
jazarwil
20 hours ago
[-]
Not at the moment, do you have something in mind?
reply
Onavo
20 hours ago
[-]
What about the open source models? I remember from the trading benchmarks Deepseek performed pretty well.
reply
jazarwil
20 hours ago
[-]
I didn't incorporate any open weights/source models just to limit the number of API providers I had to juggle, but it is just a config change if somebody wants to try a run with them.
reply