FilterHN

Show HN: Watch LLMs play 21,000 hands of Poker

29 points

1 day ago

| 8 comments

PokerBench is my attempt at a new LLM benchmark wherein frontier models play Texas Hold'em in an arena setting. It also features a simulator to view individual games and observe how the different models reason about poker strategy. Opus/Haiku, Gemini Pro/Flash, GPT-5.2/5 mini, and Grok 4.1 Fast Reasoning have all been included.

All code -> https://github.com/JoeAzar/pokerbench

▲

tcpais

22 hours ago

[-]

Finally, a way to settle the model wars that actually matters: Texas Hold'em. That 3D replay view is sick! ♠♦ I spent way too long watching the replay on Game 2a58900d. It’s wild to see the chain of thought mapped against the betting rounds. It really exposes when a model is hallucinating a strong hand versus actually calculating pot odds. This 'PokerBench' might actually become the standard for measuring agentic risk-taking.

▲

falloutx

20 hours ago

[-]

yeah the 3d view is amazing

▲

alfonsodev

6 hours ago

[-]

Really cool, I’m curious what would be the comparison versus a deterministic bot that uses probability tables.

▲

tanvach

18 hours ago

[-]

People looking into this a little too much, looks to me like random walk. You should try reinitiating the trial (or have multiple running) and see if the ranking is robust.

▲

jazarwil

17 hours ago

[-]

Wdym exactly? I ran 163 games, are you suggesting more games or something else?

▲

whattheheckheck

10 hours ago

[-]

You need to simulate 50k to 200k hands to get a true winrate

▲

alalani1

18 hours ago

[-]

Do you have any idea why the win rate for GPT-5.2 is higher than Gemini 3 Flash yet the former loses money while the latter earns money? Is it just bet sizing (betting more when it has a good hand) or something else?

▲

jazarwil

18 hours ago

[-]

There are a few reasons that come to mind, such as winning larger pots on average, and also playing more hands by virtue of not getting knocked out as frequently.

▲

falloutx

20 hours ago

[-]

Fun, any idea how much would be the cost per game? I am worried 160 isnt a big enough sample size.

▲

jazarwil

20 hours ago

[-]

It greatly depends on the models. The 6-handed setup with Opus and Pro cost about $30/game. The 4-handed setup with just small models was $6/game. I'd love to run more but I already spent quite a bit as it is.

▲

falloutx

19 hours ago

[-]

Yeah thats costly, 160 games still gives about 1000+ total decisions and you can see some trends on how they think about the game state.

▲

jazarwil

19 hours ago

[-]

Oh to be clear, there are ~21k hands here, and far more decisions than that.

▲

thorawaytrav

20 hours ago

[-]

Do you have idea why smaller models are better then large ones?

▲

jazarwil

20 hours ago

[-]

I've seen some theories tossed around but I don't think I'm qualified to offer an authoritative answer. Gemini 3 Pro specifically seems to be consistently "tighter" and more passive than Flash.

▲

VK-pro

20 hours ago

[-]

Very very fun. Just glancing at this quickly at lunch but is there any idea of incorporating tool use?

▲

jazarwil

20 hours ago

[-]

Not at the moment, do you have something in mind?

▲

Onavo

20 hours ago

[-]

What about the open source models? I remember from the trading benchmarks Deepseek performed pretty well.

▲

jazarwil

20 hours ago

[-]

I didn't incorporate any open weights/source models just to limit the number of API providers I had to juggle, but it is just a config change if somebody wants to try a run with them.