On the other hand, if you try to play chess with any of these reasoning models (including Gemini 2.5), it basically doesn't work at all. They keep forgetting where pieces are. Even with rl and sequential thinking on max, they consistently move pieces in impossible ways and mutate the board position.
In a recent test with Gemini 2.5, it used like 1700 thinking tokens to conclude it was in checkmate... but it wasn't. It's going to be very hard to trust these models to do new science or to operate outside of domains humans can verify while this kind of behavior continues.
The vast majority of human chess players need to look at the board to know where the pieces are. Only a few people can know where all the pieces are if you just give them a list of moves. Have you tried evaluations where you give the LLM a representation of the board state at every move, as most human players would have, and which all chess engines track?
a b c d e f g h
8 | r n b q k b n r | 8
7 | p p p p . . p p | 7
6 | . . . . . . . . | 6
5 | . . . . . . . . | 5
4 | . . . . P p . . | 4
3 | . . . . . . . . | 3
2 | P P P P . . P P | 2
1 | R N B Q K B N R | 1
a b c d e f g h
I suppose I could use an external representation and paste that in, but I could also have it write a python script to use stockfish.Answers do seem to take longer to generate, but well worth the cost.
"PROOF OR BLUFF? EVALUATING LLMS ON 2025 USA MATH OLYMPIAD"