I built TetrisBench, a benchmark that tests LLMs on real-time code generation and reasoning through Tetris.
Live: https://tetrisbench.com/
*How it works:*
Each model starts with an initial optimization function for evaluating Tetris moves.
As the game progresses, the model sees the current board state and updates its algorithm—adapting its strategy based on how the game is evolving.
The model continuously refines its optimizer: - Board getting too high? Prioritize clearing lines. - Hole forming? Adjust penalties. - Safe stack? Build for a Tetris.
The model generates updated code, executes it to score all placements, and picks the best move.
*Current standings:*
| Model | Win Rate | |-------|----------| | Opus 4.5 | 68% | | GPT-5.2 | 63% | | Grok 4.1 | 22% |
(181 games so far, running more)
*Try it yourself:*
You can also play against any model directly. See if you can beat opus at Tetris—only 1 human has so far.
*All trajectories are logged.* Every game saves board states, the code each model generated, and placement decisions. Happy to share the dataset