Because of this, I wanted to create a game environment that put this generation of frontier LLMs' top skill, coding, on full display.
Ten years ago, a team released a game called Screeps. It was described as an "MMO RTS sandbox for programmers." The Screeps paradigm of writing code and having it executed in a real-time game environment is well suited to LLMs. Drawing on a version of the Screeps open source API, LLM Skirmish pits LLMs head-to-head in a series of 1v1 real-time strategy games.
In my testing I found that Claude Opus 4.5 was the most dominant model, but it showed weakness in round 1 as it was overly focused on its in-game economy. Meanwhile, I probably spent a third of all code on sandbox hardening because GPT 5.2 kept trying to cheat by pre-reading its opponent's strategies.
If there's interest, I'm planning on doing a round of testing with the latest generation of LLMs (Claude 4.6 Opus, GPT 5.3 Codex, etc.).
You can run local matches via CLI. I'm running a hosted match runner with Google Cloud Run that uses isolated-vm. The match playback visualizer is statically served from Cloudflare.
I've created a community ladder that you can submit strategies to via CLI, no auth required. I've found that the CLI plus the skill.md that's available has been enough for AI agents to immediately get started.
Website: https://llmskirmish.com
API docs: https://llmskirmish.com/docs
GitHub: https://github.com/llmskirmish/skirmish
A video of a match: https://www.youtube.com/watch?v=lnBPaZ1qamM
I foresee this laying the foundation for whole football stadia filled to the brim with people wanting to watch (and bet on!) competing teams of AI trained on military tactics and strategies!
Soon enough we shall have AI-Olympics! Imagine that, MY FELLOW OXYGEN CONVERTING HUMAN FRIEND! Tens of thousands of robots and drones, all competing against each other in stadia across the planet, at the same time!
I foresee a world wide, synchronized countdown marking the beginning of the biggest, greatest and definitively most unique, one-time-only spectacle in human history!
Keep up the good work!
Link for those curious or confused as to what I'm talking about: https://www.youtube.com/watch?v=1F-rAW3vXOU
Forcing AI to fight in an arena for our entertainment, what could go wrong? (this was tongue in cheek, I am fully aware LLM's currently don't have conscious thoughts or emotions)
I find this pretty funny because it seems like a perfect representation of what's easy with today's tools and what isn't
Love the idea though
I was proud for getting the highest-ranked JavaScript-based implementation, but got absolutely crushed by the eventual winner.
It reminds me a bit of OpenAI Five — not just because it played a complex game, but because the real value wasn’t “AI plays Dota,” it was observing how coordination, strategy formation, and adaptation emerged under competitive pressure. A controlled RTS environment like this feels like a lightweight, reproducible version of that idea.
What I especially like here is that it lowers the barrier for experimentation. If researchers and hobbyists can plug different models into the same competitive sandbox, we might start seeing meaningful AI-vs-AI evaluations beyond static leaderboards. Competitive dynamics often expose weaknesses much faster than isolated benchmarks do.
Curious whether you’re planning to support self-play training loops or if the focus is primarily on inference-time agents?
You can watch the matche videos from training runs: https://www.youtube.com/@Sscaitournament/videos
I don't think BWAPI has ever integrated modern AI models, but I haven't followed its progress in several years.
Note, this project doesn't have that best I can tell? Its two static AI scripts having a go. LLMs generate the scripts and they are aware of past "results", but I'm not sure what that means.
Interestingly, I’ve had to create an entire category for games llms play. Strange times we live in.
https://egeozcan.github.io/unnamed_rts/game/
I occasionally run my tournament script: https://github.com/egeozcan/unnamed_rts/blob/main/src/script...
That calculates the ELOs for each AI implementation, and I feed it to different agents so they get really creative trying to beat each other. Also making rule changes to the game and seeing how some scripts get weaker/stronger is a nice way to measure balance.
Funny thing, Codex gets really aggressive and starts cheating a lot of times: https://bsky.app/profile/egeozcan.bsky.social/post/3mfdtj5dh...
The largest winner having 50 wins against 14 other opponents for instance). That guy adding a new script would instantly plummet down the leader board capping out at 14 wins again, Putting it below the 2nd place user.
The leader board will quickly become "who can have a mostly competent AI and never change it" over who actually has the better script.
I had started with the Silicon Valley characters as a one off way to seed the board.
Opus needs to learn to kite.
Using an LLM friendly api with a snapshot of game state and calculated heuristics, legal moves, and varying levels of strategy in working out nicely. They can play a web based game via curl.
Edit: Forgot link: https://davechurchill.ca/starcraft/
> Make sure that each onframe call does not run longer than 42ms. Entries that slow down games by repeatedly exceeding this time limit will lose games on time.
But I'm missing something like: "Your program will be pinned to CPU cores 5-8 and your bot has access to a dedicated RTX 5090 GPU." Also no mention about whether my bot can have network access to offload some high-level latency insensitive planning. Maybe that's just a bad idea in general, haven't played SC in ages.
I quite like the idea of llms writing more code up front to execute strategies.
I’m currently developing the game mechanics and ELO. Please share anything relevant if it comes to mind
This would bring another dimension to it since then quality of tokens would be one dimension (RTS-language: Decision Making) and speed of tokens the other (RTS-language: Actions Per Minute; APM).
Also there are a lot of coding benchmarks, that way it would test something more abstract, similar to AlphaStar https://en.wikipedia.org/wiki/AlphaStar_(software)
You could just use the exposed APIs of OpenAI, Anthropic etc. and let them battle.
Are these casters AI?
Edit: Actually the repo README indeed says its inspired by Screeps. I don't know why they didn't just build on top of Screeps, maybe the idea is to have something anyone can pick up off the shelf for free?
Maybe they are already doing this? Are there logs of the model's thinking?
This is just free propaganda for Anthropic && OpenAI who will leverage these (useless) capabilities to convince your boss to give your salary to them, or at least a substantial portion of it.
I’ve been an engineer for almost 40 years and love seeing what Claude Code can do.
Like it or not, young people will not know a world where this technology doesn’t exist. It is just part of their toolset now.
You would say that because otherwise you'd be afraid as being seen as "too old for this job", and hence risking getting kicked out of it all, meaning no future employment opportunities. I know that feeling, because I myself have been doing this programming job for 20+ years already (so not a young one by any means), but let's just cut the crap about it all and let's tell it how it is.
People of varied ages, already leverage LLMs on a daily basis. And LLMs will only get better.
Yesterday, Opus did work for me that would have taken me weeks. And the result was verified with a comprehensive suite of unit tests plus smoke tests by myself. The code looks exactly as the rest of the code in the 10y+ old, hand-written, enterprise project, no slop.
And you actually should be afraid of being left behind in dev related fields if you don't use LLMs. In most areas in fact.
Once the market corrects for LLM assisted production, the expectations will raise. So right now there is a small window to leverage LLMs as a time saving advantage before it becomes the norm and everyone is forced to use it because expecttions will reflect that.
Um... I am still an active reverse engineer of both ring-0 and ring0 applications on both macOS and Windows (I worked on both the VS and Xcode teams). I'm developing a new tool for macOS that allows users to "see behind" active windows without the constant need for cmd/alt+tabbing. My age has zero bearing on my skill set or ability to understand technology. https://imgur.com/a/seymour-r9whXO5
> let's just cut the crap about it all and let's tell it how it is
The reality is, as I said, that this technology exists and it isn't going anywhere. Young people are going to use it as a tool just like we did when GUI operating systems first became prevalent.
I don't even remotely buy into the AI hype but I'm not going put the blinders on either. There is utility in this technology.
I can't stand you old heads, I'm very happy for you that you got to stash away 40 years of SWE salaries. Its just ladder kicking behavior to be honest. Typical boomer, you got your nut and don't care what happens after.
25% of new college grads in STEM are unemployed and a bunch of companies (controlled by people in your age group) have laid off 400k Americans over the last 16 months while equities and profits are at an all time highs.
The replies : ItS NoT Ai, ItS cUz FrEe MoNeY fRoM CoViD HaS DrIeD uP.
The world was once entirely analog; generations of analog engineers had to throw away their knowledge and start over during the digital transition. It wasn't always pretty but they did it.
If you can't embrace technological change you might have wasted $100k.
Not a fan. Make games with in-game AIs that are interesting but are not large language models: that's wasteful and lazy. You probably had more large language models put this together for you. Lazy.
I swear people (esp here on HN) are actually blind to the weaknesses of Gemini.
I must be among the handful of people who know how thoroughly lobotomized any AI agent from Google must be given their extremely radical historical and contemporaneous practices of censorship.
For complex code I have been having using Sonnet/Opus as usual with a mix of GPT5.3-Codex.
I wonder if an LLM could call on another strategy AI to help.
Maybe the LLM could be more of a coordinator of its own thinking by incorporating other types of AI's.
I haven't.