Getting LLMs to reliably generate functional games required solving three specific engineering bottlenecks:
1. The Training Data Scarcity: LLMs barely know GDScript. It has ~850 classes and a Python-like syntax that will happily let a model hallucinate Python idioms that fail to compile. To fix this, I built a custom reference system: a hand-written language spec, full API docs converted from Godot's XML source, and a quirks database for engine behaviors you can't learn from docs alone. Because 850 classes blow up the context window, the agent lazy-loads only the specific APIs it needs at runtime.
2. The Build-Time vs. Runtime State: Scenes are generated by headless scripts that build the node graph in memory and serialize it to .tscn files. This avoids the fragility of hand-editing Godot's serialization format. But it means certain engine features (like `@onready` or signal connections) aren't available at build time—they only exist when the game actually runs. Teaching the model which APIs are available at which phase — and that every node needs its owner set correctly or it silently vanishes on save — took careful prompting but paid off.
3. The Evaluation Loop: A coding agent is inherently biased toward its own output. To stop it from cheating, a separate Gemini Flash agent acts as visual QA. It sees only the rendered screenshots from the running engine—no code—and compares them against a generated reference image. It catches the visual bugs text analysis misses: z-fighting, floating objects, physics explosions, and grid-like placements that should be organic.
Architecturally, it runs as two Claude Code skills: an orchestrator that plans the pipeline, and a task executor that implements each piece in a `context: fork` window so mistakes and state don't accumulate.
Everything is open source: https://github.com/htdt/godogen
Demo video (real games, not cherry-picked screenshots): https://youtu.be/eUz19GROIpY
Blog post with the full story (all the wrong turns) coming soon. Happy to answer questions.
I'm planning to do a proper full game with more iteration and publish it as a playable build, not just a video. That should give a much better sense of actual quality ceiling.
The "Racing game" appeared to be a car following a set path with a freecam and there didn't seem to be any gameplay mechanics in the snowboarding one, just a physics entity wildly crashing down a hill with no consequences or score.
Last summer I built a factorio-like automation game with older models and over time the game really started to take life.
i do think LLMs need a physics skill though. very consistently they are bad at writing physics related code. at least without a lot of prompting and feedback
Let there be games! And games there shall be, millions of generated games.
Can I go back to the 80's please?
You'll need to find a publisher, journalists, etc to market your game. You'll ask your friends what they are playing instead of scrolling the store page. Trusted platforms will promote games that are actually worth looking at. This problem already exists on modern platforms like Steam but AI is supercharging it.
If you want to test this, find yourself a record store and pick up a few LPs less than a few bucks from bands you've never heard. You might get something really great or it might be terrible.
instead we will see something like flash or game maker, with new art styles driven by what agents make easy, and what children think is fun.
games have immediate feedback loops about quality. either theyre fun or theyre not.
There are still... dozens of us left!
I like the knittling analogy that was made by the OpenClaw inventor recently. Programming will continue to exist as a hobby, not as a profession.
Oddly I feel AI is getting me off the endless learn new tech churn. I was looking at a few odd ball programming books on my shelf, graphics programming from scratch and retro game dev (c64 edition and nes editions) and thinking I might now have time to work through these instead of learning technology x.
https://gabrielgambetta.com/computer-graphics-from-scratch/
And I'll be manually coding as I want to learn!
How is that a good thing? Sounds insanely dystopian to me. Especially considering all the other jobs that will be affected too.
Equally true for today's AI coding agents
In fact this whole analogy makes no sense, a knitting machine is far closer to a compiler in this analogy then it is to a language model. Many would argue that automatic looms were the first compilers of the industrial age, and I would agree with that argument.
The "art" of programming is going exactly that route, maybe with a little fewer ladies and more men.
The same people who were going to make something good will still make something good, the code imo has very little to do with it.
Passion is necessary but insufficient by itself to make good things
Every good and enjoyable game made was handcoded, with art, music, dialogue and design created with intent. I have yet to see a game created with an LLM that's even worth playing, despite countless LLM enthusiasts declaring the death of art , design and programming.
A tool that takes a simple prompt and generates a game from it isn't capable of any of that, and the necessary passion is nonexistent. It's an interesting technical demo but it's useless for gamedev unless your only goal is churning out programmatic slop, which is exactly what it will be used for.
If you want to handcraft something, do it. How popular it is among other people isn't relevant.
*Atari 1980 (20 games) vs Steam 2025 (20,008 games)
I've switched to emulators, a bluetooth controller and zero android games (and zero ios games on my work phone). But yeah it was/is horribly enshittified already. And what people predicted did happen.
The fact that the app store allows updates means existing games get systematically worse. Even the games I used to enjoy, and bought 5 years ago, like collossatron now have ads after every play.
LLMs are really good at C# (and tscn files for some reason), so that solves the "LLMs suck at GDScript" problem. Also, C# can be cheaper in terms of token usage (even accounting for not having to load the additional APIs): one agent writes the interfaces, another one fills in the details.
Saying this because I had really enjoyed vibecoding a Godot game in C# - and it was REALLY painful to vibecode with GDScript.
The original reasoning: GDScript is the default path in Godot, nearly all docs and community examples use it, and the engine integration is tighter (signals, exports, scene tree). C# still has some gaps — no web export, no GDExtension bindings.
But you're right that from the LLM side, C# flips the core problem. Strong training data, static typing for better compiler feedback, interfaces for clean architecture. The context window savings from not loading a custom language spec could be significant.
Main thing I'd want to test is whether headless scene building — the core of the pipeline — works as smoothly in C#. Going to experiment with this.
This always puzzled me about Godot. I like Python as much as the next guy (afaik GDScript is a quite similar language), but for anything with a lot of moving parts, wouldn't you prefer to use static typing? And even simple games have a lot of moving parts!
Be happy to find out I’m wrong.
The way unity solves this is with some kind of proprietary compiler. They translate the C# into C++, and then compile that into webassembly.
Whereas others (incl. Godot) need to ship the .NET runtime in the browser. (A VM in a VM.)
It makes me sad that Unity doesn't open source that. That would be amazing.
I looked at the video, awful results, better start with a template.
As Two Minute Paper's always says, it's not just about what this looks like at the moment, it's about what this might look like another three breakthroughs down the line.
While you can't guarantee further breakthroughs, at the rate of advancement and pace of improvement, you would have to be brave to bet on no further breakthroughs.
Models can be used more efficiently, at the moment, but you have to understand what you are doing, and not trying to one-shot anything.
I had assumed with the complex mix of scripts and the scene graph in Godot wouldn't be a good fit (personally trying and failing to make games in it by hand in the past may have been a factor)
Perhaps I'll give this approach a go if inspiration strikes!
I tried using Claude Code to build an RPG game with Godot and GDScript, using free to use assets: a total failure :/
The game was supposed to be many implementation steps long but I asked Claude to first produce a one area demo, so I could test the assets and choose the one I liked. First it produced some garbage using the assets randomly. Then it tried to copy from an existing demo but it had not idea where a door or a path were and at a certain point it even admitted it with something like: "I can't design an usable and nice area: I either make it functional and ugly or I copy and adapt the existing demo but I will have no clue about what is what"
I've never even attempted to develop games before so I'm sure I don't even know the basic concepts, but this use case definitely didn't work for me.
Maybe it could generate the code of the game if I provided the full design?
Godogen closes that loop: after writing code, it captures screenshots from the running engine and a vision model evaluates them. That's the difference between "compiles but broken" and "actually playable."
And yes — providing design docs helps a lot. The pipeline generates those automatically (visual reference, architecture, task plan), but you can provide your own and customize the skills to match your vision.
Haven't looked into Bevy but will check it out, thanks.
I think minimizing the amount of human effort in the loop is the wrong optimization, and it's the reason we end up with "slop".
It's the dream of a lot of people to have a magic box that makes you things you can sell, or enjoy for personal leisure. But LLMs are not the magic box. And there may not ever be a magic box. The sooner we can accept that the magic box isn't in the room with us, then the sooner we can start getting real utility out of LLMs.
TLDR: Human taste is more important than building things for the sake of building them.
The starting points of Three.js examples are more of a game than anything here.
Stop saying AI is building games when it can’t even build a standard web page to match a mockup.
Btw: Have you looked at Tripo3D models' topology? Is it still so bad that if you want to make small edits you have to retopologize the whole thing first?
FWIW as a disclaimer I'm making my own game not using AI since I value learning the skills myself, but I am interested to see how fast AI tools adopt to gamedev. For now they've been more of a false shortcut in anything else than prototyping and semantic search ("I need to achieve this visual effect, what algorithms should I look up").
I feel like this could be a real positive thing if you had spent some effort writing about how and why this is useful, and targeted this more for learning + artist assistance versus just generating a complete game. Gamers universally do not want more AI slop, but tools that artists and programmers could use to automate busywork or learn the engine would have been much better.
That being said, Claude does not structure the project in the way someone familiar with the engine would, and just like any 'real' software, if you don't guide it, the output quickly degenerates. For example, stuff that would normally, intuitively be a child item in a scene, Claude instead prefers to initialize in code for some reason. It does not seem to care about group labels, which is an extremely easy way to identify different (types of) objects that should be treated in the same way in certain cases.
The games in the video look like GameJam projects? I'm not good at Godot, and I could probably hack most of them together in a week or so. I imagine an actual game developer could put some of them together in days.
In order to have LLMs build something good with any framework, not just a game engine, you have to steer and curate the output, otherwise non-trivial projects become intractable past a certain point, and you have a mountain of bugs to sort through.
> The Training Data Scarcity: LLMs barely know GDScript.
I've not found this to be an issue. Claude does just fine when you explain what you want. I've never had it hallucinate stuff, and I've barely seen it look at docs. Granted, I've only had it write 1-2k lines of GDScript, but I've never felt like it was spouting complete nonsense.
> To fix this, I built a custom reference system: a hand-written language spec, full API docs converted from Godot's XML source, and a quirks database for engine behaviors you can't learn from docs alone.
This is the point where I feel like this is nonsense (more than what the LLM-written prose would imply). Maybe this is my inexperience talking, but I feel there is no way that this would be better in any way over any alternative. Especially if you just lazy-load stuff at runtime. Godot already has good docs. They should certainly cover much more than whatever you need to make the games you demonstrated. What is the point of making a duplicate version of the docs, when you have the docs right there? If you really think that Claude can't handle GDScript, you can just use C#?
> The Build-Time vs. Runtime State: Scenes are generated by headless scripts that build the node graph in memory and serialize it to .tscn files. This avoids the fragility of hand-editing Godot's serialization format.
Again, maybe that's my inexperience with Godot, but I have no idea what you're talking about here? When you run, you do get a different node tree (and 'state' I guess?) but where does "hand-editing Godot's serialization format" come into this? Why would you ever need to concern yourself with what Godot does to transform your code after you've written it?
> It catches the visual bugs text analysis misses: z-fighting, floating objects, physics explosions, and grid-like placements that should be organic.
Funnily enough, those are all stuff that text analysis should be better at finding. I personally use logs & actually playing the game.
Nice set of prompts and skills tho, im grabbing them for personal use.
Godot whole engine is text based. This means you can just let claude rip through the assets and files just fine. It basically just works.
The thing that is critical is to make some documentation about the axis systems and core classes (the one on OP project is pretty good, ive grabbed it) and then you set your claude.md to point at the godot source code so that the bot can doublecheck things.
Ive been playing with multiple engines, and godot is by far the best one to use with the AI. Unreal engine is too heavy on binary files that coding tools cant parse, and Unity is closed source which leaves the bot with no reliable documentation or way to check what the game apis are doing. Godot is small enough that the bot can understand it and works fine for games that arent too complicated.
Im using it to build a spiritual remake of daggerfall as a procedural open world rpg, right now its at 60.000 lines of code, quite advanced. I got it running on a steamdeck at 60 fps even with 4 kilometers of draw distance with thousands of trees and procedural terrain thanks to doing tons of custom shaders and a few engine edits.
If I'm not mistake how Claude Code or AI agent work, they need everything in 'context' and few tricks to reduce the context size. Sure, but given the number of files you have, how much of the context is consumed by all those claude files vs actual user input?
Filtering at load time based on what the agent actually needs makes a huge difference. Curious if the orchestrator/executor split causes issues with state handoff between the two context forks.