Yep, a constantly updated spec is the key. Wrote about this here:
https://lukebechtel.com/blog/vibe-speccing
I've also found it's helpful to have it keep an "experiment log" at the bottom of the original spec, or in another document, which it must update whenever things take "a surprising turn"
Some things I've been doing:
- Move as much actual data into YML as possible.
- Use CEL?
- Ask Claude to rewrite pseudocode in specs into RFC-style constrained language?
How do you sync your spec and code both directions? I have some slash commands that do this but I'm not thrilled with them?
I tend to have to use Gemini for actually juggling the whole spec. Of course it's nice and chunked as much as it can be? but still. There's gonna need to be a whole new way of doing this.
If programming languages can have spooky language at a distance wait until we get into "but paragraph 7, subsection 5 of section G clearly defines asshole as..."
What does a structured language look like when it doesn't need mechanical sympathy? YML + CEL is really powerful and underexplored but it's still just ... not what I'm actually wanting.
Sharding: Make well-named sub-documents for parts of work. LLM will be happy to create these and maintain cross references for you.
Compaction: Ask the LLM to compact parts of the spec, or changelog, which are over specified or redundant.
"Make sub-documents with cross-references" is just... recreating the problem of programming languages but worse. Now we have implicit dependencies between prose documents with no tooling to track them, no way to know if a change in document A invalidates assumptions in document B, no refactoring support, no tests for the spec.
To make things specific:
At some level you have to do semantic compression... To your point on non-explicitness -- the dependencies between the specs and sub-specs can be explicit (i.e. file:// links, etc).
But your overall point on assumption invalidation remains... Reminds me of a startup some time ago that was doing "Automated UX Testing" where user personas (i.e. prosumer, avg joe, etc) were created, and Goals/ Implicit UX flows through the UI were described (i.e. "I want to see my dashboard", etc). Then, an LLM could pretend to be each persona, and test each day whether that user type could achieve the goals behind their user flow.
This doesn't fully solve your problem, but it hints at a solution perhaps.
Some of what you're looking for is found by adding strict linter / tests. But your repo looks like something in an entirely different paradigm and I'm curious to dig into it more.
1. The post was written before this was common :)
2. If using Cursor (as I usually am), this isn't what it always does by default, though you can invoke something like it using "plan" mode. It's default is to keep todo items in a little nice todo list, but that isn't the same thing as a spec.
3. I've found that Claude Code doesn't always do this, for reasons unknown to me.
4. The prompt is completely fungible! It's really just an example of the idea.
Did you run any benchmarking? I'm curious if python's stack is faster or slower than a pure C vibe coded inference tool.
People can say what they want about LLMs reducing intelligence/ability; The trend has clearly been that people are beginning to get more organized, document things better, enforce constraints, and think in higher-level patterns. And there's renewed interest in formal verification.
LLMs will force the skilled, employable engineer to chase both maintainability and productivity from the start, in order to maintain a competitive edge with these tools. At least until robots replace us completely.
[0] https://www.atlassian.com/work-management/knowledge-sharing/...
One suggestion, which I have been trying to do myself, is to include a PROMPTS.md file. Since your purpose is sharing and educating, it helps others see what approaches an experienced developer is using, even if you are just figuring it out.
One can use a Claude hook to maintain this deterministically. I instruct in AGENTS.md that they can read but not write it. It’s also been helpful for jumping between LLMs, to give them some background on what you’ve been doing.
If the spec and/or tests are sufficiently detailed maybe you can step back and let it churn until it satisfies the spec.
I only say this as it seems one of your motivations is education. I'm also noting it for others to consider. Much appreciation either way, thanks for sharing what you did.
I don't think it counts as recreating a project "from scratch" if the model that you're using was trained against it. Claude Opus 4.5 is aware of the stable-diffusion.cpp project and can answer some questions about it and its code-base (with mixed accuracy) with web search turned off.
I've had some moments recently for my own projects as I worked through some bottle necks where I took a whole section of a project and said "rewrite in rust" to Claude and had massive speedups with a 0 shot rewrite, most recently some video recovery programs, but I then had an output product I wouldn't feel comfortable vouching for outside of my homelab setup.
It’s surprising how much even Opus 4.5 still trips itself up with things like off-by-one or logic boundaries, so another model (preferably with a fresh session) can be a very effective peer reviewer.
So my checks are typically lint->test->other model->me, and relatively few things get to me in simple code. Contrived logic or maths, though, it needs to be all me.
Every one (IIRC) was breaking copyrights by sharing 3rd-party works in data sets without permission. Some were trained on patent filings which makes patent infringement highly likely. Many breaking EULA's (contract law) by scraping them. Some outputs were verbatim reproductions of copyrighted works, too, which could get someoen sued if they published them.
So, I warned people to stay away from AI until (a) training on copyrighted/patented works was legal in all those circumstances, (b) the outputs had no liability, and (c) users of a model could know this by looking at the pretraining data. There's no GPT3- or Claude-level models produced that way.
On a personal level, I follow Jesus Christ who paid for my sins with His life. We're to be obedient to God's law. One is to submit to authority (aka don't break man's law). I don't know that I can use AI outputs if they were illegally trained or like fencing stolen goods. Another reason I want the pretraining to be legal either by mandate or using only permissible works.
Note: If your country is in the Berne Convention, it might apply to you, too.
Now that the Redis author supports broad copyright violations and has turned into an LLM influencer, I regret having ever supported Redis. I have watched many open source authors, who have positioned themselves as rebels and open source populists, go fully corporate. This is the latest instance.
That said, I'm mixed on agentic performance for data science work but it does a good job if you clearly give it the information it needs to solve the problem (e.g. for SQL, table schema and example data)
What you're saying here is that you do not appreciate systems not using the Python stack, which I think is the opposite of what you wanted to say.
It's almost as if this is the first time many have seen something built in C with zero dependencies which makes this easily possible.
Since they are used to languages with package managers adding 30 package and including 50-100+ other dependencies just before the project is able to build.
FLUX.2 [Klein]: Towards Interactive Visual Intelligence