And it confuses Claude.
This way of running tests is also what Rails does, and AFAIK Django too. Tests are isolated and can be run in random order. Actually, Rails randomizes the order so if the are tests that for any reason depend on the order of execution, they will eventually fail. To help debug those cases, it prints the seed and it can be used to rerun those tests deterministically, including the calls to methods returning random values.
I thought that this is how all test frameworks work in 2026.
I've been tweaking my skills to avoid nested cases, better use of with/do to control flow, good contexts, etc.
They could've been sorted with precise context injection of claude.md files and/or dedicated subagents, no?
My experience using Claude suggests you should spend a good amount of time scaffolding its instructions in documents it can follow and refer to if you don't want it to end in the same loops over and over.
Author hasn't written on whether this was tried.
We don't 100% AI it but this very much matches our experience, especially the bits about defensiveness.
Going to do some testing this week to see if a better agents file can't improve some of the author's testing struggles.
The new generation of code assistants are great. But when I dogmatically try to only let the AI work on a project it usually fails and shots itself in its proverbial feet.
If this is indeed 100% vibe coded, then there is some magic I would love to learn!
- https://github.com/agoodway/.claude/blob/main/skills/elixir-...
- https://github.com/agoodway/.claude/blob/main/agents/elixir-...
- https://github.com/agoodway/.claude/blob/main/agents/elixir-...
Getting pretty good results so far.
What I'd really like to see though is experiments on whether you can few shot prompt an AI to in-context-learn a new language with any level of success.
It's certainly helpful, but has a tendency to go for very non idiomatic patterns (like using exceptions for control flow).
Plus, it has issues which I assume are the effect of reinforcement learning - it struggles with letting things crash and tends to silence things that should never fail silently.
The SOTA models come do a great job for all of them, but if I had to rank the capabilities for each language it would look like this:
JavaScript, Julia > Elixir > Python > C++
That's just a sample size of one, but I suspect, that for all but the most esoteric programming languages there is more than enough code in the training data.
What if it doesnt? What if LLMs just stay mostly the same level of usefulness they are now, but the costs continue to rise as subsidization wears off?
Is it still worth it? Maybe, but not worth abandoning having actual knowledge of what you’re doing.
- Silently closes the tab, and makes a remark to avoid given software at any cost.