I think the secret sauce is talk to the model about what you want first, make the plan, then when you feel good about the spec, regardless of tooling (you can even just use a simple markdown file!) you have it work on it. Since it always has a file to go back to, it can never 'forget' it just needs to remember to review the file. The more detail in the file, the more powerful the output.
Tell your coding model: how you want it, what you want, and why you want it. It also helps to ask it to poke holes and raise concerns (bypass the overly agreeable nature of it so you dont waste time on things that are too complex).
I love using Claude to prototype ideas that have been in my brain for years, and they wind up coming out better than I ever envisioned.
And that was AFTER literally burning a weeks worth of codex and Claude 20$ plans and 50$ API credits and getting completely bumfucked - AI was faking out tests etc.
I had better experiences just guiding the thing myself. It definitely was not a set and forget experience (6 hours of constant monitoring) but I was able to get a full research MVP that informed the next iteration with only 75% of a codex weekly plan.
It's working, and I'm enjoying how productive it is, but it feels like a step on a journey rather than the actual destination. I'm looking forward to seeing where this journey ends up.
I would love to do something more sophisticated but it's ironic that when I played both agents in this loop over the past few decades, the loop got faster and faster as computers got faster and faster. Now I'm back to waiting on agentic loops just like I used to wait for compilations on large code bases.
It is perhaps confirmation bias on my part but I've been finding it's doing a better job with similar problems than I was getting with base plan mode. I've been attributing this to its multiple layers of cross checks and self-reviews. Yes, I could do that by hand of course, but I find superpowers is automating what I was already trying to accomplish in this regard.
Is that an issue? GitHub charges per-request, not per-token, so a verbose output and short output will be the same cost
What model are you using?
I’ve been really enjoying Codex CLI recently though. It seems to do just as well as Opus 4.6, but using the standard GPT 5.4
I reviewed the code from both and the GSD code was definitely written with the rest of the project and possibilities in mind, while the Claude Plan was just enough for the MVP.
I can see both having their pros and cons depending on your workflow and size of the task.
You never want the LLM to do anything that deterministic software does better, because it inflates the context and is not guaranteed to be done accurately. This includes things like tracking progress, figuring out dependency ordering, etc.
Most projects I do take 20 minutes or less for an agent to complete and those don't need a wrapper. But for longer tasks, like hours or days, it gets distracted.
edit: GSD is a cli wrapper, Superpowers not so much. Both are over-engineered for an easy problem IMHO.
Otherwise, if you can own your own thinking, orchestrating, and steering of agents, you're in a more mature place.
A mess. I still enjoy superpowers brainstorming but will pull the chute towards the end and then deliver myself.
Plan mode is great, but to me that's just prompting your LLM agent of choice to generate an ad-hoc, imprecise, and incomplete spec.
The downside of specs is that they can consume a lot of context window with things that are not needed for the task. When that is a concern, passing the spec to plan mode tends to mitigate the issue.
1. The LLM is operating on more what you'd call "guidelines" than the rules -- it will mostly make a PR after fixing a bug, but sometimes not. It will mostly run tests after completing a fix, but sometimes not. So there's a sentiment "heck, let's write some prompt that tells it to always run tests after fixing code", etc.
2. You end up running the LLM tool against state that is in GitHub (or RCS du jour). E.g. I open a bug (issue) and type what I found that's wrong, or whatever new feature I want. Then I tell Claude to go look at issue #xx. It runs in the terminal, asks me a bunch of unnecessary permission questions, fixes the bug, then perhaps makes a PR, perhaps I have to ask for that, then I go watch CI status on the PR, come back to the terminal and tell it that CI passed so please merge (or I can ask it to watch CI and review status and merge when ready). After a while you realize that all that process could just be driven from the GitHub UI -- if there was a "have Claude work on this issue" button. No need for the terminal.
These meta-frameworks are useful for the one who set them up but for another person they seem like complete garbage.
This has been solved already - automated testing. They encode behaviour of the system into executables which actually tell you if your system aligns or not.
Better to encode the behaviour of your system into real, executable, scalable specs (aka automated tests), otherwise your app's behaviour is going to spiral out of control after the Nth AI generated feature.
The way to ensure this actually scales with the firepower that LLMs have for writing implementation is ensure it follows a workflow where it knows how to test, it writes the tests first, and ensures that the tests actually reflect the behaviour of the system with mutation testing.
I've scoped this out here [1] and here [2].
[1] https://www.joegaebel.com/articles/principled-agentic-softwa... [2] https://github.com/JoeGaebel/outside-in-tdd-starter
1. Specs are subject to bit-rot, there's no impetus to update them as behaviour changes - unless your agent workflow explicitly enforces a thorough review and update of the specs, and unless your agent is diligent with following it. Lots of trust required on your LLM here.
2. There's no way to systematically determine if the behaviour of your system matches the specs. Imagine a reasonable sized codebase - if there's a spec document for every feature, you're looking at quite a collection of specs. How many tokens need be burnt to ensure that these specs are always up to date as new features come in and behaviour changes?
3. Specs are written in English. They're ambiguous - they can absolutely serve the planning and design phases, but this ambiguity prevents meaningful behaviour assertions about the system as it grows.
Contrast that with tests:
1. They are executable and have the precision of code. They don't just describe behaviour of the system, they validate that the system follows that behaviour, without ambiguity.
2. They scale - it's completely reasonable to have extensive codebases have all (if not most) of their behaviour covered by tests.
3. Updating is enforcable - assuming you're using a CI pipeline, when tests break, they must be updated in order to continue.
4. You can systematically determine if the tests fully describe the behaviour (ie. is all the behaviour tested) via mutation testing. This will tell you with absolute certainty if code is tested or not - do the tests fully describe the system's behaviour.
That being said, I think it's very valuable to start with a planning stage, even to provide a spec, such that the correct behaviour gets encoded into tests, and then instantiated by the implementation. But in my view, specs are best used within the design stage, and if left in the codebase, treated only as historical info for what went into the development of the feature. Attempting to use them as the source of truth for the behaviour of the system is fraught.
And I guess finally, I think that insofar as any framework uses the specs as the source of truth for behaviour, they're going to run into alignment problems since maintaining specs doesn't scale.
This is specious reasoning. Automated tests are already the output of these specs, and specs cover way more than what you cover with code.
Framing tests as the feedback that drives design is also a baffling opinion. Without specialized prompts such as specs, you LLM agent of choice ends up either ignoring tests altogether or even changing them to fit their own baseless assumptions.
I mean, who hasn't stumbled upon the infamous "the rest of your tests go here" output in automated tests?
ok but how are you sure that the AI is correctly turning the spec into tests. if it makes a mistake there and then builds the code in accordance with the mistaken test you only get the Illusion of a correct implementation
You use the specs to generate the tests, and you review the changes.
This is specious reasoning
It's an insulting phrase and from now on I'm immediately down voting it when I see it.
I'm sorry you feel like that. How would you phrase an observation where you find the rationale for an assertion to not be substantiated and supported beyond surface level?
I want a system that enforces planning, tests, and adversarial review (preferably by a different company's model). This is more for features, less for overall planning, but a similar workflow could be built for planning.
1. Prompt 2. Research 3. Plan (including the tests that will be written to verify the feature) 4. adversarial review of plan 5. implementation of tests, CI must fail on the tests 6. adversarial review verifying that the tests match with the plan 7. implementation to make the tests pass. 8. adversarial PR review of implementation
I want to be able to check on the status of PRs based on how far along they are, read the plans, suggest changes, read the tests, suggest changes. I want a web UI for that, I don't want to be doing all of this in multiple terminal windows.
A key feature that I want is that if a step fails, especially because of adversarial review, the whole PR branch is force pushed back to the previous state. so say #6 fails, #5 is re-invoked with the review information. Or if I come to the system and a PR is at #8, and I don't like the plan, then I make some edits to the plan (#3), the PR is reset to the git commit after the original plan, and the LLM is reinvoked with either my new plan or more likely my edits to the plan, then everything flows through again.
I want to be able to sit down, tend to a bunch of issues, then come back in a couple of hours and see progress.
I have a design for this of course. I haven't implemented it yet.
If I fork out a version for others that is public, then I have to maintain that variation as well.
Is anyone in a similar situation? I think most of the ones I see released are not particularly complex compraed to my system, but at the same time I don't know how to convey how to use my system as someone who just uses it alone.
it feels like I don't want anyone to run my system, I just want people to point their ai system to mine and ask it what there is valuable to potentially add to their own system.
I don't want to maintain one for people. I don't want to market it as some magic cure. Just show patterns that others can use.
there's a lot of patterns i think are helpful for me.
i have checks of which types of repos i'm in with branching dev flows for each one.
it's going to be hard to communicate all of this genericly, but i am trying.
That was my impression of superpowers as well. Maybe not highly overengineered but definitely somewhat. I ended up stripping it back to get something useful. Kept maybe 30%.
There's a kernel of a good idea in there but I feel it's something that we're all gradually aligning on independently, these shared systems are just fancy versions of a "standard agentic workflow".
Another great technique is to use one of these structures in a repo, then task your AI with overhauling the framework using best practices for whatever your target project is. It works great for creative writing, humanizing, songwriting, technical/scientific domains, and so on. In conjunction with agents, these are excellent to have.
I think they're going to be a temporary thing - a hack that boosts utility for a few model releases until there's sufficient successful use cases in the training data that models can just do this sort of thing really well without all the extra prompting.
These are fun to use.
I doubt any hot off the press features are *that* important, but am curious if the customizations of the fork are a net positive considering this.
My findings:
1. The spec created by Superpowers was very detailed (described the specific fonts, color palette), included the exact content of config files, commit messages etc. But it missed a lot of things like analytics, RSS feed etc.
2. Superpowers wrote the spec and plan as two separate documents which was better than the collaborative method, which put both into one document.
3. Superpowers recommended an in-place migration of the blog whereas the collaborative spec suggested a parallel branch so that Hugo and Astro can co-exist until everything is stable.
And a few more difference written in [0].
In general, I liked the aspect of developing the spec through discussion rather than one-shotting it, it let me add things to the spec as I remember them. It felt like a more iterative discovery process vs. you need to get everything right the first time. That might just be a personal preference though.
At the end of this exercise, I asked Claude to review both specs in detail, it found a few things that both specs missed (SEO, rollback plan etc.) and made a final spec that consolidates everything.
Superpowers and gsd are claude code plugins (providing skills)
Get Shit Done is best when when you're an influencer and need to create a Potemkin SaaS overnight for tomorrow's TikTok posts.
It's hard to say why GSD worked so much better for us than other similar frameworks, because the underlying models also improved considerably during the same period. What is clear is that it's a huge productivity boost over vanilla Claude Code.
Yes this is how much paying FreshBooks annoyed me. Plus I hated they forced an emailed 2FA if you didn’t connect with Google.
Also, how much it is in terms of cost? Like - API costs?
Is it pure Swift? Or Electron app?
Is this supposed to run in a VM?
The best way I have today is to start with a project requirements document and then ask for a step-by-step implementation plan, and then go do the thing at each step but only after I greenlight the strategy of the current step. I also specify minimal, modular, and functional stateless code.
Sometimes annoying - you can't really fire and forget (I tend to regret skipping discussion on any complex tasks). It asks a lot of questions. But I think that's partly why the results are pretty good.
The new /gsd:list-phase-assumptions command added recently has been a big help there to avoid needing a Q&A discussion on every phase - you can review and clear up any misapprehensions in one go and then tell it to plan -> execute without intervention.
It burns quite a lot of tokens reading and re-reading its own planning files at various times, but it manages context effectively.
Been using the Claude version mostly. Tried it in OpenCode too but is a bit buggy.
They are working on a standalone version built on pi.dev https://github.com/gsd-build/gsd-2 ...the rationale is good I guess, but it's unfortunate that you can't then use your Claude Max credits with it as has to use API.
https://zarar.dev/spec-driven-development-from-vibe-coding-t...
I find the added structure of yaml + requirement ids helps tremendously compared to plain markdown -
I am still a few days away from open sourcing the stack (CLI / API & Server), plan is to gather as much feedback as I can and decide if this is worth maintaining.
>hundreds of customers
I started with all the standard spec flow and as I got more confident and opinionated I simplified it to my liking.
I think the point of any spec driven framework is that you want to eventually own the workflow yourself, so that you can constraint code generation on your own terms.
I think these type of systems (gsd/superpowers) are way too opinionated.
It's not that they can't or don't work. I just think that the best way to truly stay on top of the crazy pace of changes is to not attach yourself to super opinionated workflows like these.
I'm building an orchestrator library on top of openspec for that reason.
For this reason I don’t think it’s actually a good name. It should be called planning-shit instead. Since that’s seemingly 80%+ of what I did while interacting with this tool. And when it came to getting things done, I didn’t need this at all, and the plans were just alright.
But what makes a difference is running plan review and work review agent, they fix issues before and after work. Both pull their weight but the most surprising is the plan-review one. The work review judge reliably finds bugs to fix, but not as surprising in its insights. But they should run from separate subagents not main one because they need a fresh perspective.
Other things that matter are 1. testing enforcement, 2. cross task project memory. My implementation for memory is a combination of capturing user messages with a hook, append only log, and keeping a compressed memory state of the project, which gets read before work and updated after each task.
One pattern that's worked well for me: instead of writing specs manually, I extract structured architecture docs from existing systems (database schemas, API endpoints, workflow logic) and use those as the spec. The AI gets concrete field names, actual data relationships, and real business logic — not abstractions. The output quality jumps significantly compared to hand-written descriptions.
The tricky part is getting that structured context in the first place. For greenfield projects it's straightforward. For migrations or rewrites of existing systems, it's the bottleneck that determines whether AI-assisted development actually saves time or just shifts the effort from coding to prompt engineering.
[1] https://www.riaanzoetmulder.com/articles/ai-assisted-program...
If multiple people work with different AI tools on the same project, they will all add their own stuff in the project and it will become messy real quick.
I'll keep superpowers, claude-mem, context7 for the moment. This combination produces good results for me.
This is the real challenge. The people I know that jump around to new tools have a tough time explaining what they want, and thus how new tool is better than last tool.
There's some VC money interest but I'd classify more than 9 / 10ths of it as good old fashioned wildcat open source interest. Because it's fascinating and amazing, because it helps us direct our attention & steer our works.
And also it's so much more approachable and interesting, now that it's all tmux terminal stuff. It's so much more direct & hackable than, say, wading into vscode extension building, deep in someone else's brambly thicket of APIs, and where the skeleton is already in place anyhow, where you are only grafting little panes onto the experience rather than recasting the experience. The devs suddenly don't need or care for or want that monolithic big UI, and have new soaring freedom to explore something much nearer to them, much more direct, and much more malleable: the terminal.
There's so many different forms of this happening all at once. Totally different topic, but still in the same broad area, submitted just now too: Horizon, an infinite canvas for trrminals/AI work. https://github.com/peters/horizon https://news.ycombinator.com/item?id=47416227
[1] https://github.com/ChristopherKahler/paul
[2] https://github.com/ChristopherKahler/paul/blob/main/PAUL-VS-...
If it was game engine or new web framework for example there would be demos or example projects linked somewhere.
I'm facing increasing pressure from senior executives who think we can avoid the $$$ B2B SaaS by using AI to vibe code a custom solution. I love the idea of experimenting with this but am horrified by the first-ever-case being a production system that is critical to the annual strategic plan. :-/
I've been poking at security issues in AI-generated repos and it's the same thing: more generation means less review. Not just logic — checking what's in your .env, whether API routes have auth middleware, whether debug endpoints made it to prod.
You can move that fast. But "review" means something different now. Humans make human mistakes. AI writes clean-looking code that ships with hardcoded credentials because some template had them and nobody caught it.
All these frameworks are racing to generate faster. Nobody's solving the verification side at that speed.
Saying "I generated 250k lines" is like saying "I used 2500 gallons of gas". Cool, nice expense, but where did you get? Because it it's three miles, you're just burning money.
250k lines is roughly SQLite or Redis in project size. Do you have SQLite-maintaining money? Did you get as far as Redis did in outcomes?
There’s probably a world where you could do that if the spec was written in a formal language with no ambiguity and there was a rigorous system for translating from spec to code sure.
My rant about this: https://sibylline.dev/articles/2026-01-27-stop-orchestrating...
Rather than having agents decide to manage their own code lifecycle, define a state machine where code moves from agent to agent and isolated agents critique each others code until the code produced is excellent quality.
This is still a bit of an token hungry solution, but it seems to be working reasonably well so far and I'm actively refining it as I build.
Not going to give you formal verification, but might be worth looking into strategies like this.
We built AI code generation tools, and suddenly the bottleneck became code review. People built AI code reviewers, but none of the ones I've tried are all that useful - usually, by the time the code hits a PR, the issues are so large that an AI reviewer is too late.
I think the solution is to push review closer to the point of code generation, catch any issues early, and course-correct appropriately, rather than waiting until an entire change has been vibe-coded.
Things have changed quite a bit. I hope you give GSD a try yourself.
It absolutely tore through tokens though. I don't normally hit my session limits, but hit the 5-hour limits in ~30 minutes and my weekly limits by Tuesday with GSD.
Makes sense for consistency, but also shifts the problem:
how do you keep those artifacts in sync with the actual codebase over time?
I've been down the "don't read the code" path and I can say it leads nowhere good.
I am perhaps talking my own book here, but I'd like to see more tools that brag about "shipped N real features to production" or "solved Y problem in large-10-year-old-codebase"
I'm not saying that coding agents can't do these things and such tools don't exist, I'm just afraid that counting 100k+ LOC that the author didn't read kind of fuels the "this is all hype-slop" argument rather than helping people discover the ways that coding agents can solve real and valuable problems.
#1 rejection reason: missing context. 80% needed human fixes. Agents can write code fine. They just don't know what "done" looks like in your codebase.
Count successful merges into repos with real history instead of LOC and the hard part is specification, not execution.
Wrote about this topic @ https://www.augmentcode.com/blog/the-end-of-linear-work
Like most spec driven development tools, GSD works well for greenfield or first few rounds of “compound engineering.” However, like all others, the project gets too big and GSD can’t manage to deliver working code reliably.
Agents working GSD plans will start leaving orphans all over, it won’t wire them up properly because verification stages use simple lexical tools to search code for implementation facts. I tried giving GSD some ast aware tools but good luck getting Claude to reliably use them.
Ultimately I put GSD back on the shelf and developed my own “property graph” based planner that is closer to Claude “plan mode” but the design SOT is structured properties and not markdown. My system will generate docs from the graph as user docs. Agents only get tasked as my “graph” closes nodes and re-sorts around invariants, then agents are tasked directly.
I think I have to get my head around a lot more than I think
Now all you really have to do is chat with claude about what you're thinking about building.
In plan mode, claude can't edit anything and has some extra "you're an expert at planning!" prompts prepended to your initial message in plan mode.
And then either when you're ready or Claude thinks the "plan" is gelling, it'll suggest stopping and let it write up a detailed plan. CC will dispatch some "planning agents" with prompts that your 'main' CC has crafted for that agent to plan for within the context of your conversation and what parts of the codebase it should look to integrate/explore.
Once all that is done, it will display it to you and then offer to "clear context and implement" - where it will just get to work. Or it will offer to go back to chatting and resolve whatever was misunderstood, or if you had a new idea you wanted to mix in.
These plans are saved as markdown in your .claude/plans directory.
Plan mode is handy on the on-off. But if you enter another plan mode, thinking claude would learn from, or build off a previous plan spec, it won't unless you explicitly say something like "read previous plan <path to plan file> and re-use the scaffolding directives for this new project"
Looked at profile, hasn't done or published anything interesting other than promoting products to "get stuff done"
This is like the TODO list book gurus writing about productivity
But I guess if I go by what you’re saying I suppose it makes sense for it not to do a bunch of things you didn’t ask it to do.
Its already quite debatable whether software developers should be called software engineers, but this is just ridiculous.
I got a promotion once for deleting 250K lines of code in less than a month. Now that sounds better
Faster than using ai. Cheaper. Code is better tested/more secure. I can learn/build with other humans.
1. Backend unit tests — fast in-memory tests that run the full suite in ~5 seconds on every save.
2. Full end-to-end tests — automated UI tests that spin up a real cloud server, run through the entire user journey (provision → connect → manage → teardown), and
verify the app behaves correctly on all supported platforms (phone, tablet, desktop).
3. Screenshot regression tests — every E2E run captures named screenshots and diffs them against saved baselines. Any unintended UI change gets caught
automatically.LOL screenshot regression. You're still not a dev buddy read some books
Can we pls stop this.
Between my own apps and consulting work, I had a pretty good side business. Like everything else though, those days didn't last forever. But there was a lot of easy money early on.
I will open source it soon in few weeks, as I have still complete few more features.
There is a gsd-plan-checker that runs before execution, but it only verifies logical completeness — requirement coverage, dependency graphs, context budget. It never looks at what commands will actually run. So if the planner generates something destructive, the plan-checker won't catch it because that's not what it checks for. The gsd-verifier runs after execution, checking whether the goal was achieved, not whether anything bad happened along the way. In /gsd:autonomous this chains across all remaining phases unattended.
The granular permissions fallback in the README only covers safe reads and git ops — but the executor needs way more than that to actually function. Feels like there should be a permission profile scoped to what GSD actually needs without going full skip.
That's not a reason to stop trying. This is the iterative process of figuring out what works.
I've been using a Claude Pro plan just as a code analyzer / autocomplete for a year or so. But I recently decided to try to rewrite a very large older code base I own, and set up an AI management system for it.
I started this last week, after reading about paperclip.ing. But my strategy was to layer the system in a way I felt comfortable with. So I set up something that now feels a bit like a rube goldberg machine. What I did was, set up a clean box and give my Claude Pro plan root access to it. Then set up openclaw on that box, but not with root... so just in case it ran wild, I could intervene. Then have openclaw set up paperclip.ing.
The openclaw is on a separate Claude API account and is already costing what seems like way too many tokens, but it does have a lot of memory now of the project, and in fairness, for the $150 I've spent, it has rewritten an enormous chunk of the code in a satisfactory way (with a lot of oversight). I do like being able to whatsapp with it - that's a huge bonus.
But I feel like maybe this a pretty wasteful way of doing things. I've heard maybe I could just run openclaw through my Claude Pro plan, without paying for API usage. But I've heard that Anthropic might be shutting down that OAuth pathway. I've also heard people saying openclaw just thoroughly sucks, although I've been pretty impressed with its results.
The general strategy I'm taking on this is to have Claude read the old codebase side by side with me in VSCode, then prepare documents for openclaw to act on as editor, then re-evaluate; then have openclaw produce documents for agent roles in Paperclip and evaluate them.
Am I just wasting my money on all these API calls? $150 so far doesn't seem bad for the amount of refactoring I've gotten, across a database and back and front end at the same time, which I'm pretty sure Claude Pro would not have been able to handle without much more file-by-file supervision. I'm slightly afraid now to abandon the memory I've built up with openclaw and switch to a different tool. But hey, maybe I should just be doing this all on the Claude Pro CLI at this point...?
Looking for some advice before I try to switch this project to a different paradigm. But I'm still testing this as a structure, and trying to figure out the costs.
[Edit: I see so many people talking about these lighter-weight frameworks meant for driving an agent through a large, long-running code building task... like superpowers, GSD, etc... which to me as a solo coder sound very appealing if I were building a new project. But for taking 500k LOC and a complicated database and refactoring the whole thing into a headless version that can be run by agents, which is what I'm doing now, I'm not sure those are the right tools; but at the same time, I never heard anyone say openclaw was a great coding assistant -- all I hear about it being used for is, like, spamming Twitter or reading your email or ordering lunch for you. But I've only used it as a code-manager, not for any daily tasks, and I'm pretty impressed with its usefulness at that...]
I developed my own task tracker (github.com/kfcafe/beans), i'm not sure how portable it is; it's been a while since i've used it in claude code. I've been using pi-coding-agent the past few months, highly recommend, it's what's openclaw is built on top of. Anthropic hasn't shut down Oauth, they just say that it's banned outside of Claude Code. I'd recommend installing pi, tell it what you were doing with openclaw and have it port all of the information over to the installation of pi.
you could also check out ralph wiggum loops, could be a good way to rewrite the codebase. just write a prompt describing what you want done, and write a bash loop calling claude's cli pointed at the prompt file. the agent should run on a loop until until you decide to stop it. also not the most efficient usage of tokens, but at least you will be using Claude Pro and not spending money on API calls.
I sort of do need something with persistent memory and personality... or a way to persist it without spending a lot of time trying to bring it back up to speed... it's not exactly specific tasks being tracked, I need it to have a fairly good grasp on the entire ecosystem.
The codebase is small enough that I can basically go and find all the changes the LLM executed with each request, and read them with a very skeptical eye to verify that they look sane, and ask it why it did something or whether it made a mistake if anything smells wrong. That said, the code I'm rewriting is a genetic algorithm / evaluation engine I wrote years ago, which itself writes code that it then evaluates; so the challenge is having the LLM make changes to the control structure, with the aim of having an agent be able to run the system at high speed and read the result stream through a headless API, without breaking either the writing or evaluation of the code that the codebase itself is writing and running. Openclaw has a surprisingly good handle on this now, after a very very very long running session, but most of the problems I'm hitting still have to do with it not understanding that modifying certain parameters or names could cause downstream effects in the output (eval code) or input (load files) of the system as it's evolving.
If I remember correctly, it created a lot of changes, spent a lot of time doing something and in the end this was all smoke and mirrors. If I would ever use something like this, I would maybe use BMad, which suffers from same issues, like Speckit and others.
I don't know if they have some sponsorship with bunch of youtubers who are raving how awesome this is... without any supporting evidence.
Anyhow, this is my experience. Superpowers on the other hand were quite useful so far, but I didn't use them enough to have to claim anything.