Its not a surprise to me that this approach also helps AI coding agents to work more effectively, as in-depth planning is essentially moving the thinking upfront.
(I wrote more about this here: https://liampulles.com/jira-tickets.html)
Nobody who delivers any system professionally thinks it’s a bad thing to plan out and codify every piece of the problem you’re trying to solve.
That’s part of what waterfall advocates for. Write a spec, and decompose to tasks until you can implement each piece in code.
Where the model breaks - and what software developers rightly hate - is unnecessarily rigid specifications.
If your project’s acceptance criteria are bound by a spec that has tasked you with the impossible, while simultaneously being impossible to change, then you, the dev, are screwed. This is doubly true in cases where you might not get to implementing the spec until months after the spec has been written - in which case, the spec has calcified into something immutable in stakeholders’ minds.
Agile is frequently used by weak product people and lousy project managers as an excuse to “figure it out when we get there”. It puts off any kind of strategic planning or decision making until the last possible second.
I’ve lost track of the number of times that this has caused rework in projects I’ve worked on.
Most of the people you describe here will try to start changes at the last possible second, and since our estimates are always wrong, and preemptions always happen, then they start all changes too late to avoid the consequences of waiting too long. It is the worst of all worlds. Because the solution and the remidiatiom are both rushed, leading to tech debt piling up instead of being paid down.
No battle plan survives contact with the enemy. But waterfall is not just a battle plan, it’s an entire campaign. And the problem comes both from trying to define problems we have little in house experience with, and then the sunk cost fallacy of having to redo all that “work” of project definition when reality and the customers end up not working the way we planned.
And BTW, trying to maintain the illusion of that plan results in many abstractions leaking. It creates impedance mismatches in the code and those always end up multiplying the difficulty of implementing new features. This is a major source of Business and Product not understanding why implementing a feature is so hard. It seems like it should just fit in with the existing features, but those features are all a house of cards built on an abstraction that is an outright fabrication.
That's what agile advocates for too. The difference is purely in how much spec you write before you start implementing.
Waterfall says specify the whole milestone up front before developing. Agile says create the minimum viable spec before implementing and then getting back to iterating on the spec again straight after putting it into a customer's hands.
Waterfall doesnt really get a bad rap it doesnt deserve. The longer those feedback loops are the more scope you have for fucking up and not dealing with it quickly enough.
In the end you there is project management that can keep a project on track while also being able to adapt to change and others that aren’t able to do so and choose to hide behind some bureaucratic process. Has always existed and will keep existing no matter how you call it.
Ah, and therein lies the problem.
I’ve seen companies frequently elect “none at all” as the right amount of spec to write.
I’d rather have far too many specs than none.
I've found that it's a balancing act; like so many things in software development. We can't rush in, willy-nilly, but it's also possible to kill the project by spending too much time, preparing (think "The Anal-Retentive Chef" skits, from Saturday Night Live).
Also, I have found that "It Depends" is an excellent mantra for life, in general, and software development, in specific.
I think having LLM-managed specs might be a good idea, as it reduces the overhead required to maintain them.
Yeah, agree! Also something like "moderation is key" or alike, as many things can be enjoyed but enjoy them too much, or do it too much, and it kind of stops being so effective/good. The Swedes even have a specific word for something that isn't too much, and isn't too little; "Lagom".
Can be applied to almost anything in life, from TTD, to extreme programming, and "waterfall vs agile", or engaging in kinks, or consumption of drugs, or...
I feel like things go bad when people either do nothing of something, or way too much. Finding the balance, the sweet-spot, that's where the magic happens.
I think it’s a great conversational tool for evaluating and shaking out weak points in a design at an early phase.
All of these guesses will be wrong of course, but you want to end up with a diagram of how things were, how we would like them to be if time and existing code were no object, and what we decided to do given the limitations of the existing system, and resources (skill and time).
That second diagram informs what you do any time the third fails to match reality. If you fall back to what is, you will likely paint yourself into an architectural corner that will last for years. If you move toward the ideal or at least perpendicular, then there is no backpedaling.
You're describing Agile.
This doesn't say anything about what is appropriate for larger project planning. I don't have much experience doing project planning, so I'd look to others for opinions on that.
It helps me not only reduce the complexity into more manageable chunks but go back to business team for smoothening the rough edges which would otherwise require a rework after review.
If we have a differing interpretation of what the article is motivating for, then please take the opportunity to contemplate an additional perspective and let it enrich your own.
The power of agile is supposed to be "I don't need to figure this out now, I'll figure it out based on experimentation" which doesn't mean nothing at all is planned.
If you're not planning a mission to Jupiter, you don't need every step planned out before you start. But in broad strokes it's also good to have a plan.
The optimum is to have some recorded shape of the work to come but to give yourself space to change your mind based on your experiences and to plan the work so you can change the plan.
The backlash against waterfall is the result of coming up with very detailed plans before you start, having those plans change constantly during the whole project requiring you to throw away large amounts of completed work, and when you find things that need to change, not being able to because management has decided on The Plan (which they will decide something new on later, but you can't change a thing).
For some decisions, the best time to plan is up front, for other decisions the best time to design is while you're implementing. There's a balance and these things need to be understood by everybody, but they are generally not.
The best time to plan is dependent on how stable/unstable the environment is.
That is, without spending 10x to 100x the time up front to get it right the first time. But if you're not building space ships or nuclear reactors, it's so much faster and better to just do it and figure out things along the way. So much time spent planning and guessing about the future is time wasted and that's why early stage startups can do something in a week that would take an old enterprise 5 years.
"no plan survives first contact with the enemy"
And that's the source of agile and why too much planning is just wasting time for management to have something to do. No I don't know exactly what I'm going to do or when it'll be done, and if you leave it like that I'll get it done faster.
I'm also old enough to have started my career learning the rational unified process and then progressed through XP, agile, scrum etc
My process is I spend 2-3 hours writing a "spec" focusing on acceptance criteria and then by the end of the day I have a working, tested next version of a feature that I push to production.
I don't see how using a spec has made me less agile. My iteration takes 8 hours.
However, I see tons of useless specs. A spec is not a prompt. It's an actual definition of how to tell if something is behaving as intended or not.
People are notoriously bad at thinking about correctness in each scenario which is why vibe coding is so big.
People defer thinking about what correct and incorrect actually looks like for a whole wide scope of scenarios and instead choose to discover through trial and error.
I get 20x ROI on well defined, comprehensive, end to end acceptance tests that the AI can run. They fix everything from big picture functionality to minor logic errors.
I haven't jumped in headfirst to the "AI revolution", but I have been systematically evaluating the tooling against various use cases.
The approach that tends to have the best result for me combines a collection of `RFI` (request for implementation) markdown documents to describe the work to be done, as well as "guide" documents.
The guide documents need to keep getting updated as the code changes. I do this manually but probably the more enthusiastic AI workflow users would make this an automated part of their AI workflow.
It's important to keep the guides brief. If they get too long they eat context for no good reason. When LLMs write for humans, they tend to be very descriptive. When generating the guide documents, I always add an instruction to tell the LLM to "be succinct and terse", followed by "don't be verbose". This makes the guides into valuable high-density context documents.
The RFIs are then used in a process. For complex problems, I first get the LLM to generate a design doc, then an implementation plan from that design doc, then finally I ask it to implement it while referencing the RFI, design doc, impl doc, and relevant guide docs as context.
If you're altering the spec, you wouldn't ask it to regen from scratch, but use the guide documents to compute the changes needed to implement the alteration.
I'm using claude code primarily.
That said, I think the non-determinism when rerunning a coding task is actually pretty useful when you're trying to brainstorm solutions. I quite often rerun the same prompt multiple times (with slight modifications or using different models) and write down the implementation details that I like before writing the final prompt. When I'm not happy with the throwaway solutions at all I reconsider the overall specification.
However, the same non-determinism has also made me "lose" a solution that I threw out and where the real prompt actually performed worse. So nowadays I try to make it a habit to stash the throwaway solutions just in case. There's probably something in Cursor where you can dig out things you backtracked on but I'm not a power user.
You can provide the existing spec, the new spec, and the existing codebase all as context, then have the LLM modify the codebase according to the updates to the spec.
If I get a nonsensical requirement i push back. If i see some risky code i will think of some way to make it less risky.
Have you ever written the EXACT same code twice?
> it introduces an unreliable compiler.
So then by definition so our humans. If compiling is "taking text and converting it to code" that's literally us.
> it's up to the developer to review the changes. Which just seems like a laborious error prone task.
There are trade-offs to everything. Have you ever worked with an off-shore team? They tend to produce worse code and have 1% of the context the LLM does. I'd much rather review LLM-written code than "I'm not even the person you hired because we're scamming the system" developers.
A spec was from a customer where it would detail every feature. They would be huge, but usually lack enough detail or be ambiguous. They would be signed off by the customer and then you'd deliver to the spec.
It would contain months, if not years, worth of work. Then after all this work the end product would not meet the actual customer needs.
A day's work is not a spec. It's a ticket's worth of work, which is agile.
Agile is an iterative process where you deliver small chunks of work and the customer course corrects as regular intervals. Commonly 3/4 week sprints, made up of many tickets that take hours or days, per course correct.
Generally each sprint had a spec, and each ticket had a spec. But it sounds like until now you've just been winging it, with vague definitions per feature. It's very common, especially where the PO or PM are bad at their job. Or the developer is informally acting as PO.
Now you're making specs per ticket, you're just now doing what many development teams already do. You're just bizarrely calling it a new process.
It's like watching someone point at a bicycle and insist it's a rocketship.
The approach we take is the specs are developed from the tests and tests exercise the spec point in its entirety. That is, a test and a spec are semantically synonymous within the code base. Any interesting thing we're playing with is using the specs alongside the signatures to have an LLM determine when the spec is incomplete.
The problem I see a lot with Agile is that people over-focus on functional requirements in the form of user stories. Which in your case would be statements like “X should do…”
Take things like "capacity". When building a system, you may have a functional requirement like "User can retrieve imagery data if authorized" (that is the function of the system). A non-functional requirement might be how many concurrent users the system can handle at a time. This will influence your design because different system architectures/designs will support different levels of usage, even though the usage (the task of getting imagery to analyze or whatever) is the same whether it handles one user at a time or one million.
People defer thinking about what correct and incorrect actually
looks like for a whole wide scope of scenarios and instead choose
to discover through trial and error.
LLMs are _still_ terrible at deriving even the simplest of logical
entailment. I've had the latest and greatest Claude and GPT derive 'B
instead of '(not B) from '(and A (not B)) when 'A and 'B are anything
but the simplest of English sentences.I shudder to think what they decide the correct interpretations of a spec written in prose is.
Kick that over to some agents to bash on, check in and review here and there, maybe a little mix of vibe and careful corrections by me, and it's done!
Usually in less time, but! any time an agent is working on work shit, Im working on my race car... so its a win win win to me. Im still using my brain, no longer slogging through awful "human centered" programming languages, more time my hobbies.
Isn't that the dream?
Now, to crack this research around generative gibber-lang programming... 90% of our generative code problems are related to the programming languages themselves. Intended for humans, optimized for human interaction, speed, and parsing. Let the AIs design, speak, write, and run the code. All I care about is that the program passes my tests and does what I intended. I do not care if it has indents, or other stupid dogmatic aspects of what makes one language equally usable to any other, but no "my programming language is better!", who cares. Loving this era.
I believe (and practice) that spec-based development is one of the future methodologis for developing projects with LLMs. At least it will be one of the niches.
Author thinks about specs as waterfalls. I think about them as a context entrypoint for LLMs. Giving enough info about the project (including user stories, tech design requirements, filesystem structure and meaning, core interfaces/models, functions, etc) LLM will be able to build sufficient initial context for the solution to expand it by reading files and grepping text. And the most interesting is that you can make LLM to keep the context/spec/projetc file updated each time LLM updates the project. Viola: now you are in agile again: just keep iterating on the context/spec/project
You provide basic specs and can work with LLMs to create thorough test suites that cover the specs. Once specs are captured as tests, the LLM can no longer hallucinate.
I model this as "grounding". Just like you need to ground an electrical system, you need to ground the LLM to reality. The tests do this, so they are REQUIRED for all LLM coding.
Once a framework is established, you require tests for everything. No code is written without tests. These can also be perf tests. They need solid metrics in order to output quality.
The tests provide context and documentation for future LLM runs.
This is also the same way I'd handle foreign teams, that at no fault of their own, would often output subpar code. It was mainly because of a lack of cultural context, communication misunderstandings, and no solid metrics to measure against.
Our main job with LLMs now as software engineers is a strange sort of manager, with a mix of solutions architect, QA director, and patterns expertise. It is actually a lot of work and requires a lot of human people to manage, but the results are real.
I have been experimenting with how meta I can get with this, and the results have been exciting. At one point, I had well over 10 agents working on the same project in parallel, following several design patterns, and they worked so fast I could no longer follow the code. But with layers of tests, layers of agents auditing each other, and isolated domains with well defined interfaces (just as I would expect in a large scale project with multiple human teams), the results speak for themselves.
I write all this to encourage people to take a different approach. Treat the LLMs like they are junior devs or a foreign team speaking a different language. Remember all the design patterns used to get effective use out of people regardless of these barriers. Use them with the LLMs. It works.
Except when it decides to remove all the tests, change their meaning to make them pass or write something not in the spec. Hallucinations are not a problem of the input given, it’s in the foundations of LLMs and so far nobody have solved it. Thinking it won’t happen can and will have really bad outcomes.
I like to keep domains with their own isolated workspaces and git repos. I am not there yet, but I plan on making a sort of local-first gitflow where agents have to pull the codebase, make a new branch, make changes, and submit pull requests to the main codebase.
I would ultimately like to make this a oneliner for agents, where new agents are sandboxed with specific tools and permissions cloning the main codebase.
Fresh-context agents then can function as code reviewers, with escalation to higher tier agents (higher tier = higher token count = more expensive to run) as needed.
In my experience, with correct prompting, LLMs will self-correct when exposed to auditors.
If mistakes do make it through, it is all version controlled, so rolling back isn't hard.
Tests are not a correctness proof. I can’t trust LLMs to correctly reason about their code, and tests are merely a sanity check, they can’t verify that the code was correctly reasoned.
I also actually do not care if it reasons properly. I care about results that eventually stabilizes on a valid solution. These results do not need to based on "thinking," it can be experimentally derived. Agents can own whatever domain they work in, and acquire results with whatever methods they choose given constraints they are subject to. I measure results by validating via e2e tests, penetration testing, and human testing.
I also measure via architecture agents and code review agents that validate adherence to standards. If standards are violated a deeper audit is conducted, if it becomes a pattern, the agent is modified until it stabilizes again.
This is more like numerical methods of relaxation. You set the edge conditions / constraints, then iterate the system until it stabilizes on a solution. The solution in this case, however, is meta, because you are stabilizing on a set of agents that can stabilize on a solution.
Agents don't "reason" or "think", and I don't need to trust them. I trust only results.
The value of a good developer is that they generalize over all possible inputs and states. That’s something current LLMs can’t be trusted to do (yet?).
Hallucinations don't matter if the mechanics of the pipeline mitigate them. In other words, at a systems level, you can mitigate hallucinations. The agent level noise is not a concern.
This is no different from CPU design or any other noisy system. Transistors are not perfect and there is always error, so you need error correction. At a transistor level, CPUs are unreliable. At a systems level, they are clean and reliable.
This is no different. The stochastic noisiness of individual agents can be mitigated with redundancy, constraints, and error correction at a systems level.
Take prolog and logic programming. It's all about describing the problem and its context and let the solver find the solution. Try writing your specs in pseudo-prolog code and you will be surprised with all the missing information you're leaving up to chance.
My objective is to write prompts for LLMs that can write prompts for LLMs that can write code.
When there is a problem downstream the descendant hierarchy, it is a failure of parent LLM's prompts, so I correct it at the highest level and allow it to trickle down.
This eventually resolves into a stable configuration with domain expertise towards whatever function I require, in whatever language is best suited for the task.
If I have to write tests manually, I have already failed. It doesn't matter how skilled I am at coding or capable I am at testing. It is irrelevant. Everything that can be automated should be automated, because it is a force amplifier.
What's not waterfall about this is lost on me.
Sounds to me like you're arguing waterfall is fine if each full run is fast/cheap enough, which could happen with LLMs and simple enough projects. [0]
Agile was offering incremental spec production , which had the tremendous advantage of accumulating knowledge incrementally as well. It might not be a good fit for LLMs, but revising the definition to make it fit doesn't help IMHO.
[0] Reminds me that reducing the project scopes to smaller runs was also a well established way to make waterfall bearable.
You might as well say agile is still waterfall, what are sprints if not waterfall with a 2 week iteration time. And Kanbal is just a collection of indepent waterfalls... It's not a useful definition of waterfall.
That being said, when for instance you had a project that should take 2 years and involve a dozen team, you'd try to cut it in 3 or 4 phases, to even if it would only be "released" and fully tested at the end of it all. At least if your goal was to have it see the light in a reasonable time frame.
Where I worked we also did integration runs at given checkpoints to be able to iron out issues earlier in the process.
PS: on agile, the main specificity I'm seeing is the ability to infinitely extend a project as the scope and specs are typically set on the go. Which is a feature if you're a contractor for a project. you can't do that with waterfall.
Most shops have a mix of pre-planning and on-the go specing to get a realistic process.
What definition would that be?
Regardless, at this point it's all semantics. What I care about is how you do stuff, not the label you assign and in my book writing specs to ground the LLM is a good idea. And I don't even like specs, but in this instance, it works.
Exactly. There is a spec, but there is no waterfall required to work and maintain it. Author from the article dismissed spec-based development exactly because they saw resemblance with waterfall. But waterfall isn't required for spec-centric development.
The problem with waterfall is not that you have to maintain the spec, but that a spec is the wrong way to build a solution. So, it doesn't matter if the spec is written by humans or by LLMs.
I don't see the point of maintaining a spec for LLMs to use as context. They should be able to grep and understand the code itself. A simple readme or a design document, which already should exist for humans, should be enough.
“I don’t see the point of maintaining a documentation for developers. They should be able to grep and understand the code itself”
“I don’t see the point of maintaining tests for developers. They should be able to grep and understand the code itself”
“I don’t see the point of compilers/linters for developers. They should be able to grep and find issues themselves”
To go from spec to code requires a lot of decisions (each introducing technical debt). Automating the process remove control over those decisions and over the ultimate truth that is the code. But why can't the LLM retains the trace of the decisions so that it presents control point to alter the results. Instead, it's always a rewrite from scratch.
I cannot think that this comment is done in good faith, when I clearly wrote above that documentation should already exist for humans:
> A simple readme or a design document, which already should exist for humans, should be enough.
The downfall of Waterfall is that there are too many unproven assumptions in too long of a design cycle. You don't get to find out where you were wrong until testing.
If you break a waterfall project into multiple, smaller, iterative Waterfall processes (a sprint-like iteration), and limit the scope of each, you start to realize some of the benefits of Agile while providing a rich context for directing LLM use during development.
Comparing this to agile is missing the point a bit. The goal isn't to replace agile, it's to find a way that brings context and structure to vibe coding to keep the LLM focused.
Then again, Waterfall was never a real methodology; it was a straw man description of early software development. A hyperbole created only to highlight why we should iterate.
If only this were accurate. Royce's chart (at the beginning of the paper, what became Waterfall, but not what he recommended by the end of the paper) has been adopted by the DOD. They're slowly moving away from it, but it's used on many real-world projects and fails about as spectacularly as you'd expect. If projects deliver on-time, it's because they blow up their budget and have people work long days and weekends for months or years at a time. If it delivers on budget, it's because they deliver late or cut out features. Either way, the pretty plan put into the presentations is not met.
People really do (and did) think that the chart Royce started with was a good idea, they're not competent, but somehow they got into positions in management to force this stupidity.
The word "spec" is a bit overloaded and I think we're all using it to define many things. There's a high-level spec and there are detailed component-level specs all of which kind of co-exist.
Sweet spot will be a moving target. LLMs build-in assumptions, ways to expand concepts will be chaning with LLMs development. So best practices will change with change of the LLMs capabilities. The same set of instructions, not too detailed, were so much better handled by sonnet 4 than sonnet 3 in my experience. Sonnet 3.5 was for me a breaking point which showed that context-based llm development is a feasible strategy.
The frustration thomascountz describes (tweaking, refining, reshaping) isn't a failure of methodology (SDD vs. Iteration). It's 'cognitive overload' from applying a deterministic mental model to a probabilistic system.
With traditional code, the 'spec' is a blueprint for logic. With an LLM, the 'spec' is a protocol for alignment.
The 'bug' is no longer a logical flaw. It's a statistical deviation. We are no longer debugging the code; we are debugging the spec itself. The LLM is the system executing that spec.
This requires a fundamental shift in our own 'mental OS'—from 'software engineer' to 'cognitive systems architect'.
As software engineers, it's very often easy to specify what the system should do. But ensuring that it doesn't do what he shouldn't do is the tiresome part of the job. And most tools we created is to ensure the latter.
I would add that to my opinion if previously code production/management was a limiting factor in software development, today it's not. The conceptualisation (onthology, methodology) of the framework (spec-centric devlopment) for the system production and maintenance (code, artifacts, running system) becomes a new limiting factor. But it's matter of time we'll figure out 2-3 methodologies (like it happened with the agile's scrum/kanban) which will become a new "baseline". We're at the early stages when new "laws of llm development" (as in "laws of physics") is still being figured out.
I though about the concept of this ort of methodology before "agent" (which I would define as "sideeffects with LLM integration") was marketed into community vocabulary. And I'm still rigidly sticking to what I consider "basics". Hope that does not impede understanding.
After Claude finally produced a significant amount of code, and after realizing it hadn't built the right thing, I was back to the drawing board to find out what language in the spec had led it astray. Never mind digging through the code at this point; it would be just as good to start again than to try to onboard myself to the 1000s of lines of code it had built... and I suppose the point is to ignore the code as "implementation detail" anyway.
Just to make clear: I love writing code with an LLM, be it for brainstorming, research, or implementation. I often write—and have it output—small markdown notes and plans for it to ground itself. I think I just found this experience with SDD quite heavy-handed and the workflow unwieldy.
System specs are non trivial for current AI agents. Hand prompting every step is time consuming.
I think (and I am still learning!) SDD sits as a fix for that. I can give it two fairly simple prompts & get a reasonably complex result. It's not a full system but it's more than I could get with two prompts previously.
The verbose "spec" stuff is just feeding the LLMs love of context, and more importantly what I think we all know is you have to tell an agent over and over how to get the right answer or it will deviate.
Early on with speckit I found I was clarifying a lot but I've discovered that was just me being not so good at writing specs!
Example prompts for speckit;
(Specify) I want to build a simple admin interface. First I want to be able to access the interface, and I want to be able to log in with my Google Workspaces account (and you should restrict logins to my workspaces domain). I will be the global superadmin, but I also want a simple RBAC where I can apply a set of roles to any user account. For simplicity let's make a record user accounts when they first log in. The first roles I want are Admin, Editor and Viewer.
(Plan) I want to implement this as a NextJS app using the latest version of Next. Please also use Mantine for styling instead of Tailwind. I want to use DynamoDB as my database for this project, so you'll also need to use Auth.js over Better Auth. It's critical that when we implement you write tests first before writing code; forget UI tests, focus on unit and integration tests. All API endpoints should have a documented contract which is tested. I also need to be able to run the dev environment locally so make sure to localise things like the database.
The plan step is overly focused on the accidental complexity of the project. While the `Specify` part is doing a good job of defining the scope, the `Plan` part is just complicating it. Why? The choice of technology is usually the first step in introducing accidental complexity in a project. Which is why it's often recommended to go with boring technology (so the cost of this technical debt is known). Otherwise go with something that is already used by the company (if it's a side project, do whatever). If you choose to go that route, there's a good chance you're already have good knowledge of those tools and have code samples (and libraries) lying around.
The whole point of code is to be reliable and to help do something that we'd rather not do. Not to exist on its own. Every decision (even little) needs to be connected to a specific need that is tied to the project and the team. It should not be just a receptacle for wishes.
Your last point; feels a bit idealistic. The point of code is to achieve a goal, there are ways to achieve with optimal efficiency in construction but a lot of people call that gold plating.
The setup these prompts leave you with is boring, standard, and something surely I can do in a couple of hours. You might even skeleton it right? The thing is the AI can do it both faster in elapsed time but also, reduces my time to writing two prompts (<2 minutes) and some review 10-15 perhaps?
Also remember this was a simple example; once we get to real business logic efficiencies grow.
Something boring and standard is something that keeps going with minimal intervention while getting better each time.
I'm struggling to see what you'd choose to do differently here?
Edit: actually I'll go further and say I'm guiding against accidental complexity. For example Auth.js is really boring technology, but I am annoyed they've deprecated in favour of better Auth - it's not better and it is definitely not boring technology!
If you change your preferences, the team refactors.
What LLMs bring to the picture is that "spec" is high-level coding. In normal coding you start by writing small functions then verify that they work. Similarly LLMs should perhaps be given small specs to start with, then add more functions/features to the spec incrementally. Would that work?
Were I to try again, I'd do a lot more manual spec writing or even template rewrites. I expected it to work more-or-less out-of-the-box. Maybe it would've for a standard web app using a popular framework.
It was also difficult to know where one "spec" ended and the next began; should I iterate on the existing one or create a new spec? This might be a solved problem in other SDD frameworks besides Spec-Kit, or else I'm just over thinking it!
a) the mulyi-year lead time from starting the spec to getting a finished product
b) no (cheap) way to iterate or deliver outside the spec
Neither of these are a problem with SDD.
It's a bit funny to see people describe a spec written in days (hours) and iterations lasting multiple weeks as "waterfall".
But these days I've already had people argue that barely stopping to think about a problem before starting to prompt a solution is "too tedious of a process".
They both have issues but they are very different. A waterfall project would have inscrutable structure and a large amount of "open doors" just in case a need of an extension at some place would materialize. Paradoxically this makes the code difficult to extend and debug because of overdone abstractions.
Hasty agile code has too many TODOs with "put this hardcoded value in a parameter". It is usually easier to add small features but when coming to a major design flaw it can be easier to throw everything out.
For UI code, AI seems to heavily tend towards the latter.
Documentation gets out of date quickly!!!
The problems with waterfall come when much of the project is done and then you discover that your spec doesn't quite work, but the changes to your spec require half the requirements to subtly change, so that it can work at all. But then these subtle changes need to be reflected in code everywhere. Do this a couple of times (with LLM and without) and now your code and spec only superficially look like one another.
The detailed spec is exactly the problem with the waterfall development. The spec presumes that it is the solution, whereas Agile says “Heck, we don't even understand our problem well, let alone understanding a solution to it.”
Beginning with a detailed spec fast with an LLM already puts you into a complex solution space, which is difficult to navigate compared to a simpler solution space. Regardless of the iteration speed, waterfall is the method that puts you into a complex space. Agile is the one you begin with smaller spaces to arrive at a solution.
How can you even develop something if you don’t have a clear idea what you’re building?
But, the statement "we don't even understand our problem well" is typically correct. In most cases where new software is started, the problem isn't well-defined, amenable to off-the-shelf solutions. And you will never know as little about the problem as you do on day one. Your knowledge will only grow.
It is more useful to acknowledge this reality and develop coping strategies than to persist in denial of it. At the time that the agile manifesto was written, the failure of "big up-front design" was becoming plainly evident. You think that you know the whole spec, and then it meets reality much as the Titanic met an iceberg.
Agile does not say "no design, no idea", it points out things that are more valuable than doomed attempts at "100% complete design and all the ideas before implementation". e.g. "while there is value in (comprehensive documentation, following a plan), we value (Working software, Responding to change) more. (see https://agilemanifesto.org/ )
In other words, start by doing enough design, and then some working software to flush out the flawed thinking in the design. And then iterate with feedback.
That's the key benefit of starting small and of iterating: it allows you to learn and to improve. You don't learn anything about your problem amd solution by writing a comprehensive design spec upfront.
It's the ability to _change_ quickly (or be agile) in response to feedback that marks the difference.
The delay is just irrelevant. It has nothing to do with it working ot not.
>b) no (cheap) way to iterate or deliver outside the spec
You could always do this in a waterfall project. Just make whatever changes to the code and ship. The problem is the same for SDD, as soon as you want quick changes you have to abandon the spec. Iterating the spec and the code quickly is impossible for any kind of significant complex project.
Either the spec contains sufficient details to make implementation feasible and iteration times become long and the process of any change becomes tedious and complex or the spec is insufficient in describing the complexity of the project, which makes it insufficient to guide an LLM adequately.
There is a fundamental contradiction here, which LLMs can not resolve. People like SDD, for a exactly the reason Managers like waterfall.
"Heavy documentation before coding" (article) is essentially a bad practice that Agile identified and proposed a remedy to.
Now the article is really about AI-driven development im which the AI agent is a "code monkey" that must be told precisely what to do. I think the interesting thing here will be do find the right balance... IMHO this works best when using LLMs only for small bits at a time instead of trying to specify the whole feature or product.
The key to Agile isn't documentation - it's in the ability to change at speed (perhaps as markets change). Literally "agile".
This approach allows for that comprehensive documentation without sacrificing agility.
In addition, the big issue is when the comprehensive documentation is written first (as in waterfall) because it delays working software and feedback on how well the design works. Bluntly, this does not work.
That's why I think it is best to feed LLMs small chunks of work at a time and to keep the humam dev in the driving see to quickly iterate and experiment, and to be able to easily reason with the AI-generated code (who will do maintenance?)
The article seems to miss many of those points.
IMHO a good start is to have the LLM prompt be a few lines at most and generate about 100 lines of code so you can read it and understand it quickly, tweak it, use it, repeat. Not even convinced you need to keep a record of the prompt at all.
REPL development and Live programming is similar to that. But when something works, it stays working. Even with the Edit-Compile-Run cycle, you can be very fast if the cycle is short enough (seconds). I see people going all in with LLMs (and wishing for very powerful machines) while ignoring other tools that could give better return on a 5 year old laptop.
I’m letting the agent help me draft the specs anyway and I found that the agent is a lot more focused when it can traverse a task tree using beads.
It’s the one spec or planning tool that I find really helps get things done without a bunch of intervention.
Another technique I employ is I require each task to be TDD. So every feature has two tasks: write tests that fail, implement feature and don’t notify me until tests complete. Then I ask the agent to tell me how to review the task and require I review every task before moving to the next one. I love this process because the agent tells me exactly what commands to run to review the task. Then I do a code review and ask it questions. Reading agent code is exhausting so I try to make the tasks as discrete and minimal as possible.
These are simple techniques that humans employ during development and I find it worked very well.
There are also times when I need to write some docs to help me better understand the problem and I usually just dump those in a specs folder.
I think spec-kit is an interesting idea but too heavy handed. Just use beads and you’ll see what I mean.
Another technique I employed for a fully vibed tool (https://github.com/neurosnap/zmx) is to have the agent get as far as possible in a project and then I completely rewrite it using the agent code purely as a reference.
it didn't really kill it - it just made the spec massively disjoint, split across hundreds to thousands of randomly filled Jira tickets.
All those small micro decisions, discussions, and dead ends can be recorded and captured by the AI. If you do something that doesn’t make sense given past choices, it can ask you.
Gradually, over time, it can gather more and more data that only lives in your brain at the time you’re building. It’s only partially captured by git commits but mostly lost to time.
Now, when you change code, the system can say, “Jim wrote that 5 years ago for this reason. Is the reason not valid anymore?”. You might get this on a good code review, but probably not. And definitely not if Jim left 2 years ago.
Don’t forget 80% of project knowledge in Jim’s head, and nobody knows how it all connects ever since he left 5 years ago.
[1] (pretty sure this is the right one): https://youtu.be/CmIGPGPdxTI
This is exactly the same thing but for AIs. The user might think that the AI got it wrong, except the spec was under-specified and it had to make choices to fill in the gaps, just like a human would.
It’s all well and good if you don’t actually know what you want and you’re using the AI to explore possibilities, but if you already have a firm idea of what you want, just tell it in detail.
Maybe the article is actually about bad specs? It does seem to venture into that territory, but that isn’t the main thrust.
Overall I think this is just a part of the cottage industry that’s sprung up around agile, and an argument for that industry to stay relevant in the age of AI coding, without being well supported by anything.
The agent here is:
Look on HN for AI skeptical posts. Then write a comment that highlights how the human got it wrong. And command your other AI agents to up vote that reply.
Of course, this is all very situational and based on the problem being solved at the time. The risk with "practices" is they are generally not concerned with problem being solved and insist on applying the same template regardless.
Devs get married to their first implementation; Stakeholders don’t tolerate rework
If companies and individuals could throw more away, then we wouldn’t need to obsess over planning. The “spec” and “design” would get discovered through doing. I’ve never worked anywhere where a long up front design addressed the important design issues. Those get discovered after you’ve tried to implement a solution a few times and failed.
If we say throwing away as a feature rather than a bug, we’d probably work more efficiently.
>How can we ensure that the code is correct with so little guidance?
Easy, diff and test the code yourself after each run.
For me currently this sweet spot is TINY. It's so small that my usage of Claude Code has dropped to almost none. It's simply more practical to let myself have the agency and drive the development, while letting AI jump in and assist when needed.
However short and targeted specifications at the right level of detail and fidelity, can be extremely useful during coding with agents
Compared to what an architect does when they create a blueprint for a building, creating blueprints for software source code is not a thing.
What in waterfall is considered the design phase is the equivalent of an architect doing sketches, prototypes, and other stuff very early in the project. It's not creating the actual blue print. The building blue print is the equivalent of source code here. It's a complete plan for actually constructing the building down to every nut and bolt.
The big difference here is that building construction is not automated, costly, and risky. So architects try to get their blueprint to a level where they can minimize all of that cost and risk. And you only build the bridge once. So iterating is not really a thing either.
Software is very different; compiling and deploying is relatively cheap and risk free. And typically fully automated. All the effort and risk is contained in the specification process itself. Which is why iteration works.
Architects abandon their sketches and drafts after they've served their purpose. The same is true in waterfall development. The early designs (whiteboard, napking, UML, brainfart on a wiki, etc.) don't matter once the development kicks off. As iterations happen, they fall behind and they just don't matter. Many projects don't have a design phase at all.
The fallacy that software is imperfect as an engineering discipline because we are sloppy with our designs doesn't hold up once you realize that essentially all the effort goes into creating hyper detailed specifications, i.e. the source code.
Having design specifications for your specifications just isn't a thing. Not for buildings, not for software.
Real software engineering does exist. It does so precisely in places where you can't risk trying it and seeing it fail, like control systems for things which could kill someone if they failed.
People get offended when you claim most software engineering isn't engineering. I am pretty certain I would quickly get bored if I was actually an engineer. Most real world non-software engineers don't even really get to build anything, they're just there to check designs/implementations for potential future problems.
Maybe there are also people in the software world who _do_ want to do real engineering and they are offended because of that. Who knows.
> it's really just a spec that gets turned into the thing we actually run. It's just that the building process is fully automated. What we do when we create software is creating a specification in source code form.
Agree. My favourite description of software development is specification and translation - done iteratively.
Today, there are two primary phases:
1. Specification by a non-developer and the translation of that into code. The former is led by BAs/PMs etc and the output is feature specs/user stories/acceptance tests etc. The latter id done by developers: they translate the specs into code.
2. The resulting code is also, as you say, a spec. It gets translated into something the machine can run. This is automated by a compiler/interpreter (perhaps in multiple steps, e.g. when a VM is involved).
There have been several attempts over the years to automate the first step. COBOL was probably the first; since then we've had 4GLs, CASE tools, UML among others. They were all trying to close the gap: to take phase 1 specification closer to what non-developers can write - with the result automatically translated to working code.
Spec-driven development is another attempt at this. The translator (LLM) is quite different to previous efforts because it's non-deterministic. That brings some challenges but also offers opportunities to use input language that isn't constrained to be interpretable by conventional means (parsers implementing formal grammars).
We're in the early days of spec-driven. It may fail like its predecessors or it may not. But first order, there's nothing sacrosanct about the use of 3rd generation languages as the means to represent the specification. The pivotal challenge is whether translation from the starting specification can be reliably translated to working software.
If it can (big if) then economics will win out.
That said, there is a bit of redundancy between software design and source code. We tend to rather get rid of the development of the latter than the former though, i.e. by having the source code be generated by some modelling tool.
The part of the process the actually needs improving, in my experience in larger codebases, is the research phase, not the implementation. With good, even quite terse research, it’s easy to iterate on a good implementation and then probably take over to finish it off.
I really think LLMs and their agent systems should be kept in their place as tools, first and foremost. We’re still quite early in their development, and they’re still fundamentally unreliable, that I don’t think we should be re-working over-arching work practices around them.
> For large existing codebases, SDD is mostly unusable.
I don't really agree with the overall blog post (my view is all of these approaches have value, and we are still to early on to fnd the One True Way) but that point is very true.
Opinions about "what works" being pushed as fact. No evidence, no attempt to create evidence (because it's hard). Enless commentary and opinion pieces, naive people being coached by believers into doing things that seem to work on specific examples.
If you have an example that worked for you it doesn't mean that it's a useful way to work for everyone else in every other situation.
So they're more like 3rd party innovations to lobby LLM providers to integrate functionalities.
X prompting method/coding behaviors? Integrated. Media? Integrated. RAG? Integrated. Coding environment? Integrated. Agents? Integrated. Spec-driven development? It's definitely present, perhaps not as formal yet.
The difference is when is this done; In a upfront discussion, while developing, or after user feedback.
For LLM we know it needs to be written down. (At least if we want human tracability)
And agile ofcourse is a shortend waterfall to get user feedback early on.
Giving enough context is important for every case.(humans and llm's alike)
What I really want is to be able to do the things I'm good at. Usually that is not what gets assigned to me or is next in line.
SDD as it's presented is a bit heavy weight, if you experimented with a bit, there is a lighter version that can work.
For some mini modules, we keep a single page spec as 'source of truth' instead of the code.
It's nice but has it's caveats but they are less of a concern over time.
such a rare (but valued!) occurrence in these posts. Thanks for sharing
Those requirements exist regardless of whether you write them down in Markdown files or not. Spec-driven Development is just making what needs to be built explicit rather than assuming the whole team know what the code should do.
Practically every company already uses 'spec-driven development', just with incredibly vague specs in the form of poorly written Jira tickets. Developers like it because it gives them freedom to be creative in how they interpret what needs to be done, plus they don't need to plan things and their estimates can be total nonsense if they want, and Product Owners and BAs like it because it means they can blame shift to the dev team if something is missed by saying "We thought that was obvious!"
Every team should be capturing requirements at a level of detail that means they know how the code should work. That doesn't need to be done up front. You can iterate. Requirements are a thing that grow with a project. All that spec-driven development is doing is pushing teams to actually write them down.
But crucially, the details here are coming from the issue authors. Do you really think that issue authors are going to be reviewing LLM-generated specs? I don't think so. And so engineers will be the intermediary. If that's going to be me, I would rather mediate between the issue author, some kind of high-level plan, and code. Not the issue author, a high-level plan, code-like specs, and code. There is one extra layer in the latter that I don't see the value of.
> Developers like it because it gives them freedom to be creative in how they interpret what needs to be done, plus they don't need to plan things and their estimates can be total nonsense if they want
I like it because it moves me closer to the product, the thing actually being built. You seem to be asking to move the clock back to where there was a stricter division of labour. Maybe that's necessary in some industries, but none that I've worked in.
Do they?
Personally, I tried SDD, consciously trying to like it, but gave up. I find writing specs much harder than writing code, especially when trying to express the finer points of a project. And of course, there is also that personal preference: I like writing code, much more than text. Yes, there are times where I shout "Do What I Mean, not what I say!", but these are mostly learning opportunities.
If you are working with constrained hardware or users... it isn't.
When that is not the case, working without a spec won't help either.
Hardware needs to be procured or implemented in the cloud - there's a lot of work on the architectures and costs early in projects so as to ensure that things will cost in. Changing that can invalidate business cases, and also can be very difficult due to architectural and security controls.
In terms of users, in corporates the user communities must be identified, trained, sometimes made redundant, sometimes given extra responsibilities. Once you have got this all lined up any changes become very hard because suddenly, like a ripple over a lake when a pebble is dropped in, everyone who's touched has a reason why they are going to miss targets (you are that reason) and therefore want 100% bonus (there is no money for 100% bonus for all).
In previous jobs I would have delighted in pointing out that if there are no users the system can't be funded!
I agree that working without a spec is madness, it's just not realistic in the real world either. People expect you to stand behind a commitment to deliver, they also want to know what they are paying for. However, things do change, both really (as in something happens and the system must now accomodate it) and also due to discovery (we didn't know, we couldn't have known, but now we know and must accomodate this knowledge). It's really important to factor this in, although perfect flexibility is infinitely expensive and completely unrealistic...
A bit of flex can be cheap, easy and a lifesaver though.
Isn't this was Kiro IDE is about? Spec-driven dev?
That's why in my workflow I don't write single monster specs. Rather, I work with the LLM to iterate on small, individual, highly constrained specs that provide useful context for what/why/how -- stories, if you will -- that include a small set of critical requirements and related context -- the criteria by which you might "accept" the work -- and then I build up a queue of those "stories" that form a, you might say, backlog of work that I then iterate with the LLM to implement.
I then organize that backlog so that I can front-load uncovering unknowns while delivering high-value features first.
This isn't rocket science.
By far the biggest challenge I experience is compounding error during those iterative cycles creating brittleness, code duplication, and generally bad architecture/design. Finding ways to incorporate key context or other hints in those individual work items is something I'm still sorting out.
(and yes, I use en-dashes, and no I'm not an AI)
Of course SDD/Waterfall helps the LLM/Outsourced labor to implement software in a predictable way. Waterfall was always a method to please Managers and in the case of SDD the manager is the user promoting the coding agent.
The problem with SDD/Waterfall is not the first part of the project. The problems come when you are deep into the project, your spec is a total mess and the tiniest feature you want to add requires extremely complex manipulation of the spec.
The success people are experiencing is the success managers have experienced at the beginning of their software projects. SDD will fail for the same reason Waterfall has failed. The constant increasing of complexity in the project, required to keep code and spec consistent can not be managed by LLM or human.
And while at it, I found out that using TDD also helps.
Amazon's Kiro is incredibly spec driven. Haven't tried it but interested. Amplifier has a strong document-driven-development loop also built-in. https://github.com/microsoft/amplifier?tab=readme-ov-file#-d...
Not at FAANG. Or at least not at Google where I was for 10 years. They were obsessed with big upfront PRDs and design docs, and they were key to getting promotion and recognition.
These days those kinds of documents -- which were laborious to produce, mostly boilerplate, and a pain to maintain, and often not really read by anybody other than promo committee -- could be produced easily by prompting an LLM.
Having drunk the well-spring of XP and agile early in my career, I found it continually frustrating. Actual development followed iterative practices, but not officially.
Same is true for UX and DevOps, just create a bunch of positions based on some blog post, and congratulate your self on a job well done. Screwing over the developer (engineers) as usual. Even though they actually might be interested in those jobs.
This is the main problem with big tech informing industry decisions, they win because they make sure they understand what all of this means. For all other companies this just creates a mess and your mentioned frustration.
Open office is the densest and cheapest office layout. That is the reason it exists and the reason it will persist. All other reasons are inferior.
The immediate pooh-poohing of Waterfall is the big tell here. If they don't give you an example of an actual Waterfall project they've worked on, or can't elucidate why it wasn't just that one project or organization that made Waterfall bad, they're likely parroting myths or a single anecdotal experience. And that bad experience was likely based on not understanding it to begin with (Waterfall in particular is the subject of many myths and lies). I've had terrible Agile experiences. Does that make Agile terrible?
In my experience, Agile has a tendency to succeed despite itself. Since you don't do planning, you just write one bit at a time. But of course eventually this doesn't work, so you spend more time rearchitecting, rewriting, and fixing things. But hey, look, you made something useful! ....it still isn't making the company any money yet, but it's a thing you can see, so everyone feels better. You can largely do this work by yourself, so you can move fast; until you need something controlled by someone else, at which point you show up at the 11th hour, and dump a demand on their desk that must be finished immediately. Often these recipients have no choice, because the organization needs the thing you're slapping together, and they're "being a blocker". And those recipients then can't accomplish what they need to, because they haven't been given any documentation to know what to do. Bugs, rushed deadlines (or worse, no deadlines), dead-cats-over-the-wall, wasted effort, dysfunction. Is this the only way to do Agile? Of course not. But it's easy for me to paint the entire model this way, based on my experience.
There does not exist a project management method which is inherently bad. I repeat: No formal project management method is bad. Methods are simply frameworks by which you organize and execute work. You will not be successful just because you used a framework. You still have to do the organizing and execute the work in a not-terrible-way. You have a lot of wiggle room about how things are done, and that is what determines the outcome. You can do things quickly or slow, skillfully or shoddily, half-assed or competently, individually or collaboratively. It's how you take each step, not which step you take.
As long as humans at at the reigns, it doesn't matter what method you use. The same project, with the same method, can either go to shit, or turn out great. The difference between the two is how you use the methods. You have to do the work, and do it well. In organizations, with other humans, that's often very difficult, because it means you depend on something outside of your control. So leadership, skill, and a collaborative, positive culture, are critical to getting things done.
OMG really? You think?
(Except people will get AI to write it.)
The problem with what people call "Waterfall" is that there is an assumption that at some point you have a complete and correct spec and you code off of that.
A spec is never complete. Any methodology applied in a way that does not allow you to go back to revise and/or clarify specs will cause trouble. This was possible with waterfall and is more explicitly encouraged with various agile processes. How much it actually happens in practice differs regardless of how you name the methodology that you use.
In contrast they're still the standard in the hardware design world.
If you don't have explicit specifications (which don't have to be complete before starting to develop code), you still have specs, but they're unarticulated. They exist in the minds of the developers, managers, customers, and what you end up with is a confused mess. You have specs, but you don't necessarily know what they are, or you know something about them but you've failed to communicate them to others and they've failed to communicate with you.
Most people call this "not having specs".
And most software projects are complete mess that waste unfathomable amounts of resources. But yeah, you “can” develop like that.