I'm very interested in what this will look like for outputs from other job functions. And if we'll end up with a similar framework that makes non-deterministic, often-wrong LLMs easier to work with.
The problem is the remaining 30% - the next 10-20% starts to require things like multi-agent judge setups, external memory, context management, and that gets you to something that’s probably working but you sure shouldn’t ship to production. As to the last 10% - I’ve seen agentic workflows with hundreds of different agents, multiple models, and fantastically complex evaluation frameworks to try to reduce the error rates past the ~10% mark. By a certain point, the amount of infrastructure and LLM calls are running into several hundred dollars per run, and you’re still not getting guaranteed reliable output.
If you know what you’re doing and you know where to fit the LLMs (they’re genuinely the best system we’ve ever devised for interpreting and categorizing unstructured human input), they can be immensely useful, but they sing a siren song of simplicity that will lure you to your doom if you believe it.
I imagine using their embeddings and training a classifier on top of that is probably a lot more effective?
I've personally found agentic LLM workflows the most effective as extremely sophisticated autocomplete. Instead of autocompleting the current next few tokens, I tell it precisely how to edit my code at a high level. You can't tell it stuff at a feature level, but telling it how to implement the feature saves me a ton of time.
Anecdotally, I have found that even if you type out paragraph after paragraph describing everything you need the agent to take care of, it eventually feels like you could have written a lot of the code yourself with the help of a good IDE by the time you can finally send your prompt off.
The AI people sure dont want that, thats too telling about its limitations and value
It feels like they’re a few versions behind what I’m doing, which is… odd.
Self-hosting a plane.io instance. Added a plane MCP tool to my codex. Added workflow instructions into Agents.md which cover standards, documentation, related work, labels, branch names, adding of comments before plan, after plan, at varying steps of implementation, summary before moving ticket to done. Creating new tickers and being able to relate to current or others, etc…
It ain’t that hard. Just do inception (high to mid level details) create epics and tasks. Add personas, details, notes, acceptance criteria and more. Can add comments yourself to update. Whatever.
Slice tickets thin and then go wild. Add tickets as your working though things. Make modifications.
Why so difficult?
They’ve made an issue tracker out of json files and a text file.
Why not hook an mcp to an actual issue tracker?
——
Something I’ve been going over in my head:
I used to work in a pretty strict Pivotal XP shop. PM ran the team like a conductor. We had analysts, QA, leads, seniors. Inceptions for new features were long, sometimes heated sessions with PM + Analyst + QA + Lead + a couple of seniors. Out of that you’d get:
- Thinly sliced epics and tasks - Clear ownership - Everyone aligned on data flows and boundaries - Specs, requirements, and acceptance criteria nailed at both high- and mid-level
At the end, everyone knew what was talking to what, what “done” meant, and where the edges were.
What I’m thinking about now is basically that process, but agentized and wired into the tooling:
- Any ticket is an entry point into a graph, not just a blob of text. - Epics ↔ tasks ↔ subtasks - Linked specs / decisions / notes - Files and PRs that touched the same areas
- Standards live as versioned docs, not just a random Agents.md:
- Markdown (with diagrams) that declares where it applies: tags, ticket types, modules.
- Tickets can pin those docs via labels/tags/links.
- From the agent’s perspective, the UI is just a viewer/editor.
- The real surface is an API: “given this ticket, type, module, and tags, give me all applicable standards, related work, and code history.”- The agent then plays something like the analyst + senior engineer role: - Pulls in the right standards automatically - Proposes acceptance criteria and subtasks - Explains why a file looks the way it does by walking past tickets / PRs / decisions
So it’s less “LLM stapled to an issue tracker” and more “that old XP inception + thin-slice discipline, encoded as a graph the agent can actually reason over.”
So adding a QA agent, while it sounds logical, just ends up being even more of this. Rather than converging on a solution, they just get all out of whack. Until that is solved, far better just to have your dev agent be smart about doing its own QA.
The only way I could see the QA agent idea working now is if it had the power to roll back the entire change, reset the dev agent, update the task with some hints of things not to overlook, and trigger the dev process from scratch. But that seems pretty inefficient, and IDK if it would work any better.
If you're using the agent to produce any kind of code that has access to manipulate the filesystem, may as well have it understand its own abilities as having the entirety of CRUD, not just updates. I could easily see the agent talking itself into working around "only be able to edit" with its other knowledge that it can just write a script to do whatever it wants. This also reinforces to devs that they basically shouldn't trust the agent when it comes to the filesystem.
As for pwd for existing projects, I start each session running tree local to the part of the project filesystem I want to have worked on.
Very interesting.