1. I have no real understanding of what is actually happening under the hood. The ease of just accepting a prompt to run some script the agent has assembled is too enticing. But, I've already wiped a DB or two just because the agent thought it was the right thing to do. I've also caught it sending my AWS credentials to deployment targets when it should never do that.
2. I've learned nothing. So the cognitive load of doing it myself, even assembling a simple docker command, is just too high. Thus, I repeatedly fallback to the "crutch" of using AI.
One the one hand, reviewing and micromaning everything it does is tedious and unrewarding. Unlike reviewing a colleague's code, you're never going to teach it anything; maybe you'll get some skills out of it if you finds something that comes up often enough it's worth writing a skill for. And this only gets you, at best, a slight speedup over writing it yourself, as you have to stay engaged and think about everything that's going on.
Or you can just let it grind away agentically and only test the final output. This allows you to get those huge gains at first, but it can easily just start accumulating more and more cruft and bad design decisions and hacks on top of hacks. And you increasingly don't know what it's doing or why, you're losing the skill of even being able to because you're not exercising it.
You're just building yourself a huge pile of technical debt. You might delete your prod database without realizing it. You might end up with an auth system that doesn't actually check the auth and so someone can just set a username of an admin in a cookie to log in. Or whatever; you have no idea, and even if the model gets it right 95% of the time, do you want to be periodically rolling a d20 and if you get a 1 you lose everything?
Of course this requires being fortunate enough that you have one of those AI positive employers where you can spend lots of money on clankers.
I don't review every move it makes, I rather have a workflow where I first ask it questions about the code, and it looks around and explores various design choices. then i nudge it towards the design choice I think is best, etc. That asking around about the code also loads up the context in the appropriate manner so that the AI knows how to do the change well.
It's a me in the loop workflow but that prevents a lot of bugs, makes me aware of the design choices, and thanks to fast mode, it is more pleasant and much faster than me manually doing it.
The agent only has access to exactly what it needs, be it an implementation agent, analysis agent, or review agent.
Makes it very easy to stay in command without having to sit and approve tons of random things the agent wants to do.
I do not allow bash or any kind of shell. I don't want to have to figure out what some random python script it's made up is supposed to do all the time.
anonu has explicitly said that they've wiped a database twice as a result of agents doing stuff. What sort of diff would help against an agent running commands, without your approval?
The diff: +8000 -4000
It has about doubled my development pace. An absolutely incredible gain in a vacuum, though tiny compared to what people seem to manage without these self-constraints. But in exchange, my understanding of the code is as comprehensive as if I had paired on it, or merged a direct report's branch into a project I was responsible for. A reasonable enough tradeoff, for me.
Day 1: Carefully handles the creds, gives me a lecture (without asking) about why .env should be in .gitignore and why I should edit .env and not hand over the creds to it.
Day 2: I ask for a repeat, has lost track of that skill or setting, frantically searches my entire disk, reads .env including many other files, understands that it is holding a token, manually creates curl commands to test the token and then comes back with some result.
It is like it is a security expert on Day 1 and absolute mediocre intern on Day 2
( This was low-stakes test creds anyway which I was testing with thankfully. )
I never pass creds via env or anything else it can access now.
My approach now is to get it to write me linqpad scripts, which has a utility function to get creds out of a user-encrypted share, or prompts if it's not in the store.
This works well, but requires me to run the scripts and guide it.
Ultimately, fully autotonous isn't compatible with secrets. Otherwise, if it really wanted to inspect it, then it could just redirect the request to an echo service.
The only real way is to deal with it the same way we deal with insider threat.
A proxy layer / secondary auth, which injects the real credentials. Then give claude it's own user within that auth system, so it owns those creds. Now responsibilty can be delegated to it without exposing the original credentials.
That's a lot of work when you're just exploring an API or DB or similar.
1. Everything is specified, written and tested by me, then cleaned up by AI. This is for the core of the application.
2. AI writes the functions, then sets up stub tests for me to write. Here I’ll often rewrite the functions as they often don’t do what I want, or do too much. I just find it gets rid of a lot of boilerplate to do things this way.
3. AI does everything. This is for experiments or parts of an application that I am perfectly willing to delete. About 70% of the time I do end up deleting these parts. I don’t allow it to touch 1 or 2.
Of course this requires that the architecture is setup in a way where this is possible. But I find it pretty nice.
[1] except perhaps read-only credentials to help diagnose problems, but even then I would only issue it an extremely short-lived token in case it leaks it somehow
Only helps if we listen to it :) which is fun b/c it means staying sharp which is inherently rewarding
Don’t give your agent access to content it should not edit, don’t give keys it shouldn’t use.
> python <<'EOF'
> ${code the agent wrote on the spot}
> EOF
I mean, yeah, in theory it's just as dangerous as running arbitrary shell commands, which the agent is already doing anyway, but still...
By default these shell commands don't have network access or write access outside the project directory which is good, but nowhere near customizable enough. Once you approve a command because it needs network access, its other restrictions are lifted too. It's all or nothing.
I'm not trying to be offense, so with all due respect... this sounds like a "you" problem. (And I've been there, too)
You can ask the LLMs: how do I run this, how do I know this is working, etc etc.
Sure... if you really know nothing or you put close to zero effort into critically thinking about what they give you, you can be fooled by their answers and mistake complete irrelevance or bullshit for evidence that something works is suitably tested to prove that it works, etc.
You can ask 2 or 3 other LLMs: check their work, is this conclusive, can you find any bugs, etc etc.
But you don't sound like you know nothing. You sound like you're rushing to get things done, cutting corners, and you're getting rushed results.
What do you expect?
Their work is cheap. They can pump out $50k+ worth of features in a $200/mo subscription with minimal baby-sitting. Be EAGER to reject their work. Send it back to them over and over again to do it right, for architectural reviews, to check for correctness, performance, etc.
They are not expensive people with feelings you need to consider in review, that might quit and be hard to replace. Don't let them cut corners. For whatever reason, they are EAGER to cut corners no matter how much you tell them not to.
https://vivekhaldar.com/articles/when-compilers-were-the--ai...
We are completely comfortable now letting the compilers do their thing, and never seem to worry that we "don't know what is actually happening under the hood".
I am not saying these situations are exactly analogous, but I am saying that I don't think we can know yet if this will be one of those things that we stop worrying about or it will be a serious concern for a while.
> Many assembly programmers were accustomed to having intimate control over memory and CPU instructions. Surrendering this control to a compiler felt risky. There was a sentiment of, if I don’t code it down to the metal, how can I trust what’s happening? In some cases, this was about efficiency. In other cases, it was about debuggability and understanding programming behavior. However, as compilers matured, they began providing diagnostic output and listings that actually improved understanding.
I would 100% use LLMs more and more aggressively if they were more transparent. All my reservations come from times when I prompt “change this one thing” and it rewrites my db schema for some reason, or adds a comment that is actively wrong in several ways. I also think I have a decent working understanding of the assembly my code compiles to, and do occasionally use https://godbolt.org/. Of course, I didn’t start out that way, but I also don’t really have any objections to teenagers vibe-coding games, I just think at some point you have to look under the hood if you’re serious.
Isn't that what git is for, though? Just have your LLM work in a branch, and then you will have a clear record of all the changes it made when you review the pull request.
LLMs are nothing like that
It is just the scope that makes it appear non-deterministic to a human looking at it, and it is large enough to be impossible for a human to follow the entire deterministic chain, but that doesn't mean it isn't in the end a function that translates input data into output data in a deterministic way.
There is a world of difference between translation and generation. It's even in the name: generative AI. I didn't say anything about magic.
Care to point to any that are set up to be deterministic?
Did you ever stop to think about why no one can get any use out of a model with temp set to zero?
I get why that is in practice different then the manner in which compilers are deterministic, but my point is the difference isnt because of determinism.
A non deterministic compiler is probably defective and in any case much less useful
Although, while the compiler devs might know what was going on in the compiler, they wouldn't know what the compiler was doing with that particular bit of code that the FORTRAN developer was writing. They couldn't possibly foresee every possible code path that a developer might traverse with the code they wrote. In some ways, you could say LLMs are like that, too; the LLM developers know how the LLM code works, but they don't know the end result with all the training data and what it will do based on that.
In addition, to the end developer writing FORTRAN it was a black box either way. Sure, someone else knows how the compiler works, but not the developer.
There's plenty of resources online to rectify that, though.
Demonstrably incorrect. This is because the model selection, among other data, is not fixed for (I would say most) LLMs. They are constantly changing. I think you meant something more like an LLM with a fixed configuration. Maybe additional constraints, depending on the specific implementation.
I can't help but read complaints about the capabilities of AI – and I'm certainly not accusing you of complaining about AI, just a general thought – and think "Yet" to myself every time.
I've spent far more time pitting one AI context against another (reviewing each other's work) than I have using AI to build stuff these days.
The benefit is that since it mostly happens asynchronously, I'm free to do other stuff.
I guess it comes down to how ossified you want your existing code to be.
If it's a big production application that's been running for decades then you probably want the minimum possible change.
If you're just experimenting with stuff and the project didn't exist at all 3 days ago then you want the agent to make it better rather than leave it alone.
Probably they just need to learn to calibrate themselves better to the project context.
Even within the same project, for a given PR, there are some parts of the codebase I want to modify freely and some that I want fixed to reduce the diff and testing scope.
I try to explain up-front to the agent how aggressively they can modify the existing code and which parts, but I've had mixed success; usually they bias towards a minimal diff even if that means duplication or abusing some abstractions. If anyone has had better success, I'd love to hear your approach.
Cross entropy loss steers towards garden path sentences. Using a paragraph to say something any person could say with a sentence, or even a few precise words. Long sentences are the low perplexity (low statistical “surprise”) path.
I think they're in here, last edited 8 months ago: https://github.com/nreHieW/fyp/blob/5a4023e4d1f287ac73a616b5...
Over-editing is definitely not some long gone problem. This was on xhigh thinking, because I forgot to set it to lower.
The idea being that if you're working in an area, you should refactor and tidy it up and clean up "tech debt" while there.
In practice, it was seldom done, and here we have LLMs actually doing it, and we're realising the drawbacks.
At times even when a function is right there doing exactly what's needed.
Worse, when it modifies a function that exists, supposedly maintaining its behavior, but breaks for other use cases. Good try I guess.
Worst. Changing state across classes not realising the side effect. Deadlock, or plain bugs.
I spent some time dealing with this today. The real issue for me, though, was that the refactors the agent did were bad. I only wanted it to stop making those changes so I could give it more explicit changes on what to fix and how.
"Refactor-as-you-go" means to refactor right after you add features / fix bugs, not like what the agent does in this article.
This is horrible practice, and very typical junior behavior that needs to be corrected against. Unless you wrote it, Chesterton's Fence applies; you need to think deeply for a long time about why that code exists as it does, and that's not part of your current task. Nothing worse than dealing with a 1000 line PR opened for a small UI fix because the code needed to be "cleaned up".
Tech debt needs to be dealt with when it makes sense. Many times it will be right there and then as you're approaching the code to do something else. Other times it should be tackled later with more thought. The latter case is frequently a symptom of the absence of the former.
In Extreme Programming, that's called the Boy Scouting Rule.
https://furqanramzan.github.io/clean-code-guidelines/princip...
If LLMs are doing sensible and necessary refactors as they go then great
I have basically zero confidence that is actually the case though
I suspect AI's learned to do this in order to game the system. Bailing out with an exception is an obvious failure and will be penalized, but hiding a potential issue can sometimes be regarded as a success.
I wonder how this extrapolates to general Q&A. Do models find ways to sound convincing enough to make the user feels satisfied and the go away? I've noticed models often use "it's not X, it's Y", which is a binary choice designed to keep the user away from thinking about other possibilities. Also they often come up with a plan of action at the end of their answer, a sales technique known as the "assumptive close", which tries to get the user to think about the result after agreeing with the AI, rather than the answer itself.
Codex also has a tendency to apply unwanted styles everywhere.
I see similar tendencies in backend and data work, but I somehow find it easier to control there.
I'm pretty much all in on AI coding, but I still don't know how to give these things large units of work, and I still feel like I have to read everything but throwaway code.
But yeah, I saw a suggestion about adding a long-lived agent that would keep track of salient points (so kinda memory) but also monitor current progress by main agent in relation to the "memory" and give the main agent commands when it detects that the current code clashes with previous instructions or commands. Would be interesting to see if it would help.
Purely anecdotal.
The cynic in me thinks it's done on purpose to burn more tokens. The pragmatist however just wants full control over the harness and system prompts. I'm sure this could be done away with if we had access to all the knobs and levers.
We do, just tell it what you want in your AGENTS.md file.
Agents also often respond well to user frustration signs, like threatening to not continue your subscription.
In this case I would ask for smaller changes and justify every change. Have it look back upon these changes and have it ask itself are they truly justified or can it be simplified.
"Do not modify any code; only describe potential changes."
I often add it to the end when prompting to e.g. review code for potential optimizations or refactor changes.
With LLMs, you glimpse a distant mountain. In the next instant, you're standing on its summit. Blink, and you are halfway down a ridge you never climbed. A moment later, you're flung onto another peak with no trail behind you, no sense of direction, no memory of the ascent. The landscape keeps shifting beneath your feet, but you never quite see the panorama. Before you know it, you're back near the base, disoriented, as if the journey never happened. But confident, you say you were on the top of the mountain.
Manual coding feels entirely different. You spot the mountain, you study its slopes, trace a route, pack your gear. You begin the climb. Each step is earned steadily and deliberately. You feel the strain, adjust your path, learn the terrain. And when you finally reach the summit, the view unfolds with meaning. You know exactly where you are, because you've crossed every meter to get there. The satisfaction isn't just in arriving, nor in saying you were there: it is in having truly climbed.
With LLM-assisted coding, you skip the trek and you instantly know that’s not it.
I am surprised Gemini 3.1 Pro is so high up there. I have never managed to make it work reliably so maybe there's some metric not being covered here.
I think we should move to semi-autonomous steerable agents, with manual and powerful context management. Our tools should graduate from simple chat threads to something more akin to the way we approach our work naturally. And a big benefit of this is that we won't need expensive locked down SOTA models to do this, the open models are more than powerful enough for pennies on the dollar.
How do you emulate that with llm’s? I suppose the objective is to get variance down to the point it’s barely noticeable. But not sure it’ll get to that place based on accumulating more data and re-training models.
You can get pretty close to reproducible output by narrowing the scope and using certain prompts/harnesses. As in, you get roughly the same output each time with identical prompts, assuming you're using a model which doesn't change every few hours to deal with load, and you aren't using a coding harness that changes how it works every update. It's not deterministic, but if you ask it for a scoped implementation you essentially get the same implementation every time, with some minor and usually irrelevant differences.
So you can imagine with a stable model and harness, with steering you can basically get what you ask it for each time. Tooling that exploits this fact can be much more akin to using an autocomplete, but instead of a line of code it's blocks of code, functions, etc.
A harness that makes it easy to steer means you can basically write the same code you would have otherwise written, just faster. Which I think is a genuine win, not only from a productivity standpoint but also you maintain control over the codebase and you aren't alienated or disenfranchised from the output, and it's much easier to make corrections or write your own implementations where you feel it's necessary. It becomes more of an augmentation and less of a replacement.
I re-did an entire UI recently, and when one of the elements failed to render I noticed the old UI peeking out from underneath. It had tried just covering up old elements instead of adjusting or replacing them. Like telling your son to clean their room, so they push all the clothes under the bed and hope you don't notice LOL
It saves 2 hours of manual syntax wrangling but introduces 1 .5 hours of clean up and sanity checking. Still a net productivity increase, but not sure if its worth how lazy it seems to be making me (this is an easy error to correct, im sure, but meh Claude can fix it in 2 seconds so...)
The version it puts down into documents is not the thing it was actually doing. It's a little anxiety-inducing. I go back to review the code with big microscopes.
"Reproducibility" is still pretty important for those trapped in the basements of aerospace and defense companies. No one wants the Lying Machine to jump into the cockpit quite yet. Soon, though.
We have managed to convince the Overlords that some teensy non-agentic local models - sourced in good old America and running local - aren't going to All Your Base their Internets. So, baby steps.
The solution to this is to use quality gates that loop back and check the work.
I'm currently building a tool with gates and a diff regression check. I haven't seen these problems for a while now.
...and that led me to believe that AI might be very capable to develop over-engineered audio equipment. Think of all the bells and whistles that could be added, that could be expressed in ridiculous ways with ridiculous price tags.
Too many people are treating the tools as a complete replacement for a developer. When you are typing a text to someone and Google changes a word you misspelled to a completely different word and changes the whole meaning of the text message do you shrug and send it anyway? If so, maybe LLMs aren't for you.
Counterpoint: no it isn't
> makes this job dramatically harder
No it doesn't