Take sending a packet over a noisy, low SNR cell network. A high number of packets may be lost. This doesn't prevent me, as a software developer, from building an abstraction on top of a "mostly-reliable" TCP connection to deliver my website.
There are times when the service doesn't work, particularly when the packet loss rate is too high. I can still incorporate these failures into my mental model of the abstraction (e.g through TIMEOUTs, CONN_ERRs…).
Much of engineering and reliability history revolves around building mathematical models on top of an unpredictable world. We are far from solving this problem with LLMs, but this doesn't prevent me from thinking of LLMs as a new level of abstraction that can edit and transform code.
It makes even less sense when people compare an LLM to a compiler. Imagine making a pull request that's just adding a binary because you threw the source code away.
If I assign a bug fix ticket to a human developer on my team, I won't be able to precisely replicate how they go about solving the bug but for many bugs I can at least be assured that the bug will get solved, and that I understand the basic approach the assigned dev would use to troubleshoot and resolve the ticket.
This is an organizational abstraction but it's an abstraction just the same, leaky as it is.
No, this is not comparable. The reason reproducible builds are tricky is not because compilers are inherently prone to randomness, it's because binaries often bake-in things like timestamps and the exact pathnames of the system used to produce the build. People need to stop comparing LLMs to compilers, it's an embarrassingly poor analogy.
inb4: "Don't worry, just use GPT to make the docs"
A reasoning error has an infinite, unpredictable blast radius. When an LLM hallucinates, it doesn't fail safely but it writes perfectly compiling code that does the wrong thing. That "wrong thing" might just render a button incorrectly, or it might silently delete your production database, or open a security backdoor.
You can build reliable abstractions over failures that are predictable and contained. You cannot abstract away unpredictable destruction.
Says who? It’s quite easy to limit the blast radius of a reasoning error.
Sure, you can patch that specific case with guardrails, but how many unpredictable edge cases are you going to cover? It only takes a user with a bit of ingenuity to circumvent them. There are already several examples of AI agents getting stuck in infinite loops, burning through massive API bills while achieving absolutely nothing.
You can contain a system failure, but you cannot contain a logic failure if the system doesn't know the logic is wrong.
Suppose you had:
Math() Add() Subtract()
Program() Math(“calculate rate”)
This is intentionally written vaguely. How do you limit that these implementations ensure Program() runs and does the right thing when there is no guarantee Math() or its components are correct?
Normally you could use a typed programming language, unit tests, etc, but if LLM is the ultimate abstraction programs will be written line above. At some point traditional software engineering principles will need to apply.
Abstractions often embrace nondeterministic translation because lower level details are unknown at time of expression -- which is the moivation for many LLM queries.
I've gone through hand-coding HTML, CGI, CMSes, web frameworks, and CMSes built with web frameworks. Each is (roughly) a layer of abstraction on top of lower layers.
People talk about LLMs as an extension of this layering but they're not. With the layers of abstraction I've listed you can go down to the layers underneath and understand them if you take the time.
LLMs are something different. They're a replacement for or a simulation of the thinking process involved in programming at various layers.
I have a feeling that if LLMs were built on a deterministic technology, a lot of the current AI-is-not-intelligent crowd would be saying "These LLMs can only generate one answer given a question, which means they lack human creativity and they'll never be intelligent!"
And LLMs can handle very abstract concepts that could not possibly be encoded in C++, like the user's goal in using software.
Is the probability much higher with GCC? Sure. But it's still a probability.
But apparently, not so much any more.
They were invented to reduce cost of computation, not to eliminate the probability of error per se. Ask a Windows 11 user, they'll tell you computers still make errors.
We have a bunch of engineers paying money to open loot boxes and they get visibly upset when they run out of tokens.
LLM companies have done an absolutely brilliant job of figuring out how to burn more tokens quickly, couch it as “more advanced” and people throw money at them.
I realize this wasn’t the thrust of your point, but tangentially, we fucked it up so badly because people desperately want to ignore this bit, and instead of looking at these tools analytically, there are the ardent defenders and the staunchly opposed… much like every other topic under the sun these days.
I use the free stuff work pays for, and I’ve never hit any token limit or anything like that. But I’m also trying extremely hard to ensure my skillsets don’t atrophy. I just use the web interface and ask questions. I have no interest in tying my development experience directly into an LLM, not after what I’ve seen at work over the last few weeks.
C and Python have a bunch of different compilers, so you don't if you take the same code, the f' output can be different. There's determinism within the same compiler. Add in different architectures, and the machine code output definitely is more varied than presented.
But that's still a manageable; then what if you add in all the dependencies, well you get a more florid complexity.
So really, it's a shitty abstraction rather than an inaccurate analogy. If you lined them up in levels, there could be some universe where they are a valid abstraction. But it's not the current universe, because we know the models function on non-determinism.
I'd posit if there was a 'turtles all the way down' abstraction for the LLM, it's simply coming from the other end, the one where human mind might start entering the picture.
- contributing individually
- contributing as a tech lead
- contributing as a technical manager
- leaving the occupation to open a vanity business, such as a gastropub or horse shoeing servicehttps://en.wikipedia.org/wiki/Abstraction_(computer_science)
For many applications, this is equally troublesome as true non-determinism.
They are definitely not interpretable, I was reading some stuff from mechanistic interpretability researchers saying they've given up trying to build a bottom up model of how they work.
Compare "You are a helpful assistant. Your task is to <100 lines of task description> <example problem>"
with
"you are a helpless assistant. Your task is to <100 lines of task description> <example problem>"
I've changed 3 or 4 CHARACTERS ("ful" to "less") out of a (by construction) 1000+ character prompt.
and the outputs are not at all similar.
Just realized I've never tried the "you are a helpless ass" prompt. Again a very minor change in wording, just dropping a few letters. The helpless assistant at least output text apologizing for being so bad at the task.
edit: I'm not talking about an LLM as accessed through a provider. I'm just talking about using a model directly. Why wouldn't that be deterministic?
After that, a piece of software that is NOT the LLM chooses the next token. This is called the sampler. There are different sampling parameters and strategies available, but if you want repeatable* outputs, just take the token with the highest probability number.
* Perfect determinism in this sense is difficult to achieve because GPU calculations naturally have a minor bit of nondeterminism. But you can get very close.
Deciding how to pick a particular output given that likelihood function is left as an exercise for the user, which we call inference.
One obvious choice is to keep picking the highest likelihood token, feed it into the model, and get another -- on repeat. This is what most algorithms call "temperature=0". But doing this for token after token can lead boring output, or steer you into pathological low-probability sequences like a set of endless repeats.
So, the current SOTA is to intentionally introduce a random factor (temperature>0) to the sampling process -- along with other hacks, like explicit suppression of repeats.
Technically even when the temperature is 0 it's not deterministic but it's more likely to be... You can have ties in probabilities for generating the next words. And floating point noise is real.
All these models are doing is guesstimating the next token to say.
And in any case, setting the temperature to zero will not produce a useful result, unless you don't mind your LLM constantly running into infinite loops.