That's when A.I. starts advancing itself and needs humans in the loop no more.
You got to put the environment back in the loop though, it needs a source of discovery and validity feedback for ideas. For math and code is easy, for self driving cars doable but not easy, for business ideas - how would we test them without wasting money? It varies field by field, some allow automated testing, others are slow, expensive and rate limited to test.
Now, depending on how good your simulation is, it may or may not be useful, but still, that's how you do it. Something like https://en.wikipedia.org/wiki/MuZero
doable but not easy, for business ideas
That requires a lot of human psychology and advanced hard economic theory (not the fluffy academic kind). With human controlled monetary supply and most high-level business requiring illegal and immoral exploitation of law and humans in general, it's not a path machines can realistically go down or even want machines treading down.Think scams and pure resource extraction. They won't consider many impacts outside of bottom line.
But the larger point stands: you don't need an environment to explore the abstraction landscape prescribed by systems thinking. You only need the environment at the human interface.
Eg randomised quicksort works really well.
Sorting a finite number of elements in a sequence, is a very narrow application of AI, akin to playing chess. Usually very simple approaches like RL work totally fine for problems like these, but auto-regression/diffusion models have to take steps that are not well defined at all, and the next step towards solving the problem is not obvious.
As an example, imagine a robot trying to grab a tomato from a table. It's arm extends across 1 meter maximum, and the tomato is placed 0.98 meters away. Is it able to grab the tomato from the point it stands, or it needs to move closer, and only then try to grab the tomato?
That computation should better be calculated deterministically. Deterministic computation is faster, cheaper and more secure. It has to prove that: $tomato_distance + $tomato_size < $arm_length. If this constraint is not satisfied, then: move_closer(); Calculate again:$tomato_distance + $tomato_size < $arm_length.
From the paper:
> Our system employs a custom interpreter that parses "LLM-Thoughts" (represented as DSL code snippets) to generate First Order Logic programs, which are then verified by a Z3 theorem prover.
Sorry, I did not suggest you should use AI to sort numbers. I was solely replying to this:
> Small steps of nondeterministic computation, checked thoroughly with deterministic computation every so often, and the sky is the limit.
You don't necessarily need your checks to be deterministic.
In fact, it's often better for them to be not deterministic.
See also https://fsharpforfunandprofit.com/series/property-based-test...
I don't understand your claim about 'Deterministic computation is faster, cheaper and more secure.' That's not true at all.
In fact, for many problems the fastest and simplest known solutions are non-deterministic. And in eg cryptography you _need_ non-determinism to get any security at all.
Beyond its relevancy to the parent comment, would you consider it a good movie yourself? (for a random/average HN commenter to watch)
It's a fine movie though.
I definitely enjoyed it many years ago as a younger person.
Like the inverse kinematics required for your arm and fingers to move.
Similarly transistor-based logic is based on such thresholds, when current or voltage reaches a certain level then a state-transition happens.
The code is also a useful artifact that can be iteratively edited and improved by both the human and LLM, with git history, etc. Running and passing tests/assertions helps to build and maintain confidence that the math remains correct.
I use helper functions to easily render from the sympy code to latex, etc.
A lot of the math behind this quantum eraser experiment was done this way.
https://github.com/paul-gauthier/entangled-pair-quantum-eras...
Probably the main deficiencies are confusion as the context grows (therefore confusion as task complexity grows).
[1]: https://quantumprolog.sgml.net/llm-demo/part1.html
[2]: https://microsoft.github.io/z3guide/docs/fixedpoints/syntax
[1] https://arxiv.org/abs/2505.20047 [2] https://github.com/antlr/grammars-v4/blob/master/datalog/dat...
Calculators are good AI, they rarely lie (due to floating arithmetics rounding). And yes, Wikipedia says calculators are AI tech, since a Computer was once a person, and not it is a tool that shows the intelligent trait of doing math with numbers or even functions/variables/equations.
Querying a calculator or wolfram alpha like symbolic AI system with LLMs seems like the only use for LLMs except for text refactoring that should be feasible.
Thinking LLMs know anything on their own is a huge fallacy.
My team has been prototyping something very similar with encoding business operations policies with LEAN. We have some internal knowledge bases (google docs / wiki pages) that we first convert to LEAN using LLMs.
Then we run the solver to verify consistency.
When a wiki page is changed, the process is run again and it's essentially a linter for process.
Can't say it moved beyond the prototyping stage though, since the LEAN conversion does require some engineers to look through it at least.
But a promising approach indeed, especially when you have a domain that requires tight legal / financial compliance.
If you ever feel like chatting and discussing more details, happy to chat!
I also used agents to synthesize, formalize, and criticize domain knowledge. Obviously, it is not a silver bullet, but it does ensure some degree of correctness.
I think introducing some degree of symbolism and agents-as-a-judge is a promising way ahead, see e.g.: https://arxiv.org/abs/2410.10934
Although that work is not public, you can play with the generally available product here!
[1] https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...
Some LLMs are more consistent between text and SMT, while others are not. (Tab 1, Fig 14,15)
You can do uncertainty quantification with selective verification to reduce the "risk", for e.g. shown as the Area Under the Risk Coverage Curve in Tab 4.
And let me be clear that this is a major limitation that fundamentally breaks whatever you are trying to achieve. You start with some LLM-generated text that is, by construction, unrelated to any notion of truth or factuality, and you push it through a verifier. Now you are verifying hot air.
It's like research into the efficacy of homeopathic medicine and there's a lot of that indeed, very carefully performed and with great attention to detail. Except all of that research is trying to prove whether doing nothing at all (i.e. homeopathy) has some kind of measurable effect or not. Obviously the answer is not. So what can change that? Only making homeopathy do something instead of nothing. But that's impossible, because homeopathy is, by construction, doing nothing.
It's the same thing with LLMs. Unless you find a way to make an LLM that can generate text that is conditioned on some measure of factuality, then you can verify the output all you like, the whole thing will remain meaningless.
"Alice has 60 brothers and she also has 212 sisters. How many sisters does Alice's brother have?"
But the generated program is not very useful:
{ "sorts": [], "functions": [], "constants": {}, "variables": [ {"name": "num_brothers_of_alice", "sort": "IntSort"}, {"name": "num_sisters_of_alice", "sort": "IntSort"}, {"name": "sisters_of_alice_brother", "sort": "IntSort"} ], "knowledge_base": [ "num_brothers_of_alice == 60", "num_sisters_of_alice == 212", "sisters_of_alice_brother == num_sisters_of_alice + 1" ], "rules": [], "verifications": [ { "name": "Alice\'s brother has 213 sisters", "constraint": "sisters_of_alice_brother == 213" } ], "actions": ["verify_conditions"] }
https://www.reddit.com/r/healthIT/comments/1n81e8g/comment/n...
Interesting that the final answer is provably entailed (or you get a counterexample), instead of being merely persuasive chain-of-thought.
Unless I’m wrong, this is mainly an API for trying to get an LLM to generate a Z3 program which “logically” represents a real query, including known facts, inference rules, and goals. The “oversight” this introduces is in the ability to literally read the logical statement being evaluated to an answer, and running the solver to see if it holds or not.
The natural source of doubt is: who’s going to read a bunch of SMT rules manually and be able to accurately double-check them against real-world understanding? Who double checks the constants? What stops the LLM from accidentally (or deliberately, for achieving the goal) adding facts or rules that are unsound (both logically and from a real-world perspective)?
The paper reports a *51%* false positive rate on a logic benchmark! That’s shockingly high, and suggests the LLM is either bad at logical models or keeps creating unsoundnesses. Sadly, the evaluation is a bit thin on the ground about how this stacks up, and what causes it to fall short.
E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.
Illustration of this can be seen in Fig 14, 15 on Page 29.
In commercially available products like AWS Automated Reasoning Checks, you build a model from your domain (e.g. from a PDF policy document), cross verify it for correctness, and during answer generation, you only cross check whether your Q/A pairs from the LLM comply with the policy using a solver with guarantees.
This means that they can give you a 99%+ soundness guarantee, which basically means that if the service says the Q/A pair is valid or guaranteed w.r.t the policy, it is right more than 99% of the time.
https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...
When the grammar of the language is better defined, like SMT (https://arxiv.org/abs/2505.20047) - we are able to do this with open source LLMs.
Please edit out swipes like this from your HN comments—this is in the site guidelines: https://news.ycombinator.com/newsguidelines.html. It comes across as aggressive, and we want curious conversation here.
Your comment would be fine without that bit.
Arguably your question reduces to: why does HN have moderators at all? The answer to that is that unfortunately, the system of community + software doesn't function well on its own over time—it falls into failure modes and humans (i.e. mods) are needed to jig it out of those [3]. I say "unfortunately" because, of course, it would be so much better if this weren't needed.
You can't assess this at the level of an individual interaction, though, because it's scoped at the whole-system level. That is, we can (and do) make bad individual calls, but what's important is how the overall system functions. If you see the mods making a mistake, you're welcome to point it out (and HN users are not shy about doing so!), and we're happy to correct it. But it doesn't follow that you don't need moderators for the system to work, or even survive.
[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
[2] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
[3] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
Ultimately there are still going to be bugs. For this reason and several others you'll still need it wrapped in a retry.
But what makes the computer hardware fundamentally incompatible with thinking? Compared to a brain
My point is, the question if an LLM reasons the same way a human does is about as useful as "does a submarine swim" or "can a telephone talk". The results speak for themselves.
That sounds like a false "both sides"-ing.
It's not symmetrical, there's a lot more money (and potential to grift) hyping things up as miracle machines.
In contrast, most of the pessimists don't have a discernible profit motive.
You have artists who've lost work due to diffusion models, teachers who can't assign homework essays anymore, people who hate Microsoft Copilot, just anyone not wanting to be replaced by a bot or being forced to use the tech to avoid being outcompeted, people set in their ways who don't want change or imagine it being destructive, etc. It's a large crowd that one can appeal to for personal gain, politics 101. Anyone with half believable credentials can go on a talk show and say the things people want to hear, maybe sell a book or two afterwards.
Are today's models on the brink of some exponential self perpetuating shot towards superintelligence? Obviously not. Are they overhyped glorified lookup tables? Also no. Are there problems? Definitely. But I don't think it's entirely fair to dismiss a tech based on someone misappropriating it in monopolistic endeavours instead of directing dismissal towards those people themselves.
Like, similar to how Elon's douchebaggery has tainted EVs for lots of people for no practical reason, the same has Altman's done for LLMs.
There have been enough cases of models providing novel results that it's clear that whatever human trait they supposedly lack they don't really need. A car does not need legs, it does things differently. Having legs would even be a major detriment and would hold it back from achieving its top performance.
That's what those brain simulating projects are conceptually btw: cars with legs or planes with flapping wings. That's why they all fail, the approach makes no sense.
Those are some mighty parrots there, if they managed to get gold at IMO, IoI, and so on...
>They are a parrot
Is it really much different from most people? The average Joe doesn't produce novel theories every day - he just rehashes what he's heard. Now the new goalpost seems to be that we can only say an LLM can "reason" if it matches Fields Medalists.
You've presented a false choice.
However the average Joe does indeed produce unique and novel thoughts every day. If it were not the case he would be brain dead. Each decision - wearing blue or red today - every tiny thought, action, feeling, indecision, crisis, or change of heart these are just as important.
The jury maybe out on how to judge what 'thought' actually is. However what it is not is perhaps easier to perceive. My digital thermometer does not think when it tells me the temperature.
My paper and pen version of the latest LLM (quite a large bit of paper and certainly a lot of ink I might add) also does not think.
I am surprised so many in the HN community have so quickly taken to assuming as fact that LLM's think or reason. Even anthropomorphising LLM's to this end.
For a group inclined to quickly calling out 'God of the gaps' they have quite quickly invented their very own 'emergence'.
Even if we're to humor the "novel" part, have they actually come up with anything truly novel? New physics? New proofs of hard math problems that didn't exist before?
[0] https://research.google/blog/ai-as-a-research-partner-advanc...
Imagine somebody in 2007: "It's so funny to me that people are still adamant about mortgage default risk after it's become a completely moot point because nobody cares in this housing market."
It’s pretty clear to me there is a collective desire to ignore the problems to sell more GPU, close the next round, get that high paying AI job.
Part of me wishes humans would show the same dedication to fight climate change…
Diving into how well/badly anybody predicted a certain economic future is a whole different can of worms.
That said: "The market can stay irrational longer than I can stay solvent." :p
But this is an incredibly interesting problem!
You'll also see why their applications are limited compared to what you probably hoped for.
Imagine having a bunch of 2D matrices with a combined 1.8 trillion total numbers, from which you pick out a blocks of numbers in a loop and finally merge them and combine them to form a token.
Good luck figuring out what number represents what.
My paper and pen version of the latest LLM (quite a large bit of paper and certainly a lot of ink I might add) also does not think.
I am surprised so many in the HN community have so quickly taken to assuming as fact that LLM's think or reason. Even anthropomorphising LLM's to this end.
For a group inclined to quickly calling out 'God of the gaps' they have quite quickly invented their very own 'emergence'.
Sure, but if you assume that physical reality can be simulated by a Turing machine, then (computational practicality aside) one could do the same thing with a human brain.
Unless you buy into some notion of magical thinking as pertains to human consciousness.
That isn’t a definition or even a coherent attempt.
For starters, what kind of cognition or computation can’t be implemented with either logic or arithmetic?
What is or is not “cognition” is going to be a higher level property than what basic universally capable substrate is used. Given such substrates can easily simulate each other, be substituted for each other.
Even digital and analog systems can be used to implement each other to arbitrary accuracy.
Cognition is a higher level concern.
It's just shorthand for "that's an extraordinary claim and nobody has provided any remotely extraordinary evidence to support it."
Complex phenomena emerge from interactions of things that don’t exhibit that phenomena all the time.
Atoms can’t think. In no sense can you find any thinking in an atom.
They are no different from dominos in that respect.
You can pile atoms to the moon without seeing any thinking.
Yet they can still be arranged so they do think.
We're already at the point where LLMs can beat the Turing test. If we define thinking as something only humans can do, then we can't decide if anyone is thinking at all just by talking to them through text, because we can't tell if they're human any more.
What are you saying?
Are you saying you have a clear definition for thinking, and you can demonstrate that animals pass that definituon?
Then share the definition.
Or are you simply defining thinking as a common property of humans and animals, using animals and human behavior as exemplars?
A useful definition for focusing inquiry. But it does not clarify or constrain what else might or might not be enabled to think.
Or are you defining thinking as an inherent property of animals and humans that other things cannot have because they are not animals or humans?
Fine, but that that’s an exercise in naming. Something we are all free to do however we want. It has no explanatory power.
Not manufactured stop gaps or generic cynicism.
There is no reason more GPUs can’t contribute to further understanding, as one of many tools that have already assisted with relevent questions and problems.
Opt out of serious inquiry, no excuse needed, if you wish. Reframing others efforts is not necessary to do that.
Sure thing buddy, I'm the confused one in this entire millenarian frenzy.