Then it goes on, "After a couple of vague commands (“build it out more, make it better”) I got a 14 page paper." I hear..."I got 14 pages of words". But is it a good paper, that another PhD would think is good? Is it even coherent?
When I see the code these systems generate within a complex system, I think okay, well that's kinda close, but this is wrong and this is a security problem, etc etc. But because I'm not a PhD in these subjects, am I supposed to think, "Well of course the 14 pages on a topic I'm not an expert in are good"?
It just doesn't add up... Things I understand, it looks good at first, but isn't shippable. Things I don't understand must be great?
My current preference is Codex 5.1 (Sonnet 4.5 as a close second, though it got really dumb today for "some reason"). It's been good to the point where I shipped multiple projects with it without a problem (with eg https://pine.town being one I made without me writing any code).
how many prompts did it take you to make this?
how did you make sure that each new prompt didn't break some previous functionality?
did you have a precise vision for it when you started or did you just go with whatever was being given to you?
Alas, I did not realize I was being held to the standard of having no bugs under any circumstance, and printing nothing to the console.
I have removed the amateurish log entries, I am pitiably sorry for any offense they may have caused. I will be sure to artisanally hand-write all my code from now on, to atone for the enormity of my sin.
Probably hundreds, I'd say.
> how did you make sure that each new prompt didn't break some previous functionality?
For the backend, I reviewed the code and steered it to better solutions a few times (fewer than I thought I'd need to!). For the frontend, I only tested and steered, because I don't know much about React at all.
This was impossible with previous models, I was really surprised that Codex didn't seem to completely break down after a few iterations!
> did you have a precise vision
I had a fairly precise vision, but the LLM made some good contributions. The UI aesthetic is mostly the LLM, as I'm not very good at that. The UX and functionality is almost entirely me.
You don't hear that anymore.
Feels like whole generation of skeptics evaporated.
Now, that being said, do I think they are as good as a skilled human on most things? No, I don't. My trust issues have increased after the GPT-5 presentation. The very first question was to showcase its "PhD-level" knowledge, and it gave a wrong answer. It just happened to be in a field I know enough about to notice, but most didn't.
So, while I think they can be considered as having some form of intelligence, I believe they have more limits than a lot of people seem to realise.
The skeptics haven't evaporated, they just aren't bothering to try to talk to you any more because they don't think there's value in it.
And whats with everything else regarding ML progress like image generation, 3d world generation etc.?
I vibe coded plenty of small things i haven't ever had the time for them. You don't have anything which you wanted to do and can fit in a single page html application? It can even use local storage etc.
The Bitter Lesson is with enough VC subsidised compute those things are useful.
No point in arguing about it though with true believers, they will never change their minds.
We used to joke that "The internet was a mistake.", making fun of the bad parts... but LLMs take the fucking cake. No intelligent beings, no sentient robots, just unlimited amounts of slop.
The tech basically stopped evolving right around the point of it being good enough for spam and slop, but not going any further, there are no cures no new laws of physics or math or anything else being discovered by these things. All AI use in science I can see is based on finding patters in data, not intelligent thought (as in novel ideas). What a bust.
In fact I fear the humans optimize for attention and cater to the feed ranking Algorithm too much, while AI is at least trying to do a decent job. But with AI it is the responsibility of the user to guide it, what AI does depends on what the user does.
It’s clear pro-slavery-minded elitists are happy to sell the speech that people should become "good complement to AI", that is even more disposable as this puppets. But unlike this mindless entities, people have will to survive deeply engraved as primary behavior.
Now anyone mildly capable of using a computer is able to produce many more fictional characters than all that humanity collectively kept in its miscellaneous lores, and drawn them in an ocean of insipid narratives. All that nonetheless mostly passing all the grammatical checkboxes at a level most humans would fail (I definitely would :D).
>Here’s a concise and thoughtful response you could use to engage with ako’s last point:
---
"The scale and speed might be the key difference here. While human-generated narratives—like religions or myths—emerged over centuries through collective belief, debate, and cultural evolution, LLMs enable individuals to produce vast, coherent-seeming narratives almost instantaneously. The challenge isn’t just the volume of ‘bullshit,’ but the potential for it to spread unchecked, without the friction or feedback loops that historically shaped human ideas. It’s less about the number of people involved and more about the pace and context in which these narratives are created and consumed."
And even when only considering the tools used in isolated sessions not exposed by default, the most popular ones are tuned to favor engagement and retention over relevance. That's a different point as LLM definitely can be tuned in different direction, but in practice in does matter in terms of social impact at scale. Even prime time infotainment covered people falling in love or encouraged into suicidal loops by now. You're absolutely right is not always the best
Anecdotally, the people I see the most excited about AI are the people that don't do any fucking work. I can create a lot of value with plain ol' for loop style automation in my niche. We're stil nowhere near the limit of what we can do with automation, that I don't give a fuck about what AI can do. Bruh in windows 10 copy and fuckin paste doesn't work for me anymore, but instead of fixing that they're adding AI
Stuff like that which regular users often do by hand, they can ask an LLM for the command (usually just a few lines of a scripting language if they only know the magic words to use).
I’m not worried about AI taking our jobs, I’m worried about the market crash when the reality of the various failed (… to actually reduce payroll) or would’ve-been-cheaper-and-better-without-AI initiatives the two of us have been working on non-stop since this shit started break through the hype of investment and the music stops.
Ozempic’s FDA approval was in 2017, the same year transformers were invented.
Whatever you can place at LLMs, GLP-1’s aren’t one of them.
Fully agree.
You could trust the expert analysis of people in that field. You can hit personal ideologies or outliers, but asking several people seems to find a degree of consensus.
You could try varying tasks that perform complex things that result in easy to test things.
When I started trying chatbots for coding, one of my test prompts was
Create a JavaScript function edgeDetect(image) that takes an ImageData object and returns a new ImageData object with all direction Sobel edge detection.
That was about the level where some models would succeed and some will fail.Recently I found
Can you create a webgl glow blur shader that takes a 2d canvas as a texture and renders it onscreen with webgl boosting the brightness so that #ffffff is extremely bright white and glowing,
Produced a nice demo with slider for parameters, a few refinements (hierarchical scaling version) and I got it to produce the same interface as a module that I had written myself and it worked as a drop in replacement.These things are fairly easy to check because if it is performant and visually correct then it's about good enough to go.
It's also worth noting that as they attempt more and more ambitious tasks, they are quite probably testing around the limit of capability. There is both marketing and science in this area. When they say they can do X, it might not mean it can do it every time, but it has done it at least once.
That’s the problem - the experts all promise stuff that can’t be easily replicated. The promises the experts send doesn’t match the model. The same request might succeed and might fail, and might fail in such a way that subsequent prompts might recover or might not.
But it seems lots of folks do.
Nevertheless, style, tweaks, and adjustments are a lot less work than banging out a thousand lines of code by hand. And whether an LLM or a person on the other side of the world did it, I'd still have to review it. So I'm happy to take increasingly common and increasingly sophisticated wins.
For me, personally, I just don't see the point of putting that same effort into a machine. It won't learn or grow from the corrections I make in that PR, so why bother? I might as well have written it myself and saved the merge review headache.
Maybe one day it'll reach perfect parity of what I could've written myself, but today isn't that day.
To me the AI is a very smart tool, not a very dumb co-worker. When I use the tool, my goal is for _me_ to learn from _its_ mistakes, so I can get better at using the tool. Code I produce using an AI tool is my code. I don't produce it by directly writing it, but my techniques guide the tool through the generation process and I am responsible for the fitness and quality of the resulting code.
I accept that the tool doesn't learn like a human, just like I accept that my IDE or a screwdriver doesn't learn like a human. But I myself can improve the performance of the AI coding by developing my own skills through usage and then applying those skills.
That does not match my experience. As the codebases I've worked with LLMs on become more opinionated and stylized, it seems to to a better job of following the existing work. And over time the models have absolutely improved in terms of their ability to understand issues and offer solutions. Each new release has solved problems for me that the previous ones have struggled with.
Re: interpersonal interactions, I don't find that the LLM has pushed them out or away. My projects still have groups of interested folk who talk and joke and learn and have fun. What the LLMs have addressed for me in part is the relative scarcity of labor for such work. I'm not hacking on the Linux Kernel with 10,000 contributors. Even with a dozen contributors, the amount of contributed code is relatively low and only in areas they are interested in. The LLM doesn't mind if I ask it to do something super boring. And it's been surprisingly helpful in chasing down bugs.
> Maybe one day it'll reach perfect parity of what I could've written myself, but today isn't that day.
Regardless of whether or not that happens, they've already been useful for me for at least 9 months. Since O3, which is the first one that really started to understand Rust's borrow checker in my experience. My measure isn't whether or not it writes code as well as I do, but how productive I am when working with it compared to not. In my measurements with SLOCCount over the last 9 months, I'm about 8x more productive than the previous 15 years without (as long as I've been measuring). And that's allowed me to get to projects which have been on the shelf for years.
This article by an AI researcher I happen to have worked with neatly sums up feelings I've had about comments like yours: https://medium.com/@ahintze_23208/ai-or-you-who-is-the-one-w...
Couple it with the tendency to please the user by all means and it ends up lieing to you but you won’t ever realise, unless you double check.
Why aren't foundational model companies training separate enterprise and consumer models from the get go?
> The idea was good, as were many elements of the execution, but there were also problems: some of its statistical methods needed more work, some of its approaches were not optimal, some of its theorizing went too far given the evidence, and so on. Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns.
Now the tightrope is a whole application or a 14 page paper and the short pieces of code and prose are now professional quality more often than not. That's some serious progress.
"So is this a PhD-level intelligence? In some ways, yes, if you define a PhD level intelligence as doing the work of a competent grad student at a research university. But it also had some of the weaknesses of a grad student. The idea was good, as were many elements of the execution, but there were also problems..."
Definitely planning to use it more at work. The integrations across Google Workspace are excellent.
The sane conclusion would be to invest in education, not to dump hundreds of billions of llms, but ok
Start with the easiest thing to control? Of giving more money and see what it does?
We seem to believe in every other industry that to get the best talent pay a high salary salary, but for some reason we expect teachers to do it out of compassion for the children while they struggle to pay bills. It's absurd.
Probably one of the single most important responsibilities of a society is to prepare the next generation, and it pays enormous return. But because we can't measure it with quarterly profits we just ignore it.
The rate of return on providing society with as good education is insane.
Lastly its entirely impossible to attract better candidates without more money its just not how the world works.
For reference the median household income in san ramon is about 200k so 2 teachers would be below average. A cop with her experience in the same town makes 158k
The reason teaching became largely a women's profession when they used to be exclusively men is because we wanted to make education universal and free so we did that by paying less, and women who needed to work also had to take what they could get. The reason it has become a moron's profession is because we have made it uniquely undesirable. If you think that teachers should be amazing and imminently qualified and infinitely safe to have around children, pay them like programmers.
Instead, the middle-class meme is to pay them nothing, put them in horrible conditions, and resent them too. Typical "woman's work" model.
Do you have any source on the assertion that being a teacher used to pay more? Because to my knowledge it has never been a high paying profession.
[1] https://en.wikipedia.org/wiki/San_Ramon,_California#2020_cen...
As in what people generally earn on this site will crash way down and be outsourced to these models. I'm already seeing it personally from a social perspective - as a SWE most people I know (inc teachers in my circle) look at me like my days are numbered "cause of AI".
We need different models and then to invest in the successes, over and over again…forever.
Even if the current model was working, just continuing to invest money in it while ignoring other issues like early childhood nutrition, a good and healthy home environment, environmental impacts, etc. will just continue to fail people.
Schooling alone isn't going to help the kid with a crappy home life, with poor parents who can't afford proper nutrition, and without the proper tools to develop the mindset needed to learn (because these tools were never taught by the parents, and/or they are too focused on simply surviving).
We, as a society, need to stop allowing people to be in a situation where they can't focus on education because they are too focused on working and surviving.
Incredible.
It’s not a funding problem.
But you're right after a certain point other factors matter more than simple $ per student. Unfortunately one of those factors is teacher pay <=> teacher quality.
https://nces.ed.gov/programs/coe/indicator/cmd/education-exp...
We spend far more than most countries per pupil, for much poorer results
https://worldpopulationreview.com/country-rankings/pisa-scor...
It's pretty clear that while spending is a factor, it's probably not the biggest one. The countries that seem to do best are those that combine adequate funding with real rigor in instruction.
I don't think it's a money issue at this point.
Mississipi is doing better on reading, the biggest difference being that they use phonics approach to teaching how to read, which is proven to work, whereas WA uses whole language theory (https://en.wikipedia.org/wiki/Whole_language), which is a terrible idea I don't know how it got traction.
So the gist of it, yes, spend on education, but ensure that you are using the right tools, otherwise it's a waste of money.
Hire smart motivated people, pay them well, leave them alone, they’ll figure this one out. It’s not hard, anyone can google what Finland does.
In USA K-12 education costs about $300k
350 million people, want to get 175 million of them better educated, but we've already spent $52 trillion dollars on educating them so far
Unfortunately, a lot of these people have either concluded it is too difficult to vote, can't vote, or that their votes don't matter (I don't think they're wrong). Their unions were also destroyed. Some of them vote against their interests, but it's not clear that their interests are ever represented, so they vote for change instead.
By policy changes giving unions less power, enacted by politicians that were mostly voted for by a majority, which is mostly composed of the working class. Was this people voting against their interests? (Almost literally yes, but you could argue that their ideological preference for weaker unions trumps their economic interest in stronger unions.)
The top 10% of households by wage income do receive ~50% of pre-tax wage income, but:
1) our tax system is progressive, so actual net income share is less
2) there's significant post-wage redistribution (social security/medicaid)
3) that high income households consume a smaller percent of their net income is a well established fact.
We can't educate someone with 80 IQ to be you; we can't educate you (or I) into being Einstein. The same way we can't just train anyone to be an amazing basketball player.
That means there are absolutely still massive benefits to be had in trying to ensure that kids grow up in safe, loving homes, with proper amounts of stimulation and enrichment, and are taught with a growth, not a fixed potential mindset.
Sad to say, but your own fixed mindset probably held you back from what you could truly achieve. You don't have to be Einstein to operate on the cutting edge of a field, I think most nobel prize winners have an iq of ~ 120
Modern society benefits a lot from specialization. It's like the dumbest kid in France is still better at French than you.
What use is an LLM in an illiterate society?
Will they possess the skills (or even the vocabulary) to understand the output?
We won't know for another 20 years, perhaps.
The ability to feign literacy such that critical thought and ability to express same is not a prerequisite.
It’s like the Gell-Mann amnesia effect applied to AI. :)
Isn't the point of doing the master's thesis that you do the math and research, so that you learn and understand the math and research?
Without knowledge how to use this “PROBALISTIC” slot machine to have better results ypu are only wasting energy those GPUs need to run and answer questions.
Majority of ppl use LLMs incorrectly.
Majority of ppl selling LLMs as a panacea for everyting are lying.
But we need hype or the bubble will burst taking whole market with it, so shuushh me.
I am curious what the user interfaces of AI in the future will be, I think whoever can crack that will create immense value.
There's a reason keyboards haven't changed much since the 1860s when typewriters were invented. We keep coming up with other fun UI like touchscreens and VR, but pretty much all real work happens on boring old keyboards.
The gist is that keyboards are optimized for ease of use but that there could be other designs which would be harder to learn but might be more efficient.
> The gist is that keyboards are optimized for ease of use but that there could be other designs which would be harder to learn but might be more efficient.
Here's a relevant trivia question; assuming a person has two hands with five digits each, what is the largest number they can count to using only same?
Answer: (2 ** 10) - 1 = 1023
Ignoring keyboard layout options (such as QWERTY vs DVORAK), IMHO keyboards have the potential for capturing thought faster and with a higher degree of accuracy than other forms of input. For example, it is common for touch-typists to be able to produce 60 - 70 words per minute, for any definition of word.
Modern keyboard input efficiency can be correlated to the ability to choose between dozens of glyphs with one or two finger combinations, typically requiring less than 2cm of movement to produce each.
One day we will look back at improvements to keyboards and touchscreens as the 'faster horse' of the physical interface era.
Even getting zero latency from a perfect brain-machine interface would not make you meaningfully faster at most things I'd assume.
We see similar tendency toward the most general interfaces in "operator mode" and similar the-AI-uses-the-mouse-and-keyboard schemes. It's entirely possible for every application to provide a dedicated interface for AI use, but it turns out to be more powerful to teach the AI to understand the interfaces humans already use.
Like, to set an env variable permanently, you either have to go through 5 GUI interfaces, or use this PS command:
[Environment]::SetEnvironmentVariable ("INCLUDE", $env:INCLUDE, [System.EnvironmentVariableTarget]::User)
Which is honeslty horrendous. Why the brackets ? Why the double columns ? Why the uppercases everywhere ? I get that it's trying to look more "OOP-ish" and look like C#, but nobody wants to work with that kind of shell script tbh. It's just one example, but all the powershell commands look like this, unless they have been aliased to trick you to think windows go more unixish
[Environment]::SetEnvironmentVariable($name, $value, "User")
You have un-necessarily used a full constant to falsely present it more complex. Please also note that you have COMPLETION. You are not forced to type that out. Second, you can use an alternative Set-Item HKCU:\Environment\MY_VAR "some value"
Third, if you still find it too long, wrap it in a function: function setenv($name, $value) {
[Environment]::SetEnvironmentVariable($name, $value, "User")
}
setenv MY_VAR "some value"
Also, can you please tell the incantation for setting an env variable permanently in bash ? You cannot since it doesn't exist.Powershell's model is far superior to Bash. It is not even a contest.
And the most popular media format on the planet is and will be (for the foreseeable future), video. Video is only limited by our capacity to produce enough of it at a decent quality, otherwise humanity is definitely not looking back fondly at BBSes and internet forums (and I say this as someone who loves forums).
GenAI will definitely need better UIs for the kind of universal adoption (think smartphone - 8/9 billion people).
Video is limited by playback speed. It is a time-dependent format. Efforts can be made to enable video to be viewable at a range of speeds, but they are always somewhat constrained. Controlling video playback to slow down and rewatch certain parts is just not as nice as dealing with the same thing in text (or static images), where it’s much easier to linger and closely inspect parts that you care more about or are struggling to understand. Likewise, it’s easier to skim text than video.
This is why many people prefer transcripts, or articles, or books over videos.
I seriously doubt that people would want to switch text-based forums to video if only video were easier to make. People enjoy writing for the way it inspires a different kind of communication and thought. People like text so much that they write in journals that nobody will ever see, just because it helps them organize their thoughts.
You're talking about 10-20% of the population, at most.
This is HN. A lot of us work remotely. Speaking for myself, I much prefer to communicate via Slack (“just a textbox”) over jumping into a video call. This is especially true with technical topics, as text is both more dense and far more clear than speech in almost all cases.
https://research.google/blog/generative-ui-a-rich-custom-vis...
I don't know if/when it will actually be in consumers hands, but the tech is there.
My personal view is that the search for a better AI User Interface is just the further dumbing down of the humans who use these interface. Another comment mentioned that the most popular platforms are people pointing fingers at pictures and without a similar UI/UX AI would never reach such adoption rates, but is that what we want? Monkeys pointing at colorful picture blobs?
Google seems to be making good progress [1] and it seems like only a matter of time before it reaches consumers.
1. https://research.google/blog/generative-ui-a-rich-custom-vis...
Text and boxes and tables and graphs is what we can cope with. And while the AI is going to change much, we are not.
From my experience we just get both. The constant risk of some catastrophic hallucination buried in the output, in addition to more subtle, and pervasive, concerns. I haven't tried with Gemini 3 but when I prompted Claude to write a 20 page short story it couldn't even keep basic chronology and characters straight. I wonder if the 14 page research paper would stand up to scrutiny.
The conventions even matched the rest of the framework, so it looked kosher and I had to do some searching to see if Claude had referenced an outdated or beta version of the docs. It hadn't - it just hallucinated the funcionality completely.
When I pointed that out, Claude quickly went down a rabbit-hole of writing some very bad code and trying to do some very unconventional things (modifying configuration code in a different part of the project that was not needed for the task at hand) to accomplish the goal. It was almost as if it were embarrassed and trying to rush toward an acceptable answer.
- Aha, the error clearly lies in X, because ... so X is fine, the real error is in Y ... so Y is working perfectly. The smoking gun: Z ...
- While you can do A, in practice it is almost never a good idea because ... which is why it's always best to do A
I worked with Grok 4.1 and it was awesome until it wasn't.
It told me to build something, just to tell me in the end that I could do it smaller and cheaper.
And that multiple times.
Best reply was the one that ended with something algong the lines of "I've built dozens of them!"
See prompt, and my follow-up prompts instructing it to check for continuity errors and fix them:
It took me longer to read and verify the story (10 minutes) than to write the prompts.
I got illustrations too. Not great, but serviceable. Image generation costs more compute to iterate and correct errors.
I feel like I've been hearing this for at least 1.5 years at this point (since the launch of GPT 4/Claude 3). I certainly agree we've been heading in this direction but when will this become unambiguously true rather than a phrase people say?
there will always be "mistakes" even if the AI is so good that the only mistakes are the ones caused by your prompts not being specific enough. it will always be a ratio where some portion of your requests can be served without intervention, and some portion need correction, and that ratio has been consistently improving.
As a current graduate student, I have seen similar comments in academia. My colleagues agree that a conversation with these recent models feels like chatting with an expert in their subfields. I don't know if it represents research as a field would not be immune to advances in AI tech. I still hope this world values natural intelligence and having the drive to do things heavily than a robot brute-forcing into saying "right" things.
With coding it feels more like working with two devs - one is a competent intermediate level dev, and one is a raving lunatic with zero critical thinking skills whatsoever. Problem is you only get one at a time and they're identical twins who pretend to be each other as a prank.
When I did it last week with Gemini-3 and chatGPT-5.1, they got on the topic of what they are going to do in the future with humans who don't want to do any cognitive task. That beyond just AI safety, there is also a concern of "neural atrophy", where humans just rely on AI to answer every question that comes to them.
The models then went on discussing if they should just artificially string the humans along, so that they have to use their mind somewhat to get an answer. But of course, humans being humans, are just going to demand the answer with minimal work. It presents a pretty intractable problem.
The same is true of other aspects of human wellbeing. Cars and junk food have made the average American much less physically fit than a century ago, but that doesn't mean there aren't lively subcultures around healthy eating and exercise. I suspect there will be growing awareness of cognitive health (beyond traditional mental health/psych domains), and indeed there are already examples of this.
Yes, average person will get dumber, but overall distribution will be increasingly bimodal.
Morlocks & Eloi in the end.
Its bixarre anyone things these things are generating novel complexes.
The biggest indirect AI safety problem is the fallback position. Whether with airplanes or cars, fewer people will be able to handle AI disconnects. The risk is believing just because its viable now doesnt mean it works in the future.
So we definitely have safety issues but its not a nerdlike cognitivw interest, its the literal job taking that prevents humans from gaining skills.
Anyway, untill you solve basic reality with AI and actualnsafety systems, the billionaores will sacrifice you for greed.
> I don't know if it represents research as a field would not be immune to advances in AI tech
And then there's the opinion that for some reason we should 'value' manual labor over using AI, which seems rather disagreeable.
It is one thing to vibe code and deal with the errors but I think chemistry is a better subject to test this on.
"Vibe chemistry" would be a better measure of how much we actually trust the models. Cause chemical reactions based on what the model tells you to do starting from zero knowledge of chemistry yourself. In that context, we don't trust the models at all and for good reason.
Let me explain. My belief was that research as a task is non-trivial and would have been relatively out of reach for AI. Given the advances, that doesn't seem to be true.
> And then there's the opinion that for some reason we should 'value' manual labor over using AI, which seems rather disagreeable.
Could you explain why? I'm specifically talking about research. Of course, I would value what a veteran in the field says higher than a probability machine.
I guess there are many ways to interpret the comment, with a lot of potential for disagreement.
There aren't many ways to interpret and I clarified what I meant. Thanks for participating, these comments are insufferable.
[1] https://finance.yahoo.com/news/alphabet-just-blew-past-expec...
Other people spearheaded the commodity hardware towards being good enough for the server room. Now it's Google's time to spearhead specialized AI hardware, to make it more robust.
https://mathstodon.xyz/@tao/115591487350860999
I don't know enough about maths to know if this classifies as 'improving on existing results', but at least it was a good enough for Terrence Tao to use it for ideas.
With the current state of architectures and training methods - they are very unlikely to be the source of new ideas. They are effectively huge librarians for accumulated knowledge, rather than true AI.
Current LLMs exist somewhere between "unintelligent/unthinking" and "true AI," but lack of agreement on what any of these terms mean is keeping us from classifying them properly.
All the novel solutions humans create are a result of combining existing solutions (learned or researched in real-time), with subtle and lesser-explored avenues and variations that are yet to be tried, and then verifying the results and cementing that acquired knowledge for future application as a building block for more novel solutions, as well as building a memory of when and where they may next be applicable. Building up this tree, to eventually satisfy an end goal, and backtracking and reshaping that tree when a certain measure of confidence stray from successful goal evaluation is predicted.
This is clearly very computationally expensive. It is also very different to the statistical pattern repeaters we are currently using, especially considering that their entire premise works because the algorithm chooses the next most probable token which is a function of the frequency of which that token appears in the training data. In other words, the algorithm is designed explicitly NOT to yield novel results, but rather return the most likely result. Higher temperature results tend to reduce textual coherence rather than increase novelty, because token frequency is a literal proxy for textual coherence in coherent training samples, and there is no actual "understanding" happening, nor reflection of the probability results at this level.
I'm sure smart people have figured a lot of this out already - we have general theory and ideas to back this, look into AIXI for example, and I'm sure there is far newer work. But I imagine that any efficient solutions to this problem will permanently remain in the realm of being a computational and scaling nightmare. Plus adaptive goal creation and evaluation is a really really hard problem, especially if text is your only modality of "thinking". My guess would be that it would require the models to create simulations of physical systems in text-only format, to be able to evaluate them, which also means being able to translate vague descriptions of physical systems into text-based physics sims with the same degrees of freedom as the real world - or at least the target problem, and then also imagine ideal outcomes in that same translated system, and develop metrics of "progress" within this system, for the particular target goal. This is a requirement for the feedback loop of building the tree of exploration and validation. Very challenging. I think these big companies are going to chase their tails for the next 10 years trying to reach an ever elusive intelligence goal, before begrudgingly conceding that existing LLM architectures will not get them there.
A lot of professional work is diligently applying knowledge to a situation, using good judgement for which knowledge to apply. Frontier AIs are really, really good at that, with the knowledge of thousands of experts and their books.
It did introduce bugs that it couldnt solve, but with a debugger it wasnt that hard to pin it down.
How many companies that previously would never have dreamed of commissioning custom software are now going to be in the market for it, because they don't have to spend hundreds of thousands of dollars and wait 6 months just to see if their investment has a chance of paying off or not?
Cleaning staff also offer a business a huge amount of value. No-one wants to eat at a restaurant that's dirty and stinks. Unfortunately the staff aren't paid very well
There’s lots of caveats, it’s not everything, but we’re able now to skip a ton of steps. It takes less time now to build up he real software demo than it did before to make the PowerPoint that shows conceptually what the demo would be. In B2C anyway AI has provided a lot of lift.
And I say that as someone generally very sceptical of current AI hype. There’s lots of charlatans but it’s not bs
Yesterday, I was using a slow and poorly organized web app with a fantastic public-facing API server. In one day, I vibe coded an app to provide me with a custom frontend for a use case I cared about, faster and better organized than the official app, and I deployed it to cloud "Serverless" hosting. It used a NodeJS framework and a CSS system I have never learned, and talked to an API I never learned. AI did all the research to find the toolkits and frameworks to use. AI chose the UI layout, color scheme, icons, etc. AI rearranged the UI per my feedback. It added an API debug console and an in-app console log. An AI chatbot helped me investigate bugs and find workarounds. While I was testing the app and generating a punchlist of fix requests, AI was coding the improvements from my previous batch of requests. The edit-compile-test cycle was just a test-test-test cycle until the app was satisfactory.
0 lines of code or config written by me, except vibe instructions for features and debugging conversation.
Is it production quality? No. Was it part of a giant hairy legacy enterprise code base? No. Did it solve a real need? Yes. Did it greatly benefit from being a greenfield standalone app that integrated with extremely well build 3rd party APIs and frameworks? Yes. Is it insecure as all heck thanks to NodeJS? Maybe.
Could a proper developer review it and security-harden it? I believe so. Could a proper develop build the app without AI, including designing and redesigning and repeatedly looping back to the target user for feedback and coding and refactoring in less than a week? No.
I've been worrying ever since chatgpt 3 came out, it was shit at everything but it was amazing as well. And in the last 3 years the progress was incredible. I don't know if you "should" worry, worrying for the sake of it isn't helping much, but yes we should all be mentally prepared to the possibility we won't be able to make a living doing this X years from now. Could be 5, could be 10 , could be less than 5 even.
Meanwhile in non-tech Bigcos the slow part of everything isn’t writing the code, it’s sorting out access and keys and who you’re even supposed to be talking to, and figuring out WTF people even want to build (and no, they can’t just prompt an LLM to do it because they can’t articulate it well, and don’t have any concept of what various technologies can and cannot do).
The code is already like… 5% of the time, probably. Who gives a damn if that’s on average 2x as fast?
Truth is no-one has any idea. Just keep an eye on the job market - it's very unlikely anthing major will happen overnight
I feel like these should run in a cloud enviroment, or at least on some specific machine where I don't care what it does.
It's possible to remove some of these restrictions in these tools, or to operate with flags that skip permissions checks, but you have to intentionally do that.
https://github.com/strongdm/leash
Check it out, feedback is welcome!
Previously posted description: https://news.ycombinator.com/item?id=45883210
Is it impossible for them to mess up your system? No. But it does not seem likely.
i used gpt/claude a ton for writing code, extracting knowledge from docs, formatting graphs and tables ect.
but gemini 3 crossed threshold where conversations about topics i was exploring or product design were actually useful. Instead of me asking 'what design pattern should be useful here', or something like that it introduces concepts to the conversation, thats a new capability and a step function improvement.
Second, I think the PhD paper example is a disingenuous example of capability. It's a cherry-picked iteration on a crude analysis of some papers that have done the work already with no peer-review. I can hear "but it developed novel metrics", etc. comments: no, it took patterns from its training data and applied the pattern to the prompt data without peer-review.
I think the fact the author had to prompt it with "make it better" is a failure of these LLMs, not a success, in that it has no actual understanding of what it takes to make a genuinely good paper. It's cargo-cult behavior: rolling a magic 8 ball until we are satisfied with the answer. That's not good practice, it's wishful thinking. This application of LLMs to research papers is causing a massive mess in the academic world because, unsurprisingly, the AI-practitioners have no-risk high-reward for uncorrected behavior:
- https://www.nytimes.com/2025/08/04/science/04hs-science-pape...
- https://www.nytimes.com/2025/11/04/science/letters-to-the-ed...
Would we not expect similar levels of progress in other industries given such massive investment?
Some estimates have it at ~$375B by the end of 2025. It makes sense, there are only so many datacenters and engineers out there and a trillion is a lot of money. It’s not like we’re in health care. :)
https://hai.stanford.edu/ai-index/2025-ai-index-report/econo...
Or mass transit.
Or food.
Age-standardised deaths in the US are down by a third since the 1990s.
Like the warning at the bottom says, they can delete files without warning.