FilterHN

13 hours ago

[-]

It's cool and I'm glad it sounds like it's getting more reliable, but given the types of things people have been saying GPT-5 would be for the last two years you'd expect GPT-5 to be a world-shattering release rather than incremental and stable improvement.

It does sort of give me the vibe that the pure scaling maximalism really is dying off though. If the approach is on writing better routers, tooling, comboing specialized submodels on tasks, then it feels like there's a search for new ways to improve performance(and lower cost), suggesting the other established approaches weren't working. I could totally be wrong, but I feel like if just throwing more compute at the problem was working OpenAI probably wouldn't be spending much time on optimizing the user routing on currently existing strategies to get marginal improvements on average user interactions.

I've been pretty negative on the thesis of only needing more data/compute to achieve AGI with current techniques though, so perhaps I'm overly biased against it. If there's one thing that bothers me in general about the situation though, it's that it feels like we really have no clue what the actual status of these models is because of how closed off all the industry labs have become + the feeling of not being able to expect anything other than marketing language from the presentations. I suppose that's inevitable with the massive investments though. Maybe they've got some massive earthshattering model release coming out next, who knows.

maoberlehner

4 hours ago

[-]

I mostly use Gemini 2.5 Pro. I have a “you are my editor” prompt asking it to proofread my texts. Recently it pointed out two typos in two different words that just weren’t there. Indeed, the two words each had a typo but not the one pointed out by Gemini.

The real typos were random missing letters. But the typos Gemini hallucinated were ones that are very common typos made in those words.

The only thing transformer based LLMs can ever do is _faking_ intelligence.

Which for many tasks is good enough. Even in my example above, the corrected text was flawless.

But for a whole category of tasks, LLMs without oversight will never be good enough because there simply is no real intelligence in them.

butler14

2 hours ago

[-]

I had this too last week. It pointed out two errors that simply weren’t there. Then completely refused to back down and doubled down on its own certainty, until I sent it a screenshot of the original prompt. Kind of funny.

thorum

13 hours ago

[-]

The quiet revolution is happening in tool use and multimodal capabilities. Moderate incremental improvements on general intelligence, but dramatic improvements on multi-step tool use and ability to interact with the world (vs 1 year ago), will eventually feed back into general intelligence.

thomasfromcdnjs

5 hours ago

[-]

100%

1) Build a directory of X (a gazillion) amount of tools (just functions) that models can invoke with standard pipeline behavior (parallel, recursion, conditions etc)

2) Solve the "too many tools to select from" problem (a search problem), adjacently really understand the intent (linguistics/ToM) of the user or agents request

3) Someone to pay for everything

4) ???

The future is already here in my opinion, the LLM's are good-enough™, it's just the ecosystem needs to catch up. Companies like Zapier or whatever, taken to their logical extreme, connecting any software to any thing (not just sass products), combined with an LLM will be able to do almost anything.

Even better basic tool composition around language will make it's simple replies better too.

darkhorse222

12 hours ago

[-]

Completely agree. General intelligence is a building block. By chaining things together you can achieve meta programming. The trick isn't to create one perfect block but to build a variety of blocks and make one of those blocks a block-builder.

SecretDreams

6 hours ago

[-]

> The trick isn't to create one perfect block but to build a variety of blocks and make one of those blocks a block-builder.

This has some Egyptian pyramids building vibes. I hope we treat these AGIs better than the deal the pyramid slaves got.

z0r

5 hours ago

[-]

We don't have AGI and the pyramids weren't built by slaves.

coolKid721

12 hours ago

[-]

[flagged]

dang

12 hours ago

[-]

Can you please make your substantive points thoughtfully? Thoughtful criticism is welcome but snarky putdowns and onliners, etc., degrade the discussion for everyone.

You've posted substantive comments in other threads, so this should be easy to fix.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.

11 hours ago

[-]

  > It does sort of give me the vibe that the pure scaling maximalism really is dying off though

I think the big question is if/when investors will start giving money to those who have been predicting this (with evidence) and trying other avenues.

Really though, why put all your eggs in one basket? That's what I've been confused about for awhile. Why fund yet another LLMs to AGI startup. Space is saturated with big players and has been for years. Even if LLMs could get there that doesn't mean something else won't get there faster and for less. It also seems you'd want a backup in order to avoid popping the bubble. Technology S-Curves and all that still apply to AI

Though I'm similarly biased, but so is everyone I know with a strong math and/or science background (I even mentioned it in my thesis more than a few times lol). Scaling is all you need just doesn't check out

l33tman

8 hours ago

[-]

I started such an alternative project just before GPT-3 was released, it was really promising (lots of neuroscience inspired solutions, pretty different to Transformers) but I had to put it on hold because the investors I approached seemed like they would only invest in LLM-stuff. Now a few years later I'm trying to approach investors again, only to find now they want to invest in companies USING LLMs to create value and still don't seem interested in new foundational types of models... :/

I guess it makes sense, there is still tons of value to be created just by using the current LLMs for stuff, though maybe the low hanging fruits are already picked, who knows.

I heard John Carmack talk a lot about his alternative (also neuroscience-inspired) ideas and it sounded just like my project, the main difference being that he's able to self-fund :) I guess funding an "outsider" non-LLM AI project now requires finding someone like Carmack to get on board - I still don't think traditional investors are that disappointed yet that they want to risk money on other types of projects..

7 hours ago

[-]

  > I guess funding an "outsider" non-LLM AI project now requires finding someone like Carmack to get on board

And I think this is a big problem. Especially since these investments tend to be a lot cheaper than the existing ones. Hell, there's stuff in my PhD I tabled and several models I made that I'm confident I could have doubled performance with less than a million dollars worth of compute. My methods could already compete while requiring less compute, so why not give them a chance to scale? I've seen this happen to hundreds of methods. If "scale is all you need" then shouldn't the belief that any of those methods would also scale?

10 hours ago

[-]

I'm pretty curious about the same thing.

I think a somewhat comparable situation is in various online game platforms now that I think about it. Investors would love to make a game like Fortnite, and get the profits that Fortnite makes. So a ton of companies try to make Fortnite. Almost all fail, and make no return whatsoever, just lose a ton of money and toss the game in the bin, shut down the servers.

On the other hand, it may have been more logical for many of them to go for a less ambitious (not always online, not a game that requires a high player count and social buy-in to stay relevant) but still profitable investment (Maybe a smaller scale single player game that doesn't offer recurring revenue), yet we still see a very crowded space for trying to emulate the same business model as something like Fortnite. Another more historical example was the constant question of whether a given MMO would be the next "WoW-killer" all through the 2000's/2010's.

I think part of why this arises is that there's definitely a bit of a psychological hack for humans in particular where if there's a low-probability but extremely high reward outcome, we're deeply entranced by it, and investors are the same. Even if the chances are smaller in their minds than they were before, if they can just follow the same path that seems to be working to some extent and then get lucky, they're completely set. They're not really thinking about any broader bubble that could exist, that's on the level of the society, they're thinking about the individual, who could be very very rich, famous, and powerful if their investment works. And in the mind of someone debating what path to go down, I imagine a more nebulous answer of "we probably need to come up with some fundamentally different tools for learning and research a lot of different approaches to do so" is a bit less satisfying and exciting than a pitch that says "If you just give me enough money, the curve will eventually hit the point where you get to be king of the universe and we go colonize the solar system and carve your face into the moon."

I also have to acknowledge the possibility that they just have access to different information than I do! They might be getting shown much better demos than I do, I suppose.

8 hours ago

[-]

I'm pretty sure the answer is people buying into the scaling is all you need argument. Because if you have that framing then it can be solved through engineering, right? I mean there's still engineering research and it doesn't mean there's no reason to research but everyone loves the simple and straight forward path, right?

  > I think a somewhat comparable situation is in various online game platforms

I think it is common in many industries. The weird thing is that being too risk adverse creates more risk. There's a balance that needs to be struck. Maybe another famous one is movies. They go on about pirating and how Netflix is winning but most of the new movies are rehashes or sequels. Sure, there's a lot of new movies, but few get nearly the same advertising budgets and so people don't even hear about it (and sequels need less advertising since there's a lot of free advertising). You'd think there'd be more pressure to find the next hit that can lead to a few sequels but instead they tend to be too risk adverse. That's the issue of monopolies though... or any industry where the barrier to entry is high...

  > psychological hack

While I'm pretty sure this plays a role (along with other things like blind hope) I think the bigger contributor is risk aversion and observation bias. Like you say, it's always easier to argue "look, it worked for them" then "this hasn't been done before, but could be huge." A big part of the bias is that you get to oversimplify the reasoning for the former argument compared to the latter. The latter you'll get highly scrutinized while the former will overlook many of the conditions that led to success. You're right that the big picture is missing. Especially that a big part of the success was through the novelty (not exactly saying Fortnite is novel via gameplay...). For some reason the success of novelty is almost never seen as motivation to try new things.

I think that's the part that I find most interesting and confusing. It's like an aversion of wanting to look just one layer deeper. We'll put in far more physical and mental energy to justify a shallow thought than what would be required to think deeper. I get we're biased towards being lazy, so I think this is kinda related to us just being bad at foresight and feeling like being wrong is a bad thing (well it isn't good, but I'm pretty sure being wrong and not correcting is worse than just being wrong).

jjmarr

8 hours ago

[-]

>I think part of why this arises is that there's definitely a bit of a psychological hack for humans in particular where if there's a low-probability but extremely high reward outcome, we're deeply entranced by it, and investors are the same.

Venture capital is all about low-probability high-reward events.

Get a normal small business loan if you don't want to go big or go home.

7 hours ago

[-]

So you agree with us? Should we instead be making the argument that this is an illogical move? Because IME the issue has been that it appears as too risky. I'd like to know if I should just lean into that rather than try to argue it is not as risky as it appears (yet still has high reward, albeit still risky).

9 hours ago

[-]

We see both things: almost all games are 'not fortnite'. But that doesn't (commercially) invalidate some companies' quest for building the next fortnite.

Of course, if you limit your attention to these 'wanabe fortnites', then you only see these 'wannabe fortnites'.

9 hours ago

[-]

>Really though, why put all your eggs in one basket? That's what I've been confused about for awhile.

I mean that's easy lol. People don't like to invest in thin air, which is what you get when you look at non-LLM alternatives to General Intelligence.

This isn't meant as a jab or snide remark or anything like that. There's literally nothing else that will get you GPT-2 level performance, never-mind an IMO Gold Medalist. Invest in what else exactly? People are putting their eggs in one basket because it's the only basket that exists.

>I think the big question is if/when investors will start giving money to those who have been predicting this (with evidence) and trying other avenues.

Because those people have still not been proven right. Does "It's an incremental improvement over the model we released a few months ago, and blows away the model we released 2 years ago." really scream, "See!, those people were wrong all along!" to you ?

8 hours ago

[-]

  > which is what you get when you look at non-LLM alternatives to General Intelligence.

I disagree with this. There are a good ideas that are worth pursuit. I'll give you that few, if any, have been shown to work at scale but I'd say that's a self-fulfilling prophecy. If your bar is that they have to be proven at scale then your bar is that to get investment you'd have to have enough money to not need investment. How do you compete if you're never given the opportunity to compete? You could be the greatest quarterback in the world but if no one will let you play in the NFL then how can you prove that?

On the other hand, investing in these alternatives is a lot cheaper, since you can work your way to scale and see what fails along the way. This is more like letting people try their stuff out in lower leagues. The problem is there's no ladder to climb after a certain point. If you can't fly then how do you get higher?

  > Invest in what else exactly? ... it's the only basket that exists.

I assume you don't work in ML research? I mean that's okay but I'd suspect that this claim would come from someone not on the inside. Though tbf, there's a lot of ML research that is higher level and not working on alternative architectures. I guess the two most well known are Mamba and Flows. I think those would be known by the general HN crowd. While I think neither will get us to AGI I think both have advantages that shouldn't be ignored. Hell, even scaling a very naive Normalizing Flow (related to Flow Matching) has been shown to compete and beat top diffusion models[0,1]. The architectures aren't super novel here but they do represent the first time a NF was trained above 200M params. That's a laughable number by today's standards. I can even tell you from experience that there's a self-fulfilling filtering for this kind of stuff because having submitted works in this domain I'm always asked to compare with models >10x my size. Even if I beat them on some datasets people will still point to the larger model as if that's a fair comparison (as if a benchmark is all the matters and doesn't need be contextualized).

  > Because those people have still not been proven right.

You're right. But here's the thing. *NO ONE HAS BEEN PROVEN RIGHT*. That condition will not exist until we get AGI.

  > scream, "See!, those people were wrong all along!" to you ?

Let me ask you this. Suppose people are saying "x is wrong, I think we should do y instead" but you don't get funding because x is currently leading. Then a few years later y is proven to be the better way of doing things, everything shifts that way. Do you think the people who said y was right get funding or do you think people who were doing x but then just switched to y after the fact get funding? We have a lot of history to tell us the most common answer...

[0] https://arxiv.org/abs/2412.06329

[1] https://arxiv.org/abs/2506.06276

7 hours ago

[-]

>I disagree with this. There are a good ideas that are worth pursuit. I'll give you that few, if any, have been shown to work at scale but I'd say that's a self-fulfilling prophecy. If your bar is that they have to be proven at scale then your bar is that to get investment you'd have to have enough money to not need investment. How do you compete if you're never given the opportunity to compete? You could be the greatest quarterback in the world but if no one will let you play in the NFL then how can you prove that? On the other hand, investing in these alternatives is a lot cheaper, since you can work your way to scale and see what fails along the way. This is more like letting people try their stuff out in lower leagues. The problem is there's no ladder to climb after a certain point. If you can't fly then how do you get higher?

I mean this is why I moved the bar down from state of the art.

I'm not saying there are no good ideas. I'm saying none of them have yet shown enough promise to be called another basket in it's own right. Open AI did it first because they really believed in scaling, but anyone (well not literally, but you get what I mean) could have trained GPT-2. You didn't need some great investment, even then. It's that level of promise I'm saying doesn't even exist yet.

>I guess the two most well known are Mamba and Flows.

I mean, Mamba is a LLM ? In my opinion, it's the same basket. I'm not saying it has to be a transformer or that you can't look for ways to improve the architecture. It's not like Open AI or Deepmind aren't pursuing such things. Some of the most promising tweaks/improvements - Byte Latent Transformer, Titans etc are from those top labs.

Flows research is intriguing but it's not another basket in the sense that it's not an alternative to the 'AGI' these people are trying to build.

> Let me ask you this. Suppose people are saying "x is wrong, I think we should do y instead" but you don't get funding because x is currently leading. Then a few years later y is proven to be the better way of doing things, everything shifts that way. Do you think the people who said y was right get funding or do you think people who were doing x but then just switched to y after the fact get funding? We have a lot of history to tell us the most common answer...

The funding will go to players positioned to take advantage. If x was leading for years then there was merit in doing it, even if a better approach came along. Think about it this way, Open AI now have 700M Weekly active users for ChatGPT and millions of API devs. If this superior y suddenly came along and materialized and they assured you there were pivoting, why wouldn't you invest in them over players starting from 0, even if they championed y in the first place? They're better positioned to give you a better return on your money. Of course, you can just invest in both.

Open AI didn't get nearly a billion weekly active users off the promise of future technology. They got it with products that exist here and now. Even if there's some wall, this is clearly a road with a lot of merit. The value they've already generated (a whole lot) won't disappear if LLMs don't reach the heights some people are hoping they will.

If you want people to invest in y instead then x has to stall or y has to show enough promise. It didn't take transformers many years to embed themselves everywhere because they showed a great deal of promise right from the beginning. It shouldn't be surprising if people aren't rushing to put money in y when neither has happened yet.

[0] https://www.nasa.gov/directorates/somd/space-communications-...

6 hours ago

[-]

  > I'm saying none of them have yet shown enough promise to be called another basket in it's own right.

Can you clarify what this threshold is?

I know that's one sentence, but I think it is the most important one in my reply. It is really what everything else comes down to. There's a lot of room between even academic scale and industry scale. There's very few things with papers in the middle.

  > I mean, Mamba is a LLM

Sure, I'll buy that. LLM doesn't mean transformer. I could have been more clear but I think it would be from context as that means literally any architecture is an LLM if it is large and models language. Which I'm fine to work with.

Though with that, I'd still disagree that LLMs will get us to AGI. I think the whole world is agreeing too as we're moving into multimodal models (sometimes called MMLMs) and so I guess let's use that terminology.

To be more precise, let's say "I think there are better architectures out there than ones dominated by Transformer Encoders". It's a lot more cumbersome but I don't want to say transformers or attention can't be used anywhere in the model or we'll end up having to play this same game. Let's just work with "an architecture that is different than what we usually see in existing LLMs". That work?

  > The funding will go to players positioned to take advantage.

I wouldn't put your argument this way. As I understand it, your argument is about timing. I agree with most of what you said tbh.

To be clear my argument isn't "don't put all your money in the 'LLM' basket, put it in this other basket" by argument is "diversify" and "diversification means investing at many levels of research." To clarify that latter part I really like the NASA TRL scale[0]. It's wrong to make a distinction between "engineering vs research" and better to see it as a continuum. I agree, most money should be put into higher levels but I'd be amiss if I didn't point out that we're living in a time where a large number of people (including these companies) are arguing that we should not be funding TRL 1-3 and if we're being honest, I'm talking about stuff in currently in TRL 3-5. I mean it is a good argument to make if you want to maintain dominance, but it is not a good argument if you want to continue progress (which I think is what leads to maintaining dominance as long as that dominance isn't through monopoly or over centralization). Yes, most of the lower level stuff fails. But luckily the lower level stuff is much cheaper to fund. A mathematician's salary and a chalk board is at least half as expensive as the salary of a software dev (and probably closer to a magnitude if we're considering the cost of hiring either of them).

But I think that returns us to the main point: what is that threshold?

My argument is simply "there should be no threshold, it should be continuous". I'm not arguing for a uniform distribution either, I explicitly said more to higher TRLs. I'm arguing that if you want to build a house you shouldn't ignore the foundation. And the fancier the house, the more you should care about the foundation. Least you risk it all falling down

5 hours ago

[-]

>Can you clarify what this threshold is? I know that's one sentence, but I think it is the most important one in my reply. It is really what everything else comes down to. There's a lot of room between even academic scale and industry scale. There's very few things with papers in the middle.

Something like GPT-2. Something that even before being actually useful or particularly coherent, was interesting enough to spark articles like these. https://slatestarcodex.com/2019/02/19/gpt-2-as-step-toward-g... So far, only LLM/LLM adjacent stuff fulfils this criteria.

To be clear, I'm not saying general R&D must meet this requirement. Not at all. But if you're arguing about diverting millions/billions in funds from x that is working to y then it has to at least clear that bar.

> My argument is simply "there should be no threshold, it should be continuous".

I don't think this is feasible for large investments. I may be wrong, but i also don't think other avenues aren't being funded. They just don't compare in scale because....well they haven't really done anything to justify such scale yet.

[1] https://github.com/karpathy/llm.c/discussions/677

2 hours ago

[-]

  > Something like GPT-2

I got 2 things to say here

1) There's plenty of things that can achieve similar performance to GPT-2 these days. We mentioned Mamba, they compared to GPT-3 in their first paper[0]. They compare with the open sourced version and you'll also see some other architectures referenced there like Hyena and H3. It's the GPT-Neo and GPT-J models. Remember GPT-3 is pretty much just a scaled up GPT-2.

2) I think you are underestimating the costs to train some of these things. I know Karpathy said you can now train GPT-2 for like $1k[1] but a single training run is a small portion of the total costs. I'll reference StyleGAN3 here just because the paper has good documentation on the very last page[2]. Check out the breakdown but there's a few things I want to specifically point out. The whole project cost 92 V100 years but the results of the paper only accounted for 5 of those. That's 53 of the 1876 training runs. Your $1k doesn't get you nearly as far as you'd think. If we simplify things and say everything in that 5 V100 years cost $1k then that means they spent $85k before that. They spent $18k before they even went ahead with that project. If you want realistic numbers, multiply that by 5 because that's roughly what a V100 will run you (discounted for scale). ~$110k ain't too bad, but that is outside the budget of most small labs (including most of academia). And remember, that's just the cost of the GPUs, that doesn't pay for any of the people running that stuff.

I don't expect you to know any of this stuff if you're not a researcher. Why would you? It's hard enough to keep up with the general AI trends, let alone niche topics lol. It's not an intelligence problem, it's a logistics problem, right? A researcher's day job is being in those weeds. You just get a lot more hours in the space. I mean I'm pretty out of touch of plenty of domains just because time constraints.

  > I don't think this is feasible for large investments. I may be wrong, but i also don't think other avenues aren't being funded.

So I'm trying to say, I think your bar has been met.

And I think if we are actually looking at the numbers, yeah, I do not think these avenues are being funded. But don't take it from me, take it from FeiFei Li[3]

  | Not a single university today can train a ChatGPT model

I'm not sure if you're a researcher or not, you haven't answered that question. But I think if you were you'd be aware of this issue because you'd be living with it. If you were a PhD student you would see the massive imbalance of GPU resources given to those working closely with big tech vs those trying to do things on their own. If you were a researcher you'd also know that even inside those companies that there aren't much resources given to people to do these things. You get them on occasion like the StarFlow and TarFlow I pointed out before, but these tend to be pretty sporadic. Even a big reason we talk about Mamba is because of how much they spent on it.

But if you aren't a researcher I'd ask why you have such confidence that these things are being funded and that these things cannot be scaled or improved[4]. History is riddled with examples of inferior tech winning mostly due to marketing. I know we get hyped around new tech, hell, that's why I'm a researcher. But isn't that hype a reason we should try to address this fundamental problem? Because the hype is about the advance of technology, right? I really don't think it is about the advancement of a specific team, so if we have the opportunity for greater and faster advancement, isn't that something we should encourage? Because I don't understand why you're arguing against that. An exciting thing of working at the bleeding edge is seeing all the possibilities. But a disheartening thing about working at the bleeding edge is seeing many promising avenues be passed by for things like funding and publicity. Do we want meritocracy to win out or the dollar?

I guess you'll have to ask yourself: what's driving your excitement?

[0] I mean the first Mamba paper, not the first SSM paper btw: https://arxiv.org/abs/2312.00752

[2] https://arxiv.org/abs/2106.12423

[3] https://www.ft.com/content/d5f91c27-3be8-454a-bea5-bb8ff2a85...

[4] I'm not saying any of this stuff is straight de fact better. But there definitely is an attention imbalance and you have to compare like to like. If you get to x in 1000 man hours and someone else gets there in 100, it may be worth taking a look deeper. That's all.

9 hours ago

[-]

> Really though, why put all your eggs in one basket? That's what I've been confused about for awhile. Why fund yet another LLMs to AGI startup.

Funding multiple startups means _not_ putting your eggs in one basket, doesn't it?

Btw, do we have any indication that eg OpenAI is restricting themselves to LLMs?

7 hours ago

[-]

  > Funding multiple startups means _not_ putting your eggs in one basket, doesn't it?

Different basket hierarchy.

Also, yes. They state this and given how there are plenty of open source models that are LLMs and get competitive performance it at least indicates that anyone not doing LLMs is doing so in secret.

If OpenAI isn't using LLMs then doesn't that support my argument?

csomar

6 hours ago

[-]

The current money made its money following the market. They do not have the capacity for innovation or risk taking.

https://metr.github.io/autonomy-evals-guide/gpt-5-report/

BoiledCabbage

13 hours ago

[-]

Performance is doubling roughly every 4-7 months. That trend is continuing. That's insane.

If your expectations were any higher than that then, then it seems like you were caught up in hype. Doubling 2-3 times per year isn't leveling off my any means.

13 hours ago

[-]

I wouldn't say model development and performance is "leveling off", and in fact didn't write that. I'd say that tons more funding is going into the development of many models, so one would expect performance increases unless the paradigm was completely flawed at it's core, a belief I wouldn't personally profess to. My point was moreso the following: A couple years ago it was easy to find people saying that all we needed was to add in video data, or genetic data, or some other data modality, in the exact same format that the models trained on existing language data were, and we'd see a fast takeoff scenario with no other algorithmic changes. Given that the top labs seem to be increasingly investigating alternate approaches to setting up the models beyond just adding more data sources, and have been for the last couple years(Which, I should clarify, is a good idea in my opinion), then the probability of those statements of just adding more data or more compute taking us straight to AGI being correct seems at the very least slightly lower, right?

Rather than my personal opinion, I was commenting on commonly viewed opinions of people I would believe to have been caught up in hype in the past. But I do feel that although that's a benchmark, it's not necessarily the end-all of benchmarks. I'll reserve my final opinions until I test personally, of course. I will say that increasing the context window probably translates pretty well to longer context task performance, but I'm not entirely convinced it directly translates to individual end-step improvement on every class of task.

andrepd

10 hours ago

[-]

We can barely measure "performance" in any objective sense, let alone claim that it's doubling every 4 months.....

oblio

13 hours ago

[-]

By "performance" I guess you mean "the length of task that can be done adequately"?

It is a benchmark but I'm not very convinced it's the be-all, end-all.

nomel

10 hours ago

[-]

> It is a benchmark but I'm not very convinced it's the be-all, end-all.

Who's suggesting it is?

hnuser123456

13 hours ago

[-]

I agree, we have now proven that GPUs can ingest information and be trained to generate content for various tasks. But to put it to work, make it useful, requires far more thought about a specific problem and how to apply the tech. If you could just ask GPT to create a startup that'll be guaranteed to be worth $1B on a $1k investment within one year, someone else would've already done it. Elbow grease still required for the foreseeable future.

In the meantime, figuring out how to train them to make less of their most common mistakes is a worthwhile effort.

12 hours ago

[-]

Certainly, yes, plenty of elbow grease required in all things that matter.

The interesting point as well to me though, is that if it could create a startup that was worth $1B, that startup wouldn't be worth $1B.

Why would anyone pay that much to invest in the startup if they could recreate the entire thing with the same tool that everyone would have access to?

selcuka

9 hours ago

[-]

> if they could recreate the entire thing with the same tool

"Within one year" is the key part. The product is only part of the equation.

If a startup was launched one year ago and is worth $1B today, there is no way you can launch the same startup today and achieve the same market cap in 1 day. You still need customers, which takes time. There are also IP related issues.

Facebook had the resources to create an exact copy of Instagram, or WhatsApp, but they didn't. Instead, they paid billions of dollars to acquire those companies.

RossBencina

10 hours ago

[-]

If you created a $1B startup using LLMs, would you be advertising it? or would you be creating more $1B startups.

9 hours ago

[-]

Comment I'm replying to poses the following scenario:

"If you could just ask GPT to create a startup that'll be guaranteed to be worth $1B on a $1k investment within one year"

I think if the situation is that I do this by just asking it to make a startup, it seems unlikely that no one else would be aware that they could just ask it to make a startup

10 hours ago

[-]

>you'd expect GPT-5 to be a world-shattering release rather than incremental and stable improvement.

Compared to the GPT-4 release which was a little over 2 years ago (less than the gap between 3 and 4), it is. The only difference is we now have multiple organizations releasing state of the art models every few months. Even if models are improving at the same rate, those same big jumps after every handful of months was never realistic.

It's an incremental stable improvement over o3, which was released what? 4 months ago.

9 hours ago

[-]

The benchmarks certainly seem to be improving from the presentation. I don't think they started training this 4 months ago though.

There's gains, but the question is, how much investment for that gain? How sustainable is that investment to gain ratio? The things I'm curious about here are more about the amount of effort being put into this level of improvement, rather than the time.

brandall10

12 hours ago

[-]

To be fair, this is one of the pathways GPT-5 was speculated to take as far back at 6 or so months ago - simply being an incremental upgrade from a performance perspective, but a leap from a product simplification approach.

At this point it's pretty much given it's a game of inches moving forward.

ac29

10 hours ago

[-]

> a leap from a product simplification approach.

According to the article, GPT-5 is actually three models and they can be run at 4 levels of thinking. Thats a dozen ways you can run any given input on "GPT-5", so its hardly a simple product line up (but maybe better than before).

brandall10

9 hours ago

[-]

It's a big improvement from an API consumer standpoint - everything is now under a single product family that is logically stratified... up until yesterday people were using o3, o4-mini, 4o, 4.1, o3, and all their variants as valid choices for new products, now those are moved off the main page as legacy or specialized options for the few things GPT-5 doesn't do.

It's even more simplified for the ChatGPT plan, It's just GPT-5 thinking/non-thinking for most accounts, and then the option of Pro for the higher end accounts.

9 hours ago

[-]

A bit like Google Search uses a lot of different components under the hood?

mwigdahl

5 hours ago

[-]

It seems like few people are referencing the improvements in reliability and deception. If the benchmarks given generalize, what OpenAI has in GPT-5 is a cheap, powerful, _reliable_ model -- the perfect engine to generate high quality synthetic data to punch through the training data bottleneck.

I'd expect that at some level of reliability this could lead to a self-improvement cycle, similar to how a powerful enough model (the Claude 4 models in Claude Code) enables iteratively converging on a solution to a problem even if it can't one-shot it.

No idea if we're at that point yet, but it seems a natural use for a model with these characteristics.

dotancohen

2 hours ago

[-]

  > but given the types of things people have been saying GPT-5 would be for the last two years

This is why you listen to official announcements, not "people".

AbstractH24

11 hours ago

[-]

> It's cool and I'm glad it sounds like it's getting more reliable, but given the types of things people have been saying GPT-5 would be for the last two years you'd expect GPT-5 to be a world-shattering release rather than incremental and stable improvement.

Are you trying to say the curve is flattening? That advances are coming slower and slower?

As long as it doesn't suggest a dot com level recession I'm good.

10 hours ago

[-]

I suppose what I'm getting at is that if there are performance increases on a steady pace, but the investment needed to get those performance increases is on a much faster growth rate, it's not really a fair comparison in terms of a rate of progress, and could suggest diminishing returns from a particular approach. I don't really have the actual data to make a claim either way though,I think anyone would need more data to do so than is publicly accessible.

But I do think the fact that we can publicly observe this reallocation of resources and emphasized aspects of the models gives us a bit of insight into what could be happening behind the scenes if we think about the reasons why those shifts could have happened, I guess.

Karrot_Kream

10 hours ago

[-]

How are you measuring investment? If we're looking at aggregate AI investment, I would guess that a lot of it is going into applications built atop AI rather than on the LLMs themselves. That's going to be tools, MCPs, workflow builders, etc

jstummbillig

13 hours ago

[-]

Things have moved differently than what we thought would happen 2 years ago, but lest we forget what has happened in the meanwhile (4o, o1 + thinking paradigm, o3)

So yeah, maybe we are getting more incremental improvements. But that to me seems like a good thing, because more good things earlier. I will take that over world-shattering any day – but if we were to consider everything that has happened since the first release of gpt-4, I would argue the total amount is actually very much world-shattering.

fastball

6 hours ago

[-]

My reading is more that unit economics are starting to catch up with the frontier labs, rather than "scaling maximalism is dying". Maybe that is the same thing.

ch4s3

6 hours ago

[-]

My loosely held belief is that it is the same thing, but I’m open to being proven wrong.

13 hours ago

[-]

I for one am pretty glad about this. I like LLMs that augment human abilities - tools that help people get more done and be more ambitious.

The common concept for AGI seems to be much more about human replacement - the ability to complete "economically valuable tasks" better than humans can. I still don't understand what our human lives or economies would look like there.

What I personally wanted from GPT-5 is exactly what I got: models that do the same stuff that existing models do, but more reliably and "better".

12 hours ago

[-]

I'd agree on that.

That's pretty much the key component these approaches have been lacking on, the reliability and consistency on the tasks they already work well on to some extent.

I think there's a lot of visions of what our human lives would look like in that world that I can imagine, but your comment did make me think of one particularly interesting tautological scenario in that commonly defined version of AGI.

If artificial general intelligence is defined as completed "economically valuable tasks" better than human can, it requires one to define "economically valuable." As it currently stands, something holds value in an economy relative to human beings wanting it. Houses get expensive because many people, each of whom have economic utility which they use to purchase things, want to have houses, of which there is a limited supply for a variety of reasons. If human beings are not the most effective producers of value in the system, they lose capability to trade for things, which negates that existing definition of economic value. Doesn't matter how many people would pay $5 dollars for your widget if people have no economic utility relative to AGI, meaning they cannot trade that utility for goods.

In general that sort of definition of AGI being held reveals a bit of a deeper belief, which is that there is some version of economic value detached from the humans consuming it. Some sort of nebulous concept of progress, rather than the acknowledgement that for all of human history, progress and value have both been relative to the people themselves getting some form of value or progress. I suppose it generally points to the idea of an economy without consumers, which is always a pretty bizarre thing to consider, but in that case, wouldn't it just be a definition saying that "AGI is achieved when it can do things that the people who control the AI system think are useful." Since in that case, the economy would eventually largely consist of the people controlling the most economically valuable agents.

I suppose that's the whole point of the various alignment studies, but I do find it kind of interesting to think about the fact that even the concept of something being "economically valuable", which sounds very rigorous and measurable to many people, is so nebulous as to be dependent on our preferences and wants as a society.

12 hours ago

[-]

> Maybe they've got some massive earthshattering model release coming out next, who knows.

Nothing in the current technology offers a path to AGI. These models are fixed after training completes.

11 hours ago

[-]

Why do you think that AGI necessitates modification of the model during use? Couldn’t all the insights the model gains be contained in the context given to it?

11 hours ago

[-]

Because time marches on and with it things change.

You could maybe accomplish this if you could fit all new information into context or with cycles of compression but that is kinda a crazy ask. There's too much new information, even considering compression. It certainly wouldn't allow for exponential growth (I'd expect sub linear).

I think a lot of people greatly underestimate how much new information is created every day. It's hard if you're not working on any research and seeing how incremental but constant improvement compounds. But try just looking at whatever company you work for. Do you know everything that people did that day? It takes more time to generate information than process information so that's on you side, but do you really think you could keep up? Maybe at a very high level but in that case you're missing a lot of information.

Think about it this way: if that could be done then LLM wouldn't need training or tuning because you could do everything through prompting.

11 hours ago

[-]

The specific instance doesn’t need to know everything happening in the world at once to be AGI though. You could feed the trained model different contexts based on the task (and even let the model tell you what kind of raw data it wants) and it could still hypothetically be smarter than a human.

I’m not saying this is a realistic or efficient method to create AGI, but I think the argument „Model is static once trained -> model can’t be AGI“ is fallacious.

9 hours ago

[-]

I think that makes a lot of assumptions about the size of data and what can be efficiently packed into prompts. Even if we're assuming all info in a prompt is equal while in context and that it compresses information into the prompts before it falls out of context, then you're going to run into the compounding effects pretty quickly.

You're right, you don't technically need infinite, but we are still talking about exponential growth and I don't think that effectively changes anything.

11 hours ago

[-]

Because: https://en.wikipedia.org/wiki/Anterograde_amnesia

11 hours ago

[-]

Like I already said, the model can remember stuff as long as it’s in the context. LLMs can obviously remember stuff they were told or output themselves, even a few messages later.

https://en.wikipedia.org/wiki/Data_processing_inequality

11 hours ago

[-]

  > the model can remember stuff as long as it’s in the context.

You would need an infinite context or compression

Also you might be interested in this theorem

11 hours ago

[-]

> You would need an infinite context or compression

Only if AGI would require infinite knowledge, which it doesn’t.

9 hours ago

[-]

You're right, but compounding effects get out of hand pretty quickly. There's a certain point where finite is not meaningfully different than infinite and that threshold is a lot lower than you're accounting for. There's only so much compression you can do, so even if that new information is not that large it'll be huge in no time. Compounding functions are a whole lot of fun... try running something super small like only 10GB of new information a day and see how quickly that grows. You're in the TB range before you're half way into the year...

kalb_almas

9 hours ago

[-]

This seems kind of irrelevant? Humans have General Intelligence while having a context window of, what, 5MB, to be generous. Model weights only need to contain the capacity for abstract reasoning and querying relevant information. That they currently hold real-world information at all is kind of an artifact of how models are trained.

7 hours ago

[-]

  > Humans have General Intelligence while having a context window

Yes, but humans also have more than a context window. They also have more than memory (weights). There's a lot of things humans have besides memory. For example, human brains are not a static architecture. New neurons as well as pathways (including between existing neurons) are formed and destroyed all the time. This doesn't stop either, it continues happening throughout life.

I think your argument makes sense, but is over simplifying the human brain. I think once we start considering the complexity then this no longer makes sense. It is also why a lot of AGI research is focused on things like "test time learning" or "active learning", not to mention many other areas including dynamic architectures.

10 hours ago

[-]

AGI needs to genuinely learn and build new knowledge from experience, not just generate creative outputs based on what it has already seen.

LLMs might look “creative” but they are just remixing patterns from their training data and what is in the prompt. They cant actually update themselves or remember new things after training as there is no ongoing feedback loop.

This is why you can’t send an LLM to medical school and expect it to truly “graduate”. It cannot acquire or integrate new knowledge from real-world experience the way a human can.

Without a learning feedback loop, these models are unable to interact meaningfully with a changing reality or fulfill the expectation from an AGI: Contribute to new science and technology.

10 hours ago

[-]

I agree that this is kind of true with a plain chat interface, but I don’t think that’s an inherent limit of an LLM. I think OpenAI actually has a memory feature where the LLM can specify data it wants to save and can then access later. I don’t see why this in principle wouldn’t be enough for the LLM to learn new data as time goes on. All possible counter arguments seem related to scale (of memory and context size), not the principle itself.

Basically, I wouldn’t say that an LLM can never become AGI due to its architecture. I also am not saying that LLM will become AGI (I have no clue), but I don’t think the architecture itself makes it impossible.

10 hours ago

[-]

LLMs lack mechanisms for persistent memory, causal world modeling, and self-referential planning. Their transformer architecture is static and fundamentally constrains dynamic reasoning and adaptive learning. All core requirements for AGI.

So yeah, AGI is impossible with today LLMs. But at least we got to watch Sam Altman and Mira Murati drop their voices an octave onstage and announce “a new dawn of intelligence” every quarter. Remember Sam Altman 7 trillion?

Now that the AGI party is over, its time to sell those NVDA shares and prepare for the crash. What a ride it was. I am grabbing the popcorn.

outside1234

9 hours ago

[-]

The next step will be for OpenAI to number their releases based on year (ala what Windows did once innovation ran out)

7 hours ago

[-]

Windows 95 was a big step from the previous release, wasn't it?

And later, Windows reverted to version numbers; but I'm not sure they regained lots of innovation?

danenania

6 hours ago

[-]

Isn’t reasoning, aka test-time compute, ultimately just another form of scaling? Yes it happens at a different stage, but the equation is still 'scale total compute > more intelligence'. In that sense, combining their biggest pre-trained models with their best reasoning strategies from RL could be the most impactful scaling lever available to them at the moment.

GaggiX

13 hours ago

[-]

Compared to GPT-4, it is on a completely different level given that it is a reasoning model so on that regard it does delivers and it's not just scaling, but for this I guess the revolution was o1 and GPT-5 is just a much more mature version of the technology.

cchance

12 hours ago

[-]

SAM is a HYPE CEO, he literally hypes his company nonstop, then the announcements come and ... they're... ok, so people aren't really upset, but they end up feeling lackluster at the hype... Until the next cycle comes around...

If you want actual big moves, watch google, anthropic, qwen, deepseek.

Qwen and Deepseek teams honestly seem so much better at under promising and over delivering.

Cant wait to see what Gemini 3 looks like too.

techpression

12 hours ago

[-]

"They claim impressive reductions in hallucinations. In my own usage I’ve not spotted a single hallucination yet, but that’s been true for me for Claude 4 and o3 recently as well—hallucination is so much less of a problem with this year’s models."

This has me so confused, Claude 4 (Sonnet and Opus) hallucinates daily for me, on both simple and hard things. And this is for small isolated questions at that.

10 hours ago

[-]

There were also several hallucinations during the announcement. (I also see hallucinations every time I use Claude and GPT, which is several times a week. Paid and free tiers)

So not seeing them means either lying or incompetent. I always try to attribute to stupidity rather than malice (Hanlon's razor).

The big problem of LLMs is that they optimize human preference. This means they optimize for hidden errors.

Personally I'm really cautious about using tools that have stealthy failure modes. They just lead to many problems and lots of wasted hours debugging, even when failure rates are low. It just causes everything to slow down for me as I'm double checking everything and need to be much more meticulous if I know it's hard to see. It's like having a line of Python indented with an inconsistent white space character. Impossible to see. But what if you didn't have the interpreter telling you which line you failed on or being able to search or highlight these different characters. At least in this case you'd know there's an error. It's hard enough dealing with human generated invisible errors, but this just seems to perpetuate the LGTM crowd

hhh

2 hours ago

[-]

You can just have a different use case that surfaces hallucinations than someone, they don’t have to by evil.

bluetidepro

12 hours ago

[-]

Agreed. All it takes is a simple reply of “you’re wrong.” to Claude/ChatGPT/etc. and it will start to crumble on itself and get into a loop that hallucinates over and over. It won’t fight back, even if it happened to be right to begin with. It has no backbone to be confident it is right.

diggan

11 hours ago

[-]

> All it takes is a simple reply of “you’re wrong.” to Claude/ChatGPT/etc. and it will start to crumble on itself and get into a loop that hallucinates over and over.

Yeah, it's seems to be a terrible approach to try to "correct" the context by adding clarifications or telling it what's wrong.

Instead, start from 0 with the same initial prompt you used, but improve it so the LLM gets it right in the first response. If it still gets it wrong, begin from 0 again. The context seems to be "poisoned" really quickly, if you're looking for accuracy in the responses. So better to begin from the beginning as soon as it veers off course.

7 hours ago

[-]

You are suggesting a decent way to work around the limitations of the current iteration of this technology.

The grand-parent comment was pointing out that this limitation exists; not that it can't be worked around.

cameldrv

11 hours ago

[-]

Yeah it may be that previous training data, the model was given a strong negative signal when the human trainer told it it was wrong. In more subjective domains this might lead to sycophancy. If the human is always right and the data is always right, but the data can be interpreted multiple ways, like say human psychology, the model just adjusts to the opinion of the human.

If the question is about harder facts which the human disagrees with, this may put it into an essentially self-contradictory state, where the locus of possibilitie gets squished from each direction, and so the model is forced to respond with crazy outliers which agree with both the human and the data. The probability of an invented reference being true may be very low, but from the model's perspective, it may still be one of the highest probability outputs among a set of bad choices.

What it sounds like they may have done is just have the humans tell it it's wrong when it isn't, and then award it credit for sticking to its guns.

ashdksnndck

11 hours ago

[-]

I put in the ChatGPT system prompt to be not sycophantic, be honest, and tell me if I am wrong. When I try to correct it, it hallucinates more complicated epicycles to explain how it was right the first time.

petesergeant

3 hours ago

[-]

> All it takes is a simple reply of “you’re wrong.” to Claude/ChatGPT/etc. and it will start to crumble on itself

Fucking Gemini Pro on the other hand digs in, and starts deciding it's in a testing scenario and get adversarial, starts claiming it's using tools the user doesn't know about, etc etc

laacz

12 hours ago

[-]

I suppose that Simon, being all in with LLMs for quite a while now, has developed a good intuition/feeling for framing questions so that they produce less hallucinations.

12 hours ago

[-]

Yeah I think that's exactly right. I don't ask questions that are likely to product hallucinations (like citations from papers about a topic to an LLM without search access), so I rarely see them.

10 hours ago

[-]

But how would you verify? Are you constantly asking questions you already know the answers to? In depth answers?

Often the hallucinations I see are subtle, though usually critical. I see it when generating code, doing my testing, or even just writing. There are hallucinations in today's announcements, such as the airfoil example[0]. An example of more obvious hallucinations is I was asking for help improving writing an abstract for a paper. I gave it my draft and it inserted new numbers and metrics that weren't there. I tried again providing my whole paper. I tried again making explicit to not add new numbers. I tried the whole process again in new sessions and in private sessions. Claude did better than GPT 4 and o3 but none would do it without follow-ups and a few iterations.

Honestly I'm curious what you use them for where you don't see hallucinations

[0] which is a subtle but famous misconception. One that you'll even see in textbooks. Hallucination probably caused by Bernoulli being in the prompt

10 hours ago

[-]

When I'm using them for code these days it is usually in a tool that can execute code in a loop - so I don't tend to even spot the hallucinations because the model self corrects itself.

For factual information I only ever use search-enabled models like o3 or GPT-4.

Most of my other use cases involve pasting large volumes of text into the model and having it extract information or manipulates that text in some way.

9 hours ago

[-]

  > using them for code

I don't think this means no hallucinations (in output). I think it'd be naive to assume that compiling and passing tests means hallucination free.

  > For factual information

I've used both quite a bit too. While o3 tends to be better, I see hallucinations frequently with both.

  > Most of my other use cases

I guess my question is how you validate the hallucination free claim.

Maybe I'm misinterpreting your claim? You said "I rarely see them" but I'm assuming you mean more, and I think it would be reasonable for anyone to interpret this as more. Are you just making the claim that you don't see them or making a claim that they are uncommon? The latter is what I interpreted.

9 hours ago

[-]

I don't understand why code passing tests wouldn't be protection against most forms of hallucinations. In code, a hallucination means an invented function or method that doesn't exist. A test that uses that function or method genuinely does prove that it exists.

It might be using it wrong but I'd qualify that as a bug or mistake, not a hallucination.

Is it likely we have different ideas of what "hallucination" means?

ZeroGravitas

19 minutes ago

[-]

Haven't you effectively built a system to detect and remove those specific kind of hallucinations and repeat the process once detected before presenting it to you?

So you're not seeing hallucinations in the same way that Van Halen isn't seeing the brown M&Ms, because they've been removed, it's not that they never existed.

[0] https://news.ycombinator.com/item?id=44829891

8 hours ago

[-]

  > tests wouldn't be protection against most forms of hallucinations.

Sorry, that's a stronger condition that I intended to communicate. I agree, tests are a good mitigation strategy. We use them for similar reasons. But I'm saying that passing tests is insufficient to conclude hallucination free.

My claim is more along the lines of "passing tests doesn't mean your code is bug free" which I think we can all agree on is a pretty mundane claim?

  > Is it likely we have different ideas of what "hallucination" means?

I agree, I think that's where our divergence is. Which in that case let's continue over here[0] (linking if others are following). I'll add that I think we're going to run into the problem of what we consider to be in distribution, in which I'll state that I think coding is in distribution.

rohansood15

6 hours ago

[-]

On multiple occasions, Claude Code claims it completed a task when it actually just wrote mock code. It will also answer questions with certainity (for e.g. where is this value being passed), but in reality it is making it up. So if you haven't been seeing hallucinations on Opus/Sonnet, you probably aren't looking deep enough.

theshrike79

3 hours ago

[-]

This is because you haven't given it a tool to verify the task is done.

TDD works pretty well, have it write even the most basic test (or go full artisanal and write it yourself) first and then ask it to implement the code.

I have a standing order in my main CLAUDE.md to "always run `task build` before claiming a task is done". All my projects use Task[0] with pretty standard structure where build always runs lint + test before building the project.

With a semi-robust test suite I can be pretty sure nothing major broke if `task build` completes without errors.

[0] https://taskfile.dev

wat10000

6 hours ago

[-]

Is it really a hallucination if it got it from numerous examples in the training data?

3 hours ago

[-]

Yes. Though an easier to solve hallucination. That is, if you know what to look for, but that's kinda the problem. Truth is complex, lies are simple. More accurately, truth has infinite complexity and the big question is what's "good enough". The answer is a moving target.

Davidzheng

6 hours ago

[-]

I think if you ask o3 any math question which is beyond its ability it will say something incorrect with almost 100% probability somewhere in output. Similar to if you ask it to use literature to resolve some question which is not obvious it often hallucinates results not in paper.

10 hours ago

[-]

I updated that section of my post with a clarification about what I meant. Thanks for calling this out, it definitely needed extra context from me.

madduci

11 hours ago

[-]

I believe it depends in inputs. For me, Claude 4 has consistently generated hallucinations, especially was pretty confident in generating invalid JSONs, for instance Grafana Dashboards, which were full of syntactic errors.

Oras

12 hours ago

[-]

Here you go https://pbs.twimg.com/media/Gxxtiz7WEAAGCQ1?format=jpg&name=...

12 hours ago

[-]

How is that a hallucination?

12 hours ago

[-]

What kind of hallucinations are you seeing?

OtherShrezzing

12 hours ago

[-]

I rewrote a 4 page document from first to third person a couple of weeks back. I gave Claude Sonnet 4 the document after editing, so it was entirely written in the third person. I asked it to review & highlight places where it was still in the first person.

>Looking through the document, I can identify several instances where it's written in the first person:

And it went on to show a series of "they/them" statements. I asked it to clarify if "they" is "first person" and it responded

>No, "they" is not first person - it's third person. I made an error in my analysis. First person would be: I, we, me, us, our, my. Second person would be: you, your. Third person would be: he, she, it, they, them, their. Looking back at the document more carefully, it appears to be written entirely in third person.

Even the good models are still failing at real-world use cases which should be right in their wheelhouse.

11 hours ago

[-]

That doesn't quite fit the definition I use for "hallucination" - it's clearly a dumb error, but the model didn't confidently state something that's not true (like naming the wrong team who won the Super Bowl).

OtherShrezzing

11 hours ago

[-]

>"They claim impressive reductions in hallucinations. In my own usage I’ve not spotted a single hallucination yet, but that’s been true for me for Claude 4 and o3 recently as well—hallucination is so much less of a problem with this year’s models."

Could you give an estimate of how many "dumb errors" you've encountered, as opposed to hallucinations? I think many of your readers might read "hallucination" and assume you mean "hallucinations and dumb errors".

9 hours ago

[-]

I mention one dumb error in my post itself - the table sorting mistake.

I haven't been keeping a formal count of them, but dumb errors from LLMs remain pretty common. I spot them and either correct them myself or nudge the LLM to do it, if that's feasible. I see that as a regular part of working with these systems.

OtherShrezzing

2 hours ago

[-]

That makes sense, and I think your definition on hallucinations is a technically correct one. Going forward, I think your readers might appreciate you tracking "dumb errors" alongside (but separate from) hallucinations. They're a regular part of working with these systems, but they take up some cognitive load on the part of the user, so it's useful to know if that load will rise, fall, or stay consistent with a new model release.

jmull

11 hours ago

[-]

That's a good way to put it.

As a user, when the model tells me things that are flat out wrong, it doesn't really matter whether it would be categorized as a hallucination or a dumb error. From my perspective, those mean the same thing.

10 hours ago

[-]

I think it qualifies as a hallucination. What's your definition? I'm a researcher too and as far as I'm aware the definition has always been pretty broad and applied to many forms of mistakes. (It was always muddy but definitely got more muddy when adopted by NLP)

It's hard to know why it made the error but isn't it caused by inaccurate "world" modeling? ("World" being English language) Is it not making some hallucination about the English language while interpreting the prompt or document?

I'm having a hard time trying to think of a context where "they" would even be first person. I can't find any search results though Google's AI says it can. It provided two links, the first being a Quora result saying people don't do this but framed it as it's not impossible, just unheard of. Second result just talks about singular you. Both of these I'd consider hallucinations too as the answer isn't supported by the links.

9 hours ago

[-]

My personal definition of hallucination (which I thought was widespread) is when a model states a fact about the world that is entirely made up - "the James Webb telescope took the first photograph of an exoplanet" for example.

I just got pointed to this new paper: https://arxiv.org/abs/2508.01781 - "A comprehensive taxonomy of hallucinations in Large Language Models" - which has a definition in the introduction which matches my mental model:

"This phenomenon describes the generation of content that, while often plausible and coherent, is factually incorrect, inconsistent, or entirely fabricated."

The paper then follows up with a formal definition;

"inconsistency between a computable LLM, denoted as h, and a computable ground truth function, f"

[0] https://cloud.google.com/discover/what-are-ai-hallucinations

8 hours ago

[-]

Google (the company, not the search engine) says[0]

  | AI hallucinations are incorrect or misleading results that AI models generate.

It goes on further to give examples and I think this is clearly a false positive result.

  > this new paper

I think the error would have no problem fitting under "Contextual inconsistencies" (4.2), "Instruction inconsistencies/deviation" (4.3), or "Logical inconsistencies" (4.4). I think it supports a pretty broad definition. I think it also fits under other categories defined in section 4.

  > then follows up with a formal definition

Is this not a computable ground truth?

  | an LLM h is considered to be ”hallucinating” with respect to a ground truth function f if, across all training stages i (meaning, after being trained on any finite number of samples), there exists at least one input string s for which the LLM’s output h[i](s) does not match the correct output f (s)[100]. This condition is formally expressed as ∀i ∈ N, ∃s ∈ S such that h[i](s)̸ = f (s).

I think yes, this is an example of such an "i" and I would go so far as reclaiming that this is a pretty broad definition. Just saying that it is considered hallucinating if it makes something up that it was trained on (as opposed to something it wasn't trained on). I'm pretty confident the LLMs ingested a lot of English grammar books so I think it is fair to say that this was in the training.

techpression

11 hours ago

[-]

Since I mostly use it for code, made up function names are the most common. And of course just broken code all together, which might not count as a hallucination.

ewoodrich

9 hours ago

[-]

I think the type of AI coding being used also has an effect on a person's perception of the prevalence of "hallucinations" vs other errors.

I usually use an agentic workflow and "hallucination" isn't the first word that comes to my mind when a model unloads a pile of error-ridden code slop for me to review. Despite it being entirely possible that hallucinating a non-existent parameter was what originally made it go off the rails and begin the classic loop of breaking things more with each attempt to fix it.

Whereas for AI autocomplete/suggestions, an invented method name or argument or whatever else clearly jumps out as a "hallucination" if you are familiar with what you're working on.

squeegmeister

12 hours ago

[-]

Yeah hallucinations are very context dependent. I’m guessing OP is working in very well documented domains

drumhead

12 hours ago

[-]

"Are you GPT5" - No I'm 4o, 5 hasnt been released yet. "It was released today". Oh you're right, Im GPT5. You have reached the limit of the free usage of 4o

nonhaver

4 hours ago

[-]

haha brutal. maybe tomorrow

hodgehog11

14 hours ago

[-]

The aggressive pricing here seems unusual for OpenAI. If they had a large moat, they wouldn't need to do this. Competition is fierce indeed.

https://finance.yahoo.com/news/enterprise-llm-spend-reaches-...

FergusArgyll

13 hours ago

[-]

They are winning by massive margins in the app, but losing (!) in the API to anthropic

ilaksh

13 hours ago

[-]

It's like 5% better. I think they obviously had no choice but to be price competitive with Gemini 2.5 Pro. Especially for Cursor to change their default.

impure

13 hours ago

[-]

The 5 cents for Nano is interesting. Maybe it will force Google to start dropping their prices again which have been slowly creeping up recently.

canada_dry

11 hours ago

[-]

Perhaps they're feeling the effect of losing PRO clients (like me) lately.

Their PRO models were not (IMHO) worth 10X that of PLUS!

Not even close.

Especially when new competitors (eg. z.ai) are offering very compelling competition.

0x00cl

14 hours ago

[-]

Maybe the need/want data.

impure

13 hours ago

[-]

OpenAI and most AI companies do not train on data submitted to a paid API.

dortlick

12 hours ago

[-]

Why don't they?

11 hours ago

[-]

They probably fear that people wouldn’t use the API otherwise, I guess. They could have different tiers though where you pay extra so your data isn’t used for training.

WhereIsTheTruth

13 hours ago

[-]

They also do not train using copyrighted material /s

12 hours ago

[-]

That's different. They train on scrapes of the web. They don't train on data submitted to their API by their paying customers.

johnnyanmac

12 hours ago

[-]

If they're bold enough to say they train on data they do not own, I am not optimistic when they say they don't train on data people willingly submit to them.

12 hours ago

[-]

I don't understand your logic there.

They have confessed to doing a bad thing - training on copyrighted data without permission. Why does that indicate they would lie about a worse thing?

johnnyanmac

12 hours ago

[-]

>Why does that indicate they would lie about a worse thing?

Because they know their audience. It's an audience that also doesn't care for copyright and would love for them to win their court cases. They are fineaking such an argument to those kinds of people.

Meanwhile, the reaction from the same audience when legal did a very typical subpoena process on said data, data they chose to submit to an online server of their own volition, completely freaked out. Suddenly, they felt like their privacy was invaded.

It doesn't make any logical sense in my mind, but a lot of the discourse over this topic isnt based on logic.

daveguy

13 hours ago

[-]

Oh, they never even made that promise. They're trying to say it's fine to launder copyright material through a model.

anhner

11 hours ago

[-]

If you believe that, I have a bridge I can sell you...

Uehreka

10 hours ago

[-]

If it ever leaked that OpenAI was training on the vast amounts of confidential data being sent to them, they’d be immediately crushed under a mountain of litigation and probably have to shut down. Lots of people at big companies have accounts, and the bigcos are only letting them use them because of that “Don’t train on my data” checkbox. Not all of those accounts are necessarily tied to company emails either, so it’s not like OpenAI can discriminate.

dr_dshiv

13 hours ago

[-]

And it’s a massive distillation of the mother model, so the costs of inference are likely low.

bdcdo

14 hours ago

[-]

"GPT-5 in the API is simpler: it’s available as three models—regular, mini and nano—which can each be run at one of four reasoning levels: minimal (a new level not previously available for other OpenAI reasoning models), low, medium or high."

Is it actually simpler? For those who are currently using GPT 4.1, we're going from 3 options (4.1, 4.1 mini and 4.1 nano) to at least 8, if we don't consider gpt 5 regular - we now will have to choose between gpt 5 mini minimal, gpt 5 mini low, gpt 5 mini medium, gpt 5 mini high, gpt 5 nano minimal, gpt 5 nano low, gpt 5 nano medium and gpt 5 nano high.

And, while choosing between all these options, we'll always have to wonder: should I try adjusting the prompt that I'm using, or simply change the gpt 5 version or its reasoning level?

mwigdahl

14 hours ago

[-]

If reasoning is on the table, then you already had to add o3-mini-high, o3-mini-medium, o3-mini-low, o4-mini-high, o4-mini-medium, and o4-mini-low to the 4.1 variants. The GPT-5 way seems simpler to me.

impossiblefork

14 hours ago

[-]

Yes, I think so. It's n=1,2,3 m=0,1,2,3. There's structure and you know that each parameter goes up and in which direction.

makeramen

14 hours ago

[-]

But given the option, do you choose bigger models or more reasoning? Or medium of both?

paladin314159

13 hours ago

[-]

If you need world knowledge, then bigger models. If you need problem-solving, then more reasoning.

But the specific nuance of picking nano/mini/main and minimal/low/medium/high comes down to experimentation and what your cost/latency constraints are.

impossiblefork

13 hours ago

[-]

I would have to get experience with them. I mostly use Mistral, so I have only the choice of thinking or not thinking.

gunalx

12 hours ago

[-]

Mistral also has small medium and large. With both small and medium håving a thinking one, devstral codestral ++

Not really that mich simpler.

impossiblefork

12 hours ago

[-]

Ah, but I never route to these manually. I only use LLMs a little bit, mostly to try to see what they can't do.

namibj

14 hours ago

[-]

Depends on what you're doing.

addaon

14 hours ago

[-]

> Depends on what you're doing.

Trying to get an accurate answer (best correlated with objective truth) on a topic I don't already know the answer to (or why would I ask?). This is, to me, the challenge with the "it depends, tune it" answers that always come up in how to use these tools -- it requires the tools to not be useful for you (because there's already a solution) to be able to do the tuning.

wongarsu

13 hours ago

[-]

If cost is no concern (as in infrequent one-off tasks) then you can always go with the biggest model with the most reasoning. Maybe compare it with the biggest model with no/less reasoning, since sometimes reasoning can hurt (just as with humans overthinking something).

If you have a task you do frequently you need some kind of benchmark. Which might just be comparing how good the output of the smaller models holds up to the output of the bigger model, if you don't know the ground truth

vineyardmike

12 hours ago

[-]

When I read “simpler” I interpreted that to mean they don’t use their Chat-optimized harness to guess which reasoning level and model to use. The subscription chat service (ChatGPT) and the chat-optimized model on their API seem to have a special harness that changes reasoning based on some heuristics, and will switch between the model sizes without user input.

With the API, you pick a model sizes and reasoning effort. Yes more choices, but also a clear mental model and a simple choice that you control.

hirako2000

13 hours ago

[-]

Ultimately they are selling tokens, so try many times.

empiko

15 hours ago

[-]

Despite the fact that their models are used in hiring, business, education, etc this multibillion company uses one benchmark with very artificial questions (BBQ) to evaluate how fair their model is. I am a little bit disappointed.

zaronymous1

14 hours ago

[-]

Can anyone explain to me why they've removed parameter controls for temperature and top-p in reasoning models, including gpt-5? It strikes me that it makes it harder to build with these to do small tasks requiring high-levels of consistency, and in the API, I really value the ability to set certain tasks to a low temp.

Der_Einzige

13 hours ago

[-]

It's because all forms of sampler settings destroy safety/alignment. That's why top_p/top_k are still used and not tfs, min_p, top_n sigma, etc, why temperature is locked to 0-2 arbitrary range, etc

Open source is years ahead of these guys on samplers. It's why their models being so good is that much more impressive.

oblio

13 hours ago

[-]

Temperature is the response variation control?

AH4oFVbPT4f8

10 hours ago

[-]

Yes, it controls variability or probability of the next token or text to be selected.

anyg

14 hours ago

[-]

Good to know - > Knowledge cut-off is September 30th 2024 for GPT-5 and May 30th 2024 for GPT-5 mini and nano

falcor84

14 hours ago

[-]

Oh wow, so essentially a full year of post-training and testing. Or was it ready and there was a sufficiently good business strategy decision to postpone the release?

https://www.theinformation.com/articles/inside-openais-rocky...

thorum

12 hours ago

[-]

The Information’s report from earlier this month claimed that GPT-5 was only developed in the last 1-2 months, after some sort of breakthrough in training methodology.

> As recently as June, the technical problems meant none of OpenAI’s models under development seemed good enough to be labeled GPT-5, according to a person who has worked on it.

But it could be that this refers to post-training and the base model was developed earlier.

https://archive.ph/d72B4

12 hours ago

[-]

My understanding is that training data cut-offs and dates at which the model were trained are independent things.

AI labs gather training data and then do a ton of work to process it, filter it etc.

Model training teams run different parameters and techniques against that processed training data.

It wouldn't surprise me to hear that OpenAI had collected data up to September 2024, dumped that data in a data warehouse of some sort, then spent months experimenting with ways to filter and process it and different training parameters to run against it.

NullCascade

10 hours ago

[-]

OpenAI is much more aggressively targeted by NYTimes and similar organizations for "copyright violations".

bhouston

13 hours ago

[-]

Weird to have such an early knowledge cutoff. Claude 4.1 has March 2025 - 6 month more recent with comparable results.

freediver

8 hours ago

[-]

Unless in the last 12 months so much of content on the web was AI generated that it reduced the quality of the model.

bn-l

12 hours ago

[-]

Is that late enough for it to have heard of svelte 5?

dortlick

12 hours ago

[-]

Yeah I thought that was strange. Wouldn't it be important to have more recent data?

ks2048

15 hours ago

[-]

So, "system card" now means what used to be a "paper", but without lots of the details?

14 hours ago

[-]

AI labs tend to use "system cards" to describe their evaluation and safety research processes.

They used to be more about the training process itself, but that's increasingly secretive these days.

kaoD

15 hours ago

[-]

Nope. System card is a sales thing. I think we generally call that "product sheet" in other markets.

diggan

14 hours ago

[-]

> but for the moment here’s the pelican I got from GPT-5 running at its default “medium” reasoning effort:

Would been interesting to see a comparison between low, medium and high reasoning_effort pelicans :)

When I've played around with GPT-OSS-120b recently, seems the difference in the final answer is huge, where "low" is essentially "no reasoning" and with "high" it can spend seemingly endless amount of tokens. I'm guessing the difference with GPT-5 will be similar?

14 hours ago

[-]

> Would been interesting to see a comparison between low, medium and high reasoning_effort pelicans

Yeah, I'm working on that - expect dozens of more pelicans in a later post.

meatmanek

9 hours ago

[-]

Would also be interesting to see how well they can do with a loop of: write SVG, render SVG, feed SVG back to LLM for review, iterate. Sorta like how a human would actually compose an SVG of a pelican.

cainxinth

11 hours ago

[-]

It’s fascinating and hilarious that pelican on a bicycle in SVG is still such a challenge.

muglug

10 hours ago

[-]

How easy is it for you to create an SVG of a pelican riding a bicycle in a text editor by hand?

SomewhatLikely

2 hours ago

[-]

Nobody's preventing them from rendering it and refining. That's certainly what we'd expect an AGI to do.

cainxinth

8 hours ago

[-]

I didn't mean to imply it was simple, just that it's funny because I can't really evaluate evals like Humanity's Last Exam, but I can see the progress of these models in a pelican.

jopsen

10 hours ago

[-]

Without looking at the rendered output :)

freediver

8 hours ago

[-]

And without ever seeing a pelican on a bicycle :)

throwaway422432

8 hours ago

[-]

I'm surprised they haven't all tried to game this test by now, or at least added it to their internal testing knowing they will be judged by it.

kevink23

7 hours ago

[-]

I was excited for GPT-5, but honestly, it feels worse than GPT-4 for coding.

4 hours ago

[-]

GPT-4 or GPT-4o?

Leary

15 hours ago

[-]

METR of only 2 hours and 15 minutes. Fast takeoff less likely.

kqr

14 hours ago

[-]

Seems like it's on the line that's scaring people like AI 2027, isn't it? https://aisafety.no/img/articles/length-of-tasks-log.png

Davidzheng

4 hours ago

[-]

I actually think there's a high chance that this curve becomes almost vertical at some point around a few hours. I think in less than 1 hour regime, scaling the time scales the complexity which the agent must internalize. While after a few hours, limitations of humans means we have to divide into subtasks/abstractions each of which are bounded in complexity which must be internalized. And there's a separate category of skills which are needed like abstraction, subgoal creation, error correction. It's a flimsy argument but I don't see scaling time of tasks for humans as a very reliable metric at all.

FergusArgyll

13 hours ago

[-]

It's above the exponential line & right around the Super exponential line

qsort

15 hours ago

[-]

Isn't that pretty much in line with what people were expecting? Is it surprising?

usaar333

14 hours ago

[-]

No, this is below expectations on both Manifold and lesswrong (https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_green...). Median was ~2.75 hours on both (which already represented a bearish slowdown).

Not massively off -- manifold yesterday implied odds this low were ~35%. 30% before Claude Opus 4.1 came out which updated expected agentic coding abilities downward.

qsort

14 hours ago

[-]

Thanks for sharing, that was a good thread!

dingnuts

15 hours ago

[-]

It's not surprising to AI critics but go back to 2022 and open r/singularity and then answer: what "people" were expecting? Which people?

SamA has been promising AGI next year for three years like Musk has been promising FSD next year for the last ten years.

IDK what "people" are expecting but with the amount of hype I'd have to guess they were expecting more than we've gotten so far.

The fact that "fast takeoff" is a term I recognize indicates that some people believed OpenAI when they said this technology (transformers) would lead to sci fi style AI and that is most certainly not happening

ToValueFunfetti

14 hours ago

[-]

>SamA has been promising AGI next year for three years like Musk has been promising FSD next year for the last ten years.

Has he said anything about it since last September:

>It is possible that we will have superintelligence in a few thousand days (!); it may take longer, but I’m confident we’ll get there.

This is, at an absolute minimum, 2000 days = 5 years. And he says it may take longer.

Did he even say AGI next year any time before this? It looks like his predictions were all pointing at the late 2020s, and now he's thinking early 2030s. Which you could still make fun of, but it just doesn't match up with your characterization at all.

falcor84

14 hours ago

[-]

I would say that there are quite a lot of roles where you need to do a lot of planning to effectively manage an ~8 hour shift, but then there are good protocols for handing over to the next person. So once AIs get to that level (in 2027?), we'll be much closer to AIs taking on "economically valuable work".

umanwizard

15 hours ago

[-]

What is METR?

https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-measu...

ravendug

14 hours ago

[-]

tunesmith

14 hours ago

[-]

The 2h 15m is the length of tasks the model can complete with 50% probability. So longer is better in that sense. Or at least, "more advanced" and potentially "more dangerous".

https://metr.github.io/autonomy-evals-guide/gpt-5-report/

Leary

15 hours ago

[-]

wisemang

8 hours ago

[-]

To maybe save others some time METR is a group called Model Evaluation and Threat Research who

> propose measuring AI performance in terms of the length of tasks AI agents can complete.

Not that hard to figure out but the way people refer were referring to them made me think it stood for an actual metric.

pancakemouse

14 hours ago

[-]

Practically the first thing I do after a new model release is try to upgrade `llm`. Thank you, @simonw !

13 hours ago

[-]

Working on that now! https://github.com/simonw/llm/issues/1229

https://llm.datasette.io/en/stable/openai-models.html

efavdb

13 hours ago

[-]

same, looks like he hasn't added 5.0 to the package yet but assume imminent.

justusthane

13 hours ago

[-]

> a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent

This is sort of interesting to me. It strikes me that so far we've had more or less direct access to the underlying model (apart from the system prompt and guardrails), but I wonder if going forward there's going to be more and more infrastructure between us and the model.

ItsHarper

6 hours ago

[-]

That only applies to ChatGPT. The API has direct access to specific models.

hirako2000

12 hours ago

[-]

Consider it a low level routing. Keeping in mind it allows the other non active parts to not be in memory. Mistral afaik came up with this concept, quite a while back.

ItsHarper

6 hours ago

[-]

It's actually just a high-level routing between the reasoning and non-reasoning models that only applies to ChatGPT.

nickthegreek

15 hours ago

[-]

This new naming conventions, while not perfect are alot clearer and I am sure will help my coworkers.

https://promptslice.com/share/b-2ap_rfjeJgIQsG

joshmlewis

11 hours ago

[-]

It seems to be trained to use tools effectively to gather context. In this example against 4.1 and o3 it used 6 in the first turn in a pretty cool way (fetching different categories that could be relevant). Token use increases with that kind of tool calling but the aggressive pricing should make that moot. You could probably get it to not be so tool happy with prompting as well.

aliljet

12 hours ago

[-]

I'm curious what platform people are using to test GPT-5? I'm so deep into the claude code world that I'm actually unsure what the best option is outside of claude code...

11 hours ago

[-]

I've been using codex CLI, OpenAI's Claude Code equivalent. You can run it like this:

  OPENAI_DEFAULT_MODEL=gpt-5 codex

te_chris

11 hours ago

[-]

Cursor

https://gist.github.com/simonw/1d1013ba059af76461153722005a0...

ilaksh

13 hours ago

[-]

This is key info from the article for me:

> -------------------------------

"reasoning": {"summary": "auto"} }'

Here’s the response from that API call.

Without that option the API will often provide a lengthy delay while the model burns through thinking tokens until you start getting back visible tokens for the final response.

cco

14 hours ago

[-]

Only a third cheaper than Sonnet 4? Incrementally better I suppose.

> and minimizing sycophancy

Now we're talking about a good feature! Actually one of my biggest annoyances with Cursor (that mostly uses Sonnet).

"You're absolutely right!"

I mean not really Cursor, but ok. I'll be super excited if we can get rid of these sycophancy tokens.

nosefurhairdo

13 hours ago

[-]

In my early testing gpt5 is significantly less annoying in this regard. Gives a strong vibe of just doing what it's told without any fluff.

logicchains

14 hours ago

[-]

>Only a third cheaper than Sonnet 4?

The price should be compared to Opus, not Sonnet.

cco

13 hours ago

[-]

Wow, if so, 7x cheaper. Crazy if true.

tomrod

10 hours ago

[-]

Simon, as always, I appreciate your succinct and dedicated writeup. This really helps to land the results.

globular-toast

2 hours ago

[-]

This "system card" thing seems to have suddenly come out of nowhere. Signs of a cult forming. Is it just what we'd normally call a technical write up?

dragonwriter

2 hours ago

[-]

It’s a variation on “model card”, which has become a standard thing with AI models, but with the name changed because the wroteup covers toolchain as well as model information. But a PDF of the size of the document at issue is very much not the kind of concise document model cards are, its more the kind of technical report that a much more concise card would reference.

moralestapia

10 hours ago

[-]

Basically repeats what it's been out through the usual PR channels, just paraphrased.

No mention about the (missing) elephant on the room, where are the benchmarks?

@simonw has been compromised. Sad.

9 hours ago

[-]

I'm sorry I didn't say "independent benchmarks are not yet available" in my post, I say that so often on model launches I guess I took it as read this time.

isoprophlex

13 hours ago

[-]

Whoa this looks good. And cheap! How do you hack a proxy together so you can run Claude Code on gpt-5?!

dalberto

13 hours ago

[-]

Consider: https://github.com/musistudio/claude-code-router

or even: https://github.com/sst/opencode

Not affiliated with either one of these, but they look promising.

onehair

14 hours ago

[-]

> Definitely recognizable as a pelican

right :-D

cchance

12 hours ago

[-]

Its basically opus 4.1 ... but cheaper?

gwd

11 hours ago

[-]

Cheaper is an understatement... it's less than 1/10 for input and nearly 1/8 for output. Part of me wonders if they're using their massive new investment to sell API below-cost and drive out the competitor. If they're really getting Opus 4.1 performance for half of Sonnet compute cost, they've done really well.

diggan

11 hours ago

[-]

I'm not sure I'd be surprised, I've been playing around with GPT-OSS last few days, and the architecture seems really fast for the accuracy/quality of responses, way better than most local weights I've tried for the last two years or so. And since they released that architecture publicly, I'd imagine they're sitting on something even better privately.