FilterHN

https://x.com/polynoamial/status/1946478258968531288

og_kalu

11 days ago

[-]

From Noam Brown

"When you work at a frontier lab, you usually know where frontier capabilities are months before anyone else. But this result is brand new, using recently developed techniques. It was a surprise even to many researchers at OpenAI. Today, everyone gets to see where the frontier is."

and

"This was a small team effort led by @alexwei_ . He took a research idea few believed in and used it to achieve a result fewer thought possible. This also wouldn’t be possible without years of research+engineering from many at @OpenAI and the wider AI community."

gsf_emergency_2

11 days ago

[-]

"frontier" seems to be "zapad" for OpenAI

They have a parallel effort to corner Ramanujan called https://epoch.ai/frontiermath/tier-4

(& Problem 6, combinatorics, the one class of problems not yet fallen to AI?)

The hope for humanity is that of the big names associated to FrontierMath (starkly opposite to oAI proper) Daniel is the one youngish nonexsoviet guy :)

yahoozoo

11 days ago

[-]

That brand new technique? Training on the test data. /s

dinfinity

8 days ago

[-]

Do you have any proof to support this claim?

otabdeveloper4

10 days ago

[-]

Hey, cheating is also an artform.

rafael859

11 days ago

[-]

Interesting that the proofs seem to use a limited vocabulary: https://github.com/aw31/openai-imo-2025-proofs/blob/main/pro...

Why waste time say lot word when few word do trick :)

Also worth pointing out that Alex Wei is himself a gold medalist at IOI.

throw310822

11 days ago

[-]

Interesting observation. One one hand, these resemble more the notes that an actual participant would write while solving the problem. Also, less words = less noise, more focus. But also, specifically for LLMs that output one token at a time and have a limited token context, I wonder if limiting itself to semantically meaningful tokens can be create longer stretches of semantically coherent thought?

btown

11 days ago

[-]

The original thread mentions “test-time compute scaling” so they had some architecture generating a lot of candidate ideas to evaluate. Minimizing tokens can be very meaningful from a scalability perspective alone!

11 days ago

[-]

This is just speculation but I wouldn't be surprised if there were some symbolic AI 'tricks'/tools (and/or modern AI trained to imitiate symbolic AI) under the hood.

beyonddream

11 days ago

[-]

He is talking about IMO (math olympiad) while he got gold at IOI (informatics olympiad) :)

11 days ago

[-]

> Also worth pointing out that Alex Wei is himself a gold medalist at IOI.

Terence Tao also called it, that the top LLMs would get gold this year in a recent podcast.

11 days ago

[-]

In transformers generating each token takes the same amount of time, regardless of how much meaning it carries. By cutting out the filler from the text, you get a huge speedup.

krackers

11 days ago

[-]

Except generating more tokens also effectively extends the computational power beyond the depth of the circuit, which is why chain of thought works in the first place. Even sampling only dummy tokens that don't convey anything still provides more computational power.

11 days ago

[-]

I mean, generating more tokens means you use more computing power, and there's som e evidence that not all of these filler words go to waste (esp since they are not really words, but vectors that can carry latent meaning), as models tend to become smarter when allowed to generate a lot of heeming and hawing.

It's been proven that this accidental computation is actually helping CoT models, but they're not supposed to work like that - they're supposed to generate logical observations and use said observations to work further towards the goal (and they primarily do do that).

Considering filler tokens occupy context space and are less useful than meaningful tokens, a model that tries to maximize useful results per amount of compute, you'd want a terse context window without any fluff.

taneq

11 days ago

[-]

Dummy tokens work for humans too. “Shhh I need to think!”

justinhj

11 days ago

[-]

"Let me be clear..."

johnpaulkiser

11 days ago

[-]

Are you saying "see the world?" or "seaworld"?

lukebechtel

11 days ago

[-]

whoah, very very interesting / telling.

tlb

11 days ago

[-]

I encourage anyone who thinks these are easy high-school problems to try to solve some. They're published (including this year's) at https://www.imo-official.org/problems.aspx. They make my head spin.

mauriziocalo

11 days ago

[-]

Related — these videos give a sense of how someone might actually go about thinking through and solving these kinds of problems:

- A 3Blue1Brown video on a particularly nice and unexpectedly difficult IMO problem (2011 IMO, Q2): https://www.youtube.com/watch?v=M64HUIJFTZM

-- And another similar one (though technically Putnam, not IMO): https://www.youtube.com/watch?v=OkmNXy7er84

- Timothy Gowers (Fields Medalist and IMO perfect scorer) solving this year’s IMO problems in “real time”:

-- Q1: https://www.youtube.com/watch?v=1G1nySyVs2w

-- Q4: https://www.youtube.com/watch?v=O-vp4zGzwIs

nmca

11 days ago

[-]

It takes Tim Gowers more than hour and a half to go through q4! (Sure, he could go faster without video. But Tim Gowers! An hour and a half!!)

https://secondthoughts.ai/p/solving-math-olympiad-problems

snewman

10 days ago

[-]

For people who prefer reading to watching videos, I wrote a detailed account of my process for solving one of last year's IMO problems, along with thoughts on how this relates to AI:

https://www.youtube.com/watch?v=csS4BjQuhCc

bko

11 days ago

[-]

I like watching youtube videos solving these problems. They're deceptively simple. I remember reading one:

x+y=1

xy=1

The incredible thing is the explanation uses almost all reasoning steps that I am familiar with from basic algebra, like factoring, quadratic formula, etc. But it just comes together so beautifully. It gives you the impression that if you thought about it long enough, surely you would have come up with the answer, which is obviously wrong, at least in my case.

11 days ago

[-]

This is slightly tedious to do by hand but there isn't really anything interesting going on in that problem - it's just solving a quadratic equation over the complex numbers.

roenxi

11 days ago

[-]

That isn't much of an argument; nothing in math is truly interesting if you take that approach. exp(i\pi)+1=0 could be said to be dis-interesting because it is just rotation on the complex plane. But it is the opposite - it is interesting because it turned out to be rotation on the complex plane but approached from summing infinite series.

Similarly you can say that solving a quadratic over complex numbers is dis-interesting, but it is actually an interesting puzzle because it is trying its best to pretend it isn't a quadratic. In many ways succeeding, it isn't a quadratic - there is no "2" in it.

11 days ago

[-]

It's "not interesting" because no novel insight has to be used in order to solve this. It's immediately obvious how to solve it, just follow the textbook procedure.

This is distinct both from other typical IMO problems that I've seen and from research mathematics which usually do require some amount of creativity.

> exp(i\pi)+1=0

If your definition of "exp(i*theta)" is literally "rotation of the number 1 by theta degrees counterclockwise", then indeed what you quoted is a triviality and contains no nugget of insight (how could it?).

It becomes nontrivial when your definition of "exp" is any of the following:

- The everywhere absolutely convergent power series sum_{i=0}^\infty z^n/n!

- The unique function solving the IVP y'=y, y(0)=1

- The unique holomorphic extension of the real-valued exponential function to the complex numbers

Going from any of these definitions to "exp(i*\pi)+1=0" from scratch requires quite a bit of clever mathematics (such as proving convergence of the various series, comparing terms, deriving the values of sin and cos at pi from their power series representation, etc.). That's definitely not something that a motivated high schooler would be able to derive from scratch.

roenxi

10 days ago

[-]

Those definitions of exp are all immediately obvious and nearly the definition of textbook, every university calculus course covers them. That is the issue with defining interesting as novel - nothing generally known is novel any more. And they don't require any special maths - sum_{i=0}^\infty z^n/n! is literally just multiplication and addition.

The long and short of it is it just isn't possible to tell someone that their problem isn't interesting. Interest isn't an inherent property of an equation, it is the state of mind of the person looking at the equation. And in this case the x+y/xy is a classic interesting puzzle despite (really because of) how well known the solution is.

10 days ago

[-]

> those definitions of exp are all immediately obvious

csomar

10 days ago

[-]

It is not interesting because there is no “real” solution (pun intended).

If you go to the complex plane, you are re-defining the plane. If you redefine the plane, then you can do anything. The puzzle is about confusing the observer who is expecting a solution in a certain dimension.

10 days ago

[-]

It's true that it might be unexpected that there is no real solution. I also wouldn't have intuited that from the problem statement itself.

However, it's not like you have to go out of your way to look for the complex numbers in some creative way. At some point while solving the quadratic equation you'll have to take the root of a negative number. So the only choice is to reach for the complex numbers, your hand is kinda forced.

xpressvideoz

11 days ago

[-]

I didn't know there were localized versions of the IMO problems. But now that I think of it, having versions of multiple languages is a must to remove the language barrier from the competitors. I guess having that many language versions (I see ~50 languages?) may make keeping the security of the problems considerably harder?

dmurray

11 days ago

[-]

The problems are chosen by representatives from all the countries. So every country has someone who knows the full exam before the participants get it. Security is on the honour system, but it seems to mostly work.

peteyreplies

10 days ago

[-]

iirc, the IMO system automatically translates the questions into 50 languages, after they are entered in English.

koakuma-chan

11 days ago

[-]

How do those compare to leetcode hard problems?

kenjackson

11 days ago

[-]

Depends on how hard, but the “average hard” leetcode problem is much easier. These will be more like the ACM ICPC level questions, which I’d put at the “hard hard” leetcode level (also this is a collegiate competition rather than high school, but with broader participation).

kappi

11 days ago

[-]

[flagged]

yongjik

11 days ago

[-]

All the past IMO problems are known to public and contestants practice on them. If solving an IMO problem is the simple matter of "looking at all the past problems and apply the same pattern," you'd expect human contestants to do a lot better.

kappi

11 days ago

[-]

I think you haven't gone thru AMC8, AMC10, AIME competitions. If you are so confident, try giving an unsolved math problem outside the high school math competitions.

11 days ago

[-]

It's still tough, though practice obviously helps.

throw310822

11 days ago

[-]

I think you're joking, but you never know :)

hislaziness

11 days ago

[-]

Terence Tao on the matter - https://imgur.com/a/terence-tao-on-supposed-gold-imo-sMKP0bm

mananaysiempre

11 days ago

[-]

Actual post instead of ad-decorated screnshot: https://mathstodon.xyz/@tao/114881418225852441 (thread continued in https://mathstodon.xyz/@tao/114881419368778558 and https://mathstodon.xyz/@tao/114881420636881657).

jiggawatts

10 days ago

[-]

Fair points, but the reason everyone is amazed is that five years ago this was entirely impossible for computers irrespective of the competition format or rules.

It’s as-if we had learned whale song, and then within two years a whale had won a Nobel prize for their research in high pressure aquatic environments. You’d similarly get naysayers debating the finer points of what special advantage whales may have in that particular field, neglecting the stunned shock of the general population — “Whales are publishing research papers now!? Award winning papers at that!?”

vineyardmike

10 days ago

[-]

And it’s very impressive that whales can write papers.

A computer system that can perform these tasks that were unthinkably complex a few years ago is quite impressive. That is a big win, and it can be celebrated. They don’t need to be celebrated as a “gold medalist” if they didn’t perform according to the same criteria as a gold-medalist.

jiggawatts

10 days ago

[-]

The same argument works in the reverse: the computer was not given 18 years of calendar time to study. It doesn’t have the benefit of a meat brain with a billion years of evolution optimising it for efficient thought. Etc…

That the architecture of the machine mind is different to ours is the point.

If it was identical then nobody would be excited! That’s a high school student equipped with a biological brain.

That the computer used silicon, that it used parallel agents, that it used whatever it has in its programming is irrelevant except in the sense that these very differences make the achievement more amazing — not less.

catigula

9 days ago

[-]

Win for whom?

dlubarov

10 days ago

[-]

It's a good point - IMO is about performance under some specific resource constraints, and those constraints don't make sense for AIs. But I wonder how far we are from an AI solving a well-studied unsolved math problem. That would be more of a decisive "quantum supremacy" type milestone.

[1] https://lexfridman.com/terence-tao-transcript/

xpressvideoz

10 days ago

[-]

> there will be a proposal at some point to actually have an AI math Olympiad where at the same time as the human contestants get the actual Olympiad problems, AI’s will also be given the same problems, the same time period and the outputs will have to be graded by the same judges, which means that it’ll have be written in natural language rather than formal language.[1]

Last month, Tao himself said that we can compare humans and AIs at IMO. He even said such AI didn't exist yet and AIs won't beat IMO in 2025. And now that AIs can compete with humans at IMO under the same conditions that Tao mentioned, suddenly it becomes an apples-to-oranges comparison?

peteyreplies

10 days ago

[-]

To be clear, AIs didn't beat IMO, or the best human competitors at IMO, on these terms.

darkoob12

10 days ago

[-]

He is basically asking OpenAI to publish their methodology so we can understand the real state of AI in solving math problems.

dylanbyte

11 days ago

[-]

These are high school level only in the sense of assumed background knowledge, they are extremely difficult.

Professional mathematicians would not get this level of performance, unless they have a background in IMO themselves.

This doesn’t mean that the model is better than them in math, just that mathematicians specialize in extending the frontier of math.

The answers are not in the training data.

This is not a model specialized to IMO problems.

11 days ago

[-]

Are you sure this is not specialized to IMO? I do see the twitter thread saying it's "general reasoning" but I'd imagine they RL'd on olympiad math questions? If not I really hope someone from OpenAI says that bc it would be pretty astounding.

stingraycharles

11 days ago

[-]

They also said this is not part of GPT-5, and “will be released later”. It’s very, very likely a model specifically fine-tuned for this benchmark, where afterwards they’ll evaluate what actual real-world problems it’s good at (eg like “use o4-mini-high for coding”).

11 days ago

[-]

Humans who excel at IMO questions are also "fine tuned" on them in the sense that they practice them for hundreds of hours

SiempreViernes

11 days ago

[-]

Sure, but nobody is using their IMO score to prove they are superintelligent and pulling it off in wider groups.

sebmellen

11 days ago

[-]

I’ve seen IMO rank used to justify more than one $100m+ seed round.

htrp

9 days ago

[-]

Someone was burned by Cognition?

wanderlust123

11 days ago

[-]

I’m pretty sure that a high score in imo is a sign of high intelligence.

10 days ago

[-]

And the opposite of ADHD.

Jensson

10 days ago

[-]

Their hardware isn't fine tuned to it though, it uses the same general intelligence hardware that all other humans use.

So its a big difference if you use a general intelligence system and makes it do well in math, or when you create a specialized system that is only good at math and can't be used to get good in other areas.

10 days ago

[-]

This IMO LLM isn't using fine tuned hardware either.

11 days ago

[-]

From my vague rememberance of doing data science years ago, it's very hard not to leak the training set.

Basically how you do RL is that you make a set of training examples of input-output pairs, and set aside a smaller validation set, which you never train on, to check if your model's doing well.

What you do is you tweak the architecture and the training set until it does well on the validation set. By doing so, you inadvertedly leak info about the training set. Perhaps you choose an architecture which does well on the validation set. Perhaps you train more on examples more like ones being validated.

Even without the explicit intent to cheat, it's very hard to avoid this contamination, if you chose a different validation set, you'd end up with a different model.

SonOfLilit

11 days ago

[-]

The questions were published a few days ago. The 2025 IMO just ended.

KoolKat23

11 days ago

[-]

And the model was in lockdown to avoid this.

11 days ago

[-]

>> This is not a model specialized to IMO problems.

How do you know?

joe_the_user

11 days ago

[-]

Yeah, looking at the GP ... say a sequence of things that are true and plausible. That add your strong, unsupported claim at the end. I remember the approach from when I studied persuasion techniques...

11 days ago

[-]

> The answers are not in the training data.

> This is not a model specialized to IMO problems.

Any proof?

11 days ago

[-]

There's no proof that this is not made up, let alone any shred of transparency or reproducibility.

There are trillions of dollars at stake in hyping up these products; I take everything these companies write with a cartload of salt.

BoorishBears

11 days ago

[-]

No, and they're lying on the most important claim: that this is not a model specialized to IMO problems.

From the thread:

> just to be clear: the IMO gold LLM is an experimental research model.

The thread tried to muddy the narrative by saying the methodology can generalize, but no one is claiming the actual model is a generalized model.

There'd be a massively different conversation needed if a generalized model that could become the next iteration of ChatGPT had achieved this level of performance.

AIPedant

11 days ago

[-]

It almost certainly is specialized to IMO problems, look at the way it is answering the questions: https://xcancel.com/alexwei_/status/1946477742855532918

E.g here: https://pbs.twimg.com/media/GwLtrPeWIAUMDYI.png?name=orig

Frankly it looks to me like it's using an AlphaProof style system, going between natural language and Lean/etc. Of course OpenAI will not tell us any of this.

https://x.com/alexwei_/status/1946477745627934979?s=46&t=Hov...

11 days ago

[-]

OpenAI explicitly stated that it is natural language only, with no tools such as Lean.

https://x.com/polynoamial/status/1946478249187377206?s=46&t=...

redlock

11 days ago

[-]

Nope

AIPedant

11 days ago

[-]

If you don't have a Twitter account then x.com links are useless, use a mirror: https://xcancel.com/polynoamial/status/1946478249187377206

Anyway, that doesn't refute my point, it's just PR from a weaselly and dishonest company. I didn't say it was "IMO-specific" but the output strongly suggests specialized tooling and training, and they said this was an experimental LLM that wouldn't be released. I strongly suspect they basically attached their version of AlphaProof to ChatGPT.

11 days ago

[-]

We can only go off their word unfortunately and they say no formal math. so I assume it's being eval'd by a verifier model instead of a formal system. There's actually some hints of this b/c geometry in Lean is not that well developed so unless they also built their own system it's hard to do it formally (though their P2 proof is by coordinate bash (computation by algebra instead of geometric construction) so it's hard to tell.

skdixhxbsb

11 days ago

[-]

> We can only go off their word

We’re talking about Sam Altman’s company here. The same company that started out as a non profit claiming they wanted to better the world.

Suggesting they should be given the benefit of the doubt is dishonest at this point.

11 days ago

[-]

“they must be lying because I personally dislike them”

This is why HN threads about AI have become exhausting to read

nosianu

11 days ago

[-]

In general I agree with you, but I see the point of requiring proof for statements made by them, instead of accepting them at face value. In those cases, given previous experiences and considering that they benefit from making them, if they are believed, the burden of proof should be on those making these statements, not on those questioning them, no?

Those models seem to be special and not part of their normal product line, as is pointed out in the comments here. I would assume that in that case they indeed had the purpose of passing these tests in mind when creating them. Or was it created for something different, and completely by chance they discovered they could be used for the challenge, unintentionally?

otabdeveloper4

11 days ago

[-]

Yeah, that's how the concept of "reputation" works.

11 days ago

[-]

No, they are likely lying, because they have huge incentives to lie

dandanua

11 days ago

[-]

You don't need specialized tooling like Lean if you have enough training data with statements written in the natural language, I suppose. But the use of AlphaProof/AlphaGeometry type of learning is almost certain. And I'm sure they have spent a lot of compute to produce solutions, $10k is not a problem for them.

The bigger question is - why should everyone be excited by this? If they don't plan to share anything related to this AI model back to humanity.

11 days ago

[-]

I actually think this “cheating” is fine. In fact it’s preferable. I don’t need an AI that can act as a really expensive calculator or solver. We’ve already built really good calculators and solvers that are near optimal. What has been missing is the abductive ability to successfully use those tools in an unconstrained space with agency. I find really no value in avoiding the optimal or near optimal techniques we’ve devised rather than focusing on the harder reasoning tasks of choosing tools, instrumenting them properly, interpreting their results, and iterating. This is the missing piece in automated reasoning after all. A NN that can approximate at great cost those tools is a parlor trick and while interesting not useful or practical. Even if they have some agent system here, it doesn’t make the achievement any less that a machine can zero shot do as well as top humans at incredibly difficult reasoning problems posed in natural language.

SJC_Hacker

11 days ago

[-]

> I actually think this “cheating” is fine. In fact it’s preferable.

The thing with IMO, is the solutions are already known by someone.

So suppose the model got the solutions beforehand, and fed them into the training model. Would that be an acceptable level of "cheating" in your view?

11 days ago

[-]

Surely you jest. The cheating would be the same cheating as any other situation - someone inside the IMO skipping the questions and answers to people outside then that being used to compete. Fine - but why? If this were discovered then it would be disastrous for everyone involved, and for what? A noteworthy HN link? The downside would be international scandal and careers destroyed. The upside is imperceptible.

Finally, even if you aligned the model with the answers its weight shift of such an enormous model would be inconsequential. You would need to prime the context or boost the weights. All this seems like absurd lengths to go to to cheat on this one thing rather than focusing your energies on actually improving model performance. The payout for OpenAI isn’t a gold medal in the IMO it’s having a model that can get a gold medal at IMO then selling it. But it has to actually be capable of doing what’s on the tin otherwise their customers will easily and rapidly discover this.

Sorry, I like tin foil as much as anyone else, but this doesn’t seem credibly likely given the incentive structure.

Jensson

10 days ago

[-]

Yet that level of cheating happens all the time because its very unlikely to be discovered. Sometimes its just done by people lower down to increase their own career, since they don't have as much to lose, but cheating does happen and its not that unlikely especially when salaries are this high.

signatoremo

11 days ago

[-]

Why is "almost certainly"? The link you provided has this to say:

> 5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.

BoorishBears

11 days ago

[-]

Also from the thread:

> 8/N Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model.

And from Sam Altman:

> we are releasing GPT-5 soon but want to set accurate expectations: this is an experimental model that incorporates new research techniques we will use in future models.

The wording you quoted is very tricky: the method used to create the model is generalizable, but the model is not a general-use model.

If I have a post-training method that allows a model excel at a narrow task, it's still a generalizable method if there's a wide range of narrow tasks that it works on.

11 days ago

[-]

Since this looks like geometric proof, I wonder if the AI operates only on logical/mathematical statements or it actually somehow 'visualizes' the proof like a human would while solving.

gniv

11 days ago

[-]

From that thread: "The model solved P1 through P5; it did not produce a solution for P6."

It's interesting that it didn't solve the problem that was by far the hardest for humans too. China, the #1 team got only 21/42 points on it. In most other teams nobody solved it.

gus_massa

11 days ago

[-]

In the IMO, the idea is that the first day you get P1, P2 and P3, and the second day you get P4, P5 and P6. Usually, ordered by difficulty, they are P1, P4, P2, P5, P3, P6. So, usually P1 is "easy" and P6 is very hard. At least that is the intended order, but sometime reality disagree.

Edit: Fixed P4 -> P3. Thanks.

masterjack

11 days ago

[-]

In this case P6 was unusually hard and P3 was unusually easy https://sugaku.net/content/imo-2025-problems/

qingcharles

10 days ago

[-]

Yikes. 30 years ago I would eat this stuff up and I was the lead dev on 3D engines.

Now I can't even make heads-or-tails of what P6 is even asking (＾▽＾)

thundergolfer

11 days ago

[-]

You have P4 twice in there, latter should be 3

tantalor

11 days ago

[-]

That's very silly. They should do the order like this:

Day 1: P1 P3 P5 (odds)

Day 2: P2 P4 P6 (evens)

Then the problem # is the difficulty.

gus_massa

10 days ago

[-]

On one hand, it's very difficult to break traditions.

On the other hand, the order P1 P4 P2 P5 P3 P6 is not always true.

Usually there is only one problem of geometry per day.

Some problems involve a brilliant trick and another analyzing many cases. You don't want too "long" problems the same day. (Sometimes there is solution that the Jury didn't see and the problem changes it of made-up-category.)

Some problems are difficult but have a nice easy/medium intermediate step that assigns some points.

There are a lot of implicit restrictions that can affect the order of the problem.

Also, sometimes the Jury miscalculate how difficult is a problem and it's easier or more difficult than expected. Or the Jury completely miss an alternative easier solution.

The only sure part is the order that they are printed in the paper.

11 days ago

[-]

I think from Canada team someone solved it but among all, its very few

11 days ago

[-]

To me, this is a tell of human-involvement in the model solution.

There is no reason why machines would do badly on exactly the problem which humans do badly as well - without humans prodding the machine towards a solution.

Also, there is no reason why machines could not produce a partial or wrong answer to problem 6 which seems like survivor bias to me. ie, that only correct solutions were cherrypicked.

rjtobin

11 days ago

[-]

There is at least one reason - it was a harder problem. Agreed that which IMO problems are hard for a human IMO participant and which are hard for an LLM are different things, but seems like they should be positively correlated at least?

cyberax

11 days ago

[-]

IMO problems are not hard. They are merely tricky. They test primarily pattern recognition capabilities, requiring that flash of insight to find the hidden clue.

So it's no wonder that AI can solve them so well. Neural networks are great at pattern recognition.

A better test is to ask the AI to come up with good Olympiad problems. I went ahead and tried, and the results are average.

cornholio

10 days ago

[-]

While it's zero proof, since the data used for training is human generated, you raise an interesting point: the financial stakes are so high in LLM research that we should be skeptical of all frontier results.

An internet connected machine that reasons like humans was by default considered a fraud 5 years ago; it's not unthinkable some researchers would fake it till they made it, but of course you need proof of it before making such an accusation.

senko

11 days ago

[-]

> There is no reason why machines would do badly on exactly the problem which humans do badly as well

Unless the machine is trained to mimic human thought process.

OtherShrezzing

11 days ago

[-]

Maybe it’s a hint that our current training techniques can create models comparable to the best humans in a given subject, but that’s the limit.

11 days ago

[-]

We've hit the limit of 'our current training techniques'? This result literally used newly developed techniques that surprised researchers at OpenAI.

Noam Brown: 'This result is brand new, using recently developed techniques. It was a surprise even to many researchers at OpenAI.'

So your thesis is that these new techniques - which just produced unexpected breakthroughs - represent some kind of ceiling? That's an impressive level of confidence about the limits of methods we apparently just invented in a field which seems to, if anything, be accelerating.

Jensson

10 days ago

[-]

> Maybe it’s a hint that our current training techniques can create models comparable to the best humans in a given subject, but that’s the limit.

IMO is not the best humans in a given subject, college competitions are at a much higher level than high school competitions, and you have even higher level above that since college competitions are still limited to students.

raincole

11 days ago

[-]

Lmao.

You know IMO questions are not all equally difficult, right? They're specifically designed to vary in difficulty. The reason that problem 6 is hard for both humans and LLM is... it's hard! What a surprise.

11 days ago

[-]

Lol the OpenAI naysayers on this site are such conspiracy theorists.

There are many things that are hard for AI’s for the same reason they’re hard for humans. There are subtleties in complexity that make challenging things universal.

Obviously the model was trained on human data so its competencies lie in what other humans have provided input for over the years in mathematics, but that isn’t data contamination, that’s how all humans learn. This model, like the contestants, never saw the questions before.

https://x.com/natolambert/status/1946569475396120653

11 days ago

[-]

Google also joined IMO, and got gold prize.

OAI announced early, probably we will hear announcement from Google soon.

futureshock

11 days ago

[-]

Google’s AlphaProof, which got a silver last year, has been using a neural symbolic approach. This gold from OpenAI was pure LLM. We’ll have to see what Google announces, but the LLM approach is interesting because it will likely generalize to all kinds of reasoning problems, not just mathematical proofs.

11 days ago

[-]

OpenAI’s systems haven’t been pure language models since the o models though, right? Their RL approach may very well still generalize, but it’s not just a big pre-trained model that is one-shotting these problems.

The key difference is that they claim to have not used any verifiers.

beering

11 days ago

[-]

What do you mean by “pure language model”? The reasoning step is still just the LLM spitting out tokens and this was confirmed by Deepseek replicating the o models. There’s not also a proof verifier or something similar running alongside it according to the openai researchers.

If you mean pure as in there’s not additional training beyond the pretraining, I don’t think any model has been pure since gpt-3.5.

gallerdude

10 days ago

[-]

Local models you can get just the pretrained versions of, no RLHF. IIRC both Llama and Gemma make them available.

alach11

11 days ago

[-]

> it will likely generalize to all kinds of reasoning problems, not just mathematical proofs

Big if true. Setting up an RL loop for training on math problems seems significantly easier than many other reasoning domains. Much easier to verify correctness of a proof than to verify correctness (what would this even mean?) for a short story.

kevinventullo

11 days ago

[-]

I’m much more excited about the formalized approach, as LLM’s are susceptible to making things up. With formalization, we can be mathematically certain that a proof is correct. This could plausibly lead to machines surpassing humans in all areas of math. With a “pure English” approach, you still need a human to verify correctness.

csomar

10 days ago

[-]

Neither Gemini or OpenAI have open models. We don’t know for sure what’s happening underneath.

bjackman

11 days ago

[-]

Given the Noam Brown comment ("It was a surprise even to many researchers at OpenAI") it seems extra surprising if multiple labs achieved this result at once.

There's a comment on this twitter thread saying the Google model was using Lean, while IIUC the OpenAI one was pure LLM reasoning (no tools). Anyone have any corroboration?

In a sense it's kinda irrelevant, I care much more about the concrete things AI can achieve, than the how. But at the same time it's very informative to see the limits of specific techniques expand.

egillie

11 days ago

[-]

Explains why they’d announce on a Saturday

https://x.com/polynoamial/status/1946478249187377206

11 days ago

[-]

Noam Brown:

> this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.

> it’s also more efficient [than o1 or o3] with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.

> As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery.

I thought progress might be slowing down, but this is clear evidence to the contrary. Not the result itself, but the claims that it is a fully general model and has a clear path to improved efficiency.

1. https://github.com/aw31/openai-imo-2025-proofs/blob/main/pro...

lossolo

11 days ago

[-]

> it’s also more efficient [than o1 or o3] with its thinking.

"So under his saturate response, he never loses. For her to win, must make him unable at some even -> would need Q_{even-1}>even, i.e. some a_j> sqrt2. but we just showed always a_j<=c< sqrt2. So she can never cause his loss. So against this fixed response of his, she never wins (outcomes: may be infinite or she may lose by sum if she picks badly; but no win). So she does NOT have winning strategy at λ=c. So at equality, neither player has winning strategy."[1]

Why use lot word when few word do trick?

nightski

11 days ago

[-]

That's a big leap from "answering test questions" to "contributing to scientific discovery".

transcriptase

11 days ago

[-]

Having spent tens of thousands of hours contributing to scientific discovery by reading dense papers for a single piece of information, reverse engineering code written by biologists, and tweaking graphics to meet journal requirements… I can say with certainty it’s already contributing by allowing scientists to spend time on science versus spending an afternoon figuring out which undocumented argument in a R package from 2008 changes chart labels.

highwaylights

11 days ago

[-]

This. Even if LLM’s ultimately hit some hard ceiling as substantially-better-Googling-automatons they would already accelerate all thought-based work across the board, and that’s the level they’re already at now (arguably they’re beyond that).

We’re already at the point where these tools are removing repetitive/predictable tasks from researchers (and everyone else), so clearly they’re already accelerating research.

13years

11 days ago

[-]

Not sure how you get around the contamination problems. I use these everyday and they are extremely problematic about making errors that are hard to perceive.

They are not reliable tools for any tasks that require accurate data.

j2kun

11 days ago

[-]

That is not what they mean by contributing to scientific discovery.

transcriptase

11 days ago

[-]

Perhaps not, but my point stands from personal experience and knowing what’s going on in labs right now that AI is greatly contributing to research even if it’s not doing the parts that most people think of when they think science. A sufficiently advanced AI in the near term isn’t going to start churning out novel hypotheses and being able to collect non-online data without first being able to secure funding to hire grad students or whatever robots can replace those.

wouldbecouldbe

11 days ago

[-]

Yeah that’s the dream, but same as with the bar exams, they are fine tuning the models for specific tests. Which probably the model even has been trained on previous version of those tests

11 days ago

[-]

What's the clear path to improved efficiency now that we've reached peak data?

11 days ago

[-]

> now that we've reached peak data?

A) that's not clear

B) now we have "reasoning" models that can be used to analyse the data, create n rollouts for each data piece, and "argue" for / against / neutral on every piece of data going into the model. Imagine having every page of a "short story book" + 10 best "how to write" books, and do n x n on them. Huge compute, but basically infinite data as well.

We went from "a bunch of data" to "even more data" to "basically everything we got" to "ok, maybe use a previous model to sort through everything we got and only keep quality data" to "ok, maybe we can augment some data with synthetic datasets from tools etc" to "RL goes brrr" to (point B from above) "let's mix the data with quality sources on best practices".

9 days ago

[-]

Are you basically saying synthetic data and having a bunch of models argue with each other to distill the most agreeable of their various outputs solves the issue of peak data?

Because from my vantage point, those have not given step changes in AI utility the way crunching tons of data did. They have only incrementally improved things

11 days ago

[-]

A) We are out of the Internet-scale-for-free data. Of course the companies deploying LLM based systems at massive scale are of course ingesting a lot of human data from their users, that they are seeking to use to further improve their models.

B) Has learning though "self-play" (like with AlphaZero etc) been demonstrated working for improving LLMs? What is the latest key research on this?

11 days ago

[-]

Certainly the models have orders of magnitude more data available to them than the smartest human being who ever lived does/did. So we can assume that if the goal is "merely" superhuman intelligence, data is not a problem.

It might be a constraint on the evolution of godlike intelligence, or AGI. But at that point we're so far out in bong-hit territory that it will be impossible to say who's right or wrong about what's coming.

Has learning though "self-play" (like with AlphaZero etc) been demonstrated working for improving LLMs?

My understanding (which might be incorrect) is that this amounts to RLHF without the HF part, and is basically how DeepSeek-R1 was trained. I recall reading about OpenAI being butthurt^H^H^H^H^H^H^H^H concerned that their API might have been abused by the Chinese to train their own model.

11 days ago

[-]

Superhuman capability within the tasks that are well represented in the dataset, yes. If one takes the view that intelligence is the ability to solve novel problems (ref F. Chollet), then the amount of data alone might not take us to superhuman intelligence. At least without new breakthroughs in the construction of models or systems.

R1 managed to replicate a model on the level of one one they had access to. But as far as I know they did not improve on its predictive performance? They did improve in inference time, but that is another thing. The ability to replicate a model is well demonstrated and quite common practice for some years already, see teacher-student distillation.

jsnell

11 days ago

[-]

The thing is, people claimed already a year or two ago that we'd reached peak data and progress would stall since there was no more high-quality human-written text available. Turns out they were wrong, and if anything progress accelerated.

The progress has come from all kinds of things. Better distillation of huge models to small ones. Tool use. Synthetic data (which is not leading to model collapse like theorized). Reinforcement learning.

I don't know exactly where the progress over the next year will be coming from, but it seems hard to believe that we'll just suddenly hit a wall on all of these methods at the same time and discover no new techniques. If progress had slowed down over the last year the wall being near would be a reasonable hypothesis, but it hasn't.

Aperocky

11 days ago

[-]

I'm loving it, can't wait to deploy this stuff locally. The mainframe will be replaced by commodity hardware, OpenAI will stare down the path of IBM unless they reinvent themselves.

10 days ago

[-]

> people claimed already a year or two ago that we'd reached peak data and progress would stall

The claim was we've reached peak data (which, yes we did) and that progress would have to come from some new models or changes. Everything you described has made incremental changes, not step changes. Incremental changes are effectively stalled progress. Even this model has no proof and no release behind it

11 days ago

[-]

there is also huge realm of private/commercial data which is not absorbed by LLMs yet. I think there are way more private/commercial data than public data.

qingcharles

10 days ago

[-]

We're so far from peak data that we've barely even scratched the surface, IMO.

10 days ago

[-]

What changed from this announcement?

> “We’ve achieved peak data and there’ll be no more,” OpenAI’s former chief scientist told a crowd of AI researchers.

strangeloops85

11 days ago

[-]

I assume there was tool use in the fine tuning?

nmca

11 days ago

[-]

There wasn’t in the CoT for these problems.

wds

11 days ago

[-]

> I think we’re close to AI substantially contributing to scientific discovery.

The new "Full Self-Driving next year"?

Velorivox

11 days ago

[-]

"AI" already contributes "substantially" to "scientific discovery". It's a very safe statement to make, whereas "full self-driving" has some concrete implications.

mort96

11 days ago

[-]

"AI" here means language models. Machine learning has been contributing to scientific discovery for ages, but this new wave of hype that marketing departments are calling "AI" are language models.

Aperocky

11 days ago

[-]

Well I also think full self-driving contribute substantially to navigating the car on the street..

oceanplexian

11 days ago

[-]

I know it’s a meme but there actually are fully self driving cars, they make thousands of trips every day in a couple US cities.

elefanten

11 days ago

[-]

The capitalization makes it a Tesla reference, which has notoriously been promising that as an un-managed consumer capability for years, while it is not yet launched even now.

bigyabai

11 days ago

[-]

> in a couple US cities

FWIW, when you get this reductive with your criterion there were technically self-driving cars in 2008 too.

skybrian

11 days ago

[-]

To be a bit more specific: no, they were not routinely making thousands of taxi rides with paying customers every day in 2008.

tomrod

11 days ago

[-]

We can go further. Automated trains have cars. Streetcars are automatable since the track is fixed.

And both of these reduce traffic

manmal

11 days ago

[-]

I thought FSD has to be at least level 4 to be called that.

kakapo5672

11 days ago

[-]

As an aside, that is happening in China right now in commercial vehicles. I rode a robotaxi last month in Beijing, and those services are expanding throughout China. Really impressive.

tim333

11 days ago

[-]

We have Waymo and AlphaFold.

11 days ago

[-]

How is a claim, "clear evidence" to anything?

mitthrowaway2

11 days ago

[-]

I read the GP's comment as "but [assuming this claim is correct], this is clear evidence to the contrary."

11 days ago

[-]

Most evidence you have about the world is claims from other people, not direct experiment. There seems to be a thought-terminating cliche here on HN, dismissing any claim from employees of large tech companies.

Unlike seemingly most here on HN, I judge people's trustworthiness individually and not solely by the organization they belong to. Noam Brown is a well known researcher in the field and I see no reason to doubt these claims other than a vague distrust of OpenAI or big tech employees generally which I reject.

11 days ago

[-]

> I judge people's trustworthiness individually and not solely by the organization they belong to

This is certainly a courageous viewpoint – I imagine this makes it very hard for you to engage in the modern world? Most of us are very bound by institutions we operate in!

11 days ago

[-]

Over the years my viewpoint has led to great success in predicting the direction and speed of development of many technologies, among other things. As a result, by objective metrics of professional and financial success, I have done very well. I think your imagination is misleading you.

zer00eyz

11 days ago

[-]

> dismissing any claim from employees of large tech companies

Me: I have a way to turn lead into gold.

You: Show me!!!

Me: NO (and then spends the rest of my life in poverty).

Cold Fusion (physics not the programing language) is the best example of why you "Show your work". This is the Valley we're talking about. It's the thudnderdome of technology and companies. If you have a meaningful breakthrough you don't talk about it you drop it on the public and flex.

mlyle

11 days ago

[-]

I don't think this is a reasonable take. Some people/organizations send signals about things that we're not ready to fully drop it on the world. Others consider those signals in context (reputation of sender, prior probability of being true, reasons for sender to be honest vs. deceptive, etc).

When my wife tells me there's a pie in the oven and it's smelling particularly good, I don't demand evidence or disbelieve the existence of the pie. And I start to believe that it'll probably be a particularly good pie.

This is from OpenAI. Here they've not been so great with public communications in the past, and they have a big incentive in a crowded marketplace to exaggerate claims. On the other hand, it seems like a dumb thing to say unless they're really going to deliver that soon.

zer00eyz

11 days ago

[-]

> Some people/organizations send signals about things that we're not ready to fully drop it on the world.

This is called marketing.

> When my wife tells me there's a pie in the oven and it's smelling particularly good, I don't demand evidence

Because you have evidence, it smells.

And if later your ask your wife "where is the pie" and she says "I sprayed pie scent in the air, I was just singling" how are you going to feel?

Open AI spent its "fool us once" card already. Doing things this way does not earn back trust, failure to deliver (and they have done that more than once) ... See staff non disparagement, see the math fiasco, see open weights.

mlyle

11 days ago

[-]

> This is called marketing.

Many signals are marketing, but the purpose of signals is not purely to develop markets. We all have to determine what we think will happen next and how others will act.

> Because you have evidence, it smells.

I think you read that differently than what I intended to write -- she claims it smells good.

> Open AI spent its "fool us once" card already.

> > This is from OpenAI. Here they've not been so great with public communications in the past, and they have a big incentive in a crowded marketplace to exaggerate claims.

11 days ago

[-]

A thought-terminating cliché? Not at all, certainly not when it comes to claims of technological or scientific breakthroughs. After all, that's partly why we have peer review and an emphasis on reproducibility. Until such a claim has been scrutinised by experts or reproduced by the community at large, it remains an unverified claim.

>> Unlike seemingly most here on HN, I judge people's trustworthiness individually and not solely by the organization they belong to.

That has nothing to do with anything I said. A claim can be false without it being fraudulent, in fact most false claims are probably not fraudulent; though, still, false.

Claims are also very often contested. See e.g. the various claims of Quantum Superiority and the debate they have generated.

Science is a debate. If we believe everything anyone says automatically, then there is no debate.

og_kalu

10 days ago

[-]

They don't give a lot of details but they give enough for it to be pretty hard to say the claim is false but unfraudulent.

Some researchers got a breakthrough and decided to share right then rather than the months later it would take for a viable product. It happens, researchers are humans after all and i'm generally glad to take a peek at the actual frontier rather than what's behind by many months.

You can and it's fair to ignore such claims until that part but i think anything more than that is fairly uncharitable for the situation.

8 days ago

[-]

I'm not ignoring it. I'm waiting to see evidence of it. Is that uncharitable?

11 days ago

[-]

OpenAI have already shown us they aren’t trustworthy. Remember the FrontierMath debacle?

11 days ago

[-]

It's only a "debacle" if you already assume OpenAI isn't trustworthy, because they said they don't train on the test set. I hope you can see that presenting your belief that they lied about training on the test set as evidence of them being untrustworthy is a circular argument. You're assuming the thing you're trying to prove.

The one OpenAI "scandal" that I did agree with was the thing where they threatened to cancel people's vested equity if they didn't sign a non-disparagement agreement. They did apologize for that one and make changes. But it doesn't have a lot to do with their research claims.

I'm open to actual evidence that OpenAI's research claims are untrustworthy, but again, I also judge people individually, not just by the organization they belong to.

11 days ago

[-]

They funded the entire benchmark and didn’t disclose their involvement. They then proceeded to make use of the benchmark while pretending like they weren’t affiliated with EpochAI. That’s a huge omission and more than enough reason to distrust their claims.

11 days ago

[-]

IMO their involvement is only an issue if they gained an advantage on the benchmark by it. If they didn't train on the test set then their gained advantage is minimal and I don't see a big problem with it nor do I see an obligation to disclose. Especially since there is a hold-out set that OpenAI doesn't have access to, which can detect any malfeasance.

cycomanic

11 days ago

[-]

It's typically difficult to find direct evidence for bias. That is why rules for conflict of interest and disclosure are strict in research and academia. Crucially, something is a conflict of interest if it could be perceived as a conflict of interest by someone external, so it doesn't matter if you think you could judge fairly, it's important if someone else might doubt you could.

Not disclosing a conflict of interest is generally considered a significant ethics violation, because it reduces trust in the general scientific/research system. Thus OpenAI has become untrustworthy in many people's view irrespective if their involvement with the benchmarks creation affected their results or not.

11 days ago

[-]

There’s no way to figure out whether they gained an advantage. We have to trust their claims, which again, is an issue for me after finding out they already lied.

11 days ago

[-]

Lied about what? Your only claim so far is that they failed to disclose something that in my opinion didn't need to be disclosed.

11 days ago

[-]

I’d expect any serious company to disclose conflicts of interest.

11 days ago

[-]

Thing is, for example, all of classical physics can be derived from Newton's laws, Maxwell's equations and the laws of Thermodynamics, all of which can be written on a slip of paper.

A sufficiently brilliant and determined human can invent or explain everything armed only with this knowledge.

There's no need to train him on a huge corpus of text, like they do with ChatGPT.

Not sure what this model's like, but I'm quite certain it's not trained on terabytes of Internet and book dumps, but rather is trained for abstract problem solving in some way, and is likely much smaller than these trillion parameter SOTA transformers, hence is much faster.

tim333

11 days ago

[-]

If you look at the history of physics I don't think it really worked like that. It took about three centuries from Newton to Maxwell because it's hard to just deduce everything from basic principles.

11 days ago

[-]

I think you misundertand me, I'm making some pie in the sky statement about AI being able to discover the laws of nature in an afternoon. I'm just making the observation that if you know the basic equiations, and enough math (which is about multivariate calc), you can derive every single formula in your Physics textbook (and most undergrads do as part of their education).

Since smart people can derive a lot of knowledge from a tiny set of axioms, smart AIs should be able to as well, which means you don't need to rely on a huge volume of curated information. Which means that needing to invest the internet and training on a terabyte of text might not be how these newer models are trained, and since they don't need to learn that much raw information, they might be smaller and faster.

11 days ago

[-]

There's no evidence this model works like that. The "axioms" for counting the number of r's in a word are magnitudes simpler than classical physic's, and yet it took a few years to get that right. It's always been context, not derivation of logic.

10 days ago

[-]

First, false equivalence. The 'strawberry' problem was because LLMs operate not on text directly, but on embedding vectors, which made it hard for it to manipulate the syntax of language directly. This does not prevent it from properly doing math proofs.

Second, we know nothing about these models or how they work and trained, and indeed, if they can do these things or not. But a smart human could (by smart I mean someone who gets good grades at engineering school effortlessly, not Albert Einstein)

lostmsu

10 days ago

[-]

Right, humans are pretrained on terabytes of sensory data instead.

11 days ago

[-]

And the billions of years of evolution and the language that you use to explain the task to him and and the schooling he needs to understand what you're saying it and... and and and?

johnecheck

11 days ago

[-]

Wow. That's an impressive result, but how did they do it?

Wei references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.

If this is legit, then we need to know what tools were used and how the model used them. I'd bet those are the 'techniques to make them better at hard to verify tasks'.

11 days ago

[-]

Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting? Regardless of the tools for verification or even solvers - why is the goal post moving so fast? There is no bonus for “purity of essence” and using only neural networks. We live in an era where it’s hard to tell if machines are thinking or not, which for since the first computing machines was seen as the ultimate achievement. Now we Pooh Pooh the results of each iteration - which unfold month over month not decade over decade now.

You don’t have to be hyped to be amazed. You can retain the ability to dream while not buying into the snake oil. This is amazing no matter what ensemble of techniques used. In fact - you should be excited if we’ve started to break out of the limitations of forcing NN to be load bearing in literally everything. That’s a sign of maturing technology not of limitations.

11 days ago

[-]

>> Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting?

Half the internet is convinced that LLMs are a big data cheating machine and if they're right then, yes, boldly cheating where nobody has cheated before is not that exciting.

falcor84

11 days ago

[-]

I don't get it, how do you "big data cheat" an AI into solving previously unencountered problems? Wouldn't that just be engineering?

QuadmasterXLII

11 days ago

[-]

I don’t know how you’d cheat at it either, but if you could, it would manifest as the model getting gold on the test and then in six months when its released to the public, exhibiting wild hallucinations and basic algebra errors. I don’t kbow if that’s how it’ll play out this time but I know how it played out the last ten.

11 days ago

[-]

It depends on what you mean by "engineering". For example, "engineering" can mean that you train and fine-tune a machine learning system to beat a particular benchmark. That's fun times but not really interesting or informative.

otabdeveloper4

11 days ago

[-]

> previously unencountered problems

I haven't read the IMO problems, but knowing how math Olympiad problems work, they're probably not really "unencountered".

People aren't inventing these problems ex nihilo, there's a rulebook somewhere out there to make life easier for contest organizers.

People aren't doing these contests for money, they are doing them for honor, so there is little incentive to cheat. With big business LLM vendors it's a different situation entirely.

11 days ago

[-]

I mean, solutions for the 2025 IMO problems are already available on the internet. How can we be sure these are “unencountered” problems?

hyghjiyhu

11 days ago

[-]

They probably have an archived data set from before then that they trained on.

samrus

11 days ago

[-]

They would if theyre honest. Which we just dont know for sure these days

11 days ago

[-]

Without sharing their methodology, how can we trust the claim ? questions like:

1) did humans formalize the input 2) did humans prompt the llm towards the solution etc..

I am excited to hear about it, but I remain skeptical.

Dilettante_

11 days ago

[-]

>Why is that less exciting?

Because if I have to throw 10000 rocks to get one in the bucket, I am not as good/useful of a rock-into-bucket-thrower as someone who gets it in one shot.

People would probably not be as excited about the prospect of employing me to throw rocks for them.

machiaweliczny

9 days ago

[-]

If you don't have automatic way to verify solution then picking correct answer from 10 000 is more impressive than coming with some answer in the first place. If AI tech will be able to effectively prune tree search without eval that would be super big leap but I doubt they achieved this.

beering

11 days ago

[-]

It’s exciting because nearly all humans have 0% chance of throwing the rock into the bucket, and most people believed a rock-into-bucket-thrower machine is impossible. So even an inefficient rock-into-bucket-thrower is impressive.

But the bar has been getting raised very rapidly. What was impressive six months ago is awful and unexciting today.

Dilettante_

10 days ago

[-]

You're putting words in my mouth. It's not "awful and unexciting", it is certainly an important step, but the hype being invited with the headline is the immensely greater one of an accurate rock-thrower. And if they have the inefficient one and trying to pretend to have the real deal, that's flim-flam-man levels of overstatement.

parasubvert

11 days ago

[-]

I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result.

Certainly the emergent behaviour is exciting but we tend to jump to conclusions as to what it implies.

This means we are far more trusting with software that lacks formal guarantees than we should be. We are used to software being sound by default but otherwise a moron that requires very precise inputs and parameters and testing to act correctly. System 2 thinking.

Now with NN it's inverted: it's a brilliant know-it-all but it bullshits a lot, and falls apart in ways we may gloss over, even with enormous resources spent on training. It's effectively incredible progress on System 1 thinking with questionable but evolving System 2 skills where we don't know the limits.

If you're not familiar with System 1 / System 2, it's googlable .

lordnacho

11 days ago

[-]

> These models cannot reason

Not trying to be a smarty pants here, but what do we mean by "reason"?

Just to make the point, I'm using Claude to help me code right now. In between prompts, I read HN.

It does things for me such as coding up new features, looking at the compile and runtime responses, and then correcting the code. All while I sit here and write with you on HN.

It gives me feedback like "lock free message passing is going to work better here" and then replaces the locks with the exact kind of thing I actually want. If it runs into a problem, it does what I did a few weeks ago, it will see that some flag is set wrong, or that some architectural decision needs to be changed, and then implements the changes.

What is not reasoning about this? Last year at this time, if I looked at my code with a two hour delta, and someone had pushed edits that were able to compile, with real improvements, I would not have any doubt that there was a reasoning, intelligent person who had spent years learning how this worked.

It is pattern matching? Of course. But why is that not reasoning? Is there some sort of emergent behavior? Also yes. But what is not reasoning about that?

I'm having actual coding conversations that I used to only have with senior devs, right now, while browsing HN, and code that does what I asked is being produced.

11 days ago

[-]

I think the biggest hint that the models aren't reasoning is that they can't explain their reasoning. Researchers have shown for explained that how a model solves a simple math problem and how it claims to have solved it after the fact have no real correlation. In other words there was only the appearance of reasoning.

hyghjiyhu

11 days ago

[-]

People can't explain their reasoning either. People do a parallel construction of logical arguments for a conclusion they already reached intuitively in a way they have no clue how it happened. "The idea just popped into my head while showering" to our credit, if this post-hoc rationalization fails we are able to change our opinion to some degree.

11 days ago

[-]

Interestingly people have to be trained in logic and identifying fallacies because logic is not a native capability of our mind. We aren’t even that good at it once trained and many humans (don’t forget a 100 IQ is median) can not be trained.

Reasoning appears to actually be more accurately described as “awareness,” or some process that exists along side thought where agency and subconscious processes occur. It’s by construction unobservable by our conscious mind, which is why we have so much trouble explaining it. It’s not intuition - it’s awareness.

10 days ago

[-]

Yeah, surprisingly I think the differences are less in the mechanism used for thought and more in the experience of being a person alive in a body. A person can become an idea. An LLM always forgets everything. It cannot "care"

lordnacho

11 days ago

[-]

Is this true though? I've suggested things that it pushed back on. Feels very much like a dev. It doesn't just dumbly do what I tell it.

10 days ago

[-]

Sure but it isn't reasoning that it should push back. It isn't even "pushing" which would require an intent to change you which it lacks

bamboozled

11 days ago

[-]

I'm having actual coding conversations that I used to only have with senior devs, right now, while browsing HN, and code that does what I asked is being produced.

I’m using Opus 4 for coding and there is no way that model demonstrates any reasoning or demonstrates any “intelligence” in my opinion. I’ve been through the having conversations phase etc but doesn’t get you very far, better to read a book.

I use these models to help me type less now, that’s it. My prompts basically tell it to not do anything fancy and that works well.

fragmede

11 days ago

[-]

> It will do something brilliant and another 5 dumb things in the same prompt.

it me

11 days ago

[-]

YOU are reasoning.

facefactsdamnit

11 days ago

[-]

You raise a far point. These criticisms based on "it's merely X" or "it's not really Y" don't hold water when X and Y are poorly defined.

The only thing that should matter is the results they get. And I have a hard time understanding why the thing that is supposed to behave in an intelligent way but often just spew nonsense gets 10x budget increases over and over again.

This is bad software. It does not do the thing it promises to do. Software that sometimes works and very often produces wrong or nonsensical output is garbage software. Sink 10x, 100x, 1000x more resources into it is irrational.

Nothing else matters. Maybe it reasons, maybe it's intelligent. If it produces garbled nonsense often, giving the teams behind it 10x the compute is insane.

falcor84

11 days ago

[-]

"Software that sometimes works and very often produces wrong or nonsensical output" can be extremely valuable when coupled with a way to test whether the result is correct.

hannofcart

11 days ago

[-]

> It does not do the thing it promises to do. Software that sometimes works and very often produces wrong or nonsensical output...

Is that very unlike humans?

You seem to be comparing LLMs to much less sophisticated deterministic programs. And claiming LLMs are garbage because they are stochastic.

Which entirely misses the point because I don't want an LLM to render a spreadsheet for me in a fully reproducible fashion.

No, I expect an LLM to understand my intent, reason about it, wield those smaller deterministic tools on my behalf and sometimes even be creative when coming up with a solution, and if that doesn't work, dream up some other method and try again.

If _that_ is the goal, then some amount of randomness in the output is not a bug it's a necessary feature!

hnfong

11 days ago

[-]

You're right, they should never have given more resources and compute to the OpenAI team after the disaster called GPT-2, which only knew how to spew nonsense.

11 days ago

[-]

We already have highly advanced deterministic software. The value lies in the abductive “reasoning” and natural language processing.

We deal with non determinism any time our code interacts with the natural world. We build guard rails, detection, classification of false/true positive and negatives, and all that all the time. This isn’t a flaw, it’s just the way things are for certain classes of problems and solutions.

It’s not bad software - it’s software that does things we’ve been trying to do for nearly a hundred years beyond any reasonable expectation. The fact I can tell a machine in human language to do some relative abstract and complex task and it pretty reliably “understands” me and my intent, “understands” it’s tools and capabilities, and “reasons” how to bridge my words to a real world action is not bad software. It’s science fiction.

The fact “reliably” shows up is the non determinism. Not perfectly, although on a retry with a new seed it often succeeds. This feels like most software that interacts with natural processes in any way or form.

It’s remarkable that anyone who has ever implemented exponential back off and retry, has ever implemented edge cases, and sir and say “nothing else matters,” when they make their living dealing with non determinism. Because the algorithmic kernel of logic is 1% of programming and systems engineering, and 99% is coping with the non determinism in computing systems.

The technology is immature and the toolchains are almost farcically basic - money is dumping into model training because we have not yet hit a wall with brute force. And it takes longer to build a new way of programming and designing highly reliable systems in the face of non determinism, but it’s getting better faster than almost any technology change in my 35 years in the industry.

Your statement that it “very often produces wrong or nonsensical output” also tells me you’re holding onto a bias from prior experiences. The rate of improvement is astonishing. At this point in my professional use of frontier LLMs and techniques they are exceeding the precision and recall of humans and there’s a lot of rich ground untouched. At this point we largely can offload massive amounts of work that humans would do in decision making (classification) and use humans as a last line to exercise executive judgement often with the assistance of LLMs. I expect within two years humans will only be needed in the most exceptional of situations, and we will do a better job on more tasks than we ever could have dreamed of with humans. For the company I’m at this is a huge bottom line improvement far and beyond the cost of our AI infrastructure and development, and we do quite a lot of that too.

If you’re not seeing it yet, I wouldn’t use that to extrapolate to the world at large and especially not to the future.

logicchains

11 days ago

[-]

>I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result

This is rampant human chauvinism. There's absolutely no empirical basis for the statement that these models "cannot reason", it's just pseudoscientific woo thrown around by people who want to feel that humans are somehow special. By pretty much every empirical measure of "reasoning" or intelligence we have, SOTA LLMs are better at it than the average human.

cguess

11 days ago

[-]

> This is rampant human chauvinism

What in the accelerationist hell?

logicchains

11 days ago

[-]

There's nothing accelerationist about recognising that making unfalsifiable statements about LLMs lacking intelligence or reasoning ability serves zero purpose except stroking the speaker's ego. Such people are never willing to give a clear criteria for what would constitute proof of machine reasoning for them, which shows their belief isn't based on science or reason.

bamboozled

11 days ago

[-]

I’ve used these AI tools for multiple hours a day for months. Not seeing the reasoning party honestly. I see the heuristics part.

logicchains

11 days ago

[-]

I guess your work doesn't involve any maths then, because then you'd see they're capable of solving maths problems that require a non-trivial amount of reasoning steps.

cycomanic

11 days ago

[-]

Just the other day I needed to code some interlocked indices. It wasn't particularly hard but I didn't want to context switch and think so instead I asked gpt 4o. After a back and worth for 4 or 5 times, where it gave wrong answers I finally decided to just take a pen and paper and do it by hand. I have a hard time believing that these models are reasoning, because if they are they are very poor at it.

simonw

11 days ago

[-]

GPT-4o isn't classified as a "reasoning" model (by common 2025 terminology at least) - I suggest trying again with o3 or Claude 4 or Gemini 2.5.

lanza

11 days ago

[-]

Because the usefulness of an AI model is reliably solving a problem, not being able to solve a problem given 10,000 tries.

Claude Code is still only a mildly useful tool because it's horrific beyond a certain breadth of scope. If I asked it to solve the same problem 10,000 times I'm sure I'd get a great answer to significantly more difficult problems, but that doesn't help me as I'm not capable of scaling myself to checking 10,000 answers.

constantcrying

11 days ago

[-]

>if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.

That entirely depends on who did the cherry picking. If the LLM had 10000 attempts and each time a human had to falsify it, this story means absolutely nothing. If the LLM itself did the cherry picking, then this is just akin to a human solving a hard problem. Attempting solutions and falsifying them until the desired result is achieved. Just that the LLM scales with compute, while humans operate only sequentially.

johnecheck

11 days ago

[-]

The key bit here is whether the LLM doing the cherry picking had knowledge of the solution. If it didn't, this is a meaningful result. That's why I'd like more info, but I fear OpenAI is going to try to keep things under wraps.

diggan

11 days ago

[-]

> If it didn't

We kind of have to assume it didn't right? Otherwise bragging about the results makes zero sense and would be outright misleading.

11 days ago

[-]

> would be outright misleading

why would not they? what are the incentives not to?

lucianbr

11 days ago

[-]

Corporations mislead to make money all the damn time.

[https://youtube.com/watch?v=YWdD206eSv0]

Dilettante_

11 days ago

[-]

"You really think someone would do that, just go on the internet and tell lies?"

blibble

11 days ago

[-]

openai have been caught doing exactly this before

11 days ago

[-]

Why do people keep making up controversial claims like this? There is no evidence at all to this effect

blibble

11 days ago

[-]

it was widely covered in the press earlier in the year

helloplanets

11 days ago

[-]

Source?

[1] https://x.com/markchen90/status/1946573740986257614?s=46&t=H...

11 days ago

[-]

Mark Chen posted that the system was locked before the contest. [1] It would obviously be crazy cheating to give verifiers a solution to the problem!

11 days ago

[-]

I don't think it's much less exciting if they ran it 10000 parallel? It implies an ability to discern when the proof is correct and rigorous (which o3 can't do consistently) and also means that outputting the full proof is within capabilities even if rare.

FeepingCreature

11 days ago

[-]

The whole point of RL is if you can get it to work 0.01% of the time you can get it to work 100% of the time.

lcnPylGDnU4H9OF

11 days ago

[-]

> what tools were used and how the model used them

According to the twitter thread, the model was not given access to tools.

karmasimida

11 days ago

[-]

> if OpenAI ran this 10000 times in parallel and cherry-picked the best one

This is almost certainly the case, remember the initial o3 ARC benchmark? I could add this is probably multi-agent system as well, so the context length restriction can be bypassed.

Overall, AI good at math problems doesn't make news to me. It is already better than 99.99% of humans, now it is better than 99.999% of us. So ... ?

https://matharena.ai/imo/

11 days ago

[-]

Progress is astounding. Recently report published about evaluation of LLMs on IMO 2025. o3 high didn't even get bronze.

Waiting for Terry Tao's thoughts, but these kind of things are good use of AI. We need to make science progress faster rather than disrupting our economy without being ready.

davis

11 days ago

[-]

Here they are: https://mathstodon.xyz/@tao/114881419368778558

11 days ago

[-]

Appreciated

> I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition.

11 days ago

[-]

[flagged]

11 days ago

[-]

Please see https://news.ycombinator.com/item?id=44617609.

You degraded this thread badly by posting so many comments like this.

11 days ago

[-]

I did competitive math in high school and I can confidently say that they are anything but "basic". I definitely can't solve them now (as an adult) and it's likely I never will. The same is true for most people, including people who actually pursued math in college (I didn't). I'm not going to be the next guy who unknowingly challenges a Putnam winner to do these but I will just say that it is unlikely that someone who actually understands the difficulty of these problems would say that they are not hard.

For those following along but without math specific experience: consider whether your average CS professor could solve a top competitive programming question. Not Leetcode hard, Codeforces hard.

11 days ago

[-]

Thanks for speaking sense. I think 99% of people saying IMO problems are not hard would not be able to solve basic district-level competition problems and are just not equipped to judge the problems.

And 1% here are those IMO/IOI winners who think everyone is just like them. I grew up with them and to you, my friends, I say: this is the reason why AI would not take over the world (and might even not be that useful for real world tasks), even if it wins every damn contest out there.

11 days ago

[-]

I feel like people see the question (or even the solution), they can actually understand what it says because it’s only using basic algebraic notation, then assume it must be easy to solve. Obviously it must be easier than that funny math with weird symbols…

11 days ago

[-]

> I assume you are aware of the standard of Olympiad problems and that they are not particularly high.

Every time an LLM reaches a new benchmark there’s a scramble to downplay it and move the goalposts for what should be considered impressive.

The International Math Olympiad was used by many people as an example of something that would be too difficult for LLMs. It has been a topic of discussion for some time. The fact that an LLM has achieved this level of performance is very impressive.

You’re downplaying the difficulty of these problems. It’s called international because the best in the entire world are challenged by it.

11 days ago

[-]

sorry but I don't think it's accurate to say "they are just challenging for the age range"

11 days ago

[-]

I'm aware you believe they are impossible tasks unless you have specific training, I happen to disagree with that.

11 days ago

[-]

you meaning specific IMO training or general math training? Latter is certainly needed, former being needed in my opinion is a general observation for example about the people who make it on the teams.

11 days ago

[-]

I mean IMO training, as yes I agree you wouldn't be able to do this without a complete Math knowledge.

11 days ago

[-]

I mean progress speed, few months ago they released o3 it has 16 pt in imo 2025

11 days ago

[-]

In that regards I would agree but that to me suggests that prior hype was unbased though.

zug_zug

11 days ago

[-]

I feel like I've noticed you you making the same comment 12 places in this thread -- incorrectly misrepresenting the difficulty of this tournament and ultimately it comes across as a bitter ex.

Here's an example problem 5:

Let a1,a2,…,an be distinct positive integers and let M=max⁡1≤i<j≤n.

Find the maximum number of pairs (i,j) with 1≤i<j≤n for which (ai +aj )(aj −ai )=M.

11 days ago

[-]

What does max⁡1≤i<j≤n mean? Wouldn't M always be j?

kelipso

11 days ago

[-]

Guessing it should be M = max_{⁡1≤i<j≤n} ai+aj or some other function M = max_{⁡1≤i<j≤n} f(ai,aj).

11 days ago

[-]

Where did you get this? Don't see it on the 2025 problem set and now I wanna see if I have the right answer

zug_zug

11 days ago

[-]

I asked chatGPT. However it's saying that's 2022 problem 5, however that seems to be clearly wrong... Moreover I can't find that problem anywhere so I don't know if it's a hallucination or something from it's training set that isn't on the internet....

9 days ago

[-]

Okay, please don't post ChatGPT output as fact without verification or at least stating where you got it.

kelipso

11 days ago

[-]

Butlerian jihad can’t come soon enough lol

11 days ago

[-]

Hence proofs as I've stated.

11 days ago

[-]

Go up to Andrew Wiles and say, "Meh, NBD, it was just a proof."

8 days ago

[-]

IMO questions and Andrew Wiles solving Fermat's last theorem are two vastly different things. One is far harder than the other and the effort he put in and thinking needed is something very few can do. He also did some other fascinating work that I couldn't hope to understand fully. There is a gulf between FLT and IMO types of proofs.

nmca

11 days ago

[-]

It’s interesting that this is a competition elite enough that several posters on a programming website don’t seem to understand what it is.

My very rough napkin math suggests that against the US reference class, imo gold is literally a one in a million talent (very roughly 20 people who make camp could get gold out of very roughly twenty million relevant high schoolers).

11 days ago

[-]

I’m not trying to take away from the difficulty of the competition. But I went to a relatively well regarded high school and never even heard of IMO until I met competitors during undergrad.

I think that the number of students who are even aware of the competition is way lower than the total number of students.

I mean, I don’t think I’d have been a great competitor even if I tried. But I’m pretty sure there are a lot of students that could do well if given the opportunity.

SJC_Hacker

11 days ago

[-]

Are you in the US? Have you heard of the AMC (used to be AHMSE) and the AIME? Those are the feeders to the IMO.

If your school had a math team and you were on it, would be surprised if you didn't hear of it

You may not have heard of the IMO because no one in school district, possibly even state got in. It is extremely selective (like 20 students in the entire country)

11 days ago

[-]

I’m in the US but this was a while back, in the south. It was a highly ranked school and ended up producing lots of PhDs, but many of the families were blue collar and so there just wasn’t any awareness of things like this.

ajennings

11 days ago

[-]

I'm curious on your response to GP's question. Have you heard of AHSME, AMC, or AIME?

Nobody mentioned them in high school (1997) until I heard of them online and got my school to participate. 30 kids took the AHSME. Only one qualified for the AIME. And nobody qualified for IMO (though I tell myself I was close).

I believe the 1 in a million number.

11 days ago

[-]

Never heard of those either. And I took the highest level math courses offered. The only competition I can remember is participating in is Academic Decathlon.

11 days ago

[-]

In the RLHF sphere you could tell some AI company/companies were targeting this because of how many IMO RLHF’ers they were hiring specifically. I don’t think it’s really easy to say how much “progress” this is given that.

ainch

11 days ago

[-]

I doubt this is coming from RLHF - tweets from the lead researcher state that this result flows from a research breakthrough which enables RLVR on less verifiable domains.

11 days ago

[-]

Math RLHF already has verifiable ground truth/right vs wrong, so I don't what this distinction really shows.

And AI changes so quickly that there is a breakthrough every week.

Call my cynical, but I think this is an RLHF/RLVR push in a narrow area--IMO was chosen as a target and they hired specifically to beat this "artificial" target.

11 days ago

[-]

RLHF means Reinforcement Learning from Human Feedback. The right/wrong ones are either called RL or RLVR (Verfiable Rewards)

11 days ago

[-]

They were hiring IMO winners because IMO winners tend to be good at working on AI, not because they had the people specifically to make the AI better at math.

9 days ago

[-]

Uh no. I’m a math RLHF’er. When I get hired, I work on math/logic up to masters level because that’s my qualifications. Masters and PHD work on masters and PHD level. And IMO work on IMO math.

Every skill and skill level is specifically assigned and hired in the RLHF world.

Sometime the skill levels are fuzzier, but that’s usually very temporary.

And as been said already, IMO is a specific skill that even PHD math holders aren’t universally trained for.

https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challe...

11 days ago

[-]

Some previous predictions:

In 2021 Paul Christiano wrote he would update from 30% to "50% chance of hard takeoff" if we saw an IMO gold by 2025.

He thought there was an 8% chance of this happening.

Eliezer Yudkowsky said "at least 16%".

Source:

11 days ago

[-]

While I usually enjoy seeing these discussions, I think they are really pushing the usefulness of bayesian statistics. If one dude says the chance for an outcome is 8% and another says it's 16% and the outcome does occur, they were both pretty wrong, even though it might seem like the one who guessed a few % higher might have had a better belief system. Now if one of them had said 90% while the other said 8% or 16%, then we should pay close attention to what they are saying.

AlphaAndOmega0

11 days ago

[-]

The person who guessed 16% would have a lower Brier score (lower is better) and someone who estimated 100%, beyond being correct, would have the lowest possible value.

10 days ago

[-]

I'm not saying there aren't ways to measure this (bayesian statistics does exist after all), I'm saying the difference is not worth arguing about who was right. Or even who had a better guess.

zeroonetwothree

11 days ago

[-]

A 16% or even 8% event happening is quite common so really it tells us nothing and doesn’t mean either one was pretty wrong.

grillitoazul

11 days ago

[-]

From a mathematical point of view there are two factors: (1) Initial prior capability of prediction from the human agents and (2) Acceleration in the predicted event. Now we examine the result under such a model and conclude that:

The more prior predictive power of human agents imply the more a posterior acceleration of progress in LLMs (math capability).

Here we are supposing that the increase in training data is not the main explanatory factor.

This example is the gem of a general framework for assessing acceleration in LLM progress, and I think its application to many data points could give us valuable information.

grillitoazul

11 days ago

[-]

Another take at a sound interpretation:

(1) Bad prior prediction capability of humans imply that result does not provide any information

(2) Good prior prediction capability of humans imply that there is acceleration in math capabilities of LLMs.

tunesmith

11 days ago

[-]

The whole point is to make many such predictions and experience many outcomes. The goal is for your 70% predictions to be correct 70% of the time. We all have a gap between how confident we are and how often we're correct. Calibration, which can be measured by making many predictions, is about reducing that gap.

fxwin

11 days ago

[-]

If i predict that my next dice roll will be a 5 with 16% certainty and i do indeed roll a 5, was my prediction wrong?

davidclark

11 days ago

[-]

The correctness of 8%, 16%, and 90% are all equally unknown since we only have one timeline, no?

navane

11 days ago

[-]

That's why you have to let these people make predictions about many things. Than you can weigh the 8, 16, and 90 pct and see who is talking out of their ass.

11 days ago

[-]

That's just the frequentist approach. But we're talking about bayesian statistics here.

navane

10 days ago

[-]

I admit I dont know Bayesian, but isn't the only way to check if the future teller is lucky or not to have them predict many things? If he predicts 10 to happen with a 10% chance, and one of them happens, he's good. If he predicts 10 to happen with a 90% chance and 9 happen, same. How is this different with Bayesian?

10 days ago

[-]

It is the only way if you're a frequentist. But there is a whole other subfield of statistics that deals with assigning probabilities to single events.

lblume

11 days ago

[-]

If one is calibrated to report proper percentages and assigns 8% to 25 distinct events, you should expect 2 of the events to occur; 4 in case of 16% and 22.5 in case of 90%. Assuming independence (as is sadly too often done) standard math of binomial distributions can be applied and used to distinguish the prediction's accuracy probabilistically despite no actual branching or experimental repetition taking place.

cellis

11 days ago

[-]

This is probably the best thing I’ve ever read about predictions of the future. If we could run 80 parallel universes then sure it would make sense. But we only have the one [1]. If you’re right and we get fast takeoff it won’t matter because we’re all dead. In any case the number is meaningless, there is only ONE future.

hyghjiyhu

11 days ago

[-]

You can make predictions of many different things though. Building a quantifiable track record. If one person is consistently confidently wrong then that says something about their ability and methodology

exegeist

11 days ago

[-]

Impressive prediction, especially pre-ChatGPT. Compare to Gary Marcus 3 months ago: https://garymarcus.substack.com/p/reports-of-llms-mastering-...

We may certainly hope Eliezer's other predictions don't prove so well-calibrated.

rafaelero

11 days ago

[-]

Gary Marcus is so systematically and overconfidently wrong that I wonder why we keep talking about this clown.

qoez

11 days ago

[-]

People just give attention to people making surprising bold counter narrative predictions but don't give them any attention when they're wrong.

keeda

11 days ago

[-]

People like him and Zitron do serve a useful purpose in balancing the hype from the other side, which, while justified to a great extent, is often a bit too overwhelming.

Philpax

11 days ago

[-]

Being wrong in the other direction doesn't mean you've found a great balance, it just means you've found a new way to be wrong.

11 days ago

[-]

These numbers feel kind of meaningless without any work showing how he got to 16%

dcre

11 days ago

[-]

I do think Gary Marcus says a lot of wrong stuff about LLMs but I don’t see anything too egregious in that post. He’s just describing the results they got a few months ago.

m3kw9

11 days ago

[-]

He definitely cannot use the original arguments from then ChatGPT arrived, he's a perennial goal post shifter.

shuckles

11 days ago

[-]

My understanding is that Eliezer more or less thinks it's over for humans.

0xDEAFBEAD

11 days ago

[-]

He hasn't given up though: https://xcancel.com/ESYudkowsky/status/1922710969785917691#m

11 days ago

[-]

Context? Who are these people and what are these numbers and why shouldn't I assume they're pulled from thin air?

11 days ago

[-]

> why shouldn't I assume they're pulled from thin air?

You definitely should assume they are. They are rationalists, the modus operandi is to pull stuff out of thin air and slap a single digit precision percentage prediction in front to make it seems grounded in science and well thought out.

[1] https://en.wikipedia.org/wiki/Cross-entropy

11 days ago

[-]

You should basically assume they are pulled from thin air. (Or more precisely, from the brain and world model of the people making the prediction.)

The point of giving such estimates is mostly an exercise in getting better at understanding the world, and a way to keep yourself honest by making predictions in advance. If someone else consistently gives higher probabilities to events that ended up happening than you did, then that's an indication that there's space for you to improve your prediction ability. (The quantitative way to compare these things is to see who has lower log loss [1].)

jancsika

11 days ago

[-]

> If someone else consistently gives higher probabilities to events that ended up happening than you did, then that's an indication that there's space for you to improve your prediction ability.

Your inference seems ripe for scams.

For example-- if I find out that a critical mass of participants aren't measuring how many participants are expected to outrank them by random chance, I can organize a simplistic service to charge losers for access to the ostensible "mentors."

I think this happened with the stock market-- you predict how many mutual fund managers would beat the market by random chance for a given period. Then you find that same (small) number of mutual fund managers who beat the market and switched to a more lucrative career of giving speeches about how to beat the market. :)

lucianbr

11 days ago

[-]

Is there some database where you can see predictions of different people and the results? Or are we supposed to rely on them keeping track and keeping themselves honest? Because that is not something humans do generally, and I have no reason to trust any of these 'rationalists'.

This sounds like a circular argument. You started explaining why them giving percentage predictions should make them more trustworthy, but when looking into the details, I seem to come back to 'just trust them'.

11 days ago

[-]

Yes, there is: https://manifold.markets/

People's bets are publicly viewable. The website is very popular with these "rationality-ists" you refer to.

I wasn't in fact arguing that giving a prediction should make people more trustworthy, please explain how you got that from my comment? I said that the main benefit to making such predictions is as practice for the predictor themselves. If there's a benefit for readers, it is just that they could come along and say "eh, I think the chance is higher than that". Then they also get practice and can compare how they did when the outcome is known.

ohdeargodno

11 days ago

[-]

>Who are these people

Clowns, mostly. Yudkowski in particular, whose only job today seems to be making awful predictions and letting lesswrong eat it up when one out of a hundred ends up coming true, solidifying his position as AI-will-destroy-the-world messiah. They make money from these outlandish takes, and more money when you keep talking about them.

It's kind of like listening to the local drunkard at the bar that once in a while ends up predicting which team is going to win in football inbetween drunken and nonsensical rants, except that for some reason posting the predictions on the internet makes him a celebrity, instead of just a drunk curiosity.

meindnoch

11 days ago

[-]

>Who are these people

Be glad you don't know anything about them. Seriously.

Maxious

11 days ago

[-]

ask chatgpt

empiricus

11 days ago

[-]

16% is just a way of saying one in six chances

Xenoamorphous

11 days ago

[-]

Or just “twice as likely as the guy who said 8%”.

Workaccount2

11 days ago

[-]

One of the most worrying trends in AI has been how wrong the experts have been with overestimating timelines.

On the other hand, I think human hubris naturally makes us dramatically overestimate how special brains are.

11 days ago

[-]

Those percentages are completely meaningless. No better than astrology.

11 days ago

[-]

Off topic, but am I the only one getting triggered every time I see a rationalist quantify their prediction of the future with single digit accuracy? It's like their magic way of trying to get everyone to forget that they reached their conclusion in completely hand-wavy way, just like every other human being. But instead of saying "low confidence" or "high confidence" like the rest of us normies, they will tell you they think there is 16.27% chance because they really really want you to be aware that they know bayes theorem.

tedsanders

11 days ago

[-]

Interestingly, this is actually a question that's been looked at empirically!

Take a look at this paper: https://scholar.harvard.edu/files/rzeckhauser/files/value_of...

They took high-precision forecasts from a forecasting tournament and rounded them to coarser buckets (nearest 5%, nearest 10%, nearest 33%), to see if the precision was actually conveying any real information. What they found is that if you rounded the forecasts of expert forecasters, Brier scores got consistently worse, suggesting that expert forecast precision at the 5% level is still conveying useful, if noisy, information. They also found that less expert forecasters took less of a hit from rounding their forecasts, which makes sense.

It's a really interesting paper, and they recommend that foreign policy analysts try to increase precision rather than retreating to lumpy buckets like "likely" or "unlikely".

Based on this, it seems totally reasonable for a rationalist to make guesses with single digit precision, and I don't think it's really worth criticizing.

11 days ago

[-]

Likely vs. unlikely is rounding to 50%. Single digit is rounding to 1%. I don't think the parent was suggesting the former is better than the latter. Even before I read your comment I thought that 5% precision is useful but 1% precision is a silly turn-off, unless that 1% is near the 0% or 100% boundary.

btilly

11 days ago

[-]

The book Superforecasting documented that for their best forecasters, rounding off that last percent would reliably reduce Brier scores.

Whether rationalists who are publicly commenting actually achieve that level of reliability is an open question. But that humans can be reliable enough in the real world that the last percentage matters, has been demonstrated.

11 days ago

[-]

Your comment is incredibly confusing (possibly misleading) because of the key details you've omitted.

> The book Superforecasting documented that for their best forecasters, rounding off that last percent would reliably reduce Brier scores.

Rounding off that last percent... to what, exactly? Are you excluding the exceptions I mentioned (i.e. when you're already close to 0% or 100%?)

Nobody is arguing that 3% -> 4% is insignificant. The argument is over whether 16% -> 15% is significant.

btilly

10 days ago

[-]

To the nearest 5%, for percentages in that middle range. It is not just 16% -> 15%. But also 46% -> 45%.

10 days ago

[-]

Yes so this confirms my point rather than refuting it...

btilly

10 days ago

[-]

It seems that you reversed your point then. You said before:

Even before I read your comment I thought that 5% precision is useful but 1% precision is a silly turn-off, unless that 1% is near the 0% or 100% boundary.

However what I am saying is that there is real data, involving real predictions, by real people, that demonstrates that there is a measurable statistical loss of accuracy in their predictions if you round off those percentages.

This doesn't mean that any individual prediction is accurate to that percent. But it happens often enough that the last percent really does contain real value.

ToValueFunfetti

11 days ago

[-]

The most useful frame here is looking at log odds. Going from 15% -> 16% means

-log_2(.15/(1-.15)) -> -log_2(.16/1-.16))

2.5 -> 2.39

So saying 16% instead of 15% implies an additional tenth of a bit of evidence in favor (alternatively, 16/15 ~= 1.07 ~= 2^.1).

I don't know if I can weigh in on whether humans should drop a tenth of a bit of evidence to make their conclusion seem less confident. In software (eg. spam detector), dropping that much information to make the conclusion more presentable would probably be a mistake.

NooneAtAll3

11 days ago

[-]

I thought single digit means single significant digit, aka rounding to 10%?

11 days ago

[-]

I did mean 1%, not sure if I used the right term though, english not being my first language.

11 days ago

[-]

Wasn't 16% the example they were talking about? Isn't that two significant digits?

And 16% very much feels ridiculous to a reader when they could've just said 15%.

gjm11

11 days ago

[-]

In context, the "at least 16%" is responding to someone who said 8%, and 16 just happens to be exactly twice 8. I suspect (though I don't know) that Yudkowsky would not have claimed to have a robust way to pick whether 16% or 17% was the better figure.

For what it's worth, I don't think there's anything even slightly wrong with using whatever estimate feels good to you, even if it happens not to fit someone else's criterion for being a nice round number, even if your way of getting the estimate was sticking a finger in the air and saying the first number you thought of. You never make anything more accurate by rounding it[1], and while it's important to keep track of how precise your estimates are I think it's a mistake to try to do that by modifying the numbers. If you have two pieces of information (your best estimate, and how fuzzy it is), you should represent it as two pieces of information[2].

[1] This isn't strictly true, but it's near enough.

[2] Cf. "Pitman's two-bit rule".

11 days ago

[-]

> In context, the "at least 16%" is responding to someone who said 8%, and 16 just happens to be exactly twice 8. I suspect (though I don't know) that Yudkowsky would not have claimed to have a robust way to pick whether 16% or 17% was the better figure.

If this was just a way to say "at least double that", that's... fair enough, I guess.

Regarding your other point:

> For what it's worth, I don't think there's anything even slightly wrong with using whatever estimate feels good to you, even if it happens not to fit someone else's criterion for being a nice round number

This is completely missing the point. There absolutely is something wrong with doing this (barring cases like the above where it was just a confusing phrasing of something with less precision like "double that"). The issue has nothing to do with being "nice", it has to do with the significant figures and the error bars.

If you say 20% then it is understood that your error margin is 5%. Even those that don't understand sigfigs still understand that your error margin is < 10%.

If you say 19% then suddenly the understanding becomes that your error margin < 1%. Nobody is going to see that and assume your error bars on it are 5% -- nobody. Which is what makes it a ridiculous estimate. This has nothing to do with being "nice and round" and everything with conveying appropriate confidence.

gjm11

10 days ago

[-]

I'm not missing the point, I'm disagareeing with it. I am saying that the convention that if you say 20% then you are assumed to have an error margin of 5%, while if you say 19% you are assumed to have an error margin of 1%, is a bad convention. It gives you no way to say that the number is 20% with a margin of 1%. It gives you only a very small set of possible degrees-of-uncertainty. It gives you no way to express that actually your best estimate is somewhat below 20% even though you aren't sure it isn't 5% out.

It's true, of course, that if you are talking to people who are going to interpret "20%" as "anywhere between 17.5% and 22.5%" and "19%" as "anywhere between 18.5% and 19.5%", then you should try to avoid giving not-round numbers when your uncertainty is high. And that many people do interpret things that way, because although I think the convention is a bad one it's certainly a common one.

But: that isn't what happened in the case you're complaining about. It was a discussion on Less Wrong, where all the internet-rationalists hang out, and where there is not a convention that giving a not-round number implies high confidence and high precision. Also, I looked up what Yudkowsky actually wrote, and it makes it perfectly clear (explicitly, rather than via convention) that his level of uncertainty was high:

"Ha! Okay then. My probability is at least 16%, though I'd have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more."

(Incidentally, in case anyone's similarly salty about the 8% figure that gives context to this one: it wasn't any individual's estimate, it was a Metaculus prediction, and it seems pretty obvious to me that it is not an improvement to report a Metaculus prediction of 8% as "a little under 10%" or whatever.)

fredoliveira

11 days ago

[-]

My interpretation was that Yudkowski simply doubled Christiano's guess of 8% (as one might say in conversation "oh it's at least double that", but using the actual number)

ghjnut

11 days ago

[-]

Aim small, miss small?

11 days ago

[-]

Would you also get triggered if you saw people make a bet at, say, $24 : $87 odds? Would you shout: "No! That's too precise, you should bet $20 : $90!"? For that matter, should all prices in the stock market be multiples of $1, (since, after all, fluctuations of greater than $1 are very common)?

If the variance (uncertainty) in a number is large, correct thing to do is to just also report the variance, not to round the mean to a whole number.

Also, in log odds, the difference between 5% and 10% is about the same as the difference between 40% and 60%. So using an intermediate value like 8% is less crazy than you'd think.

People writing comments in their own little forum where they happen not to use sig-figs to communicate uncertainty is probably not a sinister attempt to convince "everyone" that their predictions are somehow scientific. For one thing, I doubt most people are dumb enough to be convinced by that, even if it were the goal. For another, the expected audience for these comments was not "everyone", it was specifically people who are likely to interpret those probabilities in a Bayesian way (i.e. as subjective probabilities).

11 days ago

[-]

> Would you also get triggered if you saw people make a bet at, say, $24 : $87 odds? Would you shout: "No! That's too precise, you should bet $20 : $90!"? For that matter, should all prices in the stock market be multiples of $1, (since, after all, fluctuations of greater than $1 are very common)?

No.

I responded to the same point here: https://news.ycombinator.com/item?id=44618142

> correct thing to do is to just also report the variance

And do we also pull this one out of thin air?

Using precise number to convey extremely unprecise and ungrounded opinions is imho wrong and to me unsettling. I'm pulling this purely out of my ass, and maybe I am making too much out of it, but I feel this is in part what is causing the many cases of very weird, and borderline associal/dangerous behaviours of some associated with the rationalists movement. When you try to precisely quantify what cannot be, and start trusting those numbers too much, you can easily be led to trust your conclusions way too much. I am 56% confident this is a real effect.

11 days ago

[-]

I mean, sure people can use this to fool themselves. I think usually the cause of someone fooling themselves is "the will to be fooled", and not so much that fact that they used precise numbers in the their internal monologue as opposed to verbal buckets like "pretty likely", "very unlikely". But if you estimate 56% it sometimes actually makes a difference, then who am I to argue? Sounds super accurate to me. :)

In all seriousness, I do agree it's a bit harmful for people to use this kind of reasoning, but only practice it on things like AGI that will not be resolved for years and years (and maybe we'll all be dead when it does get resolved). Like ideally you'd be doing hand-wavy reasoning with precise probabilities about whether you should bring an umbrella on a trip, or applying for that job, etc. Then you get to practice with actual feedback and learn how not to make dumb mistakes while reasoning in that style.

> And do we also pull this one out of thin air?

That's what we do when training ML models sometimes. We'll have the model make a Gaussian distribution by supplying both a mean and a variance. (Pulled out of thin air, so to speak.) It has to give its best guess of the mean, and if the variance it reports is too small, it gets penalized accordingly. Having the model somehow supply an entire probability distribution is even more flexible (and even less communicable by mere rounding). Of course, as mentioned by commenter danlitt, this isn't relevant to binary outcomes anyways, since the whole distribution is described by a single number.

11 days ago

[-]

> and not so much that fact that they used precise numbers in the their internal monologue as opposed to verbal buckets like "pretty likely", "very unlikely"

I am obviously only talking from my personal anecdotal experience, but having been on a bunch of coffee chat in the last few months with people in the AI safety field in SF, and a lot of them being Lesswrong-ers, I experienced a lot of those discussions with random % being thrown in succession to estimate the final probability of some event, and even though I have worked in ML for 10+ years (so I would guess more constantly aware of what a bayesian probability is than the average person), I do find myself often swayed by whatever numbers comes out at the end and having to consciously take a step back and pull myself from instinctively trusting this random number more than I should. I would not need to pull myself back, I think, if we were using words instead of precise numbers.

It could be just a personal mental weakness with numbers with me that is not general, but looking at my interlocutors emotional reactions to their own numerical predictions I do feel quite strongly that this is a general human trait.

ben_w

11 days ago

[-]

> It could be just a personal mental weakness with numbers with me that is not general, but looking at my interlocutors emotional reactions to their own numerical predictions I do feel quite strongly that this is a general human trait.

Your feeling is correct; anchoring is a thing, and good LessWrongers (I hope to be in that category) know this and keep track of where their prior and not just posterior probabilities come from: https://en.wikipedia.org/wiki/Anchoring_effect

Probably don't in practice, but should. That "should" is what puts the "less" into "less wrong".

11 days ago

[-]

Ah thanks for the link, yes this is precisely the bias I am feeling falling victim to if not making an effort to counter it.

danlitt

11 days ago

[-]

> If the variance (uncertainty) in a number is large, correct thing to do is to just also report the variance

I really wonder what you mean by this. If I put my finger in the air and estimate the emergence of AGI as 13%, how do I get at the variance of that estimate? At face value, it is a number, not a random variable, and does not have a variance. If you instead view it as a "random sample" from the population of possible estimates I might have made, it does not seem well defined at all.

11 days ago

[-]

I meant in a general sense that it's better when reporting measurements/estimates of real numbers to report the uncertainty of the estimate alongside the estimate, instead of using some kind of janky rounding procedure to try and communicate that information.

You're absolutely right that if you have a binary random variable like "IMO gold by 2026", then the only thing you can report about its distribution is the probability of each outcome. This only makes it even more unreasonable to try and communicate some kind of "uncertainty" with sig-figs, as the person I was replying to suggested doing!

(To be fair, in many cases you could introduce a latent variable that takes on continuous values and is closely linked to the outcome of the binary variable. Eg: "Chance of solving a random IMO problem for the very best model in 2025". Then that distribution would have both a mean and a variance (and skew, etc), and it could map to a "distribution over probabilities".)

danlitt

11 days ago

[-]

No, you are right, this hyper-numericalism is just astrology for nerds.

OldfieldFund

11 days ago

[-]

The whole community is very questionable, at best. (AI 2027, etc.)

mewpmewp2

11 days ago

[-]

In military they estimate distances this way if they don't have proper tools. Each says a min max range and then where there's most overlap, that will be taken. It's a reasonable way to make quick intuition based decisions when no other way is available.

danlitt

11 days ago

[-]

If you actually try to flesh out the reasoning behind the distance estimation strategy it will turn out 100x more convincing than the analogous argument for bayesian probability estimates. (and for any bayesians reading this, please don't multiply the probability by 100)

ben_w

11 days ago

[-]

> But instead of saying "low confidence" or "high confidence" like the rest of us normies

To add to what tedsanders wrote: there's also research that shows verbal descriptions, like those, mean wildly different things from one person to the next: https://lettersremain.com/perceptions-of-probability-and-num...

https://en.wikipedia.org/wiki/Brier_score

jdmoreira

11 days ago

[-]

Obviously you know nothing about a brier score.

also:

https://en.m.wikipedia.org/wiki/Superforecaster

mewpmewp2

11 days ago

[-]

If you take it with a grain of salt it's better than nothing. In life to express your opinion sometimes the best way is to quantify that based on intuition. To make decisions you could compile multiple experts intuitive quantities and use median or similar. There are some cases where it's more straight forward and rote, e.g. in military if you have to make distance based decisions, you might ask 8 of your soldiers to each name a number they think the distance is and take the median.

baxtr

11 days ago

[-]

No you’re definitely not the only one… 10% is ok, 5% maybe, 1% is useless.

And since we’re at it: why not give confidence intervals too?

meindnoch

11 days ago

[-]

>Off topic, but am I the only one getting triggered every time I see a rationalist

The rest of the sentence is not necessary. No, you're not the only one.

jere

11 days ago

[-]

You could look at 16% as roughly equivalent to a dice roll (1 in 6) or, you know, the odds you lose a round of Russian roulette. That's my charitable interpretation at least. Otherwise it does sound silly.

Veedrac

11 days ago

[-]

There is no honor in hiding behind euphemisms. Rationalists say ‘low confidence’ and ‘high confidence’ all the time, just not when they're making an actual bet and need to directly compare credences. And the 16.27% mockery is completely dishonest. They used less than a single significant figure.

11 days ago

[-]

> just not when they're making an actual bet

That is not my experience talking with rationalists irl at all. And that is precisely my issue, it is pervasive in every day discussion about any topic, at least with the subset of rationalists I happen to cross paths with. If it was just for comparing ability to forecast or for bets, then sure it would make total sense.

Just the other day I had a conversation with someone about working in AI safety, it when something like "well I think there is 10 to 15% chance of AGI going wrong, and if I join I have maybe 1% chance of being able to make an impact and if.. and if... and if, so if we compare with what I'm missing by not going to <biglab> instead I have 35% confidence it's the right decision"

What makes me uncomfortable with this, is that by using this kind of reasoning and coming out with a precise figure at the end, it cognitively bias you into being more confident in your reasoning than you should be. Because we are all used to treat numbers as the output of a deterministic, precise, scientific process.

There is no reason to say 10% or 15% and not 8% or 20% for rogue AGI, there is no reason to think one individual can change the direction by 1% and not by 0.3% or 3%, it's all just random numbers, and so when you multiply a gut feeling number by a gut feeling number 5 times in a row, you end up with something absolutely meaningless, where the margin of error is basically 100%.

But it somehow feels more scientific and reliable because it's a precise number, and I think this is dishonest and misleading both to the speaker themselves and to listeners. "Low confidence", or "im really not sure but I think..." have the merit of not hiding a gut feeling process behind a scientific veil.

To be clear I'm not saying you should never use numerics to try to quantify gut feeling, it's ok to say I think there is maybe 10% chance of rogue AGI and thus I want to do this or that. What I really don't like is the stacking of multiple random predictions and trying to reason about this in good faith.

> And the 16.27% mockery is completely dishonest.

Obviously satire

baxtr

11 days ago

[-]

I wonder if what you observe is a direct effect of the rationalist movement worshipping the god of Bayes.

drexlspivey

11 days ago

[-]

Yes

quirino

11 days ago

[-]

I think equally impressive is the performance of the OpenAI team at the "AtCoder World Tour Finals 2025" a couple of days ago. There were 12 human participants and only one did better than OpenAI.

Not sure there is a good writeup about it yet but here is the livestream: https://www.youtube.com/live/TG3ChQH61vE.

zeroonetwothree

11 days ago

[-]

And yet when working on production code current LLMs are about as good as a poor intern. Not sure why the disconnect.

kenjackson

11 days ago

[-]

Depends. I’ve been using it for some of my workflows and I’d say it is more like a solid junior developer with weird quirks where it makes stupid mistakes and other times behaves as a 30 year SME vet.

Rioghasarig

11 days ago

[-]

I really doubt it's like a "solid junior developer". If it could do the work of a solid junior developer it would be making programming projects 10-100x faster because it can do things several times faster than a person can. Maybe it can write solid code for certain tasks but that's not the same thing as being a junior developer.

peripitea

11 days ago

[-]

It can be 10-100x faster for some tasks already. I've had it build prototypes in minutes that would have taken me a few hours to cobble together, especially in domains and using libraries I don't have experience with.

roxolotl

11 days ago

[-]

It’s the same reason leet code is a bad interview question. Being good at these sorts of problems doesn’t translate directly to being good at writing production code.

11 days ago

[-]

because competitive coding is narrow well described domain(limited number of concepts: lists, trees, etc) with high volume of data available for training, and easy way to setup RL feeback loop, so models can improve well in this domain, which is not true about typical enterprise overbloated software.

quirino

11 days ago

[-]

All you said is true. Keep in mind this is the "Heuristics" competition instead of the "Algorithms" one.

Instead of the more traditional Leetcode-like problems, it's things like optimizing scheduling/clustering according to some loss function. Think simulated annealing or pruned searches.

sigbottle

11 days ago

[-]

Dude thank you for stating this.

OpenAI's o3 model can solve very standard even up to 2700 rated codeforces problems it's been trained on, but is unable to think from first principles to solve problems I've set that are ~1600 rated. Those 2700 algorithms problems are obscure pages on the competitive programming wiki, so it's able to solve it with knowledge alone.

I am still not very impressed with its ability to reason both in codeforces and in software engineering. It's a very good database of information and a great searcher, but not a truly good first-principles reasoner.

I also wish o3 was a bit nicer - it's "reasoning" seems to have made it more arrogant at times too even when it's wildly off ,and it kind of annoys me.

Ironically, this workflow has really separated for me what is the core logic I should care about and what I should google, which is always a skill to learn when traversing new territory.

quirino

11 days ago

[-]

Not completely sure how your reply relates to my comment. I was just mentioning the competition is on Heuristics which is different from what you find on CF or most coding competitions.

About the performance of AI on competitions, I agree what's difficult for it is different from what's difficult for us.

Problems that are just applying a couple of obscure techniques may be easier for them. But some problems I've solved required a special kind of visualization/intuition which I can see being hard for AI. But I'd also say that of many Math Olympiad problems and they seem to be doing fine there.

I've almost accepted it's a matter of time before they become better than most/all of the best competitors.

For context, I'm a CF Grandmaster but haven't played much with newer models so maybe I'm underestimating their weaknesses.

ksec

11 days ago

[-]

I am neither an optimist nor a pessimist for AI. I would likely be called both by the opposite parties. But the fact that AI / LLM is still rapidly improving is impressive in itself and worth celebrating for. Is it perfect, AGI, ASI? No. Is it useless? Absolutely not.

I am just happy the prize is so big for AI that there are enough money involve to push for all the hardware advancement. Foundry, Packaging, Interconnect, Network etc, all the hardware research and tech improvements previously thought were too expensive are now in the "Shut up and take my money" scenario.

asadotzler

11 days ago

[-]

But unlike the trillion dollars invested in the broadband internet build out between 1998 and 2008, when this 10 year trillion dollar bubble pops, we won't be left with an enduring and useful piece of infrastructure adding a trillion dollars to the global economy annually.

11 days ago

[-]

It would leave a lots of general purpose GPU-based compute. That is useful and enduring infrastructure? These things are used for many scientific and engineering problems - including medicine, climate modeling, material science, neuroscience, etc

machiaweliczny

9 days ago

[-]

I think that "Query Engine" you can later distill is quite useful artefact. If I were to TP back in time I would take current LLM with me over wikipedia as it's more accessible

j_timberlake

11 days ago

[-]

Holy shit, AIs just got a gold medal on the math olympiad and you guys are STILL spamming this shit in every thread. I don't even know how you can reach this level of inertia on a topic, did you short Nvidia stock or something?

ksec

11 days ago

[-]

>we won't be left with an enduring and useful piece of infrastructure adding a trillion dollars to the global economy annually.

Nearly all colleagues I know working inside a very large non-tech organisation are using Copilot for part of their work in the past 12 months. I have never seen tech adoption this quick for normal every day consumer. Not PC, Not Internet, Not Smartphone.

I actually had discussions with parents about our kids using ChartGPT. Every single one of them at school are using it. Honestly I didn't like it but they were actually the one who got used to it first and I quote "Who still uses Google?". That was when I learn there will be a tectonic shift in tech.

Does it actually add productivity? may be. Is it worth the trillion dollar investment? I have no idea. But are we going back? As someone who knows a lot about consumer behaviour I will say that is a definite no.

Note to myself. This feels another iPhone moment again. Except this time around lots of the tech people are skeptic of it, but consumer are adopting faster. When iPhone launch a lot of tech people knew it will be the future. But consumer took some time. Even MKBHD acknowledge his first Smartphone was in the iPhone 4s era.

TacticalCoder

11 days ago

[-]

> ... we won't be left with an enduring and useful piece of infrastructure adding a trillion dollars to the global economy annually.

I'm not drinking the AGI kool-aid but I use LLMs daily. We pay not one but two AI subscriptions at home (including Claude).

It's extremely useful. From translation to proof-reading to synthetizing to expanding on something to writing little dumb functions to helping with spreadsheet formulas to documenting code to writing commit messages to helping find movie names (when I only remember very partially the plot) etc.

How is this not already adding a trillion dollars to the economy?

It's not about the infrastructure: all that counts are the models. They're here to stay. They're not going away.

It's the single biggest time-saver I've ever seen for mundane tasks (and, no, it doesn't write good code: it write shitty pathetic underperforming insecure code... And yet it's still useful for proofs of concept / one-offs / throwaway).

j_timberlake

11 days ago

[-]

"worth celebrating for"

The correlation between "companies make smarter AI" and "our lives get better" is still a rounding error.

Many people will say "don't worry, tech always makes our lives better eventually", they'll probably stop saying this once autonomous killer drone-swarms are a thing.

https://xcancel.com/OpenAI/status/1946594928945148246

Philpax

11 days ago

[-]

Official OpenAI announcements:

https://xcancel.com/OpenAI/status/1946594933470900631

mehulashah

11 days ago

[-]

The AI scaling that went on for the last five years is going to be very different from the scaling that will happen in the next ten years. These models have latent capabilities that we are racing to unearth. IMO is but one example.

There’s so much to do at inference time. This result could not have been achieved without the substrate of general models. Its not like Go or protein folding. You need the collective public global knowledge of society to build on. And yes, there’s enough left for ten years of exploration.

More importantly, the stakes are high. There may be zero day attacks, biological weapons, and more that could be discovered. The race is on.

11 days ago

[-]

Latent??

If you looked at RLHF hiring over the last year, there was a huge hiring of IMO competitors to RLHF. This was a new, highly targeted, highly funded RLHF’ing.

curiousguy7374

11 days ago

[-]

Can you provide any kind of source? Very curious about this!

https://work.mercor.com/jobs/list_AAABljpKHPMmFMXrg2VM0qz4

11 days ago

[-]

https://benture.io/job/international-math-olympiad-participa...

https://job-boards.greenhouse.io/xai/jobs/4538773007

And Outlier/Scale, which was bought by Meta (via Scale), had many IMO-required Math AI trainer jobs on LinkedIn. I can't find those historical ones though.

I'm just one piece in the cog and this is an anecdote, but there was a huge upswing in IMO or similar RLHF job postings over the past 6mo-year.

Workaccount2

11 days ago

[-]

I would fully expect every IMO participant grinds IMO problems for months before the competition.

I don't know why people hold training a model on like material as a negation of it's ability.

10 days ago

[-]

It shows models need RL for any new domain/level of expertise, which is contrary to what the marketers claim about LLMs and potential for AGI.

11 days ago

[-]

Yup, we have bootstrapped to enough intelligence in the models that we can introduce higher levels of ai

gitfan86

11 days ago

[-]

This is such an interesting time because the percentage of people who are making predictions about AGI happening on the future are going to drop off and the number of people completely ignoring the term AGI will increase.

SiempreViernes

11 days ago

[-]

That doesn't seem likely because the LLMs haven't really delivered any great products that can cover the money spent and so AGI hype is essentially to keep the money flowing.

reactordev

11 days ago

[-]

The Final boss was:

   Which is greater, 9.11 or 9.9?

I kid, this is actually pretty amazing!! I've noticed over the last several months that I've had to correct it less and less when dealing with advanced math topics so this aligns.

11 days ago

[-]

If someone told me this say, 10 or 20 years ago, I would have assumed this was worthy of a Nobel/Turing prize ...

raincole

11 days ago

[-]

Early machine learning researchers literally got Nobel Prize last year. Clearly not every incremental step of progress merits a Nobel.

11 days ago

[-]

Yes! But there is also a delay as 10 or 20 years ago we already had neural nets and I'm curious if people back then thought the concept was Nobel worthy.

JohnKemeny

10 days ago

[-]

We have been doing automated theorem proving since 1954.

11 days ago

[-]

Has anyone independently reviewed these solutions?

My proving skills are extremely rusty so I can’t look at these and validate them. They certainly are not traditional proofs though.

shiandow

11 days ago

[-]

I read through P1, and it seemed to be correct. Though you could explain the central idea of the proof into about 3 sentences and a few drawings.

It reads like someone who found the correct answer but seemingly had no understanding of what they did and just handed in the draft paper.

Which seems odd, shouldn't an LLM be better at prose?

matt123456789

11 days ago

[-]

One would think. I suppose OpenAI threw the majority of their compute budget at producing and verifying solutions. It would certainly be interesting to see whether or not this new model can distill its responses to just those steps necessary to convey its result to a given audience.

esjeon

11 days ago

[-]

I get the feeling that modern computer systems are so powerful that they can solve almost all well-explored closed problems with a properly tuned model. The problem lies in efficiency, reliability, and cost. Increasing efficiency and reliability would require an exponential increase in cost. QC might solve that cost part, and symbolic reasoning model will significantly boost both efficiency and reliability.

strangescript

11 days ago

[-]

We are the frog in the warm water...

orespo

11 days ago

[-]

Definitely interesting. Two thoughts. First, are the IMO questions somewhat related to other openly available questions online, making it easier for LLMs that are more efficient and better at reasoning to deduce the results from the available content?

Second, happy to test it on open math conjectures or by attempting to reprove recent math results.

evrimoztamur

11 days ago

[-]

From what I've seen, IMO question sets are very diverse. Moreover, humans also train on all available set of math olympiad questions and similar sets too. It seems fair game to have the AI train on them as well.

For 2, there's an army of independent mathematicians right now using automated theorem provers to formalise more or less all mathematics as we know it. It seems like open conjectures are chiefly bounded by a genuine lack of new tools/mathematics.

11 days ago

[-]

You mean as in the previous years questions will have been used to train it? Yes, they are the same questions and due to them limited format on math questions, there are repeats so LLMs should fundamentally be able to recognise a structure and similarities and use that.

11 days ago

[-]

you either completely misinformed on the topic or a troll

laurent_du

11 days ago

[-]

They are not the same question, why are you spreading so much misinformed takes in this thread? I know a guy who had one of the best scores in history at IMO and he's incredibly intelligent. Stop repeating that getting a gold medal at IMO is a piece of cake - it's not.

another_twist

11 days ago

[-]

I am quite surprised that Deepmind with MCTS wasnt able to figure out math performance itself.

tootyskooty

11 days ago

[-]

Google will also have good results to report for this year's IMO, OpenAI just beat them to the announcement

11 days ago

[-]

I think google did some official collaboration with IMO, and will announce later. Or at least that's what I read from the IMO official saying "AI companies should wait 1 week before announcing so that we can celebrate the human winners" and "to my knowledge oai was not officially collaborating with IMO" ...

[1] https://gpt-unicorn.adamkdean.co.uk/

11 days ago

[-]

Makes sense. Mathematicians use intuiton a lot to drive their solution seeking, and I suppose an AI such as an LLM could develop intuition too. Of course where AI really wins is search speed and the fact that an LLM really doesn't get tired when exploring different strategies and steps within each strategy.

However, I expect that geometric intuition may still be lacking mostly because of the difficulty of encoding it in a form which an LLM can easily work with. After all, Chatgpt still can't draw a unicorn [1] although it seems to be getting closer.

11 days ago

[-]

Pre-registering a prediction:

When (not if) AI does make a major scientific discovery, we'll hear "well it's not really thinking, it just processed all human knowledge and found patterns we missed - that's basically cheating!"

Chance-Device

11 days ago

[-]

Turns out goalposts are the world’s most easily moved objects. We should start building spacecraft out of them.

11 days ago

[-]

I saw the phrase "goalposts aren't just moving, they're doing parkour" recently and I do love that image. It does seem to capture the state of things quite well.

ysavir

11 days ago

[-]

Less that AI is cheating and more that we basically found a way to take the thousand monkeys with infinite time scenario and condense that into a reasonable(?) amount of time and with some decent starting instructions. The AI wouldn't have done any of the heavy lifting of the discovery, it just iterated on the work of past researchers at speeds beyond human.

colinplamondon

11 days ago

[-]

Honest question - how is that not true of those past researchers?

IE, they...

- Start with the context window of prior researchers.

- Set a goal or research direction.

- Engage in chain of thought with occasional reality-testing.

- Generate an output artifact, reviewable by those with appropriate expertise, to allow consensus reality to accept or reject their work.

gitfan86

11 days ago

[-]

AI sceptics are on a trajectory towards philosophy.

ysavir

11 days ago

[-]

The difference is that human researchers have agency--for better or for worse.

peripitea

11 days ago

[-]

It sounds like you're saying AI is just doing brute force with a lot of force, but I can't imagine that's actually what you think, so would you mind clarifying?

j_timberlake

11 days ago

[-]

If you want credit for getting predictions right, you have to predict something that has less than 100% probability to happen.

11 days ago

[-]

I think both can be true - I'm pretty sure a lot of what it is viewed as genius insight by the public, is actually researchers being really familiar with the state of the art in their field and putting in the legwork of trying new ideas.

OldfieldFund

11 days ago

[-]

People get very fragile when AI is better at something than them (excluding speed/scale of operations, where computers have an obvious edge)

https://ai.vixra.org/pdf/2506.0065v1.pdf

Workaccount2

11 days ago

[-]

It's stochastic parrots all the ways down:

A satirical paper, but it's hilariously brilliant.

https://mathstodon.xyz/@tao/114881418225852441

robinhouston

11 days ago

[-]

There is some relevant context from Terence Tao on Mathstodon:

> It is tempting to view the capability of current AI technology as a singular quantity: either a given task X is within the ability of current tools, or it is not. However, there is in fact a very wide spread in capability (several orders of magnitude) depending on what resources and assistance gives the tool, and how one reports their results.

> One can illustrate this with a human metaphor. I will use the recently concluded International Mathematical Olympiad (IMO) as an example. Here, the format is that each country fields a team of six human contestants (high school students), led by a team leader (often a professional mathematician). Over the course of two days, each contestant is given four and a half hours on each day to solve three difficult mathematical problems, given only pen and paper. No communication between contestants (or with the team leader) during this period is permitted, although the contestants can ask the invigilators for clarification on the wording of the problems. The team leader advocates for the students in front of the IMO jury during the grading process, but is not involved in the IMO examination directly.

> The IMO is widely regarded as a highly selective measure of mathematical achievement for a high school student to be able to score well enough to receive a medal, particularly a gold medal or a perfect score; this year the threshold for the gold was 35/42, which corresponds to answering five of the six questions perfectly. Even answering one question perfectly merits an "honorable mention".

> But consider what happens to the difficulty level of the Olympiad if we alter the format in various ways:

> * One gives the students several days to complete each question, rather than four and half hours for three questions. (To stretch the metaphor somewhat, consider a sci-fi scenario in the student is still only given four and a half hours, but the team leader places the students in some sort of expensive and energy-intensive time acceleration machine in which months or even years of time pass for the students during this period.)

> * Before the exam starts, the team leader rewrites the questions in a format that the students find easier to work with.

> * The team leader gives the students unlimited access to calculators, computer algebra packages, textbooks, or the ability to search the internet.

> * The team leader has the six student team work on the same problem simultaneously, communicating with each other on their partial progress and reported dead ends.

> * The team leader gives the students prompts in the direction of favorable approaches, and intervenes if one of the students is spending too much time on a direction that they know to be unlikely to succeed.

> * Each of the six students on the team submit solutions, but the team leader selects only the "best" solution to submit to the competition, discarding the rest.

> * If none of the students on the team obtains a satisfactory solution, the team leader does not submit any solution at all, and silently withdraws from the competition without their participation ever being noted.

> In each of these formats, the submitted solutions are still technically generated by the high school contestants, rather than the team leader. However, the reported success rate of the students on the competition can be dramatically affected by such changes of format; a student or team of students who might not even reach bronze medal performance if taking the competition under standard test conditions might instead reach gold medal performance under some of the modified formats indicated above.

> So, in the absence of a controlled test methodology that was not self-selected by the competing teams, one should be wary of making apples-to-apples comparisons between the performance of various AI models on competitions such as the IMO, or between such models and the human contestants.

Source:

https://mathstodon.xyz/@tao/114881419368778558

https://mathstodon.xyz/@tao/114881420636881657

chairhairair

11 days ago

[-]

OpenAI simply can’t be trusted on any benchmarks: https://news.ycombinator.com/item?id=42761648

qoez

11 days ago

[-]

Remember that they've fired all whistleblowers that would admit to breaking the verbal agreement that they wouldn't train on the test data.

11 days ago

[-]

Could not find it on the open web. Do you have clues to search for?

Bjorkbat

11 days ago

[-]

Somewhat related, but I’ve been feeling as of late what can best be described as “benchmark fatigue”.

The latest models can score something like 70% on SWE-bench verified and yet it’s difficult to say what tangible impact this has on actual software development. Likewise, they absolutely crush humans at sport programming but are unreliable software engineers on their own.

What does it really mean that an LLM got gold on this year’s IMO? What if it means pretty much nothing at all besides the simple fact that this LLM is very, very good at IMO style problems?

11 days ago

[-]

Far as i can tell here, the actual advancement is in the methodology used to create a model tuned for this problem domain, and how efficient that method is. Theoretically then, making it easier to build other problem-domain-specific models.

That a highly tuned model designed to solve IMO problems can solve IMO problems is impressive, maybe, but yeah it doesn't really signal any specific utility otherwise.

[1] https://imo2025.au/wp-content/uploads/2025/07/IMO-2025_Closi...

lexandstuff

11 days ago

[-]

I don't fault you for maintaining a healthy scepticism, but per the President of the IMO: "It is very exciting to see progress in the mathematical capabilities of AI models, but we would like to be clear that the IMO cannot validate the methods, including the amount of compute used or whether there was any human involvement, or whether the results can be reproduced. What we can say is that correct mathematical proofs, whether produced by the brightest students or AI models, are valid." [1]

The proofs are correct, and it's very unlikely that IMO problems were leaked ahead of time. So the options for cheating in this circumstance are that a) IMO are colluding with a few researchers at OpenAI for some reason, or b) @alexwei_ solved the problems himself - both seem pretty unlikely to me.

suddenlybananas

11 days ago

[-]

Is b) really that unlikely?

algorithms432

10 days ago

[-]

Not really. This whole thing looks like a deliberately planned PR campaign, similar to the o3 demo. OpenAI has enough talented mathematicians. They had enough time to just solve the problems themselves. Alternatively, some participants leaking the questions for a reward isn't very unlikely either, and I definitely wouldn't put it past OpenAI to try something like that. Afterwards, they could secretly give hints or tool access to the model, or simply forge the answers, or keep rerunning the model until it gave out the correct answer. We know from FrontierMath and ARC-AGI that OpenAI can't be trusted when it comes to benchmarks.

11 days ago

[-]

Yes

suddenlybananas

10 days ago

[-]

Why?

11 days ago

[-]

This is not a benchmark, really. It's an official test.

PokemonNoGo

11 days ago

[-]

What is an _official_ test?

11 days ago

[-]

And what were the methods? How was the evaluation? They could be making it all up for all we know!

11 days ago

[-]

On OpenAI's own released papers they show Anthropic's models performing better than their own. They tend to be pretty transparent and reliable in honesty in their benchmarks.

The thing is, only leading AI companies and big tech have the money to fund these big benchmarks and run inference on them. As long as the benchmarks are somewhat publicly available and vetted by reputable scientists/mathematicians it seems reasonable to believe they're trustworthy.

tedsanders

11 days ago

[-]

Not to beat a dead horse or get into a debate, but to hopefully clarify the record:

- OpenAI denied training on FrontierMath, FrontierMath-derived data, or data targeting FrontierMath specifically

- The training data for o3 was frozen before OpenAI even downloaded FrontierMath

- The final o3 model was selected before OpenAI looked at o3's FrontierMath results

Primary source: https://x.com/__nmca__/status/1882563755806281986

You can of course accuse OpenAI of lying or being fraudulent, and if that's how you feel there's probably not much I can say to change your mind. One piece of evidence against this is that the primary source linked above no longer works at OpenAI, and hasn't chosen to blow the whistle on the supposed fraud. I work at OpenAI myself, training reasoning models and running evals, and I can vouch that I have no knowledge or hint of any cheating; if I did, I'd probably quit on the spot and absolutely wouldn't be writing this comment.

Totally fine not to take every company's word at face value, but imo this would be a weird conspiracy for OpenAI, with very high costs on reputation and morale.

11 days ago

[-]

The International Math Olympiad isn’t an AI benchmark.

It’s an annual human competition.

11 days ago

[-]

They didn’t actually compete.

11 days ago

[-]

Correct, they took the problems from the competition.

It’s not an AI benchmark generated for AI. It was targeted at humans

bgwalter

11 days ago

[-]

fxj

10 days ago

[-]

just tried this: take the graph of the functions x^n and exp(x) how many points of intersection do they have?

chatgpt gave me the wrong answer, it claimed 2 points of intersection, but for n=4 there are 3 as one can easily derive. one for negative x and 2 points for positive x because exp(x) is growing faster than x^4.

then i corrected it and said 3 points of intersection. it said yes and gaev me the 3 points. then i said no there are 4 points of intersection and it again explained to me that there are 2 points of intersection. which is wrong.

then i asked it how many points of intersection for n=e and it said: zero

well, exp(x)=x^e for x=e, isnt it?

https://chatgpt.com/s/t_687be6c1c1b88191b10bfa7eb1f37c07

mohsen1

11 days ago

[-]

O3 Pro could solve and prove the first problem when I tried:

https://threadreaderapp.com/thread/1946477742855532918.html

cowpig

11 days ago

[-]

ghm2180

10 days ago

[-]

While this is nice for splashy headlines, I like the headlines which would read some real life usecase of math grads using AI as a companion tool for solving novel problems.

pradn

11 days ago

[-]

> Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.

GPT-5 finally on the horizon!

nextworddev

11 days ago

[-]

Yep, we are about to get fast takeoff

eboynyc32

11 days ago

[-]

The world is changing and it’s exciting. Either you’re on or you’re off. The world doesn’t wait.

jacquesm

10 days ago

[-]

There's that pesky Fermi Paradox. Who knows, we might solve that one too!

neets

11 days ago

[-]

Solve aging and disease when?

11 days ago

[-]

Guys, that's nothing. My new AI system is not LLM-based but neuro-symbolic and yet it just scored 100% on the IMO 2026 problems that haven't even been written yet, it is that good.

What? This is a claim with all the trust-worthiness of OpenAI's claim. I mean I can claim anything I want at this point and it would still be just as trust-worthy as OpenAI's claim, with exactly zero details about anything else than "we did it, promise".

TestTime_9000

10 days ago

[-]

"The LLM system's core mechanism is probably a "propose-verify" loop that operates on a vocabulary of special tokens representing formal logic expressions. At inference time, it first proposes a new logical step by generating a sequence of these tokens into its context window, which serves as a computational workspace. It then performs a subsequent computational pass to verify if this new expression is a sound deduction from the preceding steps. This iterative cycle, learned from a vast corpus of synthetic proof traces, allows the model to construct a complete, validated formal argument. This process results in a system with abstract reasoning capabilities and functional soundness across domains that depend on reasoning, achieved at the cost of computation required for its extended inference time."

charlieyu1

11 days ago

[-]

I tried P1 on chatgpt-o4-high, it tells me the solution is k=0 or 1. It doesn’t even know that k=3 is a solution for n=3. Such a solution would get 0/7 in the actual IMO.

11 days ago

[-]

What about o3-pro? Remember that their model names make no sense.

Edit due to rate-limiting:

o3-pro returned an answer after 24 minutes: https://chatgpt.com/share/687bf8bf-c1b0-800b-b316-ca7dd9b009... Whether the CoT amounts to valid mathematical reasoning, I couldn't say, especially because OpenAI models tend to be very cagey with their CoT.

Gemini 2.5 Pro seems to have used more sophisticated reasoning ( https://g.co/gemini/share/c325915b5583 ) but it got a slightly different answer. Its chain of thought was unimpressive to say the least, so I'm not sure how it got its act together for the final explanation.

Claude Opus 4 appears to have missed the main solution the others found: https://claude.ai/share/3ba55811-8347-4637-a5f0-fd8790aa820b

Be interesting if someone could try Grok 4.

charlieyu1

11 days ago

[-]

Only have basic o3 to try. Spent like 10 minutes but did not return any response due to a network error. Checking the thoughts, the model was doing a lot of brute forcing up to n=8, and found k=0,1,3, but no mathematical reasoning was seen.

11 days ago

[-]

See how this compares to what you got from o3: https://chatgpt.com/share/687bf8bf-c1b0-800b-b316-ca7dd9b009...

It convincingly argues that Gemini's answer was wrong, and Gemini agrees ( https://g.co/gemini/share/aa26fb1a4344 ).

So that's pretty cool, IMO. Pitting these two models against each other in a cage match is an underused hack in my experience.

Another observation worth making is that (looking at the Github link) OpenAI didn't just paste an image of the question into the prompt, hit the button and walk away, like I did. They rewrote the prompts carefully to get the best results, and I'm a little surprised people aren't crying foul about that. So I'm pretty impressed with o3-pro's unassisted performance.

charlieyu1

11 days ago

[-]

This answer from o3 looks better. There are still some holes, Lemma 1 works on the four right-most columns, then the model tries to apply it to any four columns. "Hence at least four columns lack a vertical line; take the last four columns above." is a slip in logic, and Lemma 1 doesn't work for any two columns. For example, if we choose columns 1, 3, 5, 7 as columns lacking a vertical line, then we can take a sunny line with slope -1/2 and this will meets at all these four columns. Still, this looks more promising than what I got.

Interesting result from Gemini, I don't know its thought process but it seemed like Gemini tried to improve from its own previous answer and then got there.

strangeloops85

11 days ago

[-]

It’s interesting how hard and widespread a push they’re making in advertising this - at this particular moment, when there are rumors of more high level recruitment attempts / successes by Zuck. OpenAI is certainly a master at trying to drive narratives. (Independent of the actual significance / advance here). Sorry, there are too many hundreds of billions of dollars involved to not be a bit cautious and wary of claims being pushed this hard.

davidguetta

11 days ago

[-]

Wait for the Chinese version

procgen

11 days ago

[-]

riding coattails

11 days ago

[-]

On Meta's all-star superintelligence team, all but 8 of the 30 research scientists are ethnic Chinese

suddenlybananas

11 days ago

[-]

I don't believe this.

PokemonNoGo

11 days ago

[-]

> level performance on the world’s most prestigious math competition

I don't know which one i would consider the most prestigious math competition but it wouldn't be The IMO. The Putnam ranks higher to me and I'm not even an American. But I've come to realise one thing and that is that high-school is very important to Americans...

11 days ago

[-]

The Putnam and IMO are quite different. I would suggest the IMO is probably harder...

kevinventullo

11 days ago

[-]

I would disagree; the IMO depends only on late middle school/early high school level mathematics (geometry, gcd, functions) while Putnam typically depends on late high school/early college-level mathematics (integrals, limits, matrices).

sigbottle

11 days ago

[-]

Well I come from the competitive programming sphere, and I would say that IOI is harder than ICPC.

When you don't know that many things, that's when creativity shines, and there are some truly genuinely shocking IOI problems.

ICPC (well, in recent years they've gotten slightly better) is pretty well known as a knowledge-heavy implementation contest. Many teams get the experience that they mind-solved a lot more problems but couldn't implement them in time. Typing up the maxflow template for the 25th time for a series of collegiate-level but ultimately standard reductions isn't that inspiring.

My favorite problems are those you can derive from basic techniques but come up with scaffolding that is truly elegant. I've set some of them myself, which have stumped some famous people you may know :)

---

I guess my point is that I can see people feeling about Putnam the same way.

kurtis_reed

11 days ago

[-]

Thank you for this invaluable contribution

stingraycharles

11 days ago

[-]

My issue with all these citations is that it’s all OpenAI employees that make these claims.

I’ll wait to see third party verification and/or use it myself before judging. There’s a lot of incentives right now to hype things up for OpenAI.

https://matharena.ai/imo/

do_not_redeem

11 days ago

[-]

A third party tried this experiment with publicly available models. OpenAI did half as well as Gemini, and none of the models even got bronze.

jsnell

11 days ago

[-]

I feel you're misunderstanding something. That's not "this exact experiment". Matharena is testing publicly available models against the IMO problem set. OpenAI was announcing the results of a new, unpublished model, on that problems set.

It is totally fair to discount OpenAI's statement until we have way more details about their setup, and maybe even until there is some level of public access to the model. But you're doing something very different: implying that their results are fraudulent and (incorrectly) using the Matharena results as your proof.

11 days ago

[-]

If OpenAI would publish the models before the competition, then one could verify that they were not tinkered with. Assuming that there exists a way for them to prove that a model is the same, at least. Since the weights are not open, the most basic approach is void.

do_not_redeem

11 days ago

[-]

Fair enough, edited.

gettingoverit

11 days ago

[-]

Implying results are fraudulent is completely fair when it is a fraud.

The previous time they had claims about solving all of the math right there and right then, they were caught owning the company that makes that independent test, and could neither admit nor deny training on closed test set.

tedsanders

11 days ago

[-]

Just to quickly clarify:

- OpenAI doesn't own Epoch AI (though they did commission Epoch to make the eval)

- OpenAI denied training on the test set (and further denied training on FrontierMath-derived data, training on data targeting FrontierMath specifically, or using the eval to pick a model checkpoint; in fact, they only downloaded the FrontierMath data after their o3 training set was frozen and they didn't look at o3's FrontierMath results until after the final o3 model was already selected. primary source: https://x.com/__nmca__/status/1882563755806281986)

You can of course accuse OpenAI of lying or being fraudulent, and if that's how you feel there's probably not much I can say to change your mind. One piece of evidence against this is that the primary source linked above no longer works at OpenAI, and hasn't chosen to blow the whistle on the supposed fraud. I work at OpenAI myself, training reasoning models and running evals, and I can vouch that I have no knowledge or hints of any cheating; if I did, I'd probably quit on the spot and absolutely wouldn't be writing this comment.

Totally fine not to take every company's word at face value, but imo this would be a weird conspiracy for OpenAI, with very high costs on reputation and morale.

gettingoverit

11 days ago

[-]

That said, I missed the slight semantic difference between "being funded by" and "owning", even though I don't see how that would be different in practice.

Regarding the second point, I don't see how "hav[ing] a verbal agreement that these materials will not be used in model training" would actually discourage someone from not doing it, because breaking that kind of verbal agreement wouldn't cause any harm.

I have not been aware of those other claims on Twitter, but IMO they do not create sufficient basis for an investor fraud case either, because Twitter is not an official way of communcating to investors, which means they can claim whatever they want there. IANAL though.

I'm really looking for FrontierMath-level problems to be solvable by OpenAI models, and being able to validate it myself, yet I don't have much hope it will happen during my lifetime.

MegaButts

11 days ago

[-]

> One piece of evidence against this is that the primary source linked above no longer works at OpenAI, and hasn't chosen to blow the whistle on the supposed fraud.

Everywhere I worked offered me a significant amount of money to sign a non-disparagement agreement after I left. I have never met someone who didn't willingly sign these agreements. The companies always make it clear if you refuse to sign they will give you a bad recommendation in the future.

gettingoverit

11 days ago

[-]

One piece of evidence I have is that most powerful reasoning models can answer at most 10% of my questions on PhD-student level computer science, and are unable to produce correct implementations for basic algorithms, being provided direct references to their implementations and materials that describe them. Damn, o3 can't draw an SVG arrow. Recent advancement in counting "r" in "strawberry" is basically as far as it goes.

I don't know what exactly is at play here, and how exactly OpenAI's models can produce those "exceptionally good" results in benchmarks and at the same time be utterly unable to do even a quarter of that in private evaluation of pretty much everyone I knew. I'd expect them to use some kind of RAG techniques that make the question "what was in the training set at model checkpoint" irrelevant.

If you consider that several billion dollars of investment and national security are at stake, "weird conspiracy" becomes a regular Tuesday.

Unfortunately I can't see beyond the first message of that primary source.

11 days ago

[-]

> if that's you feel there's probably not much I can say to change your mind

you just brought several corp statements which are not grounded into any evidence, and could be not true, so you didn't say that much so far.

> Totally fine not to take every company's word at face value, but imo this would be a weird conspiracy for OpenAI, with very high costs on reputation and morale.

prize is XXB of investments and XXXB of valuation, so nothing weird in such conspiracy.

11 days ago

[-]

They didn't try o3-pro, which (while slow) is far in front of the competition right now.

Jackson__

11 days ago

[-]

Also interesting takeaways from that tweet chain:

>GPT5 soon

>it will not be as good as this secret(?) model

11 days ago

[-]

My view is that it's less impressive than previous go and chess results. Humans are worse at competitive math than at those games, it's still very limited space and well defined problems. They may hype "general purpose" as much as they want but for now it's still the case that AI is super human at well defined limited space tasks and can't achieve performance of a mediocre below average human at simple tasks without those limitations like driving a car.

Nice result but it's just another game humans got beaten at. This time a game which isn't even taken very seriously (in comparison to ones that have professional scene).

azan_

11 days ago

[-]

The scope and creativity required for IMO is much bigger than chess/GO. Also IMO is taken VERY seriously. It's a huge deal, much bigger than any chess or go tournaments.

11 days ago

[-]

Imo competitive math (or programming) is about knowing some tricks and then trying to find a combination of them that works for a given task. The number of tricks and depth required is much less than in go or chess.

I don't think it's very creative endeavor in comparison to chess/go. The searching required is less as well. There is a challenge processing natural language and producing solutions in it though.

Creativity required is not even a small fraction of what is required for scientific breakthroughs. After all no task that you can solve in 30 minutes or so can possibly require that much creativity - just knowledge and a fast mind - things computers are amazing at.

I am AI enthusiast. I just think a lot of things that were done so far are more impressive than being good at competitive math. It's a nice result blown out of proportion by OpenAI employees.

mathluke

11 days ago

[-]

I'd disagree with this take. Math olympiads are some of the most intellectually creative activities I've ever done that fit within a one day time limit. Chess and go don't even come close--I am not a strong player, but I've studied both games for hundreds of hours. (My hot take is that chess is not even very creative at all, that's why classical AI techniques produced super human results many years ago.)

There is no list of tricks that will get a silver much less a gold medal at the IMO. The problem setters try very hard to choose problems that are not just variations of other contests or solvable by routine calculation (indeed some types of problems, like polynomial inequalities, fell out of favor as near-universal techniques made them too routine to well prepared students). Of course there are common themes and patterns that recur--no way around it given the limited curriculum they draw on--but overall I think the IMO does a commendable job at encouraging out-of-the-box thinking within a limited domain. (I've heard a contestant say that IMO prep was memorizing a lot of template solutions, but he was such a genius among geniuses that I think his opinion is irrelevant to the rest of humanity!)

Of course there is always a debate whether competition math reflects skill in research math and other research domains. There's obvious areas of overlap and obvious areas of differences, so it's hard to extrapolate from AI math benchmarks to other domains. But i think it's fair to say the skills needed for the IMO include quite general quantitative reasoning ability, which is very exciting to see LLMs develop.

10 days ago

[-]

What you are missing about chess and go is that those games are not about finding one true solution. They are very psychological games (at human level) and are about finding moves that are difficult to handle for the opponent. You try to understand how your opponent thinks and what is going to be unpleasant for them. This gives a lot of scope for creative and psychological warfare.

In competitive math (or programming) there is one correct solution and no opponent. It's just not possible for it to be very creative endeavor if those solutions can be found in very limited time.

>>(I've heard a contestant say that IMO prep was memorizing a lot of template solutions, but he was such a genius among geniuses that I think his opinion is irrelevant to the rest of humanity!)

So you have not only chosen to ignore the view of someone who is very good at it but also assumed that even though the best preparation for them is to memorize a lot of solutions it must be about creativity for people who are not geniuses like this guy? How does it make sense at all?

azan_

10 days ago

[-]

> They are very psychological games (at human level) and are about finding moves that are difficult to handle for the opponent. You try to understand how your opponent thinks and what is going to be unpleasant for them. This gives a lot of scope for creative and psychological warfare.

And yet even basic models which can run on my phone win this psychological warfare with best players in the world. The scope of problems on IMO is unlimited. Please note that IMO is won by literally best high-school students in the world and most of them are unable to solve all problems (even gold medal winners). Do you think that they are dumb and unable to learn "few tricks"?

>In competitive math (or programming) there is one correct solution and no opponent. It's just not possible for it to be very creative endeavor if those solutions can be found in very limited time.

That's absurd. You could say same things about math research (and "one correct solution" would be wrong as it is for IMO), do you consider it something that's not creative?

9 days ago

[-]

>>Do you think that they are dumb and unable to learn "few tricks"?

They are just slow because they are humans. It's like in chess: if you calculate million times faster than a human you will win even if you're pretty dumb (old school chess programs). Soon enough Chat GPT will be able to solve IMO problems at international level. It still can't play chess.

>>That's absurd. You could say same things about math research (and "one correct solution" would be wrong as it is for IMO), do you consider it something that's not creative?

Have you missed the other condition? No meaningful math research can be done in 30-60 minutes (time you have for IME problem). Nothing of value that require creativity can be done in short time. Creativity requires forming a mental model, exploration, trying various paths, making connections. This requires time.

My point about math competitions not being taken as seriously also stands. People train chess or go for 10-12 years before they start peaking and then often improve after that as well. This is a lot of hours every day. Math competitions aren't done for so many hours and years and almost no one does them anymore once in college.

This means level at those must be low in comparison to endeavours people pursue professionally.

bmau5

11 days ago

[-]

The significance, though, is that the "very limited space and well defined problems" continue to expand. Moving from a purpose built system for playing a single game, vs. having a system that can address a broader set of problems would still be a significant step - as more high value tasks will fall into it's competency range. It seems the next big step will be on us to improve eval/feedback systems in less defined problems.

another_twist

11 days ago

[-]

Its a level playing field IMO. But theres another thread which claims not even bronze and I really don't want to go to X for anything.

11 days ago

[-]

I can save you the click. Public models (gemini/o3) are less than bronze. this is a specially trained model which is not publicly available.

aerhardt

11 days ago

[-]

Is the conclusion that elite models are being withheld from the public, or that general models are not that general?

tootyskooty

11 days ago

[-]

The conclusion is that research takes time to productize, and this is cutting-edge research. OAI employees stated that there isn't anything math-specific (think AlphaGeometry) about this model. It's a general system.

og_kalu

11 days ago

[-]

elite(er) models are always being withheld from the public. Open AI sat on GPT-4 for 8 months before actually releasing it.

csomar

11 days ago

[-]

The issue is that trust is very hard to build and very easy to lose. Even in today's age where regular humans have a memory span shorter than that of an LLM, OpenAI keeps abusing the public's trust. As a result, I take their word on AI/LLMs about as seriously as I'd take my grocery store clerk's opinion on quantum physics.

11 days ago

[-]

I still haven’t forgotten OpenAI’s FrontierMath debacle from December. If they really have some amazing math-solving model, give us more info than a vague twitter hype-post.

11 days ago

[-]

> The issue is that trust is very hard to build and very easy to lose

I think its opposite: general public blindly trusts all kind of hyped stuff, its a very few hyper-skeptical who are some fraction of percent of population.

gniv

11 days ago

[-]

Especially since they are saying they don't plan to release this kind of model anytime soon.

mrdependable

11 days ago

[-]

I like how they always say AI will advance science when they want to sell it to the public, but pump how it will replace workers when selling it to businesses. It’s like dangling a carrot while slowly putting a knife to our throats.

Edit: why was my comment moved from the one I was replying to? It makes no sense here on its own.

darkoob12

11 days ago

[-]

I don't know how much novelty should you expect from IMO every year but i expect many of them be variation of the same problem.

These models are trained on all old problem and their various solutions.For LLM models, solving thses problems are as impressive as writing code.

There is no high generalization.

mathluke

11 days ago

[-]

You should expect quite a bit of novelty from the IMO, given the constraint of high school level curriculum. The problem setters work very hard to avoid problems that are variations of other contests or solvable by routine methods. That's why this is a very exciting result--you can't just regurgitate homework problem solutions to get a high score at the IMO.

OtherShrezzing

11 days ago

[-]

The 4 hours time limit is a bit of an odd metric, given that OpenAI have effectively an unlimited amount of compute at their disposal. If they’re running the model on 100,000 GPUs for 4hrs, that’s obviously going to have better outcomes than running it on 5.

zkmon

11 days ago

[-]

This is an awesome progress in human achievement to get these machines intelligent. And this is also a fast regress and decline on the human wisdom!

We are simply greasing the grooves and letting things slide faster and faster and calling it progress. How does this help to make the human and nature integration better?

Does this improve climate or make humans adapt better to changing climate? Are the intelligent machines a burning need for the humanity today? Or is it all about business and political dominance? At what cost? What's the fall out of all this?

jebarker

11 days ago

[-]

Nobody knows the answers to these questions. Relying on AGI solving problems like climate change seems like a risky strategy but on the other hand it’s very plausible that these tools can help in some capacity. So we have to build, study and find out but also consider any opportunity cost of building these tools versus others.

jfengel

11 days ago

[-]

Solving climate change isn't a technical problem, but a human one. We know the steps we have to take, and have for many years. The hard part is getting people to actually do them.

No human has any idea how to accomplish that. If a machine could, we would all have much to learn from it.

jebarker

11 days ago

[-]

I disagree with this assessment. We don’t know the steps we have to take. We know a set of steps we could take but they’re societally unpalatable. Technology can potentially offer alternative steps or introduce societal changes that make the first set of steps more palatable.

jebarker

11 days ago

[-]

I feel I should clarify as clearly this is an unpopular opinion: I’m not saying climate change can be solved by technology alone, but I do believe that enabling the societal changes needed to deal with climate change requires using every tool we have at our disposal and that includes technology. I don’t really see why this is controversial and would love to hear that perspective.

jfengel

11 days ago

[-]

My perspective is that no technology can do much while the social problem is so intractable.

If somehow we could get past the social problem, technology will happen. Probably quickly, once we had some agreement that it was a thing worth doing. But until then the technology is largely moot.

11 days ago

[-]

>AI model performs astounding feat everyone claimed was impossible or won’t be achieved for a while

>Commenters on HN claim it must not be that hard, or OpenAI is lying, or cheated. Anything but admit that it is impressive

Every time on this site lol. A lot of people here have an emotional aversion to accepting AI progress. They’re deep in the bargaining/anger/denial phase.

efitz

11 days ago

[-]

I’ve been thinking a lot about what AI means about being human. Not about apocalypses or sentience or taking our jobs, but about “what is a human” and “what is the value of a human”.

All my life I’ve taken for granted that your value is related to your positive impact, and that the unique value of humans is that we can create and express things like no other species we’ve encountered.

Now, we have created this thing that has ripped away many of my preconceptions.

If an AI can adequately do whatever a particular person does, then is there still a purpose for that person? What can they contribute? (No I am not proposing or even considering doing anything about it).

It just makes me sad, like something special is going away from the world.

11 days ago

[-]

The fact that you're honestly grappling with this reality puts you far ahead of most people.

It seems a common recent neurosis (albeit protective one) to proclaim a permanent human preeminence over the world of value, moral status and such for reasons extremely coupled with our intelligence, and then claim that certain kinds of intelligence have nothing to do with it when our primacy in those specific realms of intelligence is threatened. This will continue until there's nothing humans have left to bargain with.

The world isn't what we want it to be, the world is what it is. The closest thing we have to the world turning out the way we want it making it that way. Which is why I think many of those who hate AI would give their desires for how the world ought to be a better fighting chance by putting in the work to making it so, rather than sitting in denial at what is happening in the world of artificial intelligence.

efitz

11 days ago

[-]

That’s very insightful, thank you.

I agree that denial is not an approach that’s likely to be productive.

tomrod

11 days ago

[-]

Sort of a naive response, considering many of the folks calling out the issues have significant experience building with LLMs or building LLMs.

11 days ago

[-]

Denying the rapid improvement in AI is the only naivety that really matters in the long run at this point. I haven’t seen much substantive criticism of this achievement that boils down to anything more than “well it’s a secret model so they must not be telling us something”

OldfieldFund

11 days ago

[-]

I'm building with LLMs, and they're solving problems that weren't possible to solve before due to how many resources they would consume. Resources, as in human-hours.

Finance, chemistry, biology, medicine.

11 days ago

[-]

A problem as old as the field. Quote from the 80s:

> There is a related “Theorem” about progress in AI: once some mental function is programmed, people soon cease to consider it as an essential ingredient of “real thinking”. The ineluctable core of intelligence is always in that next thing which hasn’t yet been programmed. This “Theorem” was first proposed to me by Larry Tesler, so I call it Tesler’s Theorem: “AI is whatever hasn’t been done yet.”

nick238

11 days ago

[-]

I guess my major question would be: does the training data include anything from 2025 which may have included information about the IMO 2025?

Given that AI companies are constantly trying to slurp up any and all data online, if the model was derived from existing work, it's maybe less impressive than at first glance. If present-day model does well at IMO 2026, that would be nice.

11 days ago

[-]

Are the human participants in the IMO held to the same standard?

bgwalter

11 days ago

[-]

Thank you for this pop psychology evaluation. It could have been written by an an "AI".

11 days ago

[-]

Simply stating what is going on in the comments, as happens any time AI hits a previously thought to be impossible or far-off milestone.

bgwalter

11 days ago

[-]

I'm very much not in denial about my open source being being stolen without attribution or the layoffs that are rationalized by fake "AI" productivity.

You do have a point though that we should be writing to Sen. Marsha Blackburn instead of complaining here.

demaga

11 days ago

[-]

It's a good camp to be in. If we're wrong, that will be awesome!

Oras

11 days ago

[-]

Get ready to downvoted, the wave of single minded is coming

11 days ago

[-]

There is a diversity of opinions on this site. I do hope that soon more of the intelligent commenters who have spent a while denying AI progress will realize what’s actually happening and contribute their brainpower to a meaningful cause in the lead up to AI-human parity. If we want a good future in a world where AI is smarter than humans, we need to do alignment work soon.

11 days ago

[-]

The cynicism/denial on HN about AI is exhausting. Half the comments are some weird form of explaining away the ever increasing performance of these models

I've been reading this website for probably 15 years, its never been this bad. many threads are completely unreadable, all the actual educated takes are on X, its almost like there was a talent drain

uludag

11 days ago

[-]

Cynicism and denial are two very different things, and have very different causes and justifications. I personally don't deny that LLMs are very powerful and are capable and capable of eliminating many jobs. At the same time I'm very cynical about the rollout and push for AI. I don't see in any way as a push for a "better" society or towards some notion of progress, but rather an enthusiastic effort to disempower employees, centralize power, expand surveillance, increase profits, etc.

kilna

11 days ago

[-]

AI is kerosene. A useful resource when applied with reason and compassion. Late stage capitalism is a dumpster full of garbage. AI in combination with late stage capitalism is a dumpster fire. Many, perhaps most people conflate the dumpster fire with "kerosine evil!"

literatepeople

11 days ago

[-]

Can be, but why not take them at their word? The people building these systems are directly stating the goal is to replace people. Should anyone blame people for being mad at not just the people building the systems, but the systems themselves?

nurettin

11 days ago

[-]

> AI in combination with late stage capitalism

What's the alternative here?

gellybeans

11 days ago

[-]

Making an account just to point out how these comments are far more exhausting, because they don't engage with the subject matter. They are just agreeing with a headline and saying, "See?"

You say, "explaining away the increasing performance" as though that was a good faith representation of arguments made against LLMs, or even this specific article. Questionong the self-congragulatory nature of these businesses is perfectly reasonable.

uh_uh

11 days ago

[-]

But don't you think this might be a case where there is both self-congragulation and actual progress?

overfeed

11 days ago

[-]

The level of proof for the latter is much higher, and IMO, OpenAI hasn't met the bar yet.

Something really funky is going on with newer AI models and benchmarks, versus how they perform subjectively when I use them for my use-cases. I say this across the board[1], not just regarding IpenAI. I don't know if frontier labs have run into Goodheart's law viz benchmarks, or if my use-cases that are atypical.

1. I first noticed this with Claud 3.5 vs Claud 3.7

gellybeans

11 days ago

[-]

That's a fair question, and I agree. I just find it odd how we shout across the aisle, whether in favor or against. It's a case of thinking the tech is neat, while cringing at all the money-people and their ideations.

softwaredoug

11 days ago

[-]

Probably because both sides have strong vested interests and it’s next to impossible to find a dispassionate point of view.

The Pro AI crowd, VC, tech CEOs etc have strong incentive to claim humans are obsolete. Many tech employees see threats to their jobs and want to poopoo any way AI could be useful or competitive.

orbital-decay

11 days ago

[-]

That's a huge hyperbole. I can assure you many people find the entire thing genuinely fascinating, without having any vested interest and without buying the hype.

spacemadness

11 days ago

[-]

Sure but it’s still a gold rush with a lot of exaggeration pushed by tech executives to acquire investors. There’s a lot of greed and fear to go around. I think LLMs are fascinating and cool myself having grown up with Eliza and crappy expert systems, but am more interested in deep learning outcomes like Alphafold than general purpose LLMs. You don’t hear enough about non-LLM AI because of all the money riding on LLM based tech. It’s hard not to see the bad behavior that has arisen due to all the money being thrown about. So that is to say it makes sense there is some skepticism as you can’t take what these companies say at face value. It’d be nice to have a toned down discussion about what LLMs can and can’t do but there is a lot of obfuscation and hype. Also there is the conversation about what they should or shouldn’t be doing which is completely fair to talk about.

chii

11 days ago

[-]

That's just another way to state that everybody is almost always self-serving when it comes to anything.

rvz

11 days ago

[-]

Or some can spot a euphoric bubble when they see it with lots of participants who have over-invested in 90% of these so called AI startups that are not frontier labs.

yunwal

11 days ago

[-]

What does this have to do with the math Olympiad? Why would it frame your view of the accomplishment?

11 days ago

[-]

Why don’t they release some info beyond a vague twitter hype post? I’m beginning to hate OpenAI for releasing statements like this that invariably end up being less impressive than they make it sound initially.

https://github.com/aw31/openai-imo-2025-proofs/tree/main

11 days ago

[-]

The proofs were published on GitHub for inspection, along with some details (generated within the time limit, by a system locked before the problems were released, with no external tools).

11 days ago

[-]

dude we have computers reasoning in english to solve math problems, what are you even talking about

simion314

11 days ago

[-]

I tested all available models on a harder highschool limit and they failed even with thinking mode, so I doubt OpenAI is telling the truth.

motoboi

11 days ago

[-]

Accepting openai at face value is just the lazy stance.

Finding a critic perspective and try to understand why it can be wrong is more fun. You just say "I was wrong" when proved wrong.

chvid

11 days ago

[-]

Two things can happen at the same time: Genuine technological progress and the “hype machine” going into absolute overdrive.

The problem with the hype machine is that it provokes an opposite reaction and the noise from it buries any reasonable / technical discussion.

https://news.ycombinator.com/item?id=4611830

Ologn

11 days ago

[-]

> I've been reading this website for probably 15 years, its never been this bad.

People here were pretty skeptical about AlexNet, when it won the ImageNet challenge 13 years ago.

Karrot_Kream

11 days ago

[-]

Ouch that thread makes me quite sad at the state of discourse on HN today. It's a lot better than this thread.

scarmig

11 days ago

[-]

That thread was skeptical, but it's still far more substantive than what you find here today.

11 days ago

[-]

I think that's because the announcement there actually told you something technically interesting. This just presents a result (which is cool), but the actual method is what is really cool!

rafaelero

11 days ago

[-]

Indeed it's a very unsophisticated and obnoxious audience around here. They are so conservative and unadventurous that anything possibly transformative is labeled as hype. I feel bad for them.

mpalmer

11 days ago

[-]

Enthusiastically denouncing or promoting something is much, much easier and more rewarding in the short term for people who want to appear hip to their chosen in-group - or profit center.

And then, it's likewise easy to be a reactionary to the extremes of the other side.

The middle is a harder, more interesting place to be, and people who end up there aren't usually chasing money or power, but some approximation of the truth.

impossiblefork

11 days ago

[-]

I agree that there's both cynicism and denial, but when I've explained my views I have usually been able to get through to the complainers.

Usually my go-to example for LLMs doing more than mass memorization is Charton's and Lample's LLM trained on function expressions and their derivatives and which is able to go from the derivatives to the original functions and thus perform integration, but at the same time I know that LLMs are essentially completely crazy with no understanding of reality-- just ask them to write some fiction and you'll have the model outputting discussions where characters who have never met before are addressing each other by name, or getting other similarly basic things wrong, and when something genuinely is not in the model you will end up in hallucination land. So the people saying that the models are bad are not completely crazy.

With the wrong codebase I wouldn't be surprised if you need a finetune.

scarmig

11 days ago

[-]

It's caught in a kind of feedback loop. There are only so many times you can see "stochastic parrot" or "fancy autocomplete" or "can't draw hands" or "just a bunch of matmuls, it can't replicate the human soul" lines before you decide to just not engage. This leads to more of the content being exactly that, driving more people away.

At this point, there are much better places to find technical discussion of AI, pros and cons. Even Reddit.

11 days ago

[-]

yeah alot of the time now, i will draft a comment, and then not even publish it. like whats the point

thisisit

11 days ago

[-]

This sounds like a version of "HN hates X and I am tired of it". In last 10 years or so I have been reading HN, X has been crypto, Musk/Tesla and many more.

So, as much I get the frustration comments like these don't really add much. Its complaining about others complaining. Instead this should be taken as a signal that maybe HN is not the right forum to read about these topics.

qsort

11 days ago

[-]

GP is exaggerating, but this thread in particular is really bad.

It's healthy to be skeptical, and it's even healthier to be skeptical of openai, but there are commenters who clearly have no idea of what IMO problems are saying that this means nothing somehow?

ninetyninenine

11 days ago

[-]

Makes sense. Everyone here has their pride and identity tied to their ability to code. HN likes to upvote articles related to IQ because coding correlates with IQ and HNers like to think they are smart.

AI is of course a direct attack on the average HNers identity. The response you see is like attacking a Christian on his religion.

The pattern of defense is typical. When someone’s identity gets attacked they need to defend their identity. But their defense also needs to seem rational to themselves. So they begin scaffolding a construct of arguments that in the end support their identity. They take the worst aspects of AI and form a thesis around it. And that becomes the basis of sort of building a moat around their old identity as an elite programmer genius.

Tell tale sign you or someone else is doing this is when you are talking about AI and someone just comments about how they aren’t afraid of AI taking over their own job when it wasn’t even directly the topic.

If you say like ai is going to lessen the demand for software engineering jobs the typical thing you here is “I’m not afraid of losing my job” and I’m like bro, I’m not talking about your job specifically, I’m not talking about you or your fear of losing a job I’m just talking about the economics of the job market. This is how you know it’s an identity thing more than a technical topic.

imiric

11 days ago

[-]

[flagged]

11 days ago

[-]

Please don't cross into personal attack. We ban accounts that do that.

Also, please don't fulminate. This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.

imiric

11 days ago

[-]

Noted.

General attacks are fine, but we draw the line at personal.

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

11 days ago

[-]

No, that's certainly not the case.

If you see a post that ought to have been moderated but hasn't been, the likeliest explanation is that we didn't see it. You can help by flagging it or emailing us at hn@ycombinator.com.

The other thing, though, is that views differ about how such comments should be classified. What seems like an outrageous "general attack" to one reader (especially if you feel passionately about a topic) may not at all land that way with the rest of the community. For this reason, it's hard to generalize; I'd need to see specific links.

LEDThereBeLight

11 days ago

[-]

Your anger about his comment suggests that it actually is about pride and identity. I simply don’t buy that most people here argue against AI because they’re worried about software quality and lowering the user experience. It’s the same argument the American Medical Association made in order to allow them to gatekeep physician jobs and limit openings. We’ve had developers working on adtech directly intended to reduce the quality of the user experience for decades now.

Karrot_Kream

11 days ago

[-]

> The fact we're now promoting and discussing fucking Twitter threads is absurd.

The ML community is really big on Twitter. I'm honestly quite surprised that you're angry or surprised at this. That means either you're very disconnected from the actual ML community, which is fine of course but then maybe you should hold your opinions a bit less tightly. Alternatively you're ideologically against Twitter which brings me to:

> It's not about pride and identity, you dingus.

Maybe it is? There's a very-online-tech-person identity that I'm familiar with that hates Twitter because they think that Twitter's short post length and other cultural factors on the site contributed to bad discourse quality. I used to buy it, but I've stopped because HN and Reddit are equally filled with terrible comments that generate more heat than light.

FWIW a bunch of ML researchers tried to switch to Bluesky but got so much hate, including death threats, sent at them that they all noped back to Twitter. That's the other identity portion of it that, post Musk there's a set of folks who hate Twitter ideologically and have built an identity around it. Unfortunately this identity also is anti-AI enough that it's willing to act with toxicity toward ML researchers. Tech cynicism and anti-capitalism has some tie-ins with this also.

So IMO there is an identity aspect to this. It might not be the "true hacker" identity that the GP talks about but I do very much think that this pro vs anti AI fight has turned into another culture war axis on HN that has more to do with your identity or tribe than any reasoned arguments.

[2] https://x.com/DimitrisPapail/status/1888325914603516214

11 days ago

[-]

Based on the past history with frontier-math & AIME 2025 [1],[2] I would not trust announcements which cant be independently verified. I am excited to try it out though.

Also, the performance of LLMs on imo 2025 was not even bronze [3].

Finally, this article shows that LLMs were just mostly bluffing [4] on usamo 2025.

[1] https://www.reddit.com/r/slatestarcodex/comments/1i53ih7/fro...

[3] https://matharena.ai/imo/

[4] https://arxiv.org/pdf/2503.21934

11 days ago

[-]

The solutions were publicly posted to GitHub: https://github.com/aw31/openai-imo-2025-proofs/tree/main

11 days ago

[-]

Did humans formalize the inputs ? or was the exact natural language input provided to the llm. A lot of detail is missing on the methodology used. Not to mention of any independent validation.

My skepticism stems from the past frontier math announcement which turned out to be a bluff.

11 days ago

[-]

People are reading a lot into the FrontierMath articles from a couple months ago, but tbh I don’t really understand what the controversy is supposed to be there. failing to clearly disclose sponsoring Epoch to make the benchmark clearly doesn’t affect performance of a model on it

miguelacevedo

11 days ago

[-]

Basically this. Not sure why people here love to doubt AI progress as it clearly makes strides

11 days ago

[-]

because per corps statements, AI are now top 0.1% of PhD in math, coding, physics, law, medicine etc, yet, when I try it myself for my work it makes stupid mistakes, so I have suspicion that corp very pushy on manipulating metrics/benchmarks.

11 days ago

[-]

I don't doubt the genuine progress in the field (from like, a research perspective) but my experience with commercial LLM products comes absolutely nowhere close to the hype.

It's reasonable to be suspicious of self aggrandizing claims from giant companies hyping a product, and it's hard not to be cynical when every forced AI interaction (be it Google search or my corporate managers or whatever) makes my day worse.

cynicalpeace

11 days ago

[-]

HN feels very low signal, since it's populated by people who barely interact with the real world

X is higher signal, but very group thinky. It's great if you want to know the trends, but gotta be careful not to jump off the cliff with the lemmings.

Highest signal is obviously non digital. Going to meetups, coffee/beers with friends, working with your hands, etc.

11 days ago

[-]

it used to be high signal though. you have to wonder if the type of people posting on here is different than it used to be

throw1235435

9 days ago

[-]

Its obvious why though. The typical "tech" culture values human ingenuity, creativity, intelligence and agency due to its history. Someone coming up with a new algorithm in their garage can build a billion dollar business - it is a indie hacker culture that historically valued "human intelligence".

i.e. it is a culture of meritocracy; where no matter your social connections, political or financial capital if you are smart and driven you can make it.

AI flips that around. It devalues human intelligence and moves the moats to the ol' school things of money, influence and power. The big winners are no longer the most hard working, or above average intelligence. Intelligence is devalued; as a wealthy person I now have intelligence at my fingertips making it a commodity rather than a virtue - but money, power and connections - that's now the moat.

If all you have is your talent the future could look quite scary in an AI world long term. Money buys the best models, connections, wealth and power become the remaining moats. This doesn't gel typically in a "indie hacker" like culture in most tech forums.

isaacremuant

11 days ago

[-]

Meh. Some over hype, some under hype. People like you whine and then don't want to listen to any technical concerns.

Some of us are implementing things in relation to AI so we know it's not about "increasing performance of models" but actual about the right solution for the right problem.

If you think Twitter has "educated takes" then maybe go there and stop being pretentious schmuck over here.

Talent drain, lol. I'd much rather have skeptics and good tips than usernames, follows and social media engagement.

rafaelero

11 days ago

[-]

Both sides are not equally wrong, clearly. Until yesterday prediction markets were saying the probability of an AI getting a gold medal in IMO in 2025 was <20%. So clearly we should be more hyped, not less.

isaacremuant

11 days ago

[-]

Prediction markets are companies and people trying to make money on volatility. Who cares? Why do people treat them as some prescient being?

wyuyang377

11 days ago

[-]

cynacism -> cynicism

scellus

11 days ago

[-]

It may be a talent drain too, but at least it's a selection bias. People just get enough and go away, or don't comment. At the extreme, that leads to a downward spiral in the epistemology of the site. Look at how AI fares in Bluesky.

As a partially separate issue, there are people trying to punish comments quoting AI by downvotes. You don't need to have a non-informative reply, just sourcing it to AI is enough. A random internet dude telling the same thing with less justification or detail is fine to them.

bit1993

11 days ago

[-]

Its because hackers are fed up of being conned by corporations that steal our code, ideas, data. They start out "open" only to rug pull. "Pissing in the pool of opensource".

As hackers we have more responsibility than the general public because we understand the tech and its side effects, we are the first line of defense so it is important to speak out not only to be on the right side of history but also to protect society.

adastra22

11 days ago

[-]

It’s the same with anything related to cryptocurrency. HN has a hate boner for certain topics.

11 days ago

[-]

The overconfidence/short sightedness on HN about AI is exhausting. Half the comments are some weird form of explaining how developers will be obsolete in five years and how close we are to AGI.

11 days ago

[-]

> Half the comments are some weird form of explaining how developers will be obsolete in five years and how close we are to AGI.

I do not see that at all in this comment section.

There is a lot of denial and cynicism like the parent comment suggested. The comments trying to dismiss this as just “some high school math problem” are the funniest example.

11 days ago

[-]

[flagged]

ionwake

11 days ago

[-]

congrats ur autistic, like me.

11 days ago

[-]

Your bio made my day

kenjackson

11 days ago

[-]

I went through the thread and saw nothing that looked like this.

I don’t think developers will be obsolete in five years. I don’t think AGI is around the corner. But I do think this is the biggest breakthrough in computer science history.

I worked on accelerating DNNs a little less than a decade ago and had you shown me what we’re seeing now with LLMs I’d say it was closer to 50 years out than 20 years out.

11 days ago

[-]

its very clearly a major breakthrough for humanity

NoOn3

11 days ago

[-]

Perhaps only for a very small part of humanity...

AtlasBarfed

11 days ago

[-]

Greatest breakthru in compsci.

You mean the one that paves the way for ancient Egyptian slave worker economies?

Or totalitarian rule that 1984 couldn't imagine?

Or...... Worse?

The intermediate classes of society always relied on intelligence and competence to extract money from the powerful.

AI means those classes no longer have power.

robitsT6

11 days ago

[-]

Right, if people want to talk about how they are worried about a future with super intelligence AI, that's I think something almost everyone can agree on is a worthy topic, maybe to different degrees but not the issue in my mind.

I think what it feels like I see a lot, are people who - because of their fear of a future with super intelligent AI - try to like... Deny the significance of the event, if only because they don't _want_ to wrestle with the implications.

I think it's very important we don't do that. Let's take this future seriously, so we can align ourselves on a better path forward... I fear a future where we have years of bickering in the public forums on the veracity or significance of claims, if only because this subset of the public who are incapable of mentally wrestling with the wild fucking shit we are walking into.

If not this, what is your personal line in the sand? I'm not specifically talking to any person when I say this. I just can't help but to feel like I'm going crazy, seeing people deny what is right in front of their eyes.

AtlasBarfed

10 days ago

[-]

It basically comes down to the fact that the AI that exists now will enrich and empower corporations and the government, but won't do much for anybody else.

The pro-AI astroturfers are building the popular consensus of acceptance for what those in power will use AI for: disenfranchisement and oppression. And they are correct because the capabilities of AI right now will enable that, as stated above.

The AI denialists are correct as well: current AI isn't what it is popularly billed. The CEOs are falling over themselves claiming they can cut headcount to zero because of their visionary implementation of AI. It's the old prototype/demo but not the real system snowjob in software sales, alllll over again.

Any actual benefit of AI to the common man comes with the current state of how tech companies "benefit" a consumer: with absurd degrees of privacy invasion, weaponized psychological algorithms, attention destruction, etc.

Here's a fun startup that in invariably in the works: the omnipresent employee monitoring AI. Every click you make, every shit you take, every coffee you sip, and every meeting you tune out. Your facial expressions analyzed, your actual "passion" measured, etc. Amazon is already doing 80% of this without AI in the warehouses.

The only saving grace to that is Covid and WFH, where they don't have the right to intrude on your workspace. So next time you hear about the return to office, remember what is coming....

infecto

11 days ago

[-]

I don’t typically find this to be true. There is a definite cynicism on HN especially when it comes to OpenAI. You already know what you will see. Low quality garbage of “I remember when OpenAI was open”, “remember when they used to publish research”, “sama cannot be trusted”, it’s an endless barrage of garbage.

11 days ago

[-]

its honestly ruining this website, you cant even read the comments sections anymore

NoOn3

11 days ago

[-]

But in the case of OpenAI, this is fully justified. Isn't that so?

infecto

11 days ago

[-]

No. Even if it’s true, it adds nothing to the conversation and especially considering the article.

blamestross

11 days ago

[-]

Nobody likes the idea that this is only "economical superior AI". Not as good as humans, but a LOT cheaper.

The "It will just get better" is bubble baiting the investors. The tech companies learned from the past and they are riding and managing the bubble to extract maximum ROI before it pops.

The reality is a lot of work done by humans can be replaced by an LLM with lower quality and nuance. The loss in sales/satisfaction/ect is more than offset by the reduced cost.

The current model of LLMs are enshitification accelerators and that will have real effects.

11 days ago

[-]

Incredible how many HNers cannot see this comment for what it is.

11 days ago

[-]

> I've been reading this website for probably 15 years, its never been this bad... all the actual educated takes are on X

Almost every technical comment on HN is wrong (see for example essentially all the discussion of Rust async, in which people keep making up silly claims that Rust maintainers then attempt to patiently explain are wrong).

The idea that the "educated" takes are on X though... that's crazy talk.

wrsh07

11 days ago

[-]

With regard to AI & LLMs Twitter/x is actually the only place with all of the industry people discussing.

There are a bunch of great accounts to follow that are only really posting content to x.

Karpathy, nearcyan, kalomaze, all of the OpenAI researchers including the link this discussion is on, many anthropic researchers. It's such a meme that you see people discuss reading Twitter thread + paper because the thread gives useful additional context.

Hn still has great comment sections on maker style posts, on network stuff, but I no longer enjoy the discussions wrt AI here. It's too hyperbolic.

11 days ago

[-]

that people on here dont know alot of the leading researchers only post on X is a tell in itself

ptero

11 days ago

[-]

I see the same effect regarding macroeconomic discussions. Great content on X that is head and shoulders (says me) above discussions on other platforms.

11 days ago

[-]

I'm unconvinced Twitter is a very good medium for serious technical discussion. I imagine a lot of this happens on the sidelines at conferences, on mailing threads and actually in organisations doing work on AI (e.g. Universities, Anthropic). The people who are doing the work are also often not the people who have time to Twitter.

Karrot_Kream

11 days ago

[-]

Have you published in ML conferences? I'm curious because I have ML researcher friends who have and they talk about Twitter a lot but I'm not an ML researcher myself.

aerhardt

11 days ago

[-]

Too hyperbolic for, against, or either way?

wrsh07

9 days ago

[-]

It honestly depends on the headline.

I think hn probably has a disproportionate number of haters while Twitter has a disproportionate number of blind believers / hype types.

But both have both.

Not sure how this compares to YouTube (although my guess is the thumbnails + titles are most egregious there for algorithm reasons)

breadsniffer

11 days ago

[-]

Yup! People here are great hackers but it’s almost like they have their head up their own ass when it comes to AI/ML.

Most of HN was very wrong about LLMs.

adastra22

11 days ago

[-]

This is true of every forum and every topic. When you actually know something about the topic you realize 90% of the takes about it are garbage.

But in most other sites the statistic is 99%, so HN is still doing much better than average.

scellus

11 days ago

[-]

No on AI, this is really a fringe environment of relatively uninformed commenters, compared to X. X has its craziness but you can curate your feeds by using lists. Here I can't choose who to follow.

And like said, the researchers themselves are on X, even Gary Marcus is there. ;)

facefactsdamnit

11 days ago

[-]

Software that mangles data on the regular should be thrown away.

How is it rational to 10x the budget over and over again when it mangles data every time?

The mind blowing thing is not being skeptical of that approach, it's defending it. It has become an article of faith.

It would be great to have AI chatbots. But chatbots that mangle data getting their budgets increased by orders of magnitude over and over again is just doubling down on the same mistake over and over again.

userabchn

11 days ago

[-]

HN doesn't have a strong enough protection against bots, so foreign influence campaign bots with the goal of spreading negative sentiment about American technology companies are, I believe, very common here.

https://news.ycombinator.com/newsguidelines.html

11 days ago

[-]

"Please don't post insinuations about astroturfing, shilling, bots, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data."

https://hn.algolia.com/?sort=byDate&dateRange=all&type=comme...

11 days ago

[-]

And of course it's available even in Icelandic, spoken by ~300k people, but not a single Indian language, spoken by hundreds of millions.

भारत दुर्दशा न देखी जाई...

11 days ago

[-]

Please don't take HN threads into nationalistic flamewar. It leads nowhere interesting or good.

We detached this subthread from https://news.ycombinator.com/item?id=44615783.

sebzim4500

11 days ago

[-]

Presumably almost all competitors from India would be fluent in English (given it is the second most spoken language there)? I guess the same is true of Icelandic though.

11 days ago

[-]

Yes, and there's also languages of ex-USSR countries, whose competitors presumably all understand Russian, and so on.

The real reason might be that there's an enormous class of self-loathing elites in India who actively despise the possibility of any Indian language being represented in higher education. This obviously stunts the possibility of them being used in international competitions.

tmule

11 days ago

[-]

Discussions about Indian politics or the Indian psyche—especially when laced with Indic supremacist undertones—are off-topic and an annoyance here. Please consider sharing these views in a forum focused on Indian affairs, where they’re more likely to find the traction they deserve.

11 days ago

[-]

It is not "supremacist" to believe that depriving hundreds of millions of people from higher education in their native language is deeply unjust. This reflection was prompted by a comment on why Indian languages are not represented in international competitions, which was prompted by a comment on the competition being available in many languages.

Discussions online have a tendency to go off into tangents like this. It's regrettable that this is such a contentious topic.

tmule

11 days ago

[-]

> self-loathing elites in India

Your disdain for English-speaking Indian elites (pejoratively referred to as ‘Macaulayites’ by Modi’s supporters) is quite telling. That said, as I mentioned earlier, this kind of discourse doesn’t belong here.

[1]https://www.mdpi.com/2071-1050/14/4/2168

11 days ago

[-]

My disdain is for the fact that hundreds of millions of Indians cannot access higher education in their native language, and instead of simply learning a foreign language as a subject like the rest of world, they have the bear the burden[1] of learning things in a foreign language which they have to simultaneously learn. I have disdain for the people responsible for this mess. I do not have any disdain for any language-speaking class, specially not one which I might be part of.

11 days ago

[-]

Much more efficient for us to all speak the same language. Trying to create fragmentation is inefficient.

11 days ago

[-]

You should take that up with the IMO then, or all of European Union. They provide services in ~two dozen languages.

11 days ago

[-]

Sure, but why worsen the situation by using more languages?

11 days ago

[-]

Human culture should not be particularly concerned with efficiency

chvid

11 days ago

[-]

I believe this company used to present its results and approach in academic papers with enough details so that it could be reproduced by third parties.

Now it is just doing a bunch of tweets?

11 days ago

[-]

This company used to be non profit

And many other things

samrus

11 days ago

[-]

thats when they were a real research company. the last proper research they did was instructGPT, everything since has been product development and following others. the reputation hit hasnt caught up with them because sam altman has built a whole career out of outrunning the reputation lag

do_not_redeem

11 days ago

[-]

[flagged]

falcor84

11 days ago

[-]

The tweets say that it's an unreleased model

ipsum2

11 days ago

[-]

Different model.

11 days ago

[-]

Am I missing something or is this completely meaningless? It's 100% opaque, no details whatsoever and no transparency or reproducibility.

I wouldn't trust these results as it is. Considering that there are trillions of dollars on the line as a reward for hyping up LLMs, I trust it even less.

flappyeagle

11 days ago

[-]

Yes you are missing the entire boat

aerhardt

11 days ago

[-]

The entire boat is hidden, we see nothing but a projected shadow, but we are to be blamed for missing it?

up2isomorphism

11 days ago

[-]

In fact no car company claims “gold medal” performance in Olympic running even they can do that 100 yeas ago. Obviously since IMO does not generate much money so it is an easy target.

BTW; “Gold medal performance “ looks a promotional term for me.

ddtaylor

11 days ago

[-]

Glock should show up to the UFC and win the whole tournament handily.

up2isomorphism

11 days ago

[-]

Exactly!

flappyeagle

11 days ago

[-]

LMAO

tester756

11 days ago

[-]

huh?

any details?

11 days ago

[-]

[flagged]

Jcampuzano2

11 days ago

[-]

Calling these high school/early bsc maths questions is an understatement lol.

littlestymaar

11 days ago

[-]

Which would be impressive if we knew those problems weren't in the training data already.

I mean it is quite impressive how language models are able to mobilize the knowledge they have been trained on, especially since they are able to retrieve information from sources that may be formatted very differently, with completely different problem statement sentences, different variable names and so on, and really operate at the conceptual level.

But we must wary of mixing up smart information retrieval with reasoning.

colinmorelli

11 days ago

[-]

Even if we accept as a premise that these models are doing "smart retrieval" and not "reasoning" (neither of which are being defined here, nor do I think we can tell from this tweet even if they were), it doesn't really change the impact.

There are many industries for which the vast majority of work done is closer to what I think you mean by "smart retrieval" than what I think you mean by "reasoning." Adult primary care and pediatrics, finance, law, veterinary medicine, software engineering, etc. At least half, if not upwards of 80% of the work in each of these fields is effectively pattern matching to a known set of protocols. They absolutely deal in novel problems as well, but it's not the majority of their work.

Philosophically it might be interesting to ask what "reasoning" means, and how we can assess if the LLMs are doing it. But, practically, the impacts to society will be felt even if all they are doing is retrieval.

littlestymaar

11 days ago

[-]

> There are many industries for which the vast majority of work done is closer to what I think you mean by "smart retrieval" than what I think you mean by "reasoning." Adult primary care and pediatrics, finance, law, veterinary medicine, software engineering, etc. At least half, if not upwards of 80% of the work in each of these fields is effectively pattern matching to a known set of protocols. They absolutely deal in novel problems as well, but it's not the majority of their work.

I wholeheartedly agree with that.

I'm in fact pretty bullish on LLMs, as tools with near infinite industrial use cases, but I really dislike the “AGI soon” narrative (which sets expectations way too high).

IMHO the biggest issue with LLMs isn't that they aren't good enough at solving math problem, but that there's no easy way to add information to a model after its training, which is a significant problem for a “smart information retrieval” system. RAG is used as a hack around this issue, but its performance can vary a ton with tasks. LORAs are another options, but they require significant work to make a dataset, and you can only cross your fingers the model keeps its abilities.

11 days ago

[-]

Considering these LLM utilise the entirety of the internet, there will be no unique problems that come up in the oLympiad. Even across the course of a degree, you will have likely been exposed to 95% of the various ways to write problems. As you say, retrieval is really the only skill here. There is likely no reasoning.

gcanyon

11 days ago

[-]

99.99+% of all problems humans face do not require particularly original solutions. Determining whether LLMs can solve truly original (or at least obscure) problems is interesting, and a problem worth solving, but ignores the vast majority of the (near-term at least) impact they will have.

mhh__

11 days ago

[-]

Just a stochastic parrot bro

matt3210

11 days ago

[-]

If stealing content to train models is ok, then is stealing models to merge ok?

https://news.ycombinator.com/item?id=44615695

sorokod

11 days ago

[-]

Not even bronze.

11 days ago

[-]

From the article:

Philpax

11 days ago

[-]

This is a new model. Those tests were with what's currently publicly available.

sorokod

11 days ago

[-]

Ah true.

Although it's the benchmark that is publicly available. The model is not.

Lionga

11 days ago

[-]

[flagged]

timbaboon

11 days ago

[-]

Haha no - then it wouldn't have got a gold medal ;)

11 days ago

[-]

[flagged]

11 days ago

[-]

Please see https://news.ycombinator.com/item?id=44617609.

You degraded this thread badly by posting so many comments like this.

baq

11 days ago

[-]

Velocity of AI progress in recent years is exceeded only by velocity of goalposts.

11 days ago

[-]

The goalposts should focus on being able to make a coherent statement using papers on a subject with sources. At this point it can't do that for any remotely cutting edge topic. This is just a distraction.

mindwok

11 days ago

[-]

The idea of a computer being able to solve IMO problems it has not seen before in natural language even just 3 years ago would be completely science fiction. This is astounding progress.

11 days ago

[-]

I fed the problem 1 solution into gemini and asked if it was generated by a human or llm. It said:

Conclusion: It is overwhelmingly likely that this document was generated by a human.

----

Self-Correction/Refinement and Explicit Goals:

"Exactly forbidden directions. Good." - This self-affirmation is very human.

"Need contradiction for n>=4." - Clearly stating the goal of a sub-proof.

"So far." - A common human colloquialism in working through a problem.

"Exactly lemma. Good." - Another self-affirmation.

"So main task now: compute K_3. And also show 0,1,3 achievable all n. Then done." - This is a meta-level summary of the remaining work, typical of human problem-solving "

----

bigyabai

11 days ago

[-]

LLMs are not an accurate test of whether something was written by an LLM or not.