OpenAI O3 breakthrough high score on ARC-AGI-PUB
1480 points
22 hours ago
| 151 comments
| arcprize.org
| HN
sn0wr8ven
1 hour ago
[-]
Incredibly impressive. Still can't really shake the feeling that this is o3 gaming the system more than it is actually being able to reason. If the reasoning capabilities are there, there should be no reason why it achieves 90% on one version and 30% on the next. If a human maintains the same performance across the two versions, an AI with reason should too.
reply
cornholio
30 seconds ago
[-]
but does it really matter if it "really, really" reason in the human sense, if it's able to prove some famous math theorem or come up with a novel result in theoretical physics?

While beyond current motels, that would be the final test of AGI.

reply
demirbey05
29 minutes ago
[-]
I am not expert in llm reasoning but I think because of RL. You cannot use AlphaZero to play other games.
reply
pkphilip
14 minutes ago
[-]
Yes, if a system has actually achieved AGI, it is likely to not reveal that information
reply
HeatrayEnjoyer
10 minutes ago
[-]
AGI is a spectrum, not a binary quality.
reply
GaggiX
25 minutes ago
[-]
Humans and AIs are different, the next benchmark would be build so that it emphasize the weak points of current AI models where a human is expected to perform better, but I guess you can also make a benchmark that is the opposite, where humans struggle and o3 has an easy time.
reply
bluecoconut
21 hours ago
[-]
Efficiency is now key.

~=$3400 per single task to meet human performance on this benchmark is a lot. Also it shows the bullets as "ARC-AGI-TUNED", which makes me think they did some undisclosed amount of fine-tuning (eg. via the API they showed off last week), so even more compute went into this task.

We can compare this roughly to a human doing ARC-AGI puzzles, where a human will take (high variance in my subjective experience) between 5 second and 5 minutes to solve the task. (So i'd argue a human is at 0.03USD - 1.67USD per puzzle at 20USD/hr, and they include in their document an average mechancal turker at $2 USD task in their document)

Going the other direction: I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.

Super exciting that OpenAI pushed the compute out this far so we could see he O-series scaling continue and intersect humans on ARC, now we get to work towards making this economical!

reply
bluecoconut
21 hours ago
[-]
some other imporant quotes: "Average human off the street: 70-80%. STEM college grad: >95%. Panel of 10 random humans: 99-100%" -@fchollet on X

So, considering that the $3400/task system isn't able to compete with STEM college grad yet, we still have some room (but it is shrinking, i expect even more compute will be thrown and we'll see these barriers broken in coming years)

Also, some other back of envelope calculations:

The gap in cost is roughly 10^3 between O3 High and Avg. mechanical turkers (humans). Via Pure GPU cost improvement (~doubling every 2-2.5 years) puts us at 20~25 years.

The question is now, can we close this "to human" gap (10^3) quickly with algorithms, or are we stuck waiting for the 20-25 years for GPU improvements. (I think it feels obvious: this is new technology, things are moving fast, the chance for algorithmic innovation here is high!)

I also personally think that we need to adjust our efficiency priors, and start looking not at "humans" as the bar to beat, but theoretical computatble limits (show gaps much larger ~10^9-10^15 for modest problems). Though, it may simply be the case that tool/code use + AGI at near human cost covers a lot of that gap.

reply
miki123211
15 hours ago
[-]
It's also worth keeping in mind that AIs are a lot less risky to deploy for businesses than humans.

You can scale them up and down at any time, they can work 24/7 (including holidays) with no overtime pay and no breaks, they need no corporate campuses, office space, HR personnel or travel budgets, you don't have to worry about key employees going on sick/maternity leave or taking time off the moment they're needed most, they won't assault a coworker, sue for discrimination or secretly turn out to be a pedophile and tarnish the reputation of your company, they won't leak internal documents to the press or rage quit because of new company policies, they won't even stop working when a pandemic stops most of the world from running.

reply
fsndz
4 hours ago
[-]
I get the excitement, but folks, this is a model that excels only in things like software engineering/math. They basically used reinforcement learning to train the model to better remember which pattern to use to solve specific problems. This in no way generalises to open ended tasks in a way that makes human in the loop unnecessary. This basically makes assistants better (as soon as they figure out how to make it cheaper), but I wouldn't blindly trust the output of o3. Sam Altman is still wrong: https://www.lycee.ai/blog/why-sam-altman-is-wrong
reply
robwwilliams
3 hours ago
[-]
In your blog you say:

> deep learning doesn't allow models to generalize properly to out-of-distribution data—and that is precisely what we need to build artificial general intelligence.

I think even (or especially) people like Altman accept this as a fact. I do. Hassabis has been saying this for years.

The foundational models are just a foundation. Now start building the AGI superstructure.

And this is also where most of the still human intellectual energy is now.

reply
girvo
4 hours ago
[-]
Quite. And if it was right, those businesses deploying it and replacing humans need humans with jobs and money to pay for their products and services…
reply
fakedang
2 hours ago
[-]
It will just keep bleeding the middle class on and on, till the point where either everyone is rich, homeless or a plumber or other such licensed worker. And then there will be such a glut in the latter (shrinking) market, that everyone in that group also becomes either rich or homeless.
reply
palmfacehn
1 hour ago
[-]
Productivity gains increase the standard of living for everyone. Products and services become cheaper. Leisure time increases. Scarce labor resources can be applied in other areas.

I fail to see the difference between AI-employment-doom and other flavors of Luddism.

reply
bayindirh
1 hour ago
[-]
It also fuels the income inequality with a fatter pipe in every iteration. You get richer as you move up in the supply chain, period. Companies vertically integrate to drive costs down in the long run.

As AI gets more prevalent, it'll drive the cost down for the companies supplying these services, so the former employees of said companies will be paid lower, or not at all.

So, tell me, how paying fewer people less money will drive their standard of living upwards? I can understand the leisure time. Because, when you don't have a job, all day is leisure time. But you'll need money for that, so will these companies fund the masses via government to provide Universal Basic Income, so these people can both live a borderline miserable life while funding these companies to suck these people more and more?

reply
DAGdug
48 minutes ago
[-]
Leisure time hasn’t increased in the last 100 years except for the lower income class which doesn’t have steady employment. But yes, I see your point that the homeless person who might have had a home if he had a (now automated) factory job should surely feel good about having a phone that only the ultra rich had 40 years ago.
reply
EarthAmbassador
1 hour ago
[-]
Utter nonsense. Productivity gains of the last 40 years have been captured by shareholders and top elites. Working class wages have been flat all of that time despite that gain.

In 2012, Musk was worth $2 billion. He’s now worth 223 times that yet the minimum wage has barely budged in the last 12 years as productivity rises.

reply
palmfacehn
1 hour ago
[-]
>>Productivity gains increase the standard of living for everyone.

>Productivity gains of the last 40 years have been captured by shareholders and top elites. Working class wages have been flat...

Wages do not determine the standard of living. The products and services purchased with wages determine the standard of living. "Top elites" in 1984 could already afford cellular phones, such as the Motorola DynaTAC:

>A full charge took roughly 10 hours, and it offered 30 minutes of talk time. It also offered an LED display for dialing or recall of one of 30 phone numbers. It was priced at US$3,995 in 1984, its commercial release year, equivalent to $11,716 in 2023.

https://en.wikipedia.org/wiki/Motorola_DynaTAC

Unfortunately, touch screen phones with gigabytes of ram were not available for the masses 40 years ago.

reply
DAGdug
41 minutes ago
[-]
What a patently absurd POV! A phone doesn’t compensate for the inability to solve for basic needs - housing, healthy food, healthcare. Or being unable to invest in skill development for themselves or their offspring, save for retirement.
reply
runarberg
32 minutes ago
[-]
It is also highly likely that the cost of that phone was externalized onto a worker in a poorer country that doesn’t even have basic necessity like a running water, 24 hour electricity, food security, etc.
reply
szundi
1 hour ago
[-]
Never happened with neither big technology advancement
reply
bayindirh
1 hour ago
[-]
Wealth has bled from landlords to warlords and now bleeding to techlords.

Warlords are still rich, but both money and war is flowing towards tech. You can get a piece from that pie if you're doing questionable things (adtech, targeting, data collection, brokering, etc.), but if you're a run of the mill, normal person, your circumstances are getting harder and harder, because you're slowly squeezed out of the system like a toothpaste.

reply
rockskon
11 hours ago
[-]
AI has a different risk profile than humans. They are a lot more risky for business operations where failure is wholly unacceptable under any circumstance.

They're risky in that they fail in ways that aren't readily deterministic.

And would you trust your life to a self-driving car in New York City traffic?

reply
miki123211
6 hours ago
[-]
This is a really hard and weird ethical problem IMHO, and one we'll have to deal with sooner or later.

Imagine you have a self-driving AI that causes fatal accidents 10 times less often than your average human driver, but when the accidents happen, nobody knows why.

Should we switch to that AI, and have 10 times fewer accidents and no accountability for the accidents that do happen, or should we stay with humans, have 10x more road fatalities, but stay happy because the perpetrators end up in prison?

Framed like that, it seems like the former solution is the only acceptable one, yet people call for CEOs to go to prison when an AI goes wrong. If that were the case, companies wouldn't dare use any AI, and that would basically degenerate to the latter solution.

reply
chefandy
2 hours ago
[-]
Sadly, we live in a society where those executives would use that impunity as carte blanche to spend no money improving (in the best-case scenario,) or even more likely, keep cutting safety expenditures until the body counts get high enough for it to start damaging sales. If we’ve already given them a free pass, they will exploit it to the greatest possible extent to increase profit.
reply
ETH_start
2 hours ago
[-]
What evidence exists for this characterization?
reply
chefandy
1 minute ago
[-]
Let’s see… of the top of my head…

- Air Pollution

- Water Pollution

- Disposable Packaging

- Health Insurance

- Steward Hospitals

- Marketing Junk Food, Candy and Sodas directly to children

- Tobacco

- Boeing

- Finance

- Pharmaceutical Opiates

- Social Media

- Data Brokerage

- Mining Safety

- Styrofoam Food and Bev Containers

- ITC terminal in Deerfield Park (read about the decades of them spewing thousands of pounds benzene into the air before the whole fucking thing blew up, and how they didn’t have automatic valves, spill detection, fire detection, sprinklers… in 2019.)

And, you know, plenty more. As someone that grew up playing in an unmarked, illegal, not-access-controlled toxic waste dump in a residential area owned by a huge international chemical conglomerate— and just had some cancer taken out of me last year— I’m pretty familiar with various ways corporations are willing to sacrifice health and safety to bump up their profit margin. I guess ignoring that kids were obviously playing in a swamp of toluene, PCBs, waste firefighting chemicals, and all sorts of other things on a plot not even within sight of the factory in the middle of a bunch of small farms was just the cost of doing business. Briefly looking into the topic

reply
rgbrgb
1 hour ago
[-]
The way health insurance companies optimize for denials in the US.
reply
ajmurmann
1 hour ago
[-]
Like with Cruise. One freak accident and they practically decided to go out of business. Oh wait...
reply
moritzwarhier
4 hours ago
[-]
I don't know about your country, but people going to prison for causing road fatalities is extremely rare here.

Even temporary loss of the drivers license has a very high bar, and that's the main form of accountability for driver behavior in Germany, apart from fines.

Badly injuring or killing someone who themselves did not violate traffic safety regulations is far from guaranteed to cause severe repercussions for the driver.

By default, any such situation is an accident and at best people lose their license for a couple of months.

reply
paulryanrogers
3 hours ago
[-]
Drivers are the apex predators. My local BMV passed me after I badly failed the vision test. Thankfully I was shaken enough to immediately go to the eye doctor and get treatment.
reply
monkeynotes
2 hours ago
[-]
> nobody knows why

But we do know the culpability rests on the shoulders of the humans who decided the tech was ready for work.

reply
okasaki
5 hours ago
[-]
Wait, why would we want 10x more traffic fatalities?
reply
stavros
4 hours ago
[-]
We wouldn't, that's their point.
reply
ajmurmann
40 minutes ago
[-]
Every statistic I've seen indicated much better accident rates for self-driving cars than human drivers. I've taken Waymo rides in SF and felt perfectly safe. I've taken Lyft and Uber and especially taxi rides where I felt much less safe. So I definitely would take the self-driving car. Just because I don't understand am accident doesn't make it more likely to happen.

The one minor risk I see is the cat being too polite and getting effectively stuck in dense traffic. That's a nuisance though.

Is there something about NYC traffic I'm missing?

reply
aprilthird2021
30 minutes ago
[-]
There's one important part about risk management though. If your Waymo does crash, the company is liable for it, and there's no one to shift the blame onto. If a human driver crashes, that's who you can shift liability onto.

Same with any company that employs AI agents. Sure they can work 24/7, but every mistake they make the company will be liable for (or the AI seller). With humans, their fraud, their cheating, their deception, can all be wiped off the company and onto the individual

reply
MaxPock
5 hours ago
[-]
It depends with what the risk is .Would it be whole or in part ? In an organisation,failure by an HR might present an isolated departmental risk while an AI might not be the case.
reply
lxgr
11 hours ago
[-]
Isn't everybody in NYC already? (The dangers of bad driving are much higher for pedestrians than for people in cars; there are more of the former than of the latter in NYC; I'd expect there to be a non-zero number of fully self driving cars already in the city.)
reply
chefandy
2 hours ago
[-]
If there are any fully-autonomous cars on the streets of nyc, there aren’t many of them and I don’t think there’s any way for them to operate legally. There has been discussion about having a trial.
reply
rockskon
10 hours ago
[-]
That doesn't answer my question.
reply
9dev
5 hours ago
[-]
It does, in a way; AI is already there, all around you, whether you like it or not. Technological progress is Pandora’s box; you can’t take it back or slow it down. Businesses will use AI for critical workflows, and all good that they bring, and all bad too, will happen.
reply
rockskon
1 hour ago
[-]
How about you answer my question since he did not.

Would you trust your life to a self-driving car in New York City traffic?

reply
lxgr
27 minutes ago
[-]
GP got it exactly right: I already am. There's no way for me to opt out of having self-driving cars on the streets I regularly cross as a pedestrian.
reply
zelphirkalt
4 hours ago
[-]
Deterministic they may be, but unforeseeable for humans.
reply
wwweston
11 hours ago
[-]
We can just insulate businesses employing AI from any liability, problem solved.
reply
9dev
5 hours ago
[-]
„Well, our AI that was specifically designed for maximising gains above all else may indeed have instructed the workers to cut down the entire Amazonas forest for short-term gains in furniture production.“ But no human was involved in the decision, so nobody is liable and everything is golden? Is that the future you would like to live in?
reply
lazide
5 hours ago
[-]
Hmmm, how much stock do I own in this hypothetical company? (/s, kinda)
reply
fsloth
7 hours ago
[-]
I guess - yes from business&liability sense? ”This service you are now paying for 100$? We can sell it to you for 5$ but with the caveat _we give no guarantees if it works or is it fit for purpose_ - click here to accept”.
reply
ijidak
9 hours ago
[-]
It is amazing to me that we have reached an era where we are debating the trade-off of hiring thinking machines!

I mean, this is an incredible moment from that standpoint.

Regarding the topic at hand, I think that there will always be room for humans for the reasons you listed.

But even replacing 5% of humans with AI's will have mind boggling consequences.

I think you're right that there are jobs that humans will be preferred for for quite some time.

But, I'm already using AI with success where I would previously hire a human, and this is in this primitive stage.

With the leaps we are seeing, AI is coming for jobs.

Your concerns relate to exactly how many jobs.

And only time will tell.

But, I think some meaningful percentage of the population -- even if just 5% of humanity will be replaced by AI.

reply
TheOtherHobbes
5 hours ago
[-]
It's all fun and games until the infra crashes and you can't work out why, because a machine has written all of the code, no one understands how it works or what it's doing.

Or - worse - there is no accessible code anywhere, and you have to prompt your way out of "I'm sorry Dave, I can't do that," while nothing works.

And a human-free economy does... what? For whom? When 99% of the population is unemployed, what are the 1% doing while the planet's ecosystems collapse around them?

reply
exhaze
18 minutes ago
[-]
You misunderstand the fundamentals. I've built a type-safe code generation pipeline using TypeScript that enforces compile-time and runtime safety. Everything generates from a single source of truth - structured JSON containing the business logic. The output is deterministic, inspectable, and version controlled.

Your concerns about mysterious AI code and system crashes are backwards. This approach eliminates integration bugs and maintenance issues by design. The generated TypeScript is readable, fully typed, and consistently updated across the entire stack when business logic changes.

If you're struggling with AI-generated code maintainability, that's an implementation problem, not a fundamental issue with code generation. Proper type safety and schema validation create more reliable systems, not less. This is automation making developers more productive - just like compilers and IDEs did - not replacing them.

The code works because it's built on sound software engineering principles: type safety, single source of truth, and deterministic generation. That's verifiable fact, not speculation.

reply
sirsinsalot
5 hours ago
[-]
It honestly borders on psychopathic the way engineers are treating humans in this context.

People talking like this also, in the back of their minds like to think they'll be OK. They're smart enough to be still needed. They're a human, but they'll be OK even while working to make genAI out perform them at their own work.

I wonder how they'll feel about their own hubris when they struggle to feed their family.

The US can barely make healthcare work without disgusting consequences for the sick. I wonder what mass unemployment looks like.

reply
bnj
3 hours ago
[-]
For the moment the displacement is asymmetrical; AI replacing employees, but not AI replacing consumers. If AI causes mass unemployment, the pool of consumers (profit to companies) will shrink. I wonder what the ripple effects of that will be.
reply
sirsinsalot
29 minutes ago
[-]
There's no point being rich in a world where the economy is unhealthy.
reply
jvanderbot
2 hours ago
[-]
It honestly borders on midwit to constantly introduce a false dichotomy of AI vs humans. It's just stupid base animal logic.

There is absolutely no reason a programmer should expect to write code as they do now forever, just as ASM experts had to move on. And there's no reason (no precedent and no indicators) to expect that a well-educated, even-moderately-experienced technologist will suddenly find themselves without a way to feed their family - unless they stubbornly refuse to reskill or change their workflows.

I do believe the days of "everyone makes 100k+" are nearly over, and we're headed towards a severely bimodal distribution, but I do not see how, for the next 10-15 years at least, we can't all become productive building the tools that will obviate our own jobs while we do them - and get comfortably retired in the mean time.

reply
twh270
1 hour ago
[-]
Reskill to what? When AI can do software development, it will also be able to do pretty much any other job that requires some learning.
reply
jvanderbot
1 hour ago
[-]
Even if one refuses to move on from software dev to something like AI deployer or AI validator or AI steerer, there might be a need.

If innovation ceases, then AI is king - push existing knowledge into your dataset, train, and exploit.

If innovation continues, there's always a gap. It takes time for a new thing to be made public "enough" for it to be ingested and synthesized. Who does this? Who finds the new knowledge?

Who creates the direction and asks the questions? Who determines what to build in the first place? Who synthesizes the daily experience of everyone around them to decide what tool needs to exist to make our lives easier? Maybe I'm grasping at straws here, but the world in which all scientific discovery, synthesis, direction and vision setting, etc, is determined by AI seems really far away when we talk about code generation and symbolic math manipulation.

These tools are self driving cars, and we're drivers of the software fleet. We need to embrace the fact that we might end up watching 10 cars self operate rather than driving one car, or maybe we're just setting destinations, but there simply isn't an absolutist zero sum game here unless all one thinks about is keeping the car on the road.

AND even if there were, repeating doom and feeling helpless is the last thing you want. Maybe it's not good truth that we can all adapt and should try, but it's certainly good policy.

reply
losteric
2 hours ago
[-]
There is no comfortable retirement if the process of obviating our own jobs is not coupled with appropriate socioeconomic changes.
reply
jvanderbot
1 hour ago
[-]
I don't see it. Don't you have a 401k or EU style pension? Aren't you saving some money? If not, why are you in software? I don't make as much as I thought I might, but I make enough to consider the possibility of surviving a career change.
reply
a2800276
3 hours ago
[-]
But when Sam Altman owns all the money in the world surely he'll distribute some it via his not-for-profit AI company?
reply
antihipocrat
15 hours ago
[-]
AI brings similar risks - they can leak internal information, they can be tricked into performing prohibited tasks (with catastrophic effects if this is connected to core systems), they could be accused of actions that are discriminatory (biased training sets are very common).

Sure, if a business deploys it to perform tasks that are inherently low risk e.g. no client interface, no core system connection and low error impact, then the human performing these tasks is going to be replaced.

reply
snozolli
13 hours ago
[-]
they can be tricked into performing prohibited tasks

This reminds me of the school principal who sent $100k to a scammer claiming to be Elon Musk. The kicker is that she was repeatedly told that it was a scam.

https://abc7chicago.com/fake-elon-musk-jan-mcgee-principal-b...

reply
tstrimple
11 hours ago
[-]
This is one of the things which annoys me most about anti-LLM hate. Your peers aren't right all the time either. They believe incorrect things and will pursue worse solutions because they won't acknowledge a better way. How is this any different from a LLM? You have to question everything you're presented with. Sometimes that Stack Overflow answer isn't directly applicable to your exact problem but you can extrapolate from it to resolve your problem. Why is an LLM viewed any differently? Of course you can't just blindly accept it as the one true answer, but you literally cannot do that with humans either. Humans produce a ton of shit code and non-solutions and it's fine. But when an LLM does it, it's a serious problem that means the tech is useless. Much of the modern world is built on shit solutions and we still hobble along.
reply
lazide
11 hours ago
[-]
Everyone knows humans can be idiots. The problem is that people seem to think LLMs can’t be idiots, and because they aren’t human there is no way to punish them. And then people give them too much credit/power, for their own purposes.

Which makes LLMs far more dangerous than idiot humans in most cases.

reply
0points
2 hours ago
[-]
Not people.

Certain gullible people, who tends to listen to certain charlatans.

Rational, intelligent people wouldn't consider replacing a skilled human worker with a LLM that on a good day can compete with a 3-year old.

You may see the current age as litmus for critical thinking.

reply
brookst
10 hours ago
[-]
No. Nobody thinks LLMs are perfect. That’s a strawman.

And… I am really not sure punishment is the answer to fallibility, outside of almost kinky Catholicism.

The reality is these things are very good, but imperfect, much like people.

reply
Mordisquitos
5 hours ago
[-]
> No. Nobody thinks LLMs are perfect. That’s a strawman.

I'm afraid that's not the case. Literally yesterday I was speaking with an old friend who was telling us how one of his coworkers had presented a document with mistakes and serious miscalculations as part of some project. When my friend pointed out the mistakes, which were intuitively obvious just by critically understanding the numbers, the guy kept insisting "no, it's correct, I did it with ChatGPT". It took my friend doing the calculations explicitly and showing that they made no sense to convince the guy that it was wrong.

reply
thecupisblue
9 hours ago
[-]
Sorry man, but I literally know of startups invested into by YC where CEO's for 80% of their management decisions/vision/comms use ChatGPT ... or should I say some use Claude now, as they think it's smarter and does not make mistakes.

Let that sink in.

reply
onion2k
9 hours ago
[-]
I wouldn't be surprised if GPT genuinely makes better decisions than an inexperienced, first-time CEO who has only been a dev before, especially if the person prompting it has actually put some effort into understanding their own weaknesses. It certainly wouldn't be any worse than someone who's only experience is reading a few management books.
reply
lazide
8 hours ago
[-]
And here is a great example of the problem.

An LLM doesn’t make decisions. It generates text that plausibly looks like it made a decision, when prompted with the right text.

reply
beardedwizard
5 hours ago
[-]
Why is this distinction lost in every thread on this topic, I don't get it.
reply
lazide
2 hours ago
[-]
A lot more people are credulous idiots than anyone wants to believe - and the confusion/misunderstanding is being actively propagated.
reply
sirsinsalot
4 hours ago
[-]
Think of all the human growth and satisfaction being lost to risk mitigation by offloading the pleasure of failure to Machines.
reply
lazide
2 hours ago
[-]
Ah, but machines can’t fail! So don’t worry, humans will still get to experience the ‘pleasure’. But won’t be able to learn/change anything.
reply
lazide
10 hours ago
[-]
Clearly you haven’t been listening to any CEO press releases lately?

And when was the last time a support chatbot let you actually complain or bypass to a human?

reply
gf000
6 hours ago
[-]
But human stupidity, while itself can be sometimes an unknown unknown with its creativity, is a mostly known unknown.

LLMs fail in entirely novel ways you can't even fathom upfront.

reply
halgir
4 hours ago
[-]
> LLMs fail in entirely novel ways you can't even fathom upfront.

Trust me, so do humans. Source: have worked with humans.

reply
sirsinsalot
4 hours ago
[-]
GenAI has a 100% failure to enjoy quality of life, emotional fulfillment and psychological safety.

Id say those are the goals we should be working for. That's the failure we want to look at. We are humans.

reply
pineaux
9 hours ago
[-]
Its quite stunning to frame it as anti-LLM hate. It's on the pro-LLM people to convince the anti-LLM people that choosing for LLMs is an ethically correct choice with all the necessary guardrails. It's also on the pro-LLM people to show the usefulness of the product. If pro-LLM people are right, it will be a matter of time before these people will see the errors of their ways. But doing an ad-hominem is a sure way of creating a divide...
reply
mplewis
11 hours ago
[-]
Humans can tell you how confident they are in something being right or wrong. An LLM has no internal model and cannot do such a thing.
reply
swiftcoder
9 hours ago
[-]
> Humans can tell you how confident they are in something being right or wrong

Humans are also very confidently wrong a considerable portion of the time. Particularly about anything outside their direct expertise

reply
SketchySeaBeast
1 hour ago
[-]
People only being willing to say they are unsure some of the time is still better than LLMs. I suppose, given that everything is outside of their area of expertise, it's very human of them.
reply
daveguy
1 hour ago
[-]
That's still better than never being able to make an accurate confidence assessment. The fact that this is worse outside your expertise is a main reason why expertise is so valued in hiring decisions.
reply
jvanderbot
2 hours ago
[-]
Generally, I agree with you. But, there are risks other than "But a human might have a baby any time now - what then??".

For AI example(s): Attribution is low, a system built without human intervention may suddenly fall outside its own expertise and hallucinate itself into a corner, everyone may just throw more compute at a system until it grows without bound, etc etc.

This "You can scale up to infinity" problem might become "You have to scale up to infinity" to build any reasonably sized system with AI. The shovel-sellers get fantastically rich but the businesses are effectively left holding the risk from a fast-moving, unintuitive, uninspected, partially verified codebase. I just don't see how anyone not building a CRUD app/frontend could be comfortable with that, but then again my Tesla is effectively running such a system to drive me and my kids. Albeit, that's on a well-defined problem and within literally human-made guardrails.

reply
cmiles74
2 hours ago
[-]
"...they need no corporate campuses, office space..."

This is a big downside of AI, IMHO. Those offices need to be filled! ;-)

reply
zitterbewegung
2 hours ago
[-]
Having AI "tarnish the reputation of your company" encompasses so much in regard to AI when it can receive input and be manipulated by others such as Tai from Microsoft and many other outcomes where there is a true risk for AI deployment.
reply
fakedang
2 hours ago
[-]
We can all agree we've progressed so much since Tai.
reply
lucubratory
13 hours ago
[-]
>secretly turn out to be a pedophile and tarnish the reputation of your company

This is interesting because it's both Oddly Specific and also something I have seen happen and I still feel really sorry for the company involved. Now that I think about it, I've actually seen it happen twice.

reply
bboygravity
5 hours ago
[-]
humans definitely don't need office space, but your point stands
reply
AustinW
4 hours ago
[-]
LLM office space is pretty expensive. Chillers, backup generators, raised floors, communications gear, …. They even demand multiple offices for redundancy, not to mention the new ask of a nuclear power plant to keep the lights on.
reply
monkeynotes
2 hours ago
[-]
"AIs are a lot less risky to deploy for businesses than humans" How do you know? LLMs can't even be properly scrutinized, while humans at least follow common psychology and patterns we've understood for thousands of years. This actually makes humans more predictable and manageable than you might think.

The wild part is that LLMs understand us way better than we understand them. The jump from GPT-3 to GPT-4 even surprised the engineers who built it. That should raise some red flags about how "predictable" these systems really are.

Think about it - we can't actually verify what these models are capable of or if they're being truthful, while they have this massive knowledge base about human behavior and psychology. That's a pretty concerning power imbalance. What looks like lower risk on the surface might be hiding much deeper uncertainties that we can't even detect, let alone control.

reply
ETH_start
1 hour ago
[-]
We are not pitted against AI is these match-ups. Instead, all humans and AI aligned with the goal of improving the human condition, are pitted against rogue AI which are not. Our capability to keep rogue AI in check therefore grows in proportion to the capabilities of AI.
reply
daveguy
47 minutes ago
[-]
The GP post is about how much better these AIs will be than humans once they reach a given skill level. So, yes, we are very much pitted against AI unless there are major socioeconomic changes. I don't think we are as close to a AGI as a lot of people are hyping, but at some point it would be a direct challenge to human employment. And we should think about it before that happens.
reply
salawat
37 minutes ago
[-]
You cannot tell the difference between the two veins of AI. Why do you have such a hard time understanding that?
reply
danielovichdk
5 hours ago
[-]
Name one technology that has come with computers that hasn't resulted in more humans being put to work ?

The rhetoric of not needing people doing work is cartoon'ish. I mean there is no sane explanation of how and why that would happen, without employing more people yet again, taking care of the advancements.

It's nok like technology has brought less work related stress. But it has definitely increased it. Humans were not made for using technology at such a pace as it's being rolled out.

The world is fucked. Totally fucked.

reply
mortehu
4 hours ago
[-]
Self check-out stations, ATMs, and online brokerages. Recently chat support. Namely cases where millions of people used to interact with a representative every week, and now they don't.
reply
palmfacehn
1 hour ago
[-]
"Name one use of electric lighting that hasn't resulted in candle makers losing work?"

The framing of the question misses the point. With electric lighting we can now work longer into the night. Yes, less people use and make candles. However, the second order effects allow us to be more productive in areas we may not have previously considered.

New technologies open up new opportunities for productivity. The bank tellers displaced by ATM machines can create value elsewhere. Consumers save time by not waiting in a queue, allowing them to use their time more economically. Banks have lower overhead, allowing more customers to afford their services.

reply
0points
2 hours ago
[-]
Where to even start?

Digital banks

Cashless money transfer services

Self service

Modern farms

Robo lawn mowers

NVR:s with object detection

I can go on forever

reply
salawat
33 minutes ago
[-]
Please do. I'm certain you can't, and you'll have to stop much sooner than you think. Appeals to triviality are the first refuge of the person who thinks they know, but does not.
reply
zamadatix
20 hours ago
[-]
I don't follow how 10 random humans can beat the average STEM college grad and average humans in that tweet. I suspect it's really "a panel of 10 randomly chosen experts in the space" or something?

I agree the most interesting thing to watch will be cost for a given score more than maximum possible score achieved (not that the latter won't be interesting by any means).

reply
bcrosby95
19 hours ago
[-]
Two heads is better than 1. 10 is way better. Even if they aren't a field of experts. You're bound to get random people that remember random stuff from high school, college, work, and life in general, allowing them to piece together a solution.
reply
inerte
19 hours ago
[-]
Aaaah thanks for the explanation. PANEL of 10 humans, as in, they were all together. I parsed the phrase as "10 random people" > "average human" which made little sense.
reply
modeless
18 hours ago
[-]
Actually I believe that he did mean 10 random people tested individually, not a committee of 10 people. The key being that the question is considered to be answered correctly if any one of the 10 people got it right. This is similar to how LLMs are evaluated with pass@5 or pass@10 criteria (because the LLM has no memory so running it 10 times is more like asking 10 random people than asking the same person 10 times in a row).

I would expect 10 random people to do better than a committee of 10 people because 10 people have 10 chances to get it right while a committee only has one. Even if the committee gets 10 guesses (which must be made simultaneously, not iteratively) it might not do better because people might go along with a wrong consensus rather than push for the answer they would have chosen independently.

reply
elcomet
17 hours ago
[-]
He means 10 humans voting for the answer
reply
zamadatix
2 hours ago
[-]
Aha, "at least 1 of a panel of 10", not "the panel of 10 averaged"! Thanks, that makes so much more sense to me now.

I have failed the real ARC AGI :)

reply
generic92034
16 hours ago
[-]
If that works that way at all depends on the group dynamic. It is easily possible that a not so bright individual takes an (unofficial) leadership position in the group and overrides the input of smarter members. Think of any meetings with various hierarchy levels in a company.
reply
daveguy
39 minutes ago
[-]
The ARC AGI questions can be a little tricky, but the solutions can generally be easily explained. And you get 3 tries. So, the 3 best descriptions of the solution votes on by 10 people is going to be very effective. The problem space just isn't complicated enough for an unofficial "leader" to sway the group to 3 wrong answers.
reply
herval
16 hours ago
[-]
Depends on the task, no?

Do you have a sense of what kind of task this benchmark includes? Are they more “general” such that random people would fare well or more specialized (ie something a STEM grad studied and isn’t common knowledge)?

reply
judge2020
16 hours ago
[-]
It does, which is why I don’t really subscribe to any test like this being great for actually determining “AGI”. A true AGI would be able to continuously train and create new LLMs that enable it to become a SME in entirely new areas.
reply
dlkf
14 hours ago
[-]
If you take a vote of 10 random people, then as long as their errors are not perfectly correlated, you’ll do better than asking one person.

https://en.m.wikipedia.org/wiki/Ensemble_learning

reply
hmottestad
20 hours ago
[-]
Might be that within a group of 10 people, randomly chosen, when each person attempts to solve the tasks at least 99% of the time 1 person out of the 10 people will get it right.
reply
olalonde
7 hours ago
[-]
Even if you assume that non STEM grads are dumb, isn't there a good probability of having a STEM graduate among 10 random humans?
reply
HDThoreaun
15 hours ago
[-]
ARC-AGI is essentially an IQ test. There is no "expert in the space". Its just a question of if youre able to spot the pattern.
reply
shkkmo
16 hours ago
[-]
It is fairly well documented that groups of people can show cognitive abilities that exceed that of any individual member. The classic example of this is if you ask a group of people to estimate the number of jellybeans in a jar, you can get a more accurate result than if you test to find the person with the highest accuracy and use their guess.

This isn't to say groups always outperform their members on all tasks, just that it isn't unusual to see a result like that.

reply
zamadatix
2 hours ago
[-]
Yes, my shortcoming was in understanding the 10 were implied to have their successes merged together by being a panel rather than just the average of a special selection.
reply
agumonkey
4 hours ago
[-]
who in this field is anticipating impact of near AGI for society ? maybe i'm too anxious but not planning for potential workless life seems dangerous (but maybe i'm just not following the right groups)
reply
daveguy
30 minutes ago
[-]
AGI would have a major impact on human work. Currently the hype is much greater than the reality. But it looks like we are starting to see some of the components of an AGI and that is cause for discussion of impact, but not panicked discussion. Even the chatbot customer service has to be trained on the domain. Still it is most useful in a few specific ways:

Routing to the correct human support

Providing FAQ level responses to the most common problems.

Providing a second opinion to the human taking the call.

So, even this most relevant domain for the technology doesn't eliminate human employment (because it's just not flexible or reliable enough yet).

reply
xbmcuser
16 hours ago
[-]
You are missing that cost of electricity is also going to keep falling because of solar and batteries. This year in China my table cloth math says it is $0.05 pkwh and following the cost decline trajectory be under $0.01 in 10 years
reply
barney54
14 hours ago
[-]
But the cost of electricity is not falling—it’s increasing. Wholesale prices have decreased, but retail rates are up. In the U.S. rates are up 27% over the past 4 years. In Europe prices are up too.
reply
NoLinkToMe
3 hours ago
[-]
That's a bit of a non-statement. Virtually all prices increase because of money supply, but we consider things to get cheaper if their prices grow less fast than inflation / income.

General inflation has outpaced the inflation of electricity prices by about 3x in the past 100 years. In other words, electricity has gotten cheaper over time in purchasing power terms.

And that's whilst our electricity usage has gone up by 10x in the last 100 years.

And this concerns retail prices, which includes distribution/transmission fees. These have gone up a lot as you get complications on the grid, some of which is built on a century old design. But wholesale prices (the cost of generating electricity without transmission/distribution) are getting dirt cheap, and for big AI datacentres I'm pretty sure they'll hook up to their own dedicated electricity generation at wholesale prices, off the grid, in the coming decades.

reply
xbmcuser
13 hours ago
[-]
Most large compute clusters would be buying electricity at wholesale price not at retail price. But anyway solar and battery prices have just reached the tipping point this year only now the longer power companies keep retail prices high the more people will defect from the grid and install their own solar + batteries.
reply
lucubratory
13 hours ago
[-]
I am not certain because I've been very focused on the o3 news, but at least yesterday neither the US nor Europe were part of China.
reply
lxgr
11 hours ago
[-]
But data centers pay wholesale prices or even less (given that especially AI training and, to a lesser extend, inference clusters can load shed like few other consumers of electricity).
reply
fulafel
10 hours ago
[-]
And this is great news as long as marginal production (the most expensive to produce, first to turn on/off according to demand) of electricity is fossils.
reply
patrickhogan1
15 hours ago
[-]
Bingo! Solar energy moves us toward a future where a household's energy needs become nearly cost-free.

Energy Need: The average home uses 30 kWh/day, requiring 6 kW/hour over 5 peak sunlight hours.

Multijunction Panels: Lab efficiencies are already at 47% (2023), and with multiple years of progress, 60% efficiency is probable.

Efficiency Impact: At 60% efficiency, panels generate 600 W/m², requiring 10 m² (e.g., 2 m × 5 m) to meet energy needs.

This size can fit on most home roofs, be mounted on a pole with stacked layers, or even be hung through an apartment window.

reply
jdhwosnhw
20 minutes ago
[-]
While I agree with your general assessment, I think your conclusion is a bit off. You’re assuming 1kw/m^2, which is only true with the sun directly overhead. A real-world solar setup gets hit with several factors of cosine (related to roof pitch, time of day, day of year, and latitude) that conspire to reduce the total output.

For example, my 50 sq m set up, at -29 deg latitude, generated your estimated 30 kwh/day output. I have panels with ~20% efficiency, suggesting that at 60% efficiency, the average household would only get to around half their energy needs with 10 sq m.

Yes, solar has the potential to drastically reduce energy costs, but even with free energy storage, individual households aren’t likely to achieve self sustainability.

reply
arcticbull
15 hours ago
[-]
Everyone always forgets that they only perform at less than half of their rated capacity and require significant battery installations. Rooftop solar plus storage is actually more expensive than nuclear on a comparable system LCOE due to their lack of efficiency of scale. Rooftop solar plus storage is about the most expensive form of electricity on earth, maybe excluding gas peaker plants.
reply
xbmcuser
14 hours ago
[-]
Everyone also forgets the speed of price decline for solar and battery your statement is completely false propaganda made up by power companies. Today rooftop solar and battery is cost competitive to nuclear already in many countries like India
reply
patrickhogan1
13 hours ago
[-]
You’re right that rooftop solar and storage have costs and efficiency limits, but those are improving quickly.

Rooftop solar harnesses energy from the sun, which is powered by nuclear fusion—arguably the most effective nuclear reactor in our solar system.

reply
nateglims
14 hours ago
[-]
It varies by a lot of factors but it’s way less than half. Photovoltaic panels have around 10% capacity utilization vs 50-70% for a gas or nuke plant.
reply
theendisney
10 hours ago
[-]
The thing everyone forgets is that all good energy technology is seized by governments for military purposes and to preserve the status quo. God knows how far it progressed.

What a joke

reply
sahmeepee
5 hours ago
[-]
Average US home.

In Europe it is around 6-7 kWh/day. This might increase with electrification of heating and transport, but probably nothing like as much as the energy consumption they are replacing (due to greater efficiency of the devices consuming the energy and other factors like the quality of home insulation.)

In the rest of the world the average home uses significantly less.

reply
necovek
11 hours ago
[-]
If climate change ends up changing weather profiles and we start seeing many more cloudy days or dust/mist in the air, we'll need to push those solar panel above (all the way to space?) or have many more of them, figure out transmission to the ground and costs will very much balloon.

Not saying this will happen, but it's risky to rely on solar as the only long-term solution.

reply
nateglims
14 hours ago
[-]
Is it going to fall significantly for data centers? Industrial policy for consumer power is different from subsidizing it for data centers and if you own grid infrastructure why would you tank the price by putting up massive amounts of capital?
reply
xbmcuser
12 hours ago
[-]
It's the same about using the cloud or using your own infrastructure there will be a point where building your own solar and battery plant is cheaper than what they are charging they will need to follow the price decline if they want to keep the customers if not there will be mass scale grid defections.
reply
nateglims
11 hours ago
[-]
I don’t think this reflects the reality of the power industry. Data centers are the only significant growth in actual generated power in decades and hyperscalers are already looking at very bespoke solutions.

The heavy commodification of networking and compute brought about by the internet and cloud aligned with tech company interests in delivering services or content to consumers. There does not seem to be an emerging consensus that data center operators also need to provide consumer power.

reply
xbmcuser
11 hours ago
[-]
It was not the reality of the power industry but will be soon as we have not had a source of electricity that is the cheapest and is getting cheaper and easy to install this is something unique.

I don't see Google, Amazon, Microsoft or any company pay $10 for something if building it themselves will cost them $5. Either the price difference will reach a point where investing into power production themselves makes sense or the power companies decrease prices. Looking at how all 3 have already been investing in power production over the last decade themselves either to get better prices or for PR.

reply
lyu07282
3 hours ago
[-]
But didn't we liberalized energy markets for that reason, if anyone could undercut the market like that wouldn't that happen automatically and the prices go down anyway? /s
reply
iandanforth
19 hours ago
[-]
Let's say that Google is already 1 generation ahead of nvidia in terms of efficient AI compute. ($1700)

Then let's say that OpenAI brute forced this without any meta-optimization of the hypothesized search component (they just set a compute budget). This is probably low hanging fruit and another 2x in compute reduction. ($850)

Then let's say that OpenAI was pushing really really hard for the numbers and was willing to burn cash and so didn't bother with serious thought around hardware aware distributed inference. This could be more than a 2x decrease in cost like we've seen deliver 10x reductions in cost via better attention mechanisms, but let's go with 2x for now. ($425).

So I think we've got about an 8x reduction in cost sitting there once Google steps up. This is probably 4-6 months of work flat out if they haven't already started down this path, but with what they've got with deep research, maybe it's sooner?

Then if "all" we get is hardware improvements we're down to what 10-14 years?

reply
qingcharles
12 hours ago
[-]
Until 2022 most AI research was aimed at improving the quality of the output, not the quantity.

Since then there has been a tsunami of optimizations in the way training and inference is done. I don't think we've even begun to find all the ways that inference can be further optimized at both hardware and software levels.

Look at the huge models that you can happily run on an M3 Mac. The cost reduction in inference is going to vastly outpace Moore's law, even as chip design continues on its own path.

reply
promptdaddy
16 hours ago
[-]
*deep mind research ?
reply
iandanforth
16 hours ago
[-]
Nope, Gemini Advanced with Deep Research. New mode of operation that does more "thinking" and web searches to answer your question.
reply
cchance
20 hours ago
[-]
I mean considering the big breaththrough this year for o1/o3 seems to have been "models having internal thoughts might help reasoning", seems to everyone outside of the AI field was sort of a "duh" moment.

I'd hope we see more internal optimizations and improvements to the models. The idea that the big breakthrough being "don't spit out the first thought that pops into your head" seems obvious to everyone outside of the field, but guess what turns out it was a big improvement when the devs decided to add it.

reply
versteegen
17 hours ago
[-]
> seems obvious to everyone outside of the field

It's obvious to people inside the field too.

Honestly, these things seem to be less obvious to people outside the field. I've heard so many uninformed takes about LLMs not representing real progress towards intelligence (even here on HN of all places; I don't know why I torture myself reading them), that they're just dumb memorizers. No, they are an incredible breakthrough, because extending them with things like internal thoughts will so obviously lead to results such as o3, and far beyond. Maybe a few more people will start to understand the trajectory we're on.

reply
0points
2 hours ago
[-]
> No, they are an incredible breakthrough, because extending them with things like internal thoughts will so obviously lead to results such as o3, and far beyond.

While I agree that the LLM progress as of late is interesting, the rest of your sentiment sounds more like you are in a cult.

As long as your field keep coming with less and less realistic predictions and fail to deliver over and over, eventually even the most gullible will lose faith in you.

Because that's what this all is right now. Faith.

> Maybe a few more people will start to understand the trajectory we're on.

All you are saying is that you believe something will happen in the future.

We can't have a intelligent discussion under those premises.

It's depressing to see so many otherwise smart people fall for their own hype train. You are only helping rich people get more rich by spreading their lies.

reply
Agentus
16 hours ago
[-]
a trickle of people sure, but most people never accidentally stumble upon good evaluation skills let alone reason themselves to that level, so i dont see how most people will have the semblance of an idea of a realistic trajectory of ai progress. i think most people have very little conceptualization of their own thinking/cognitive patterns, at least not enough to sensibly extrapolate it onto ai.

doesnt help that most people are just mimics when talking about stuff thats outside their expertise.

Hell, my cousin a quality-college educated individual, high social/ emotional iq, will go down the conspiracy theory rabbit hole so quickly based on some baseless crap printed on the internet. then he’ll talk about people being satan worshipers.

reply
versteegen
6 hours ago
[-]
You're being pretty harsh, but:

> i think most people have very little conceptualization of their own thinking/cognitive patterns, at least not enough to sensibly extrapolate it onto ai.

Quite true. If you spend a lot of time reading and thinking about the workings of the mind you lose sight of how alien it is to intuition. While in highschool I first read, in New Scientist, the theory that conscious thought lags behind the underlying subconscious processing in the brain. I was shocked that New Scientist would print something so unbelievable. Yet there seemed to be an element of truth to it so I kept thinking about it and slowly changed my assessment.

reply
Agentus
1 hour ago
[-]
sorry, humans are stupid and what intelligence they have is largely impotent. if this wasnt the case life wouldnt be this dystopia. my crassness comes from not necessarily trying to pick on a particular group of humans, just disappointment in recognizing the efficacy of human intelligence and its ability to turn reality into a better reality (meh).

yeah i was just thinking how a lot of thoughts which i thought were my original thoughts really were made possible out of communal thoughts. like i can maybe have some original frontier thoughts that involve averages but thats only made possible because some other person invented the abstraction of averages then that was collectively disseminated to everyone in education, not to mention all the subconscious processes that are necessary for me to will certainly thoughts into existsnce. makes me reflect on how much cognition is really mine, vs (not mine) a inevitable product of a deterministic process and a product of other humans.

reply
sfjailbird
4 hours ago
[-]
Sounds like your cousin is able to think for himself. The amount of bullshit I hear from quality-college educated individuals, who simply repeat outdated knowledge that is in their college curriculum, is no less disappointing.
reply
daveguy
15 minutes ago
[-]
Buying whatever bullshit you see on the internet to such a degree that you're re-enacting satanic panic from the 80s is not "thinking for yourself". It's being gullible about areas outside your expertise.
reply
dogma1138
17 hours ago
[-]
Reflection isn’t a new concept, but a) actually proving that it’s an effective tool for these types of models and b) finding an effective method for reflection that doesn’t just locks you into circular “thinking” were the hard parts and hence the “breakthrough”.

It’s very easy to say hey ofc it’s obvious but there is nothing obvious about it because you are anthropomorphizing these models and then using that bias after the fact as a proof of your conjecture.

This isn’t how real progress is achieved.

reply
beardedwizard
5 hours ago
[-]
Calling it reflection is, for me, further anthropomorphizing. However I am in violent agreement that a common feature of llm debate is centered around anthropomorphism leading to claims of "thinking longer" or "reflecting" when none of those things are happening.

The state of the art seems very focused on promoting that language that might encode reason is as good as actual reason, rather than asking what a reasoning model might look like.

reply
acchow
9 hours ago
[-]
> ~doubling every 2-2.5 years) puts us at 20~25 years.

The trend for power consumption of compute (Megaflops per watt) has generally tracked with Koomey’s law for a doubling every 1.57 years

Then you also have model performance improving with compression. For example, Llama 3.1’s 8B outperforming the original Llama 65B

reply
0points
2 hours ago
[-]
Then you will just have the issue of supplying enough of power to support this "linear" growth of yours.
reply
bjornsing
18 hours ago
[-]
> are we stuck waiting for the 20-25 years for GPU improvements

If this turns out to be hard to optimize / improve then there will be a huge economic incentive for efficient ASICs. No freaking way we’ll be running on GPUs for 20-25 years, or even 2.

reply
coolspot
17 hours ago
[-]
LLMs need efficient matrix multiiliers. GPUs are specialized ASICs for massive matrix multiplication.
reply
vlovich123
16 hours ago
[-]
LLMs get to maybe ~20% of the rated max FLOPS for a GPU. It’s not hard to imagine that a purpose built ASIC with maybe adjusted software stack gets us significantly more real performance.
reply
boroboro4
13 hours ago
[-]
They get more than this. For prefill we can get 70% matmul utilization, for generation less than this but we’ll get to >50 too eventually.
reply
m3kw9
13 hours ago
[-]
Don’t forget humans which is real GI paired with increasing capable AI can create a feed back loop to accelerate new advances.
reply
spencerchubb
17 hours ago
[-]
> Super exciting that OpenAI pushed the compute out this far

it's even more exciting than that. the fact that you even can use more compute to get more intelligence is a breakthrough. if they spent even more on inference, would they get even better scores on arc agi?

reply
lolinder
4 hours ago
[-]
> the fact that you even can use more compute to get more intelligence is a breakthrough.

I'm not so sure—what they're doing by just throwing more tokens at it is similar to "solving" the traveling salesman problem by just throwing tons of compute into a breadth first search. Sure, you can get better and better answers the more compute you throw at it (with diminishing returns), but is that really that surprising to anyone who's been following tree of thought models?

All it really seems to tell us is that the type of model that OpenAI has available is capable of solving many of the types of problems that ARC-AGI-PUB has set up given enough compute time. It says nothing about "intelligence" as the concept exists in most people's heads—it just means that a certain very artificial (and intentionally easy for humans) class of problem that wasn't computable is now computable if you're willing to pay an enormous sum to do it. A breakthrough of sorts, sure, but not a surprising one given what we've seen already.

reply
echelon
16 hours ago
[-]
Maybe it's not linear spend.
reply
cle
2 hours ago
[-]
Efficiency has always been the key.

Fundamentally it's a search through some enormous state space. Advancements are "tricks" that let us find useful subsets more efficiently.

Zooming way out, we have a bunch of social tricks, hardware tricks, and algorithmic tricks that have resulted in a super useful subset. It's not the subset that we want though, so the hunt continues.

Hopefully it doesn't require revising too much in the hardware & social bag of tricks, those are lot more painful to revisit...

reply
daxfohl
10 hours ago
[-]
I wonder if we'll start seeing a shift in compute spend, moving away from training time, and toward inference time instead. As we get closer to AGI, we probably reach some limit in terms of how smart the thing can get just training on existing docs or data or whatever. At some point it knows everything it'll ever know, no matter how much training compute you throw at it.

To move beyond that, the thing has to start thinking for itself, some auto feedback loop, training itself on its own thoughts. Interestingly, this could plausibly be vastly more efficient than training on external data because it's a much tighter feedback loop and a smaller dataset. So it's possible that "nearly AGI" leads to ASI pretty quickly and efficiently.

Of course it's also possible that the feedback loop, while efficient as a computation process, isn't efficient as a learning / reasoning / learning-how-to-reason process, and the thing, while as intelligent as a human, still barely competes with a worm in true reasoning ability.

Interesting times.

reply
empiko
6 hours ago
[-]
I don't think this is only about efficiency. The model I have here is that this is similar to when we beat chess. Yes, it is impressive that we made progress on a class of problems, but is this class aligned with what the economy or the society needs?

Simple turn-based games such as chess turned out to be too far away from anything practical and chess-engine-like programs were never that useful. It is entirely possible that this will end up in a similar situation. ARC-like pattern matching problems or programming challenges are indeed a respectable challenge for AI, but do we need a program that is able to solve them? How often does something like that come up really? I can see some time-saving in using AI vs StackOverflow in solving some programming challenges, but is there more to this?

reply
edanm
4 hours ago
[-]
I mostly agree with your analysis, but just to drive home a point here - I don't think that algorithms to beat Chess were ever seriously considered as something that would be relevant outside of the context of Chess itself. And obviously, within the world of Chess, they are major breakthroughs.

In this case there is more reason to think these things are relevant outside of the direct context - these tests were specifically designed to see if AI can do general-thinking tasks. The benchmarks might be bad, but that's at least their purpose (unlike in Chess).

reply
freehorse
15 hours ago
[-]
> I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.

On a very simple, toy task, which arc-agi basically is. Arc-agi tests are not hard per se, just LLM’s find them hard. We do not know how this scales for more complex, real world tasks.

reply
SamPatt
15 hours ago
[-]
Right. Arc is meant to test the ability of a model to generalize. It's neat to see it succeed, but it's not yet a guarantee that it can generalize when given other tasks.

The other benchmarks are a good indication though.

reply
criddell
11 hours ago
[-]
Does it mean anything for more general tasks like driving a car?
reply
brookst
10 hours ago
[-]
Is every smart person a good driver?
reply
earth2mars
3 hours ago
[-]
That kind of proves that point that no matter how smart it can get, it may still have several disabilities that are crucial and very naive for humans. Is it generalizing on any task or specific set of tasks.
reply
zarzavat
6 hours ago
[-]
Likely yes. Every smart person is capable of being a good driver, so long as you give them enough training and incentive. Zero smart people are born being able to drive.
reply
fragmede
6 hours ago
[-]
There are different kinds of smarts and not every smart person is good at all of them. Specifically, spacial reasoning is important for driving, and if a smart person is good at all kinds of thinking except that one, they're going to find it challenging to be a good driver.
reply
sethammons
5 hours ago
[-]
Says the technical founder and CTO of our startup who exited with 9 figures and who also has a severe lazy eye: you don't want me driving. He got pulled over for suspected dui; totally clean, just can't drive straight
reply
madduci
10 hours ago
[-]
Let's see when this will be released to the free tier. Looks promising, although I hope they will also be able to publish more details on this, as part of the "open" in their name
reply
riku_iki
21 hours ago
[-]
> ~=$3400 per single task

report says it is $17 per task, and $6k for whole dataset of 400 tasks.

reply
binarymax
21 hours ago
[-]
"Note: OpenAI has requested that we not publish the high-compute costs. The amount of compute was roughly 172x the low-compute configuration."

The low compute was $17 per task. Speculate 172*$17 for the high compute is $2,924 per task, so I am also confused on the $3400 number.

reply
bluecoconut
21 hours ago
[-]
3400 came from counting pixels on the plot.

Also its $20 on for the o3-low via the table for the semi-private, which x172 is 3440, also coming in close to the 3400 number

reply
bluecoconut
21 hours ago
[-]
That's the low-compute mode. In the plot at the top where they score 88%, O3 High (tuned) is ~3.4k
reply
HDThoreaun
15 hours ago
[-]
The low compute one did as well as the average person though
reply
ionwake
19 hours ago
[-]
sorry to be a noob, but can someone tell me doe sths mena o3 will be unaffordable for a typical user? Will only companies with thousands to spend per query be able to use this?

Sorry for being thick Im just confused how they can turn this into an addordable service?

reply
JohnnyMarcone
17 hours ago
[-]
There are likely many efficiency gains that will be made before it's released, and after. Also they showed o3 mini to be better than o1 for less cost in multiple benchmarks, so there're already improvements there at a lower cost than what available.
reply
ionwake
17 hours ago
[-]
Great thank you
reply
xrendan
21 hours ago
[-]
You're misreading it, there's two different runs, a low and a high compute run.

The number for the high-compute one is ~172x the first one according to the article so ~=$2900

reply
Thorrez
2 hours ago
[-]
What's extra confusing is that in the graph the runs are called low compute and high compute. In the table they're called high efficient and low efficiency. So the high and low got swapped.
reply
jhrmnn
21 hours ago
[-]
That’s for the low-compute configuration that doesn’t reach human-level performance (not far though)
reply
riku_iki
21 hours ago
[-]
I referred on high compute mode. They have table with breakdown here: https://arcprize.org/blog/oai-o3-pub-breakthrough
reply
junipertea
21 hours ago
[-]
The table row with 6k figure refers to high efficiency, not high compute mode. From the blog post:

Note: OpenAI has requested that we not publish the high-compute costs. The amount of compute was roughly 172x the low-compute configuration.

reply
gbnwl
21 hours ago
[-]
That's "efficiency" high, which actually means less compute. The 87.5% score using low efficiency (more compute) doesn't have cost listed.
reply
bluecoconut
21 hours ago
[-]
they use some poor language.

"High Efficiency" is O3 Low "Low Efficiency" is O3 High

They left the "Low efficiency" (O3 High) values as `-` but you can infer them from the plot at the top.

Note the $20 and $17 per task aligns with the X-axis of the O3-low

reply
EVa5I7bHFq9mnYK
21 hours ago
[-]
That's high EFFICIENCY. High efficiency = low compute.
reply
chefandy
2 hours ago
[-]
I think the real key is figuring out how to turn the hand-wavy promises of this making everything better into policy long fucking before we kick the door open. It’s self-evident that this being efficient and useful would be a technological revolution; what’s not self evident is that it wouldn’t benefit the large corporate entities that control even more disproportionately than it does now to the detriment of many other people.
reply
croemer
22 hours ago
[-]
The programming task they gave o3-mini high (creating Python server that allows chatting with OpenAI API and run some code in terminal) didn't seem very hard? Strange choice of example for something that's claimed to be a big step forwards.

YT timestamped link: https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s (thanks for the fixed link @photonboom)

Updated: I gave the task to Claude 3.5 Sonnet and it worked first shot: https://claude.site/artifacts/36cecd49-0e0b-4a8c-befa-faa5aa...

reply
zelphirkalt
1 hour ago
[-]
Looks like quite shoddy code though. Like, the procedure for running a shell command is pure side-effect procedural code, neither returning the exit code of the command nor its output. Like the incomplete stackoverflow answer it probably was trained from. It might do one job at a time, but once this stuff gets integrated into one coherent thing, one needs to rewrite lots of it, to actually be composable.

Though, of course one can argue, that lots of human written code is not much different from this.

reply
bearjaws
22 hours ago
[-]
It's good that it works since if you ask GPT-4o to use the openai sdk it will often produce invalid and out of date code.
reply
HeatrayEnjoyer
16 hours ago
[-]
Sonnet isn't a "mini" sized model. Try it with Haiku.
reply
croemer
16 hours ago
[-]
How mini is o3-mini compared to Sonnet and why does it matter whether it's mini or not? Isn't the point of the demo to show what's now possible that wasn't before?

4o is cheaper than o1 mini so mini doesn't mean much for costs.

reply
MyFirstSass
18 hours ago
[-]
What? Is this what this is? Either this is a complete joke or we're missing something.

I've been doing similar stuff in Claude for months and it's not that impressive when you see how limited they really are when going non boilerplate.

reply
phil917
21 hours ago
[-]
Yeah I agree that wasn't particularly mind blowing to me and seems fairly in line with what existing SOTA models can do. Especially since they did it in steps. Maybe I'm missing something.
reply
photonboom
22 hours ago
[-]
reply
m3kw9
22 hours ago
[-]
I would say they didn’t need to demo anything, because if you are gonna use the output code live on a demo it may make compile errors and then look stupid trying to fix it live
reply
croemer
21 hours ago
[-]
If it was a safe bet problem, then they should have said that. To me it looks like they faked excitement for something not exciting which lowers credibility of the whole presentation.
reply
sunaookami
19 hours ago
[-]
They actually did that the last time when they showed the apps integration. First try in Xcode didn't work.
reply
m3kw9
18 hours ago
[-]
Yeah I think that time it was ok because they were demoing the app function, but for this they are demoing the model smarts
reply
csomar
16 hours ago
[-]
Models are predictable at 0 temperatures. They might have tested the output beforehand.
reply
fzzzy
16 hours ago
[-]
Models in practice haven't been deterministic at 0 temperature, although nobody knows exactly why. Either hardware or software bugs.
reply
Jensson
16 hours ago
[-]
We know exactly why, it is because floating point operations aren't associative but the GPU scheduler assumes they are, and the scheduler isn't deterministic. Running the model strictly hurts performance so they don't do that.
reply
obblekk
22 hours ago
[-]
Human performance is 85% [1]. o3 high gets 87.5%.

This means we have an algorithm to get to human level performance on this task.

If you think this task is an eval of general reasoning ability, we have an algorithm for that now.

There's a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works.

Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it!

[1] https://x.com/SmokeAwayyy/status/1870171624403808366, https://arxiv.org/html/2409.01374v1

reply
phillipcarter
22 hours ago
[-]
As excited as I am by this, I still feel like this is still just a small approximation of a small chunk of human reasoning ability at large. o3 (and whatever comes next) feels to me like it will head down the path of being a reasoning coprocessor for various tasks.

But, still, this is incredibly impressive.

reply
qt31415926
21 hours ago
[-]
Which parts of reasoning do you think is missing? I do feel like it covers a lot of 'reasoning' ground despite its on the surface simplicity
reply
john_minsk
12 hours ago
[-]
My personal 5 cents is that reasoning will be there when LLM gives you some kind of outcome and then when questioned about it can explain every bit of result it produced.

For example, if we asked an LLM to produce an image of a "human woman photorealistic" it produces result. After that you should be able to ask it "tell me about its background" and it should be able to explain "Since user didn't specify background in the query I randomly decided to draw her standing in front of a fantasy background of Amsterdam iconic houses. Usually Amsterdam houses are 3 stories tall, attached to each other and 10 meters wide. Amsterdam houses usually have cranes on the top floor, which help to bring goods to the top floor since doors are too narrow for any object wider than 1m. The woman stands in front of the houses approximately 25 meters in front of them. She is 1,59m tall, which gives us correct perspective. It is 11:16am of August 22nd which I used to calculate correct position of the sun and align all shadows according to projected lighting conditions. The color of her skin is set at RGB:xxxxxx randomly" etc.

And it is not too much to ask LLMs for it. LLMs have access to all the information above as they read all the internet. So there is definitely a description of Amsterdam architecture, what a human body looks like or how to correctly estimate time of day based on shadows (and vise versa). The only thing missing is logic that connects all this information and which is applied correctly to generate final image.

I like to think about LLMs as a fancy genius compressing engines. They took all the information in the internet, compressed it and are able to cleverly query this information for end user. It is a tremendously valuable thing, but if intelligence emerges out of it - not sure. Digital information doesn't necessarily contain everything needed to understand how it was generated and why.

reply
concordDance
7 hours ago
[-]
> if we asked an LLM to produce an image of a "human woman photorealistic" it produces result

Large language models don't do that. You'd want an image model.

Or did you mean "multi-model AI system" rather than "LLM"?

reply
amelius
3 hours ago
[-]
Can an LLM use tools like humans do? Could it use an image model as a tool to query the image?
reply
0points
2 hours ago
[-]
No, a LLM is a Large Language Model.

It can language.

reply
amelius
58 minutes ago
[-]
You could teach it to emit patterns that (through other code) invoke tools, and loop the results back to the LLM.
reply
owenpalmer
6 hours ago
[-]
It might be possible for a language model to paint a photorealistic picture though.
reply
0points
2 hours ago
[-]
It is not.

You are confusing LLM:s with Generative AI.

reply
Xmd5a
10 hours ago
[-]
LLMs are still bound to a prompting session. They can't form long term memories, can't ponder on it and can't develop experience. They have no cognitive architecture.

'Agents' (i.e. workflows intermingling code and calls to LLMs) are still a thing (as shown by the fact there is a post by anthropic on this subject on the front page right now) and they are very hard to build.

Consequence of that for instance: it's not possible to have a LLM explore exhaustively a topic.

reply
mjhagen
3 hours ago
[-]
LLMs don’t, but who said AGI should come from LLMs alone. When I ask ChatGPT about something “we” worked on months ago, it “remembers” and can continue on the conversation with that history in mind.

I’d say, humans are also bound to promoting sessions in that way.

reply
Xmd5a
1 hour ago
[-]
Last time I used ChatGPT 'memory' feature it got full very quickly. It remembered my name, my dog's name and a couple tobacco casing recipes he came up with. OpenAI doesn't seem to be using embeddings and a vector database, just text snippets it injects in every conversation. Because RAG is too brittle ? The same problem arises when composing LLM calls. Efficient and robust workflows are those whose prompts and/or DAG were obtained via optimization techniques. Hence DSPy.

Consider the following use case: keeping a swimming pool water clean. I can have a long running conversation with a LLM to guide me in getting it right. However I can't have a LLM handle the problem autonomously. I'd like to have it notify me on its own "hey, it's been 2 days, any improvement? Do you mind sharing a few pictures of the pool as well as the ph/chlorine test results ?". Nothing mind-boggingly complex. Nothing that couldn't be achieved using current LLMs. But still something I'd have to implement myself and which turns out to be more complex to achieve than expected. This is the kind of improvement I'd like to see big AI companies going after rather than research-grade ultra smart AIs.

reply
phillipcarter
19 hours ago
[-]
I think it's hard to enumerate the unknown, but I'd personally love to see how models like this perform on things like word problems where you introduce red herrings. Right now, LLMs at large tend to struggle mightily to understand when some of the given information is not only irrelevant, but may explicitly serve to distract from the real problem.
reply
KaoruAoiShiho
18 hours ago
[-]
o1 already fixed the red herrings...
reply
zmgsabst
12 hours ago
[-]
That’s not inability to reason though, that’s having a social context.

Humans also don’t tend to operate in a rigorously logical mode and understand that math word problems are an exception where the language may be adversarial: they’re trained for that special context in school. If you tell the LLM that social context, eg that language may be deceptive, their “mistakes” disappear.

What you’re actually measuring is the LLM defaults to assuming you misspoke trying to include relevant information rather than that you were trying to trick it — which is the social context you’d expect when trained on general chat interactions.

Establishing context in psychology is hard.

reply
tim333
3 hours ago
[-]
Current AI is good at text but not very good at 3d physical stuff like fixing your plumbing.
reply
amelius
3 hours ago
[-]
Does it include the use of tools to accomplish a task?

Does it include the invention of tools?

reply
Agentus
14 hours ago
[-]
kinda interesting, every single CS person (especially phds) when talking about reasoning are unable to concisely quantify, enumerate, qualify, or define reasoning.

people with (high) intelligence talking and building (artificial) intelligence but never able to convincingly explain aspects of intelligence. just often talk ambiguously and circularly around it.

what are we humans getting ourselves into inventing skynet :wink.

its been an ongoing pet project to tackle reasoning, but i cant answer your question with regards to llms.

reply
YeGoblynQueenne
13 hours ago
[-]
>> Kinda interesting, every single CS person (especially phds) when talking about reasoning are unable to concisely quantify, enumerate, qualify, or define reasoning.

Kinda interesting that mathematicians also can't do the same for mathematics.

And yet.

reply
logicchains
3 hours ago
[-]
Mathematicians absolutely can, it's called foundations, and people actively study what mathematics can be expressed in different foundations. Most mathematicians don't care about it though for the same reason most programmers don't care about Haskell.
reply
YeGoblynQueenne
2 hours ago
[-]
I don't care about Haskell either, but we know what reasoning is [1]. It's been studied extensively in mathematics, computer science, psychology, cognitive science and AI, and in philosophy going back literally thousands of years with grandpapa Aristotle and his syllogisms. Formal reasoning, informal reasoning, non-monotonic reasoning, etc etc. Not only do we know what reasoning is, we know how to do it with computers just fine, too [2]. That's basically the first 50 years of AI, that folks like His Nobelist Eminence Geoffrey Hinton will tell you was all a Bad Idea and a total failure.

Still somehow the question keeps coming up- "what is reasoning". I'll be honest and say that I imagine it's mainly folks who skipped CS 101 because they were busy tweaking their neural nets who go around the web like Diogenes with his lantern, howling "Reasoning! I'm looking for a definition of Reasoning! What is Reasoning!".

I have never heard the people at the top echelons of AI and Deep learning - LeCun, Schmidhuber, Bengio, Hinton, Ng, Hutter, etc etc- say things like that: "what's reasoning". The reason I suppose is that they know exactly what that is, because it was the one thing they could never do with their neural nets, that classical AI could do between sips of coffee at breakfast [3]. Those guys know exactly what their systems are missing and, to their credit, have never made no bones about that.

_________________

[1] e.g. see my profile for a quick summary.

[2] See all of Russeel & Norvig, as a for instance.

[3] Schmidhuber's doctoral thesis was an implementation of genetic algorithms in Prolog, even.

reply
Agentus
1 hour ago
[-]
i have a question for you, in which ive asked many philosophy professors but none could answer satisfactorily. since you seem to have a penchant for reasoning perhaps you might have a good answer. (i hope i remember the full extent of the question properly, i might hit you up with some follow questions)

it pertains to the source of the inference power of deductive inference. do you think all deductive reasoning originated inductively? like when some one discovers a rule or fact that seemingly has contextual predictive power, obviously that can be confirmed inductively by observations, but did that deductive reflex of the mind coagulate by inductive experiences. maybe not all deductive derivative rules but the original deductive rules.

reply
mistermann
1 hour ago
[-]
>Those guys know exactly what their systems are missing

If they did not actually, would they (and you) necessarily be able to know?

Many people claim the ability to prove a negative, but no one will post their method.

reply
Agentus
13 hours ago
[-]
well lets just say i think i can explain reasoning better than anyone ive encountered. i have my own hypothesized theory on what it is and how it manifests in neural networks.

i doubt your mathmatician example is equivalent.

examples that are fresh on the mind that further my point. ive heard yann lecun baffled by llms instantiation/emergence of reasoning, along with other ai researchers. eric Schmidt thinks the agentic reasoning is the current frontier and people should be focusing on that. was listening to the start of an ai machine learning interview a week ago with some cs phd asked to explain reasoning and the best he could muster up is you know it when you see it…. not to mention the guy responding to the grandparent that gave a cop out answer ( all the most respect to him).

reply
YeGoblynQueenne
2 hours ago
[-]
>> well lets just say i think i can explain reasoning better than anyone ive encountered. i have my own hypothesized theory on what it is and how it manifests in neural networks.

I'm going to bet you haven't encountered the right people then. Maybe your social circle is limited to folks like the person who presented a slide about A* to a dumb-struck roomfull of Deep Learning researchers, in the last NeurIps?

https://x.com/rao2z/status/1867000627274059949

reply
Agentus
1 hour ago
[-]
possibly, my university doesn’t really do ai research beyond using it as a tool to engineer things. im looking to transfer to a different university.

but no, my take on reasoning is really a somewhat generalized reframing of the definition of reasoning (which you might find on the stanford encylopedia of philosophy) thats reframed partially in axiomatic building blocks of neural network components/terminology. im not claiming to have discovered reasoning, just redefine it in a way thats compatible and sensible to neural networks (ish).

reply
necovek
9 hours ago
[-]
Care to enlighten us with your explanation of what "reasoning" is?
reply
Agentus
2 hours ago
[-]
terribly sorry to be such a tease, but im looking to publish a paper on it, and still need to delve deeper into machine interpretability to make sure its empirically properly couched. if u can help with that perhaps we can continue this convo in private.
reply
mistermann
1 hour ago
[-]
Optimal phenomenological reasoning is going to be a tough nut to crack.

Luckily we don't know the problem exists, so in a cultural/phenomenological sense it is already cracked.

reply
azeirah
17 hours ago
[-]
I'd like to see this o3 thing play 5d chess with multiverse time travel or baba is you.

The only effect smarter models will have is that intelligent people will have to use less of their brain to do their work. As has always been the case, the medium is the message, and climate change is one of the most difficult and worst problems of our time.

If this gets software people to quit en-masse and start working in energy, biology, ecology and preservation? Then it has succeeded.

reply
concordDance
7 hours ago
[-]
> climate change is one of the most difficult and worst problems of our time.

Slightly surprised to see this view here.

I can think of half a dozen more serious problems off hand (e.g. population aging, institutional scar tissue, dysgenics, nuclear proliferation, pandemic risks, AI itself) along most axes I can think of (raw $ cost, QALYs, even X-risk).

reply
cryptoegorophy
22 hours ago
[-]
What’s interesting is it might be very close to human intelligence than some “alien” intelligence, because after all it is a LLM and trained on human made text, which kind of represents human intelligence.
reply
hammock
22 hours ago
[-]
In that vein, perhaps the delta between o3 @ 87.5% and Human @ 85% represents a deficit in the ability of text to communicate human reasoning.

In other words, it's possible humans can reason better than o3, but cannot articulate that reasoning as well through text - only in our heads, or through some alternative medium.

reply
unsupp0rted
19 hours ago
[-]
It's possible humans reason better through text than not through text, so these models, having been trained on text, should be able to out-reason any person who's not currently sitting down to write.
reply
85392_school
22 hours ago
[-]
I wonder how much of an effect amount of time to answer has on human performance.
reply
yunwal
21 hours ago
[-]
Yeah, this is sort of meaningless without some idea of cost or consequences of a wrong answer. One of the nice things about working with a competent human is being able to tell them "all of our jobs are on the line" and knowing with certainty that they'll come to a good answer.
reply
hamburga
14 hours ago
[-]
Agreed. I think what really makes them alien is everything else about them besides intelligence. Namely, no emotional/physiological grounding in empathy, shame, pride, and love (on the positive side) or hatred (negative side).
reply
6gvONxR4sf7o
21 hours ago
[-]
Human performance is much closer to 100% on this, depending on your human. It's easy to miss the dot in the corner of the headline graph in TFA that says "STEM grad."
reply
tim333
3 hours ago
[-]
A fair comparison might be average human. The average human isn't a STEM grad. It seems STEM grad approximately equals an IQ of 130. https://www.accommodationforstudents.com/student-blog/the-su...

From a post elsewhere the scores on ARC-AGI-PUB are approx average human 64%, o3 87%. https://news.ycombinator.com/item?id=42474659

Though also elsewhere, o3 seems very expensive to operate. You could probably hire a PhD researcher for cheaper.

reply
jeremyjh
1 hour ago
[-]
Why would an average human be more fair than a trained human? The model is trained.
reply
lastdong
11 hours ago
[-]
Curious about how many tests were performed. Did it consistently manage to successfully solve many of these types of problems?
reply
scotty79
22 hours ago
[-]
Still it's comparing average human level performance with best AI performance. Examples of things o3 failed at are insanely easy for humans.
reply
cchance
20 hours ago
[-]
You'd be surprised what the AVERAGE human fails to do that you think is easy, my mom can't fucking send an email without downloading a virus, i have a coworker that believes beyond a shadow of a doubt the world is flat.

The Average human is a lot dumber than people on hackernews and reddit seem to realize, shit the people on mturk are likely smarter than the AVERAGE person

reply
HarHarVeryFunny
1 hour ago
[-]
Maybe, but no doubt these "dumb" people can still get dressed in the morning, navigate a trip to the mall, do the dishes, etc, etc.

It's always been the case that the things that are easiest for humans are hardest for computers, and vice versa. Humans are good at general intelligence - tackling semi-novel problems all day long, while computers are good at narrow problems they can be trained on such as chess or math.

The majority of the benchmarks currently used to evaluate these AI models are narrow skills that the models have been trained to handle well. What'll be much more useful will be when they are capable of the generality of "dumb" tasks that a human can do.

reply
mirkodrummer
14 hours ago
[-]
Not being able to send an email or believing the world is flat it’s not a sign of intelligence, I’d rather say it’s more about culture or being more or less scholarized. Your mom or coworker still can do stuff instinctively that is outperforming every algorithm out there and still unexplained how we do it. We still have no idea what intelligence is
reply
staticman2
19 hours ago
[-]
Yet the average human can drive a car a lot better than ChatGPT can, which shows that the way you frame "intelligence" dictates your conclusion about who is "intelligent".
reply
p1esk
19 hours ago
[-]
Pretty sure a waymo car drives better than an average SF driver.
reply
manquer
7 hours ago
[-]
Waymo cannot handle poor weather at all, average human can.

Being able to perform better than humans in specific constrained problem space is how every automation system has been developed.

While self driving systems are impressive, they don’t drive with anywhere close to skills of the average driver

reply
tim333
3 hours ago
[-]
Waymo blog with video of them driving in poor weather https://waymo.com/blog/2019/08/waymo-and-weather
reply
manquer
2 hours ago
[-]
And nikola famously made a video of a truck using one which had no engine, we don’t take a company word for anything until we can verify.

This is not offered to public, they are actively expanding in only cities like LA , Miami or Phoenix now where weather is good through the year.

The tech for bad weather is nowhere close to ready for public. Average human on other hand is driving in bad weather every day

reply
tim333
1 hour ago
[-]
"Extreme Weather" tech "will be available to riders in the near future" https://www.cnet.com/roadshow/news/waymos-latest-robotaxi-is...
reply
Mordisquitos
5 hours ago
[-]
And how well would a Waymo car do in this challenge with the ARC-AGI datasets?
reply
coldcode
3 hours ago
[-]
There's a reason why Waymo isn't offered in Buffalo.
reply
fragmede
3 hours ago
[-]
Is that reason because Buffalo is the 81st most populated city in the United States, or 123rd by population density, and Waymo currently only serves approximately 3 cities in North America?

We already let computers control cars because they're better than humans at it when the weather is inclement. It's called ABS.

reply
tracerbulletx
19 hours ago
[-]
If you take an electrical sensory input signal sequence, and transform it to a electrical muscle output signal sequence you've got a brain. ChatGPT isn't going to drive a car because it's trained on verbal tokens, and it's not optimized for the type of latency you need for physical interaction.

And the brain doesn't use the same network to do verbal reasoning as real time coordination either.

But that work is moving along fine. All of these models and lessons are going to be combined into AGI. It is happening. There isn't really that much in the way.

reply
0points
2 hours ago
[-]
Your examples are just examples of lack of information. That's not a measure for intelligence.

As a contrary point, most people think they are smarter than they really are.

reply
FrustratedMonky
21 hours ago
[-]
There are things Chimps do easily that humans fail at, and vice/versa of course.

There are blind spots, doesn't take away from 'general'.

reply
Matumio
4 hours ago
[-]
We can't agree whether Portia spiders are intelligent or just have very advanced instincts. How will we ever agree about what human intelligence is, or how to separate it from cultural knowledge? If that even makes sense.
reply
FrustratedMonky
3 hours ago
[-]
I guess my point is more, if we can't decide about Portia Spiders or Chimps, then how can we be so certain about AI. So offering up Portia and Chimps as counter examples.
reply
noobermin
4 hours ago
[-]
The downvotes should tell you, this is a decided "hype" result. Don't poo poo it, that's not allowed on AI slop posts on HN.
reply
FrustratedMonky
3 hours ago
[-]
Yeah, I didn't realize Chimp studies, or neuroscience were out of vogue. Even in tech, people form strong 'beliefs' around what they think is happening.
reply
antirez
22 hours ago
[-]
NNs are not algorithms.
reply
benlivengood
21 hours ago
[-]
Deterministic (ieee 754 floats), terminates on all inputs, correctness (produces loss < X on N training/test inputs)

At most you can argue that there isn't a useful bounded loss on every possible input, but it turns out that humans don't achieve useful bounded loss on identifying arbitrary sets of pixels as a cat or whatever, either. Most problems NNs are aimed at are qualitative or probabilistic where provable bounds are less useful than Nth-percentile performance on real-world data.

reply
notfish
22 hours ago
[-]
An algorithm is “a process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer”

How does a giant pile of linear algebra not meet that definition?

reply
antirez
21 hours ago
[-]
It's not made of "steps", it's an almost continuous function to its inputs. And a function is not an algorithm: it is not an object made of conditions, jumps, terminations, ... Obviously it has computation capabilities and is Turing-complete, but is the opposite of an algorithm.
reply
janalsncm
18 hours ago
[-]
If it wasn’t made of steps then Turing machines wouldn’t be able to execute them.

Further, this is probably running an algorithm on top of an NN. Some kind of tree search.

I get what you’re saying though. You’re trying to draw a distinction between statistical methods and symbolic methods. Someday we will have an algorithm which uses statistical methods that can match human performance on most cognitive tasks, and it won’t look or act like a brain. In some sense that’s disappointing. We can build supersonic jets without fully understanding how birds fly.

reply
antirez
18 hours ago
[-]
Let's see that Turing machines can approximate the execution of NN :) That's why there are issues related to numerical precision, but the contrary is also true indeed, NNs can discover and use similar techniques used by traditional algorithms. However: the two remain two different methods to do computations, and probably it's not just by chance that many things we can't do algorithmically, we can do with NNs, what I mean is that this is not just related to the fact that NNs discover complex algorithms via gradient descent, but also that the computational model of NNs is more adapt to solving certain tasks. So the inference algorithm of NNs (doing multiplications and other batch transformations) is just needed for standard computers to approximate the NN computational model. You can do this analogically, and nobody would claim much (maybe?) it's running an algorithm. Or that brains themselves are algorithms.
reply
tsimionescu
2 minutes ago
[-]
NN inference is an algorithm for computing an approximation of a function with a huge number of parameters. The NN itself is of course just a data structure. But there is nothing whatsoever about the NN process that is non-algorithmic.

It's the exact same thing as using a binary tree to discover the lowest number in some set of numbers, conceptually: you have a data structure that you evaluate using a particular algorithm. The combination of the algorithm and the construction of the data structure arrive at the desired outcome.

reply
necovek
9 hours ago
[-]
Computers can execute precise computations, it's just not efficient (and it's very much slow).

NNs are exactly what "computers" are good for and we've been using since their inception: doing lots of computations quickly.

"Analog neural networks" (brains) work much differently from what are "neural networks" in computing, and we have no understanding of their operation to claim they are or aren't algorithmic. But computing NNs are simply implementations of an algorithm.

Edit: upon further rereading, it seems you equate "neural networks" with brain-like operation. But brain was an inspiration for NNs, they are not an "approximation" of it.

reply
antirez
9 hours ago
[-]
But the inference itself is orthogonal to the computation the NN is going. Obviously the inference (and training) are algorithms.
reply
zeroonetwothree
10 hours ago
[-]
We don’t have evidence that a TM can simulate a brain. But we know for a fact that it can execute a NN.
reply
raegis
20 hours ago
[-]
> It's not made of "steps", it's an almost continuous function to its inputs.

Can you define "almost continuous function"? Or explain what you mean by this, and how it is used in the A.I. stuff?

reply
taneq
15 hours ago
[-]
Well, it's a bunch of steps, but they're smaller. /s
reply
necovek
9 hours ago
[-]
I would say you are right that function is not an algorithm, but it is an implementation of an algorithm.

Is that your point?

If so, I've long learned to accept imprecise language as long as the message can be reasonably extracted from it.

reply
mvkel
13 hours ago
[-]
> continuous

So, steps?

reply
necovek
9 hours ago
[-]
"Continuous" would imply infinitely small steps, and as such, would certainly be used as a differentiator (differential? ;) between larger discrete stepped approach.

In essence, infinite calculus provides a link between "steps" and continuous, but those are different things indeed.

reply
drdeca
19 hours ago
[-]
How do you define "algorithm"? I suspect it is a definition I would find somewhat unusual. Not to say that I strictly disagree, but only because to my mind "neural net" suggests something a bit more concrete than "algorithm", so I might instead say that an artificial neural net is an implementation of an algorithm, rather than or something like that.

But, to my mind, something of the form "Train a neural network with an architecture generally like [blah], with a training method+data like [bleh], and save the result. Then, when inputs are received, run them through the NN in such-and-such way." would constitute an algorithm.

reply
necovek
9 hours ago
[-]
NN is a very wide term applied in different contexts.

When a NN is trained, it produces a set of parameters that basically define an algorithm to do inference with: it's a very big one though.

We also call that a NN (the joy of natural language).

reply
KeplerBoy
20 hours ago
[-]
Running inference on a model certainly is a algorithm.
reply
hypoxia
20 hours ago
[-]
It actually beats the human average by a wide margin:

- 64.2% for humans vs. 82.8%+ for o3.

...

Private Eval:

- 85%: threshold for winning the prize [1]

Semi-Private Eval:

- 87.5%: o3 (unlimited compute) [2]

- 75.7%: o3 (limited compute) [2]

Public Eval:

- 91.5%: o3 (unlimited compute) [2]

- 82.8%: o3 (limited compute) [2]

- 64.2%: human average (Mechanical Turk) [1] [3]

Public Training:

- 76.2%: human average (Mechanical Turk) [1] [3]

...

References:

[1] https://arcprize.org/guide

[2] https://arcprize.org/blog/oai-o3-pub-breakthrough

[3] https://arxiv.org/abs/2409.01374

reply
usaar333
20 hours ago
[-]
Super human isn't beating rando mech turk.

Their post has stem grad at nearly 100%

reply
tripletao
18 hours ago
[-]
This is correct. It's easy to get arbitrarily bad results on Mechanical Turk, since without any quality control people will just click as fast as they can to get paid (or bot it and get paid even faster).

So in practice, there's always some kind of quality control. Stricter quality control will improve your results, and the right amount of quality control is subjective. This makes any assessment of human quality meaningless without explanation of how those humans were selected and incentivized. Chollet is careful to provide that, but many posters here are not.

In any case, the ensemble of task-specific, low-compute Kaggle solutions is reportedly also super-Turk, at 81%. I don't think anyone would call that AGI, since it's not general; but if the "(tuned)" in the figure means o3 was tuned specifically for these tasks, that's not obviously general either.

reply
ALittleLight
22 hours ago
[-]
It's not saturated. 85% is average human performance, not "best human" performance. There is still room for the model to go up to 100% on this eval.
reply
dmead
6 hours ago
[-]
This is so strange. people think that an llm trained on programming questions and docs can do mundane tasks like this means intelligent? Come on.

It really calls into question two things.

1. You don't know what you're talking about about.

2. You have a perverse incentive to believe this such that you will preach it to others and elevate some job salary range or stock.

Either way, not a good look.

reply
javaunsafe2019
3 hours ago
[-]
This
reply
dyauspitr
17 hours ago
[-]
I’ll believe it when the AI can earn money on its own. I obviously don’t mean someone paying a subscription to use the AI I mean, letting the AI lose on the Internet with only the goal of making money and putting it into a bank account.
reply
hamburga
14 hours ago
[-]
Do trading bots count?
reply
1659447091
10 hours ago
[-]
No, the AI would have to start from zero and reason it's way to making itself money online, such as the humans who were first in their online field of interest (e-commerce, scams, ads etc from the 80's and 90's) when there was no guidance, only general human intelligence that could reason their way into money making opportunities and reason their way into making it work.
reply
concordDance
7 hours ago
[-]
I don't think humans ever do that. They research/read and ask other humans.
reply
msoad
2 hours ago
[-]
There are new research where chain of thoughts is happening in latent spaces and not in English. They demonstrated better results since language is not as expressive as those concepts that can be represented in the layers before decoder. I wonder if o3 is doing that?
reply
gliptic
2 hours ago
[-]
"You can tell the RL is done properly when the models cease to speak English in their chain of thought" -- Karpathy
reply
padolsey
2 hours ago
[-]
I think you mean this: https://arxiv.org/abs/2412.06769

From what I can see, presuming o3 is a progression of o1 and has good level of accountabiltiy bubbling up during 'inference' (i.e. "Thinking about ___") then I'd say it's just using up millions of old-school tokens (the 44 million tokens that are referenced). So not latent thinking per se.

reply
Zamicol
2 hours ago
[-]
Interesting!
reply
nopinsight
22 hours ago
[-]
Let me go against some skeptics and explain why I think full o3 is pretty much AGI or at least embodies most essential aspects of AGI.

What has been lacking so far in frontier LLMs is the ability to reliably deal with the right level of abstraction for a given problem. Reasoning is useful but often comes out lacking if one cannot reason at the right level of abstraction. (Note that many humans can't either when they deal with unfamiliar domains, although that is not the case with these models.)

ARC has been challenging precisely because solving its problems often requires:

   1) using multiple different *kinds* of core knowledge [1], such as symmetry, counting, color, AND

   2) using the right level(s) of abstraction
Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforces, AIME, and Frontier Math suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it. Yes, this includes out-of-distribution problems that most humans can solve.

It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could. But not many humans can either.

[1] https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke...

ADDED:

Thanks to the link to Chollet's posts by lswainemoore below. I've analyzed some easy problems that o3 failed at. They involve spatial intelligence, including connection and movement. This skill is very hard to learn from textual and still image data.

I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross. (OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)

reply
phil917
21 hours ago
[-]
Quote from the creators of the AGI-ARC benchmark: "Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence."
reply
qnleigh
7 hours ago
[-]
I like the notion, implied in the article, that AGI won't be verified by any single benchmark, but by our collective inability to come up with benchmarks that defeat some eventual AI system. This matches the cat-and-mouse game we've been seeing for a while, where benchmarks have to constantly adapt to better models.

I guess you can say the same thing for the Turing Test. Simple chat bots beat it ages ago in specific settings, but the bar is much higher now that the average person is familiar with their limitations.

If/once we have an AGI, it will probably take weeks to months to really convince ourselves that it is one.

reply
CooCooCaCha
21 hours ago
[-]
Yeah the real goalpost is reliable intelligence. A supposed phd level AI failing simple problems is a red flag that we’re still missing something.
reply
gremlinsinc
20 hours ago
[-]
You've never met a Doctor who couldn't figure out how to work their email? Or use street smarts? You can have a PHD but be unable to reliably handle soft skills, or any number of things you might 'expect' someone to be able to do.

Just playing devils' advocate or nitpicking the language a bit...

reply
manquer
7 hours ago
[-]
Doctors[1] or say pilots are skilled professions and difficult to master and deserve respect yes , but they do not need high levels of intelligence to be good at. They require many other skills like taking decisions under pressure or good motor skills that are hard, but not necessarily intelligence.

Also not knowing something is hardly a criteria , skilled humans focus on their areas of interest above most other knowledge and can be unaware of other subjects.

Fields medal winners for example may not be aware of most pop culture things doesn’t make them not able to do so, just not interested

—-

[1] most doctors including surgeons and many respected specialists, some doctors however do need that skills but those are specialized few and generally do know how to use email

reply
nuancebydefault
19 hours ago
[-]
A coworker of mine has a phd in physics. Showing the difference to him between little and big endian in a hex editor, showing file sizes of raw image files and how to compute it... I explained 3 times and maybe he understood part of it now.
reply
CooCooCaCha
20 hours ago
[-]
An important distinction here is you’re comparing skill across very different tasks.

I’m not even going that far, I’m talking about performance on similar tasks. Something many people have noticed about modern AI is it can go from genius to baby-level performance seemingly at random.

Take self driving cars for example, a reasonably intelligent human of sound mind and body would never accidentally mistake a concrete pillar for a road. Yet that happens with self-driving cars, and seemingly here with ARC-AGI problems which all have a similar flavor.

reply
nopinsight
21 hours ago
[-]
I'd need to see what kinds of easy tasks those are and would be happy to revise my hypothesis if that's warranted.

Also, it depends a great deal on what we define as AGI and whether they need to be a strict superset of typical human intelligence. o3's intelligence is probably superhuman in some aspects but inferior in others. We can find many humans who exhibit such tendencies as well. We'd probably say they think differently but would still call them generally intelligent.

reply
lswainemoore
21 hours ago
[-]
They're in the original post. Also here: https://x.com/fchollet/status/1870172872641261979 / https://x.com/fchollet/status/1870173137234727219

Personally, I think it's fair to call them "very easy". If a person I otherwise thought was intelligent was unable to solve these, I'd be quite surprised.

reply
nopinsight
20 hours ago
[-]
Thanks! I've analyzed some easy problems that o3 failed at. They involve spatial intelligence including connection and movement. This skill is very hard to learn from textual and still image data.

I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross.

(OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)

reply
lswainemoore
20 hours ago
[-]
> I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross.

Maybe! I suppose time will tell. That said, spatial intelligence (connection/movement included) is the whole game in this evaluation set. I think it's revealing that they can't handle these particular examples, and problematic for claims of AGI.

reply
MVissers
14 hours ago
[-]
Probably just not trained on this kind of data. We could create a benchmark about it, and they'd shatter it within a year or so.

I'm starting to really see no limits on intelligence in these models.

reply
sungho_
8 hours ago
[-]
Doesn't the fact that it can only accomplish tasks with benchmarks imply that it has limitations in intelligence?
reply
qup
4 hours ago
[-]
> Doesn't the fact that it can only accomplish tasks with benchmarks

That's not a fact

reply
PoignardAzur
5 hours ago
[-]
> This skill is very hard to learn from textual and still image data.

I had the same take at first, but thinking about it again, I'm not quite sure?

Take the "blue dots make a cross" example (the second one). The inputs only has four blue dots, which makes it very easy to see a pattern even in text data: two of them have the same x coordinate, two of them have the same y (or the same first-tuple-element and second-tuple-element if you want to taboo any spatial concepts).

Then if you look into the output, you can notice that all the input coordinates are also in the output set, just not always with the same color. If you separate them into "input-and-output" and "output-only", you quickly notice that all of the output-only squares are blue and share a coordinate (tuple-element) with the blue inputs. If you split the "input-and-output" set into "same color" and "color changed", you can notice that the changes only go from red to blue, and that the coordinates that changed are clustered, and at least one element of the cluster shares a coordinate with a blue input.

Of course, it's easy to build this chain of reasoning in retrospect, but it doesn't seem like a complete stretch: each step only requires noticing patterns in the data, and it's how a reasonably puzzle-savvy person might solve this if you didn't let them draw the squares on papers. There are a lot of escape games with chains of reasoning much more complex and random office workers solve them all the time.

The visual aspect makes the patterns jump to us more, but the fact that o3 couldn't find them at all with thousands of dollars of compute budget still seems meaningful to me.

EDIT: Actually, looking at Twitter discussions[1], o3 did find those patterns, but was stumped by ambiguity in the test input that the examples didn't cover. Its failures on the "cascading rectangles" example[2] looks much more interesting.

[1]: https://x.com/bio_bootloader/status/1870339297594786064

[2]: https://x.com/_AI30_/status/1870407853871419806

reply
93po
21 hours ago
[-]
they say it isn't AGI but i think the way o3 functions can be refined to AGI - it's learning to solve a new, novel problems. we just need to make it do that more consistently, which seems achievable
reply
dimitri-vs
16 hours ago
[-]
Have we really watered down the definition of AGI that much?

LLMs aren't really capable of "learning" anything outside their training data. Which I feel is a very basic and fundamental capability of humans.

Every new request thread is a blank slate utilizing whatever context you provide for the specific task and after the tread is done (or context limit runs out) it's like it never happened. Sure you can use databases, do web queries, etc. but these are inflexible bandaid solutions, far from what's needed for AGI.

reply
theptip
16 hours ago
[-]
> LLMs aren't really capable of "learning" anything outside their training data.

ChatGPT has had for some time the feature of storing memories about its conversations with users. And you can use function calling to make this more generic.

I think drawing the boundary at “model + scaffolding” is more interesting.

reply
dimitri-vs
12 hours ago
[-]
Calling the sentence or two it arbitrarily saves when you statd your preferences and profile info "memories" is a stretch.

True equivalent to human memories would require something like a multimodal trillion token context window.

RAG is just not going to cut it, and if anything will exacerbated problems with hallucinations.

reply
bubblyworld
11 hours ago
[-]
That's true for vanilla LLMs, but also keep in mind that there are no details about o3's architecture at the moment. Clearly they are doing something different given the huge performance jump on a lot of benchmarks, and it may well involve in-context learning.
reply
catmanjan
6 hours ago
[-]
Given every other iteration has basically just been the same thing but bigger, why should we think this?
reply
bubblyworld
1 hour ago
[-]
My point was to caution against being too confident about the underlying architecture, not to argue for any particular alternative.

Your statement is false - things changed a lot between gpt4 and o1 under the hood, but notably not the model size. In fact the model size of o1 is smaller than gpt4 by several orders of magnitude! Improvements are being made in other ways.

reply
timabdulla
21 hours ago
[-]
What's your explanation for why it can only get ~70% on SWE-bench Verified?

I believe about 90% of the tasks were estimated by humans to take less than one hour to solve, so we aren't talking about very complex problems, and to boot, the contamination factor is huge: o3 (or any big model) will have in-depth knowledge of the internals of these projects, and often even know about the individual issues themselves (e.g. you can say what was Github issue #4145 in project foo, and there's a decent chance it can tell you exactly what the issue was about!)

reply
slewis
21 hours ago
[-]
I've spent tons of time evaluating o1-preview on SWEBench-Verified.

For one, I speculate OpenAI is using a very basic agent harness to get the results they've published on SWEBench. I believe there is a fair amount of headroom to improve results above what they published, using the same models.

For two, some of the instances, even in SWEBench-Verified, require a bit of "going above and beyond" to get right. One example is an instance where the user states that a TypeError isn't properly handled. The developer who fixed it handled the TypeError but also handled a ValueError, and the golden test checks for both. I don't know how many instances fall in this category, but I suspect its more than on a simpler benchmark like MATH.

reply
nopinsight
21 hours ago
[-]
One possibility is that it may not yet have sufficient experience and real-world feedback for resolving coding issues in professional repos, as this involves multiple steps and very diverse actions (or branching factor, in AI terms). They have committed to not training on API usage, which limits their ability to directly acquire training data from it. However, their upcoming agentic efforts may address this gap in training data.
reply
timabdulla
21 hours ago
[-]
Right, but the branching factor increases exponentially with the scope of the work.

I think it's obvious that they've cracked the formula for solving well-defined, small-in-scope problems at a superhuman level. That's an amazing thing.

To me, it's less obvious that this implies that they will in short order with just more training data be able to solve ambiguous, large-in-scope problems at even just a skilled human level.

There are far more paths to consider, much more context to use, and in an RL setting, the rewards are much more ambiguously defined.

reply
nopinsight
19 hours ago
[-]
Their reasoning models can learn from procedures and methods, which generalize far better than data. Software tasks are diverse but most tasks are still fairly limited in scope. Novel tasks might remain challenging for these models, as they do for humans.

That said, o3 might still lack some kind of interaction intelligence that’s hard to learn. We’ll see.

reply
nyrikki
21 hours ago
[-]
GPQA scores are mostly from pre-training, against content in the corpus. They have gone silent but look at the GPT4 technical report which calls this out.

We are nowhere close to what Sam Altman calls AGI and transformers are still limited to what uniform-TC0 can do.

As an example the Boolean Formula Value Problem is NC1-complete, thus beyond transformers but trivial to solve with a TM.

As it is now proven that the frame problem is equivalent to the halting problem, even if we can move past uniform-TC0 limits, novelty is still a problem.

I think the advancements are truly extraordinary, but unless you set the bar very low, we aren't close to AGI.

Heck we aren't close to P with commercial models.

reply
sebzim4500
20 hours ago
[-]
Isn't any physically realizable computer (including our brains) limited to what uniform-TC0 can do?
reply
nyrikki
18 hours ago
[-]
Neither TC0 nor uniform-TC0 are physically realizable, they are tools not physical devices.

The default nonuniform circuits classes are allowed to have a different circuit per input size, the uniform types have unbounded fan-in

Similar to how a k-tape TM doesn't get 'charged' for the input size.

With Nick Class (NC) the number of components is similar to traditional compute time while depth relates to the ability to parallelize operations.

These are different than biological neurons, not better or worse but just different.

Human neurons can use dendritic compartmentalization, use spike timing, can retime spikes etc...

While the perceptron model we use in ML is useful, it is not able to do xor in one layer, while biological neurons do that without anything even reaching the soma, purely in the dendrites.

Statistical learning models still comes down to a choice function, no matter if you call that set shattering or...

With physical computers the time hierarchy does apply and if TIME(g(n)) is given more time than TIME(f(n)), g(n) can solve more problems.

So you can simulate a NTM with exhaustive search with a physical computer.

Physical computers also tend to have NAND and XOR gates, and can have different circuit depths.

When you are in TC0, you only have AND, OR and Threshold (or majority) gates.

Think of instruction level parallelism in a typical CPU, it can return early, vs Itanium EPIC, which had to wait for the longest operation. Predicated execution is also how GPUs work.

They can send a mask and save on load store ops as an example, but the cost of that parallelism is the consent depth.

It is the parallelism tradeoff that both makes transformers practical as well as limit what they can do.

The IID assumption and autograd requiring smooth manifolds plays a role too.

The frame problem, which causes hard problems to become unsolvable for computers and people alike does also.

But the fact that we have polynomial time solutions for the Boolean Formula Value Problem, as mentioned in my post above is probably a simpler way of realizing physical computers aren't limited to uniform-TC0.

reply
drdeca
18 hours ago
[-]
Do you just mean because any physically realizable computer is a finite state machine? Or...?

I wouldn't describe a computer's usual behavior as having constant depth.

It is fairly typical to talk about problems in P as being feasible (though when the constant factors are too big, this isn't strictly true of course).

Just because for unreasonably large inputs, my computer can't run a particular program and produce the correct answer for that input, due to my computer running out of memory, we don't generally say that my computer is fundamentally incapable of executing that algorithm.

reply
ec109685
20 hours ago
[-]
The problem with ARC is that there are a finite number of heuristics that could be enumerated and trained for, which would give model a substantial leg up on this evaluation, but not be generalized to other domains.

For example, if they produce millions of examples of the type of problems o3 still struggles on, it would probably do better at similar questions.

Perhaps the private data set is different enough that this isn’t a problem, but the ideal situation would be unveiling a truly novel dataset, which it seems like arc aims to do.

reply
Imnimo
21 hours ago
[-]
>Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforce, AIME, and Frontier Math strongly suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it.

The article notes, "o3 still fails on some very easy tasks". What explains these failures if o3 can solve "any problem" at the human level? Do these failed cases require some essential knowledge that has eluded the massive OpenAI training set?

reply
nopinsight
21 hours ago
[-]
Great point. I'd love to see what these easy tasks are and would be happy to revise my hypothesis accordingly. o3's intelligence is unlikely to be a strict superset of human intelligence. It is certainly superior to humans in some respects and probably inferior in others. Whether it's sufficiently generally intelligent would be both a matter of definition and empirical fact.
reply
Imnimo
21 hours ago
[-]
Chollet has a few examples here:

https://x.com/fchollet/status/1870172872641261979

https://x.com/fchollet/status/1870173137234727219

I would definitely consider them legitimately easy for humans.

reply
nopinsight
20 hours ago
[-]
Thanks! I added some comments on this at the bottom of the post above.
reply
mirkodrummer
14 hours ago
[-]
Please stop it calling AGI, we don’t even know or agree universally what that should actually mean. How far did we get with hype calling a lossy probabilistic compressor firing slowly at us words AGI? That’s a real bummer to me
reply
razodactyl
7 hours ago
[-]
Is this comment voted down because of sentiment / polarity?

Regardless the critical aspect is valid, AGI would be something like Cortana from Halo.

reply
uncomplexity_
14 hours ago
[-]
on the spatial data i see it as a highly intelligent head of a machine that just needs better limbs and better senses.

i think that's where most hardware startups will specialize with in the coming decades, different industries with different needs.

reply
puttycat
17 hours ago
[-]
Great comment. See this as well for another potential reason for failure:

https://arxiv.org/abs/2402.10013

reply
norir
21 hours ago
[-]
Personally I find "human-level" to be a borderline meaningless and limiting term. Are we now super human as a species relative to ourselves just five years ago because of our advances in developing computer programs that better imitate what many (but far from all) of us were already capable of doing? Have we reached a limit to human potential that can only be surpassed by digital machines? Who decides what human level is and when we have surpassed it? I have seen some ridiculous claims about ai in art that don't stand up to even the slightest scrutiny by domain experts but that easily fool the masses.
reply
razodactyl
7 hours ago
[-]
No I think we're just tired and depressed as a species... Existing systems work to a degree but aren't living up to their potential of increasing happiness according to technological capabilities.
reply
golol
17 hours ago
[-]
In order to replace actual humans doing their job I think LLMs are lacking in judgement, sense of time and agenticism.
reply
Kostchei
16 hours ago
[-]
I mean fkcu me when they have those things, however, maybe they are just lazy and their judgement is fine, for a lazy intelligence. Inner-self thinks "why are these bastards asking me to do this? ". I doubt that is actually happening, but now, .. prove it isn't.
reply
ryoshu
14 hours ago
[-]
Ask o3 is P=NP?
reply
amelius
3 hours ago
[-]
It will just answer with the current consensus on the matter.
reply
zwnow
5 hours ago
[-]
This is not AGI lmao.
reply
xvector
21 hours ago
[-]
Agree. AGI is here. I feel such a sense of pride in our species.
reply
PaulDavisThe1st
21 hours ago
[-]
> It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could.

Every human does this dozens, hundreds or thousands of times ... during childhood.

reply
tymonPartyLate
3 hours ago
[-]
Isn’t this like a brute force approach? Given it costs $ 3000 per task, thats like 600 GPU hours (h100 at Azure) In that amount of time the model can generate millions of chains of thoughts and then spend hours reviewing them or even testing them out one by one. Kind of like trying until something sticks and that happens to solve 80% of ARC. I feel like reasoning works differently in my brain. ;)
reply
tikkun
3 hours ago
[-]
They're only allowed 2-3 guesses per problem. So even though yes it generates many candidates, it can't validate them - it doesn't have tool use or a verifier, it submits the best 2-3 guesses. https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50...
reply
nmca
3 hours ago
[-]
It is allowed exactly two guesses, per the ARC rules.
reply
trescenzi
3 hours ago
[-]
How many guesses is the human comparison based on? I’d hope two as well but haven’t seen this anywhere so now I’m curious.
reply
nmca
3 hours ago
[-]
The real turker studies, resulting in the ~70% number, are scored correctly I believe. Higher numbers are just speculated human performance as far as I’m aware.
reply
macrolime
2 hours ago
[-]
The trick with AlphaGo was brute force combined with learning to extract strategies from brute force using reinforcement learning, that's what we'll see here. So maybe it costs a million dollars in compute to get a high score, but use reinforcement learning ala alphazero to learn from the process and it won't cost a million dollars next time and let it do lots of hard benchmarks, math problems and coding tasks and it'll keep getting better and better.
reply
nextworddev
3 hours ago
[-]
The best interpretation of this result is probably that it showed tackling some arbitrary benchmark is something you can throw money at, aka it’s just something money can solve.

Its not agi obviously in the sense that you still need to some problem framing and initialization to kickstart the reasoning path simulations

reply
torginus
2 hours ago
[-]
this might be quite an important point - if they created an algorithm that can mimic human reasoning, but scales terribly with problem complexity (in terms of big O notation), it's still a very significant result, but it's not a 'humans brains are over' moment quite yet.
reply
strangescript
3 hours ago
[-]
"We have created artificial super intelligence, it has solved physics!"

"Well, yeah, but its kind of expensive" -- this guy

reply
tymonPartyLate
3 hours ago
[-]
Haha. Hopefully you’re right and solving the ARC puzzle translates to solving all of physics. I just remain skeptical about the OpenAI hype. They have a track record of exaggerating the significance of their releases and their impact on humanity.
reply
jeremyjh
1 hour ago
[-]
Please do show me a novel result in physics from any LLM. You think "this guy" is stupid because he doesn't extrapolate from this $2MM test that nearly reproduces the work of a STEM graduate to a super intelligence that has already solved physics. Maybe you've got it backwards.
reply
freehorse
3 hours ago
[-]
The problem is not that it is expensive, but that, most likely, it is not superintelligence. Superintelligence is not exploring the problem space semi-blindly, if the thounsands $$$ per task are actually spent for that. There is a reason the actual ARC-AGI prize requires efficiency, because the point is not "passing the test" but solving the framing problem of intelligence.
reply
modeless
22 hours ago
[-]
Congratulations to Francois Chollet on making the most interesting and challenging LLM benchmark so far.

A lot of people have criticized ARC as not being relevant or indicative of true reasoning, but I think it was exactly the right thing. The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.

It's obvious to everyone that these models can't perform as well as humans on everyday tasks despite blowout scores on the hardest tests we give to humans. Yet nobody could quantify exactly the ways the models were deficient. ARC is the best effort in that direction so far.

We don't need more "hard" benchmarks. What we need right now are "easy" benchmarks that these models nevertheless fail. I hope Francois has something good cooked up for ARC 2!

reply
adamgordonbell
19 hours ago
[-]
There is a benchmark, NovelQA, that LLMs don't dominate when it feels like they should. The benchmark is to read a novel and answer questions about it.

LLMs are below human evaluation, as I last looked, but it doesn't get much attention.

Once it is passed, I'd like to see one that is solving the mystery in a mystery book right before it's revealed.

We'd need unpublished mystery novels to use for that benchmark, but I think it gets at what I think of as reasoning.

https://novelqa.github.io/

reply
loxias
16 hours ago
[-]
NovelQA is a great one! I also like GSM-Symbolic -- a benchmark based on making _symbolic templates_ of quite easy questions, and sampling them repeatedly, varying things like which proper nouns are used, what order relevant details appear, how many irrelevant details (GSM-NoOp) and where they are in the question, things like that.

LLMs are far, _far_ below human on elementary problems, once you allow any variation and stop spoonfeeding perfectly phrased word problems. :)

https://machinelearning.apple.com/research/gsm-symbolic

https://arxiv.org/pdf/2410.05229

Paper came out in October, I don't think many have fully absorbed the implications.

It's hard to take any of the claims of "LLMs can do reasoning!" seriously, once you understand that simply changing what names are used in a 8th grade math word problem can have dramatic impact on the accuracy.

reply
meta_x_ai
19 hours ago
[-]
Looks like it's not updated for nearly a year and I'm guessing Gemini 2.0 Flash with 2m context will simply crush it
reply
adamgordonbell
18 hours ago
[-]
That's true. They don't have Claude 3.5 on there either. So maybe it's not relevant anymore, but I'm not sure.

If so, let's move on to the murder mysteries or more complex literary analysis.

reply
usaar333
9 hours ago
[-]
That's an old leaderboard -- has no one checked any SOTA LLM in the last 8 months?
reply
latency-guy2
15 hours ago
[-]
> I'd like to see one that is solving the mystery in a mystery book right before it's revealed.

I would think this is a not so good bench. Author does not write logically, they write for entertainment.

reply
adamgordonbell
15 hours ago
[-]
So I'm thinking of something like Locked-room mystery where the idea is it's solvable, and the reader is given a chance to solve.

The reason it seems like an interesting bench, is it's a puzzle presented in a long context. Its like testing if an LLm is at Sherlock Holmes level of world and motivation modelling.

reply
rowanG077
17 hours ago
[-]
Benchmark how? Is it good if the LLM can or can't solve it?
reply
CamperBob2
19 hours ago
[-]
Does it work on short stories, but not novels? If so, then that's just a minor question of context length that should self-resolve over time.
reply
adamgordonbell
19 hours ago
[-]
The books fit in the current long context models, so it's not merely the context size constraint but the length is part of the issue, for sure.
reply
danielmarkbruce
19 hours ago
[-]
Highly challenging for LLMs because it has nothing to do with language. LLMs and their training processes have all kinds of optimizations for language and how it's presented.

This benchmark has done a wonderful job with marketing by picking a great name. It's largely irrelevant for LLMs despite the fact it's difficult.

Consider how much of the model is just noise for a task like this given the low amount of information in each token and the high embedding dimensions used in LLMs.

reply
computerex
16 hours ago
[-]
The benchmark is designed to test for AGI and intelligence, specifically the ability to solve novel problems.

If the hypothesis is that LLMs are the “computer” that drives the AGI then of course the benchmark is relevant in testing for AGI.

I don’t think you understand the benchmark and its motivation. ARC AGI benchmark problems are extremely easy and simple for humans. But LLMs fail spectacularly at them. Why they fail is irrelevant, the fact they fail though means that we don’t have AGI.

reply
danielmarkbruce
15 hours ago
[-]
> The benchmark is designed to test for AGI and intelligence, specifically the ability to solve novel problems.

It's a bunch of visual puzzles. They aren't a test for AGI because it's not general. If models (or any other system for that matter) could solve it, we'd be saying "this is a stupid puzzle, it has no practical significance". It's a test of some sort of specific intelligence. On top of that, the vast majority of blind people would fail - are they not generally intelligent?

The name is marketing hype.

The benchmark could be called "random puzzles LLMs are not good at because they haven't been optimized for it because it's not valuable benchmark". Sure, it wasn't designed for LLMs, but throwing LLMs at it and saying "see?" is dumb. We can throw in benchmarks for tennis playing, chess playing, video game playing, car driving and a bajillion other things while we are at it.

reply
NateEag
10 hours ago
[-]
And all that is kind of irrelevant, because if LLMs were human-level general intelligence, they would solve all these questions correctly without blinking.

But they don't. Not even the best ones.

reply
pama
2 hours ago
[-]
No human would score high on that puzzle if the images were given to them as a series of tokens. Even previous LLMs scored much better than humans if tested in the same way.
reply
jug
20 hours ago
[-]
I liked the SimpleQA benchmark that measures hallucinations. OpenAI models did surprisingly poorly, even o1. In fact, it looks like OpenAI often does well on benchmarks by taking the shortcut to be more risk prone than both Anthropic and Google.
reply
internet_points
8 hours ago
[-]
> The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.

One might also interpret that as "the fact that models which are studying to the test are getting better at the test" (Goodhart's law), not that they're actually reasoning.

reply
aimanbenbaha
17 hours ago
[-]
Because LLMs are on an off-ramp path towards AGI. A generally intelligent system can brute force its way with just memory.

Once a model recognizes a weakness through reasoning with CoT when posed to a certain problem and gets the agency to adapt to solve that problem that's a precursor towards real AGI capability!

reply
zone411
19 hours ago
[-]
It's the least interesting benchmark for language models among all they've released, especially now that we already had a large jump in its best scores this year. It might be more useful as a multimodal reasoning task since it clearly involves visual elements, but with o3 already performing so well, this has proven unnecessary. ARC-AGI served a very specific purpose well: showcasing tasks where humans easily outperformed language models, so these simple puzzles had their uses. But tasks like proving math theorems or programming are far more impactful.
reply
versteegen
16 hours ago
[-]
ARC wasn't designed as a benchmark for LLMs, and it doesn't make much sense to compare them on it since it's the wrong modality. Even a MLM with image inputs can't be expected to do well, since they're nothing like 99.999% of the training data. The fact that even a text-only LLM can solve ARC problems with the proper framework is important, however.
reply
skywhopper
20 hours ago
[-]
"The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning."

Not sure I understand how this follows. The fact that a certain type of model does well on a certain benchmark means that the benchmark is relevant for a real-world reasoning? That doesn't make sense.

reply
munchler
19 hours ago
[-]
It shows objectively that the models are getting better at some form of reasoning, which is at least worth noting. Whether that improved reasoning is relevant for the real world is a different question.
reply
moffkalast
19 hours ago
[-]
It shows objectively that one model got better at this specific kind of weird puzzle that doesn't translate to anything because it is just a pointless pattern matching puzzle that can be trained for, just like anything else. In fact they specifically trained for it, they say so upfront.

It's like the modern equivalent of saying "oh when AI solves chess it'll be as smart as a person, so it's a good benchmark" and we all know how that nonsense went.

reply
munchler
18 hours ago
[-]
Hmm, you could be right, but you could also be very wrong. Jury's still out, so the next few years will be interesting.

Regarding the value of "pointless pattern matching" in particular, I would refer you to Douglas Hofstadter's discussion of Bongard problems starting on page 652 of _Godel, Escher, Bach_. Money quote: "I believe that the skill of solving Bongard [pattern recognition] problems lies very close to the core of 'pure' intelligence, if there is such a thing."

reply
moffkalast
17 hours ago
[-]
Well I certainly at least agree with that second part, the doubt if there is such a thing ;)

The problem with pattern matching of sequences and transformers as an architecture is that it's something they're explicitly designed to be good at with self attention. Translation is mainly matching patterns to equivalents in different languages, and continuing a piece of text is following a pattern that exists inside it. This is primarily why it's so hard to draw a line between what an LLM actually understands and what it just wings naturally through pattern memorization and why everything about them is so controversial.

Honestly I was really surprised that all models did so poorly on ARC in general thus far, since it really should be something they ought to be superhuman at from the get-go. Probably more of a problem that it's visual in concept than anything else.

reply
bagels
16 hours ago
[-]
It doesn't follow, faulty logic. The two are probably correlated though.
reply
justanotherjoe
11 hours ago
[-]
i am confused cause this dataset is visual-based, and yet being used to measure 'LLM'. I feel like the visual nature of it was really the biggest hurdle to solving it.
reply
dtquad
22 hours ago
[-]
Are there any single-step non-reasoner models that do well on this benchmark?

I wonder how well the latest Claude 3.5 Sonnet does on this benchmark and if it's near o1.

reply
throwaway71271
22 hours ago
[-]

    | Name                                 | Semi-private eval | Public eval |
    |--------------------------------------|-------------------|-------------|
    | Jeremy Berman                        | 53.6%             | 58.5%       |
    | Akyürek et al.                       | 47.5%             | 62.8%       |
    | Ryan Greenblatt                      | 43%               | 42%         |
    | OpenAI o1-preview (pass@1)           | 18%               | 21%         |
    | Anthropic Claude 3.5 Sonnet (pass@1) | 14%               | 21%         |
    | OpenAI GPT-4o (pass@1)               | 5%                | 9%          |
    | Google Gemini 1.5 (pass@1)           | 4.5%              | 8%          |

https://arxiv.org/pdf/2412.04604
reply
kandesbunzler
21 hours ago
[-]
why is this missing the o1 release / o1 pro models? Would love to know how much better they are
reply
Freebytes
14 hours ago
[-]
This might be because they are referencing single step, and I do not think o1 is single step.
reply
aimanbenbaha
17 hours ago
[-]
Akyürek et al uses test-time compute.
reply
YetAnotherNick
22 hours ago
[-]
Here are the results for base models[1]:

  o3 (coming soon)  75.7% 82.8%
  o1-preview        18%   21%
  Claude 3.5 Sonnet 14%   21%
  GPT-4o            5%    9%
  Gemini 1.5        4.5%  8%
Score (semi-private eval) / Score (public eval)

[1]: https://arcprize.org/2024-results

reply
Bjorkbat
19 hours ago
[-]
It's easy to miss, but if you look closely at the first sentence of the announcement they mention that they used a version of o3 trained on a public dataset of ARC-AGI, so technically it doesn't belong on this list.
reply
dot1x
7 hours ago
[-]
It's all scam. ClosedAI trained on the data they were tested on, so no, nothing here is impressive.
reply
simonw
20 hours ago
[-]
I'd love to know how Claude 3.5 Sonnet does so well despite (presumably) not having the same tricks as the o-series models.
reply
lossolo
20 hours ago
[-]
> making the most interesting and challenging LLM benchmark so far.

This[1] is currently the most challenging benchmark. I would like to see how O3 handles it, as O1 solved only 1%.

1. https://epoch.ai/frontiermath/the-benchmark

reply
pynappo
20 hours ago
[-]
Apparently o3 scored about 25%

https://youtu.be/SKBG1sqdyIU?t=4m40s

reply
FiberBundle
19 hours ago
[-]
This is actually the result that I find way more impressive. Elite mathematicians think these problems are challenging and thought they were years away from being solvable by AI.
reply
modeless
19 hours ago
[-]
You're right, I was wrong to say "most challenging" as there have been harder ones coming out recently. I think the correct statement would be "most challenging long-standing benchmark" as I don't believe any other test designed in 2019 has resisted progress for so long. FrontierMath is only a month old. And of course the real key feature of ARC is that it is easy for humans. FrontierMath is (intentionally) not.
reply
esafak
2 hours ago
[-]
They should put some famous, unsolved problems in the next edition so ML researchers do some actually useful work while they're "gaming" the benchmarks :)
reply
refulgentis
22 hours ago
[-]
This emphasizes persons and a self-conceived victory narrative over the ground truth.

Models have regularly made progress on it, this is not new with the o-series.

Doing astoundingly well on it, and having a mutually shared PR interest with OpenAI in this instance, doesn't mean a pile of visual puzzles is actually AGI or some well thought out and designed benchmark of True Intelligence(tm). It's one type of visual puzzle.

I don't mean to be negative, but to inject a memento mori. Real story is some guys get together and ride off Chollet's name with some visual puzzles from ye olde IQ test, and the deal was Chollet then gets to show up and say it proves program synthesis is required for True Intelligence.

Getting this score is extremely impressive but I don't assign more signal to it than any other benchmark with some thought to it.

reply
modeless
22 hours ago
[-]
Solving ARC doesn't mean we have AGI. Also o3 presumably isn't doing program synthesis, seemingly proving Francois wrong on that front. (Not sure I believe the speculation about o3's internals in the link.)

What I'm saying is the fact that as models are getting better at reasoning they are also scoring better on ARC proves that it is measuring something relating to reasoning. And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs. Even today, let alone five years ago when ARC was released. ARC was visionary.

reply
HarHarVeryFunny
16 hours ago
[-]
> o3 presumably isn't doing program synthesis

I'd guess it's doing natural language procedural synthesis, the same way a human might (i.e. figuring the sequence of steps to effect the transformation), but it may well be doing (sub-)solution verification by using the procedural description to generate code whose output can then be compared to the provided examples.

While OpenAI haven't said exactly what the architecture of o1/o3 are, the gist of it is pretty clear - basically adding "tree" search and iteration on top of the underlying LLM, driven by some RL-based post-training that imparts generic problem solving biases to the model. Maybe there is a separate model orchestrating the search and solution evaluation.

I think there are many tasks that are easy enough for humans but hard/impossible for these models - the ultimate one in terms of commercial value would be to take an "off the shelf model" and treat it as an intern/apprentice and teach it to become competent in a entire job it was never trained on. Have it participate in team meetings and communications, and become a drop-in replacement for a human performing that job (any job that an be performed remotely without a physical presence).

reply
hdjjhhvvhga
22 hours ago
[-]
Your argumentation seems convincing but I'd like to offer a competitive narrative: any benchmark that is public becomes completely useless because companies optimize for it - especially AI that depends on piles of money and they need some proof they are developing.

That's why I have some private benchmarks and I'm sorry to say that the transition from GTP4 to o1 wasn't unambiguously a step forward (in some tasks yes, in some not).

On the other hand, private benchmarks are even less useful to the general public than the public ones, so we have to deal with what we have - but many of us just treat it as noise and don't give it much significance. Ultimately, the models should defend themselves by performing the tasks individual users want them to do.

reply
stonemetal12
21 hours ago
[-]
Rather any Logic puzzle you post on the internet as something AIs are bad at is in the next round of training data so AIs get better at that specific question. Not because AI companies are optimizing for a benchmark but because they suck up everything.
reply
modeless
21 hours ago
[-]
ARC has two test sets that are not posted on the Internet. One is kept completely private and never shared. It is used when testing open source models and the models are run locally with no internet access. The other test set is used when testing closed source models that are only available as APIs. So it could be leaked in theory, but it is still not posted on the internet and can't be in any web crawls.

You could argue that the models can get an advantage by looking at the training set which is on the internet. But all of the tasks are unique and generalizing from the training set to the test set is the whole point of the benchmark. So it's not a serious objection.

reply
foobiekr
16 hours ago
[-]
Given the delivery mechanism for OpenAI, how do they actually keep it private?
reply
modeless
16 hours ago
[-]
> So it could be leaked in theory

That's why they have two test sets. But OpenAI has legally committed to not training on data passed to the API. I don't believe OpenAI would burn their reputation and risk legal action just to cheat on ARC. And what they've reported is not implausible IMO.

reply
sensanaty
4 hours ago
[-]
Yeah I'm sure the Microsoft-backed company headed by Mr. Worldcoin Altman whose sole mission statement so far has been to overhype every single product they released wouldn't dare cheat on one of these benchmarks that "prove" AGI (as they've been claiming since GPT-2).
reply
QuantumGood
21 hours ago
[-]
Gaming the benchmarks usually needs to be considered first when evaluating new results.
reply
bubblyworld
21 hours ago
[-]
I think gaming the benchmarks is encouraged in the ARC AGI context. If you look at the public test cases you'll see they test a ton of pretty abstract concepts - space, colour, basic laws of physics like gravity/magnetism, movement, identity and lots of other stuff (highly recommend exploring them). Getting an AI to do well at all, regardless of whether it was gamed or not, is the whole challenge!
reply
chaps
21 hours ago
[-]
Honestly, is gaming benchmarks actually a problem in this space in that it still shows something useful? Just means we need more benchmarks, yeah? It really feels not unlike keggle competitions.

We do the same exact stuff with real people with programming challenges and such where people just study common interview questions rather than learning the material holistically. And since we know that people game these interview type questions, we can adjust the interview processes to minimize gamification.... which itself leads to gamification and back to step one. That's not ideal an ideal feedback loop of course, but people still get jobs and churn out "productive work" out of it.

reply
ben_w
21 hours ago
[-]
AI are very good at gaming benchmarks. Both as overfitting and as Goodhart's law, gaming benchmarks has been a core problem during training for as long as I've been interested in the field.

Sometimes this manifests as "outside the box thinking", like how a genetic algorithm got an "oscillator" which was really just an antenna.

It is a hard problem, and yes we still both need and can make more and better benchmarks; but it's still a problem because it means the benchmarks we do have are overstating competence.

reply
CamperBob2
21 hours ago
[-]
The idea behind this particular benchmark, at least, is that it can't be gamed. What are some ways to game ARC-AGI, meaning to pass it without developing the required internal model and insights?

In principle you can't optimize specifically for ARC-AGI, train against it, or overfit to it, because only a few of the puzzles are publicly disclosed.

Whether it lives up to that goal, I don't know, but their approach sounded good when I first heard about it.

reply
psb217
20 hours ago
[-]
Well, with billions in funding you could task a hundred or so very well paid researchers to do their best at reverse engineering the general thought process which went into ARC-AGI, and then generate fresh training data and labeled CoTs until the numbers go up.
reply
CamperBob2
20 hours ago
[-]
Right, but the ARC-AGI people would counter by saying they're welcome to do just that. In doing so -- again in their view -- the researchers would create a model that could be considered capable of AGI.

I spent a couple of hours looking at the publicly-available puzzles, and was really impressed at how much room for creativity the format provides. Supposedly the puzzles are "easy for humans," but some of them were not... at least not for me.

(It did occur to me that a better test of AGI might be the ability to generate new, innovative ARC-AGI puzzles.)

reply
psb217
16 hours ago
[-]
It's tricky to judge the difficulty of these sorts of things. Eg, breadth of possibilities isn't an automatic sign of difficulty. I imagine the space of programming problems permits as much variety as ARC-AGI, but since we're more familiar with problems presented as natural language descriptions of programming tasks, and since we know there's tons of relevant text on the web, we see the abstract pictographic ARC-AGI tasks as more novel, challenging, etc. But, to an LLM, any task we can conceive of will be (roughly) as familiar as the amount of relevant training data it's seen. It's legitimately hard to internalize this.

For a space of tasks which are well-suited to programmatic generation, as ARC-AGI is by design, if we can do a decent job of reverse engineering the underlying problem generating grammar, then we can make an LLM as familiar with the task as we're willing to spend on compute.

To be clear, I'm not saying solving these sorts of tasks is unimpressive. I'm saying that I find it unsuprising (in light of past results) and not that strong of a signal about further progress towards the singularity, or FOOM, or whatever. For any of these closed-ish domain tasks, I feel a bit like they're solving Go for the umpteenth time. We now know that if you collect enough relevant training data and train a big enough model with enough GPUs, the training loss will go down and you'll probably get solid performance on the test set. Trillions of reasonably diverse training tokens buys you a lot of generalization. Ie, supervised learning works. This is the horse Ilya Sutskever's ridden to many glorious victories and the big driver of OpenAI's success -- a firm belief that other folks were leaving A LOT of performance on the table due to a lack of belief in the power of their own inventions.

reply
chaps
21 hours ago
[-]
We're in agreement!

What's endlessly interesting to me with all of this is how surprisingly quick the benchmarking feedback loops have become plus the level of scrutiny each one receives. We (as a culture/society/whatever) don't really treat human benchmarking criteria with the same scrutiny such that feedback loops are useful and lead to productive changes to the benchmarking system itself. So from that POV it feels like substantial progress continues to be made through these benchmarks.

reply
refulgentis
21 hours ago
[-]
> Solving ARC doesn't mean we have AGI. Also o3 presumably isn't doing program synthesis, seemingly proving Francois wrong on that front.

Agreed.

> And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs.

? There's plenty.

reply
modeless
21 hours ago
[-]
I'd love to hear about more. Which ones are you thinking of?
reply
refulgentis
20 hours ago
[-]
- "Are You Human" https://arxiv.org/pdf/2410.09569 is designed to be directly on target, i.e. cross cutting set of questions that are easy for humans, but challenging for LLMs, Instead of one type of visual puzzle. Much better than ARC for the purpose you're looking for.

- SimpleBench https://simple-bench.com/ (similar to above; great landing page w/scores that show human / ai gap)

- PIQA (physical question answering, i.e. "how do i get a yolk out of a water bottle", common favorite of local llm enthusiasts in /r/localllama https://paperswithcode.com/dataset/piqa

- Berkeley Function-Calling (I prefer https://gorilla.cs.berkeley.edu/leaderboard.html)

AI search googled "llm benchmarks challenging for ai easy for humans", and "language model benchmarks that humans excel at but ai struggles with", and "tasks that are easy for humans but difficult for natural language ai".

It also mentioned Moravec's Paradox is a known framing of this concept, started going down that rabbit hole because the resources were fascinating, but, had to hold back and submit this reply first. :)

reply
modeless
19 hours ago
[-]
Thanks for the pointers! I hadn't seen Are You Human. Looks like it's only two months old. Of course it is much easier to design a test specifically to thwart LLMs now that we have them. It seems to me that it is designed to exploit details of LLM structure like tokenizers (e.g. character counting tasks) rather than to provide any sort of general reasoning benchmark. As such it seems relatively straightforward to improve performance in ways that wouldn't necessarily represent progress in general reasoning. And today's LLMs are not nearly as far from human performance on the benchmark as they were on ARC for many years after it was released.

SimpleBench looks more interesting. Also less than two months old. It doesn't look as challenging for LLMs as ARC, since o1-preview and Sonnet 3.5 already got half of the human baseline score; they did much worse on ARC. But I like the direction!

PIQA is cool but not hard enough for LLMs.

I'm not sure Berkeley Function-Calling represents tasks that are "easy" for average humans. Maybe programmers could perform well on it. But I like ARC in part because the tasks do seem like they should be quite straightforward even for non-expert humans.

Moravec's paradox isn't a benchmark per se. I tend to believe that there is no real paradox and all we need is larger datasets to see the same scaling laws that we have for LLMs. I see good evidence in this direction: https://www.physicalintelligence.company/blog/pi0

reply
refulgentis
17 hours ago
[-]
> "I'm not sure Berkeley Function-Calling represents tasks that are easy for average humans. Maybe programmers could perform well on it."

Functions in this context are not programming function calls. In this context, function calls are a now-deprecated LLM API name for "parse input into this JSON template." No programmer experience needed. Entity extraction by another name, except, that'd be harder: here, you're told up front exactly the set of entities to identify. :)

> "Moravec's paradox isn't a benchmark per se."

Yup! It's a paradox :)

> "Of course it is much easier to design a test specifically to thwart LLMs now that we have them"

Yes.

Though, I'm concerned a simple yes might be insufficient for illumination here.

It is a tautology (it's easier to design a test that $X fails when you have access to $X), and it's unlikely you meant to just share a tautology.

A potential unstated-but-maybe-intended-communication is "it was hard to come up with ARC before LLMs existed" --- LLMs existed in 2019 :)

If they didn't, a hacky way to come up with a test that's hard for the top AIs at the time, BERT-era, would be to use one type of visual puzzle.

If, for conversations sake, we ignore that it is exactly one type of visual puzzle, and that it wasn't designed to be easy for humans, then we can engage with: "its the only one thats easy for humans, but hard for LLMs" --- this was demonstrated as untrue as well.

I don't think I have much to contribute past that, once we're at "It is a singular example of a benchmark thats easy for humans but nigh-impossible for llms, at least in 2019, and this required singular insight", there's just too much that's not even wrong, in the Pauli sense, and it's in a different universe from the original claims:

- "Congratulations to Francois Chollet on making the most interesting and challenging LLM benchmark so far."

- "A lot of people have criticized ARC as not being relevant or indicative of true reasoning...The fact that [o-series models show progress on ARC proves that what it measures really is relevant and important for reasoning."

- "...nobody could quantify exactly the ways the models were deficient..."

- "What we need right now are "easy" benchmarks that these models nevertheless fail."

reply
CamperBob2
19 hours ago
[-]
How long has SimpleBench been posted? Out of the first 6 questions at https://simple-bench.com/try-yourself, o1-pro got 5/6 right.

It was interesting to see how it failed on question 6: https://chatgpt.com/c/6765e70e-44b0-800b-97bd-928919f04fbe

Apparently LLMs do not consider global thermonuclear war to be all that big a deal, for better or worse.

reply
Pannoniae
18 hours ago
[-]
Don't worry, I also got that wrong :) I thought her affair would be the biggest problem for John.
reply
jquery
17 hours ago
[-]
John was an ex, not her partner. Tricky.
reply
stego-tech
19 hours ago
[-]
I won't be as brutal in my wording, but I agree with the sentiment. This was something drilled into me as someone with a hobby in PC Gaming and Photography: benchmarks, while handy measures of potential capabilities, are not guarantees of real world performance. Very few PC gamers completely reinstall the OS before benchmarking to remove all potential cruft or performance impacts, just as very few photographers exclusively take photos of test materials.

While I appreciate the benchmark and its goals (not to mention the puzzles - I quite enjoy figuring them out), successfully passing this benchmark does not demonstrate or guarantee real world capabilities or performance. This is why I increasingly side-eye this field and its obsession with constantly passing benchmarks and then moving the goal posts to a newer, harder benchmark that claims to be a better simulation of human capabilities than the last one: it reeks of squandered capital and a lack of a viable/profitable product, at least to my sniff test. Rather than simply capitalize on their actual accomplishments (which LLMs are - natural language interaction is huge!), they're trying to prove to Capital that with a few (hundred) billion more in investments, they can make AGI out of this and replace all those expensive humans.

They've built the most advanced prediction engines ever conceived, and insist they're best used to replace labor. I'm not sure how they reached that conclusion, but considering even their own models refute this use case for LLMs, I doubt their execution ability on that lofty promise.

reply
danielmarkbruce
19 hours ago
[-]
100%. The hype is misguided. I doubt half the people excited about the result have even looked at what the benchmark is.
reply
w4
19 hours ago
[-]
The cost to run the highest performance o3 model is estimated to be somewhere between $2,000 and $3,400 per task.[1] Based on these estimates, o3 costs about 100x what it would cost to have a human perform the exact same task. Many people are therefore dismissing the near-term impact of these models because of these extremely expensive costs.

I think this is a mistake.

Even if very high costs make o3 uneconomic for businesses, it could be an epoch defining development for nation states, assuming that it is true that o3 can reason like an averagely intelligent person.

Consider the following questions that a state actor might ask itself: What is the cost to raise and educate an average person? Correspondingly, what is the cost to build and run a datacenter with a nuclear power plant attached to it? And finally, how many person-equivilant AIs could be run in parallel per datacenter?

There are many state actors, corporations, and even individual people who can afford to ask these questions. There are also many things that they'd like to do but can't because there just aren't enough people available to do them. o3 might change that despite its high cost.

So if it is true that we've now got something like human-equivilant intelligence on demand - and that's a really big if - then we may see its impacts much sooner than we would otherwise intuit, especially in areas where economics takes a back seat to other priorities like national security and state competitiveness.

[1] https://news.ycombinator.com/item?id=42473876

reply
istjohn
19 hours ago
[-]
Your economic analysis is deeply flawed. If there was anything that valuable and that required that much manpower, it would already have driven up the cost of labor accordingly. The one property that could conceivably justify a substantially higher cost is secrecy. After all, you can't (legally) kill a human after your project ends to ensure total secrecy. But that takes us into thriller novel territory.
reply
w4
18 hours ago
[-]
I don't think that's right. Free societies don't tolerate total mobilization by their governments outside of war time, no matter how valuable the outcomes might be in the long term, in part because of the very economic impacts you describe. Human-level AI - even if it's very expensive - puts something that looks a lot like total mobilization within reach without the societal pushback. This is especially true when it comes to tasks that society as a whole may not sufficiently value, but that a state actor might value very much, and when paired with something like a co-located reactor and data center that does not impact the grid.

That said, this is all predicated on o3 or similar actually having achieved human level reasoning. That's yet to be fully proven. We'll see!

reply
daemonologist
14 hours ago
[-]
This is interesting to consider, but I think the flaw here is that you'd need a "total mobilization" level workforce in order to build this mega datacenter in the first place. You put one human-hour into making B200s and cooling systems and power plants, you get less than one human-hour-equivalent of thinking back out.
reply
atleastoptimal
1 hour ago
[-]
How many 99.9th percentile mathematicians do nation states normally have access to?
reply
lurking_swe
17 hours ago
[-]
i disagree because the job market is not a true free market. I mean it mostly is, but there’s a LOT of politics and shady stuff that employers do to purposely drive wages down. Even in the tech sector.

Your secrecy comment is really intriguing actually. And morbid lol.

reply
phil917
21 hours ago
[-]
Direct quote from the ARC-AGI blog:

“SO IS IT AGI?

ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI – as we've repeated dozens of times this year. It's a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over the past five years.

Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.

Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training). This demonstrates the continued possibility of creating challenging, unsaturated benchmarks without having to rely on expert domain knowledge. You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”

The high compute variant sounds like it costed around *$350,000* which is kinda wild. Lol the blog post specifically mentioned how OpenAPI asked ARC-AGI to not disclose the exact cost for the high compute version.

Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned” (this was not displayed in the live demo graph). This suggest in those cases that the model was trained to better handle these types of questions, so I do wonder about data / answer contamination in those cases…

reply
Bjorkbat
20 hours ago
[-]
> Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned”

Something I missed until I scrolled back to the top and reread the page was this

> OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set

So yeah, the results were specifically from a version of o3 trained on the public training set

Which on the one hand I think is a completely fair thing to do. It's reasonable that you should teach your AI the rules of the game, so to speak. There really aren't any spoken rules though, just pattern observation. Thus, if you want to teach the AI how to play the game, you must train it.

On the other hand though, I don't think the o1 models nor Claude were trained on the dataset, in which case it isn't a completely fair competition. If I had to guess, you could probably get 60% on o1 if you trained it on the public dataset as well.

reply
phil917
19 hours ago
[-]
Lol I missed that even though it's literally the first sentence of the blog, good catch.

Yeah, that makes this result a lot less impressive for me.

reply
skepticATX
20 hours ago
[-]
Great catch. Super disappointing that AI companies continue to do things like this. It’s a great result either way but predictably the excitement is focused on the jump from o1, which is now in question.
reply
Bjorkbat
20 hours ago
[-]
To me it's very frustrating because such little caveats make benchmarks less reliable. Implicitly, benchmarks are no different from tests in that someone/something who scores high on a benchmark/test should be able to generalize that knowledge out into the real world.

While that is true with humans taking tests, it's not really true with AIs evaluating on benchmarks.

SWE-bench is a great example. Claude Sonnet can get something like a 50% on verified, whereas I think I might be able to score a 20-25%? So, Claude is a better programmer than me.

Except that isn't really true. Claude can still make a lot of clumsy mistakes. I wouldn't even say these are junior engineer mistakes. I've used it for creative programming tasks and have found one example where it tried to use a library written for d3js for a p5js programming example. The confusion is kind of understandable, but it's also a really dumb mistake.

Some very simple explanations, the models were probably overfitted to a degree on Python given its popularity in AI/ML work, and SWE-bench is all Python. Also, the underlying Github issues are quite old, so they probably contaminated the training data and the models have simply memorized the answers.

Or maybe benchmarks are just bad at measuring intelligence in general.

Regardless, every time a model beats a benchmark I'm annoyed by the fact that I have no clue whatsoever how much this actually translates into real world performance. Did OpenAI/Anthropic/Google actually create something that will automate wide swathes of the software engineering profession? Or did they create the world's most knowledgeable junior engineer?

reply
throwaway0123_5
19 hours ago
[-]
> Some very simple explanations, the models were probably overfitted to a degree on Python given its popularity in AI/ML work, and SWE-bench is all Python. Also, the underlying Github issues are quite old, so they probably contaminated the training data and the models have simply memorized the answers.

My understanding is that it works by checking if the proposed solution passes test-cases included in the original (human) PR. This seems to present some problems too, because there are surely ways to write code that passes the tests but would fail human review for one reason or another. It would be interesting to not only see the pass rate but also the rate at which the proposed solutions are preferred to the original ones (preferably evaluated by a human but even an LLM comparing the two solutions would be interesting).

reply
Bjorkbat
18 hours ago
[-]
If I recall correctly the authors of the benchmark did mention on Twitter that for certain issues models will submit an answer that technically passes the test but is kind of questionable, so yeah, good point.
reply
hartator
20 hours ago
[-]
> acid test

The css acid test? This can be gamed too.

reply
sundarurfriend
16 hours ago
[-]
https://en.wikipedia.org/wiki/Acid_test:

> An acid test is a qualitative chemical or metallurgical assay utilizing acid. Historically, it often involved the use of a robust acid to distinguish gold from base metals. Figuratively, the term represents any definitive test for attributes, such as gauging a person's character or evaluating a product's performance.

Specifically here, they're using the figurative sense of "definitive test".

reply
airstrike
11 hours ago
[-]
also a "litmus test" but I guess that's a different chemistry test...
reply
figure8
1 hour ago
[-]
I have a very naive question.

Why is the ARC challenge difficult but coding problems are easy? The two examples they give for ARC (border width and square filling) are much simpler than pattern awareness I see simple models find in code everyday.

What am I misunderstanding? Is it that one is a visual grid context which is unfamiliar?

reply
ItsMattyG
44 minutes ago
[-]
Francois'(the creator of ARC-AGI benchmark) whole point was that while they look the same, they're not. Coding is solving a familiar pattern in the same way (and fails when it' s NOT doing that, it just looks like it doesn't happen because it's seen SO MANY patterns in code). But the point of Arc AGI is to make each problem have to generalize in some new ay.
reply
miga89
9 hours ago
[-]
How do the organisers keep the private test set private? Does openAI hand them the model for testing?

If they use a model API, then surely OpenAI has access to the private test set questions and can include it in the next round of training?

(I am sure I am missing something.)

reply
owenpalmer
9 hours ago
[-]
I wouldn't be surprised if the term "benchmark fraud" will soon been coined.
reply
PhilippGille
6 hours ago
[-]
Benchmark fraud is not a novel concept. Outside of LLMs for example smartphone manufacturers detect benchmarks and disable or reduce CPU throttling: https://www.theregister.com/2019/09/30/samsung_benchmarking_...
reply
hmottestad
3 hours ago
[-]
CPU frequency ramp curve is also something that can be adjusted. You want the CPU to ramp up really quickly to make everything feel responsive, but at the same time you want to not have to use so much power from your battery.

If you detect that a benchmark is running then you can just ramp up to max frequency immediately. It’ll show how fast your CPU is, but won’t be representative of the actual performance that users will get from their device.

reply
7734128
9 hours ago
[-]
I suppose that's why they are calling it "semi-private".
reply
freehorse
9 hours ago
[-]
And why o3 or any OpenAI llm is not evaluated in the actual private dataset.
reply
PoignardAzur
5 hours ago
[-]
If we really want to imagine a cold-war-style solution, the two teams could meet in an empty warehouse, bring one computer with the model, one with the benchmarks, and connect them with a USB cable.

In practice I assume they just gave them the benchmarks and took it on the honor system they wouldn't cheat, yeah. They can always cook up a new test set for next time, it's only 10% of the benchmark content anyway and the results are pretty close.

reply
andrepd
4 hours ago
[-]
There's no honor system when there's billions of dollars at stake x) I'm highly highly skeptical of these benchmarks because of intentional cheating and accidental contamination.
reply
deneas
7 hours ago
[-]
They have two sets, a fully private one where the models run isolated and the semi-private one where they run models accessed over the internet.
reply
bjornsing
4 hours ago
[-]
Isn’t that why they call it “ Semi-Private”?

There’s a fully private test set too as I understand it, that o3 hasn’t run on yet.

reply
gritzko
6 hours ago
[-]
That is the top question, actually. Given all the billions at stake.
reply
ripped_britches
15 hours ago
[-]
Sad to see everyone so focused on compute expense during this massive breakthrough. GPT-2 originally cost $50k to train, but now can be trained for ~$150.

The key part is that scaling test-time compute will likely be a key to achieving AGI/ASI. Costs will definitely come down as is evidenced by precedents, Moore’s law, o3-mini being cheaper than o1 with improved performance, etc.

reply
stocknoob
13 hours ago
[-]
It’s wild, are people purposefully overlooking that inference costs are dropping 10-100x each year?

https://a16z.com/llmflation-llm-inference-cost/

Look at the log scale slope, especially the orange MMLU > 83 data points.

reply
menaerus
3 hours ago
[-]
Those are the (subsidized) prices that end clients are paying for the service so that's not something that is representative of what the actual inference costs are. Somebody still needs to pay that (actual) price in the end. For inference, as well as for training, you need actual (NVidia) hardware and that hardware didn't become any cheaper. OTOH models are only becoming increasingly more complex and bigger and with more and more demand I don't see those costs exactly dropping down.
reply
atleastoptimal
1 hour ago
[-]
Actual inference costs without considering subsidies and loss leaders are going down, due to algorithmic improvements, hardware improvements, and quantized/smaller models getting the same performance as larger ones. Companies are making huge breakthroughs making chips specifically for LLM inference
reply
croes
6 hours ago
[-]
A bit early for a every year claim not to mention what all these AI is used for.

In some parts of the internet it’s you hardly find real content only AI spam.

It will get worse the cheaper it gets.

Think of email spam.

reply
yawnxyz
15 hours ago
[-]
I think the question everyone has in their minds isn't "when will AGI get here" or even "how soon will it get here" — it's "how soon will AGI get so cheap that everyone will get their hands on it"

that's why everyone's thinking about compute expense. but I guess in terms of a "lifetime expense of a person" even someone who costs $10/hr isn't actually all that cheap, considering what it takes to grow a human into a fully functioning person that's able to just do stuff

reply
croes
6 hours ago
[-]
We are nowhere near AGI.
reply
hamburga
13 hours ago
[-]
I’m not sure if people realize what a weird test this is. They’re these simple visual puzzles that people can usually solve at a glance, but for the LLMs, they’re converted into a json format, and then the LLMs have to reconstruct the 2D visual scene from the json and pick up the patterns.

If humans were given the json as input rather than the images, they’d have a hard time, too.

reply
Jensson
10 hours ago
[-]
> If humans were given the json as input rather than the images, they’d have a hard time, too.

We shine light in text patterns at humans rather than inject the text directly into the brain as well, that is extremely unfair! Imagine how much better humans would be at text processing if we injected and extracted information from their brains using the neurons instead of eyes and hands.

reply
torginus
8 hours ago
[-]
Not sure how much that matters - I'm not an AI expert, but I did some intro courses where we had to train a classifier to recognize digits. How it worked basically was that we fed each pixel of the 2d grid of the image into an input of the network, essentially flattening it in a similar fashion. It worked just fine, and that was a tiny network.
reply
thegeomaster
1 hour ago
[-]
The classifier was likely a convolutional network, so the assumption of the image being a 2D grid was baked into the architecture itself - it didn't have to be represented via the shape of the input for the network to use it.
reply
torginus
21 minutes ago
[-]
I don't think so - convolutional neural networks also operate over 1D flat vectors - the spatial relationship of pixels is only learned from the training data.
reply
causal
11 hours ago
[-]
I think that's part of what feels odd about this- in some ways it feels like the wrong type of test for an LLM, but in many ways it makes this achievement that much more remarkable
reply
deneas
7 hours ago
[-]
The JSON files still contain images, just not in a regular image format. You have a 2D array of numbers where each number maps to a color. If you really want a regular picture format, you can easily convert the arrays.
reply
ImaCake
12 hours ago
[-]
Yeah, this entire thread seems utterly detached from my lived experience. LLMs are immensely useful for me at work but they certainly don't come close to the hype spouted by many commenters here. It would be great if it could handle more of our quite modest codebase but it's not able to yet
reply
m_ke
11 hours ago
[-]
ARC is a silly benchmark, the other results in math and coding are much more impressive.

o3 is just o1 scaled up, the main takeaway from this line of work that people should walk away with is that we now have a proven way to RL our way to super human performance on tasks where it’s cheap to sample and easy to verify the final output. Programming falls in that category, they focused on known benchmarks but the same process can be done for normal programs, using parsers, compilers, existing functions and unit tests as verifiers.

Pre o1 we only really had next token prediction, which required high quality human produced data, with o1 you optimize for success instead of MLE of next token. Explained in simpler terms, it means it can get reward for any implementation of a function that reproduces the expected result, instead of the exact implementation in the training set.

Put another way, it’s just like RLHF but instead of optimizing against learned human preferences, the model is trained to satisfy a verifier.

This should work just as well in VLA models for robotics, self driving and computer agents.

reply
Imnimo
22 hours ago
[-]
Whenever a benchmark that was thought to be extremely difficult is (nearly) solved, it's a mix of two causes. One is that progress on AI capabilities was faster than we expected, and the other is that there was an approach that made the task easier than we expected. I feel like the there's a lot of the former here, but the compute cost per task (thousands of dollars to solve one little color grid puzzle??) suggests to me that there's some amount of the latter. Chollet also mentions ARC-AGI-2 might be more resistant to this approach.

Of course, o3 looks strong on other benchmarks as well, and sometimes "spend a huge amount of compute for one problem" is a great feature to have available if it gets you the answer you needed. So even if there's some amount of "ARC-AGI wasn't quite as robust as we thought", o3 is clearly a very powerful model.

reply
solidasparagus
14 hours ago
[-]
Or the test wasn't testing anything meaningful, which IMO is what happened here. I think ARC was basically looking at the distribution of what AI is capable of, picked an area that it was bad at and no one had cared enough to go solve, and put together a benchmark. And then we got good at it because someone cared and we had a measurement. Which is essentially the goal of ARC.

But I don't much agree that it is any meaningful step towards AGI. Maybe it's a nice proofpoint that that AI can solve simple problems presented in intentionally opaque ways.

reply
atleastoptimal
1 hour ago
[-]
Id agree with you if there hasn’t been very deliberate work towards solving ARC for years, and if thr conceit of the benchmark wasn’t specifically based on a conception of human intuition being, put simply, learning and applying out of distribution rules on the fly. ARC wasn’t some arbitrary inverse set, it was designed to benchmark a fundamental capability of general intelligence
reply
exe34
22 hours ago
[-]
> the other is that there was an approach that made the task easier than we expected.

from reading Dennett's philosophy, I'm convinced that that's how human intelligence works - for each task that "only a human could do that", there's a trick that makes it easier than it seems. We are bags of tricks.

reply
Jensson
17 hours ago
[-]
> We are bags of tricks.

We are trick generators, that is what it means to be a general intelligence. Adding another trick in the bag doesn't make you a general intelligence, being able to discover and add new tricks yourself makes you a general intelligence.

reply
falcor84
17 hours ago
[-]
Not the parent, but remembering my reading of Dennett, he was referring to the tricks that we got through evolution, rather than ones we invented ourselves. As particular examples, we have neural functional areas for capabilities like facial recognition and spatial reasoning which seems to rely on dedicated "wetware" somewhat distinct from other parts of the brain.
reply
Jensson
16 hours ago
[-]
But humans being able to develop new tricks is core to their intelligence, saying its just a bag of tricks means you don't understand what AGI is. So either the poster misunderstood Dennett or Dennett weren't talking about AGI or Dennett didn't understand this well.

Of course there are many tricks you will need special training for, like many of the skills human share with animals, but the ability to construct useful shareable large knowledge bases based on observations is unique to humans and isn't just a "trick".

reply
exe34
6 hours ago
[-]
Dennett was talking about natural intelligence. I think you're just underestimating the potential of a sufficiently big bag of tricks.

sharing knowledge isn't a human thing - chimps learn from each other. bees teach each other the direction and distance to a new source of food.

we just happen to push the envelope a lot further and managed to kickstart runaway mimetic evolution.

reply
falcor84
3 hours ago
[-]
"mimetic" is apt there, but I think that Dennett, as a friend of Dawkins, would say it's "memetic"
reply
exe34
3 hours ago
[-]
nice catch!
reply
exe34
6 hours ago
[-]
generating tricks is itself a trick that relies on an enormous bag of tricks we inherited through evolution by the process of natural selection.

the new tricks don't just pop into our heads even though it seems that way. nobody ever woke up and devised a new trick in a completely new field without spending years learning about that field or something adjacent to it. even the new ideas tend to be an old idea from a different field applied to a new field. tricks stand on the shoulders of giants.

reply
vicentwu
10 hours ago
[-]
"Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data."

Really want to see the number of training pairs needed to achieve this socre. If it only takes a few pairs, say 100 pairs, I would say it is amazing!

reply
nmca
3 hours ago
[-]
75% of 400 is 300 :)
reply
WXLCKNO
1 hour ago
[-]
Wow are you AGI?
reply
highfrequency
20 hours ago
[-]
Very cool. I recommend scrolling down to look at the example problem that O3 still can’t solve. It’s clear what goes on in the human brain to solve this problem: we look at one example, hypothesize a simple rule that explains it, and then check that hypothesis against the other examples. It doesn’t quite work, so we zoom into an example that we got wrong and refine the hypothesis so that it solves that sample. We keep iterating in this fashion until we have the simplest hypothesis that satisfies all the examples. In other words, how humans do science - iteratively formulating, rejecting and refining hypotheses against collected data.

From this it makes sense why the original models did poorly and why iterative chain of thought is required - the challenge is designed to be inherently iterative such that a zero shot model, no matter how big, is extremely unlikely to get it right on the first try. Of course, it also requires a broad set of human-like priors about what hypotheses are “simple”, based on things like object permanence, directionality and cardinality. But as the author says, these basic world models were already encoded in the GPT 3/4 line by simply training a gigantic model on a gigantic dataset. What was missing was iterative hypothesis generation and testing against contradictory examples. My guess is that O3 does something like this:

1. Prompt the model to produce a simple rule to explain the nth example (randomly chosen)

2. Choose a different example, ask the model to check whether the hypothesis explains this case as well. If yes, keep going. If no, ask the model to revise the hypothesis in the simplest possible way that also explains this example.

3. Keep iterating over examples like this until the hypothesis explains all cases. Occasionally, new revisions will invalidate already solved examples. That’s fine, just keep iterating.

4. Induce randomness in the process (through next-word sampling noise, example ordering, etc) to run this process a large number of times, resulting in say 1,000 hypotheses which all explain all examples. Due to path dependency, anchoring and consistency effects, some of these paths will end in awful hypotheses - super convoluted and involving a large number of arbitrary rules. But some will be simple.

5. Ask the model to select among the valid hypotheses (meaning those that satisfy all examples) and choose the one that it views as the simplest for a human to discover.

reply
hmottestad
19 hours ago
[-]
I took a look at those examples that o3 can't solve. Looks similar to an IQ-test.

Took me less time to figure out the 3 examples that it took to read your post.

I was honestly a bit surprised to see how visual the tasks were. I had thought they were text based. So now I'm quite impressed that o3 can solve this type of task at all.

reply
neom
19 hours ago
[-]
I also took some time to look at the ones it couldn't solve. I stopped after this one: https://kts.github.io/arc-viewer/page6/#47996f11
reply
hmottestad
10 hours ago
[-]
That one's cool. All pink pixels need to be repaired so they match the symmetry in the picture.
reply
highfrequency
19 hours ago
[-]
You must be a stem grad! Or perhaps an ensemble of Kaggle submissions?
reply
aithrowawaycomm
21 hours ago
[-]
I would like to see this repeated with my highly innovative HARC-HAGI, which is ARC-AGI but it uses hexagons instead of squares. I suspect humans would only make slightly more brain farts on HARC-HAGI than ARC-AGI, but O3 would fail very badly since it almost certainly has been specifically trained on squares.

I am not really trying to downplay O3. But this would be a simple test as to whether O3 is truly "a system capable of adapting to tasks it has never encountered before" versus novel ARC-AGI tasks it hasn't encountered before.

reply
falcor84
16 hours ago
[-]
Here's my take - even if the o3 as currently implemented is utterly useless on your HARC-HAGI, it is obvious that o3 coupled with its existing training pipeline trained briefly on the hexagons would excel on it, such that passing your benchmark doesn't require any new technology.

Taking this a level of abstraction higher, I expect that in the next couple of years we'll see systems like o3 given a runtime budget that they can use for training/fine-tuning smaller models in an ad-hoc manner.

reply
zebomon
22 hours ago
[-]
My initial impression: it's very impressive and very exciting.

My skeptical impression: it's complete hubris to conflate ARC or any benchmark with truly general intelligence.

I know my skepticism here is identical to moving goalposts. More and more I am shifting my personal understanding of general intelligence as a phenomenon we will only ever be able to identify with the benefit of substantial retrospect.

As it is with any sufficiently complex program, if you could discern the result beforehand, you wouldn't have had to execute the program in the first place.

I'm not trying to be a downer on the 12th day of Christmas. Perhaps because my first instinct is childlike excitement, I'm trying to temper it with a little reason.

reply
hansonkd
22 hours ago
[-]
It doesn't need to be general intelligence or perfectly map to human intelligence.

All it needs to be is useful. Reading constant comments about LLMs can't be general intelligence or lack reasoning etc, to me seems like people witnessing the airplane and complaining that it isn't "real flying" because it isn't a bird flapping its wings (a large portion of the population held that point of view back then).

It doesn't need to be general intelligence for the rapid advancement of LLM capabilities to be the most societal shifting development in the past decades.

reply
wruza
21 hours ago
[-]
And look at the airplanes, they really can’t just land on a mountain slope or a tree without heavy maintenance afterwards. Those people weren’t all stupid, they questioned the promise of flying servicemen delivering mail or milk to their window and flying on a personal aircar to their workplace. Just like todays promises about whatever the CEOs telltales are. Imagining bullshit isn’t unique to this century.

Aerospace is still a highly regulated area that requires training and responsibility. If parallels can be drawn here, they don’t look so cool for a regular guy.

reply
Workaccount2
21 hours ago
[-]
What people always leave out is that society will bend to the abilities of the new technology. Planes can't land in your backyard so we built airports. We didn't abandon planes.
reply
wruza
21 hours ago
[-]
Yes but the idea was lost in the process. It became a faster transportation system that uses air as a medium, but that’s it. Personal planes are still either big business or an expensive and dangerous personal toy thing. I don’t think it’s the same for LLMs (would be naive). But where are promises like “we’re gonna change travel economics etc”? All headlines scream is “AGI around the corner”. Yeah, now where’s my damn postman flying? I need my mail.
reply
ben_w
19 hours ago
[-]
> It became a faster transportation system that uses air as a medium, but that’s it.

On the one hand, yes; on the other, this understates the impact that had.

My uncle moved from the UK to Australia because, I'm told*, he didn't like his mum and travel was so expensive that he assumed they'd never meet again. My first trip abroad… I'm not 100% sure how old I was, but it must have been between age 6 and 10, was my gran (his mum) paying for herself, for both my parents, and for me, to fly to Singapore, then on to various locations in Australia including my uncle, and back via Thailand, on her pension.

That was a gap of around one and a half generations.

* both of them are long-since dead now so I can't ask

reply
PaulDavisThe1st
21 hours ago
[-]
Sure, but that also vindicates the GP's point that the initial claims of the boosters for planes contained more than their fair share of bullshit and lies.
reply
tivert
19 hours ago
[-]
> What people always leave out is that society will bend to the abilities of the new technology.

Do they really? I don't think they do.

> Planes can't land in your backyard so we built airports. We didn't abandon planes.

But then what do you do with the all the fantasies and hype about the new technology (like planes that land in your backyard and you fly them to work)?

And it's quite possible and fairly common that the new technology actually ends up being mostly hype, and there's actually no "airports" use case in the wings. I mean, how much did society "bend to the abilities of" NFTs?

And then what if the mature "airports" use case is actually something most people do not want?

reply
ForHackernews
20 hours ago
[-]
This is already happening. A few days ago Microsoft turned down a documentation PR because the formatting was better for humans but worse for LLMs: https://github.com/MicrosoftDocs/WSL/pull/2021#issuecomment-...

They changed their mind after a public outcry including here on HN.

reply
moffkalast
18 hours ago
[-]
No, we built helicopters.
reply
oblio
20 hours ago
[-]
We are slowly discovering that many of our wonderful inventions from 60-80-100 years ago have serious side effects.

Plastics, cars, planes, etc.

One could say that a balanced situation, where vested interests are put back in the box (close to impossible since it would mean fighting trillions of dollars), would mean that for example all 3 in the list above are used a lot less than we use them now, for example. And only used where truly appropriate.

reply
skydhash
21 hours ago
[-]
This pretty much. Everyone knows that LLMs are great for text generation and processing. What people has been questioning is the end goals as promised by its builders, i.e. is it useful? And from most of what I saw, it's very much a toy.
reply
MVissers
13 hours ago
[-]
What would you need to see to call it useful?

To give you an example– I've used it for legal work such as an EB2-NIW visa application. Saved me countless of hours. My next visa I'll try to do without a lawyer using just LLMs. I would never try this without having LLMs at my disposal.

As a hobby– And as someone with a scientific background I've been able to build an artificial ecosystem simulation from scratch without programming experience in Rust: https://www.youtube.com/@GenecraftSimulator

I recently moved from fish to plants and believe I've developed some new science at the intersection of CS and Evolutionary Biology that I'm looking to publish.

This tool is extremely useful. For now– You do require a human in the loop for coordination.

My guess is that these will be benchmarks that we see within a few years: How good an AI coordinate multiple other AIs to build, deploy and iterate something that functions in the real world. Basically manager AI.

Because they'll literally be able to solve every single one shot problem so we won't be able to create benchmarks anymore.

But that's also when these models will be able to build functioning companies in a few hours.

reply
skydhash
12 hours ago
[-]
> ...me countless of...would never try this without having LLMs...is extremely useful...they'll literally be able to solve...will be able to... in a few hours.

That's marketing language, not scientific or even casual language. So much outstanding claims, without even some basic explanations. Like how did it help you save these hours? Terms explanations? Outlining processes? Going to the post office for you? You don't need to sell me anything, I just want the how.

reply
throwaway4aday
20 hours ago
[-]
Your point is on the verge of nullification with the rapid improvement and adoption of autonomous drones don't you think?
reply
wruza
17 hours ago
[-]
Sort of, but doesn’t that sit on a far-fetch horizon? I doubt that drone companies are all the same who sold aircraft retrofuturism to people back then.
reply
surgical_fire
22 hours ago
[-]
> to me seems like people witnessing the airplane and complaining that it isn't "real flying" because it isn't a bird flapping its wings

To me it is more like there is someone jumping on a pogo ball while flapping their arms and saying that they are flying whenever they hop off the ground.

Skeptics say that they are not really flying, while adherents say that "with current pogo ball advancements, they will be flying any day now"

reply
PaulDavisThe1st
21 hours ago
[-]
An old quote, quite famous: "... is like saying that an ape who climbs to the top of a tree for the first time is one step closer to landing on the moon".
reply
intelVISA
21 hours ago
[-]
Between skeptics and adherents who is more easily able to extract VC money for vaporware? If you limit yourself to 'the facts' you're leaving tons of $$ on the table...
reply
surgical_fire
21 hours ago
[-]
By all means, if this is the goal, AI is a success.

I understand that in this forum too many people are invested in putting lipstick on this particular pig.

reply
DonHopkins
20 hours ago
[-]
Is that what Elon Musk was trying to do on stage?
reply
zebomon
22 hours ago
[-]
I agree. If the LLMs we have today never got any smarter, the world would still be transformed over the next ten years.
reply
handsclean
22 hours ago
[-]
People aren’t responding to their own assumption that AGI is necessary, they’re responding to OpenAI and the chorus constantly and loudly singing hymns to AGI.
reply
billyp-rva
22 hours ago
[-]
> It doesn't need to be general intelligence or perfectly map to human intelligence.

> All it needs to be is useful.

Computers were already useful.

The only definition we have for "intelligence" is human (or, generally, animal) intelligence. If LLMs aren't that, let's call it something else.

reply
throwup238
21 hours ago
[-]
What exactly is human (or animal) intelligence? How do you define that?
reply
billyp-rva
21 hours ago
[-]
Does it matter? If LLMs aren't that, whatever it is, then we should use a different word. Finders keepers.
reply
throwup238
21 hours ago
[-]
How do you know that LLMs “aren’t that” if you can’t even define what that is?

“I’ll know it when I see it” isn’t a compelling argument.

reply
Aperocky
19 hours ago
[-]
I think a successful high level intelligence should quickly accelerate or converge to infinity/physical resource exhaustion because they can now work on improving themselves.

So if above human intelligence does happen, I'd assume we'd know it, quite soon.

reply
jonny_eh
19 hours ago
[-]
> “I’ll know it when I see it” isn’t a compelling argument.

It feels compelling to me.

reply
grahamj
21 hours ago
[-]
they can't do what we do therefore they aren't what we are
reply
layer8
20 hours ago
[-]
And what is that, in concrete terms? Many humans can’t do what other humans can do. What is the common subset that counts as human intelligence?
reply
dimitri-vs
12 hours ago
[-]
Process vision and sounds in parallel for 80+ years, rapidly adapt to changing environments and scenarios, correlate seemingly irrelevant details that happened a week ago or years ago, be able to selectively ignore instructions and know when to disagree
reply
skywhopper
20 hours ago
[-]
On the contrary, the pushback is critical because many employers are buying the hype from AI companies that AGI is imminent, that LLMs can replace professional humans, and that computers are about to eliminate all work (except VCs and CEOs apparently).

Every person that believes that LLMs are near sentient or actually do a good job at reasoning is one more person handing over their responsibilities to a zero-accountability highly flawed robot. We've already seen LLMs generate bad legal documents, bad academic papers, and extremely bad code. Similar technology is making bad decisions about who to arrest, who to give loans to, who to hire, who to bomb, and who to refuse heart surgery for. Overconfident humans employing this tech for these purposes have been bamboozled by the lies from OpenAI, Microsoft, Google, et al. It's crucial to call out overstatement and overhype about this tech wherever it crops up.

reply
AyyEye
22 hours ago
[-]
> Reading constant comments about LLMs can't be general intelligence or lack reasoning etc, to me seems like people witnessing the airplane and complaining that it isn't "real flying" because it isn't a bird flapping its wings (a large portion of the population held that point of view back then).

That is a natural reaction to the incessant techbro, AIbro, marketing, and corporate lies that "AI" (or worse AGI) is a real thing, and can be directly compared to real humans.

There are people on this very thread saying it's better at reasoning than real humans (LOL) because it scored higher on some benchmark than humans... Yet this technology still can't reliably determine what number is circled, if two lines intersect, or count the letters in a word. (That said behaviour may have been somewhat finetuned out of newer models only reinforces the fact that the technology inherently not capable of understanding anything.)

reply
IanCal
21 hours ago
[-]
I encounter "spicy auto complete" style comments far more often than techbro AI-everything comments and its frankly getting boring.

I've been doing AI things for about 20+ years and llms are wild. We've gone from specialized things being pretty bad as those jobs to general purpose things better at that and everything else. The idea you could make and API call with "is this sarcasm?" and get a better than chance guess is incredible.

reply
surgical_fire
20 hours ago
[-]
Eh, I see far more "AI is the second coming of Jesus" type of comments than healthy skepticism. A lot of anxiety from people afraid that their source of income will dry and a lot of excitement of people with an axe to grind that "those entitled expensive peasants will get what they deserve".

I think I count myself among the skeptics nowadays for that reason. And I say this as someone that thinks LLM is an interesting piece of technology, but with somewhat limited use and unclear economics.

If the hype was about "look at this thing that can parse natural language surprisingly well and generate coherent responses", I would be excited too. As someone that had to do natural language processing in the past, that is a damn hard task to solve, and LLMs excel at it.

But that is not the hype is it? We have people beating the drums of how this is just shy of taking the world by storm, and AGI is just around the corner, and it will revolutionize all economy and society and nothing will ever be the same.

So, yeah, it gets tiresome. I wish the hype would die down a little so this could be appreciated for what it is.

reply
williamcotton
19 hours ago
[-]
We have people beating the drums of how this is just shy of taking the world by storm, and AGI is just around the corner, and it will revolutionize all economy and society and nothing will ever be the same.

Where are you seeing this? I pretty much only read HN and football blogs so maybe I’m out of the loop.

reply
sensanaty
18 hours ago
[-]
In this very thread there are multiple people espousing their views that the high score here is proof that o3 has achieved AGI.
reply
AyyEye
20 hours ago
[-]
Nobody is disputing the coolness factor, only the intelligence factor.
reply
hansonkd
18 hours ago
[-]
I'm saying the intelligence factor doesn't matter. Only the utility factor. Today LLMs are incredibly useful and every few months there appears to be bigger and bigger leaps.

Analyzing whether or not LLMs have intelligence is missing the forest from the trees. This technology is emerging in a capitalist society that is hyper optimized to adopt useful things at the expense of almost everything else. If the utility/price point gets hit for a problem, it will replace it regardless of if it is intelligent or not.

reply
Jensson
16 hours ago
[-]
But if you want to predict the future utility of these models you want to look at their current intelligence, compare that to humans and try to figure out roughly what skills they lack and which of those are likely to get fixed.

For example, a team of humans are extremely reliable, much more reliable than one human, but a team of AI's isn't mean reliable than one AI since an AI is already an ensemble model. That means even if an AI could replace a person, it probably can't replace a team for a long time, meaning you still need the other team members there, meaning the AI didn't really replace a human it just became a tool for huamns to use.

reply
MVissers
13 hours ago
[-]
I think this is a fair criticism of capability.

I personally wouldn't be surprised if we start to see benchmarks around this type of cooperation and ability to orchestrate complex systems in the next few years or so.

Most benchmarks really focus on one problem, not on multiple real-time problems while orchestrating 3rd party actors who might or might not be able to succeed at certain tasks.

But I don't think anything is prohibiting these models from not being able to do that.

reply
alexalx666
21 hours ago
[-]
If I could put it into Tesla style robot and it could do dishes and help me figure out tech stuff, it would be more than enough.
reply
jasondigitized
19 hours ago
[-]
This a thousand times.
reply
colordrops
19 hours ago
[-]
I don't think many informed people doubt the utility of LLMs at this point. The potential of human-like AGI has profound implications far beyond utility models, which is why people are so eager to bring it up. A true human-like AGI basically means that most intellectual/white collar work will not be needed, and probably manual labor before too long as well. Huge huge implications for humanity, e.g. how does an economy and society even work without workers?
reply
vouaobrasil
18 hours ago
[-]
> Huge huge implications for humanity, e.g. how does an economy and society even work without workers?

I don't think those that create AI care about that. They just to come out on top before someone else does.

reply
sigmoid10
22 hours ago
[-]
These comments are getting ridiculous. I remember when this test was first discussed here on HN and everyone agreed that it clearly proves current AI models are not "intelligent" (whatever that means). And people tried to talk me down when I theorised this test will get nuked soon - like all the ones before. It's time people woke up and realised that the old age of AI is over. This new kind is here to stay and it will take over the world. And you better guess it'll be sooner rather than later and start to prepare.
reply
ignoramous
21 hours ago
[-]
> These comments are getting ridiculous.

Not really. Francois (co-creator of the ARC Prize) has this to say:

  The v1 version of the benchmark is starting to saturate. There were already signs of this in the Kaggle competition this year: an ensemble of all submissions would score 81%

  Early indications are that ARC-AGI-v2 will represent a complete reset of the state-of-the-art, and it will remain extremely difficult for o3. Meanwhile, a smart human or a small panel of average humans would still be able to score >95% ... This shows that it's still feasible to create unsaturated, interesting benchmarks that are easy for humans, yet impossible for AI, without involving specialist knowledge. We will have AGI when creating such evals becomes outright impossible.

  For me, the main open question is where the scaling bottlenecks for the techniques behind o3 are going to be. If human-annotated CoT data is a major bottleneck, for instance, capabilities would start to plateau quickly like they did for LLMs (until the next architecture). If the only bottleneck is test-time search, we will see continued scaling in the future.
https://x.com/fchollet/status/1870169764762710376 / https://ghostarchive.org/archive/Sqjbf
reply
ben_w
21 hours ago
[-]
> It's time people woke up and realised that the old age of AI is over. This new kind is here to stay and it will take over the world. And you better guess it'll be sooner rather than later and start to prepare.

I was just thinking about how 3D game engines were perceived in the 90s. Every six months some new engine came out, blew people's minds, was declared photorealistic, and was forgotten a year later. The best of those engines kept improving and are still here, and kinda did change the world in their own way.

Software development seemed rapid and exciting until about Halo or Half Life 2, then it was shallow but shiny press releases for 15 years, and only became so again when OpenAI's InstructGPT was demonstrated.

While I'm really impressed with current AI, and value the best models greatly, and agree that they will change (and have already changed) the world… I can't help but think of the Next Generation front cover, February 1997 when considering how much further we may be from what we want: https://www.giantbomb.com/pc/3045-94/forums/unreal-yes-this-...

reply
TeMPOraL
18 hours ago
[-]
> Software development seemed rapid and exciting until about Halo or Half Life 2, then it was shallow but shiny press releases for 15 years

The transition seems to map well to the point where engines got sophisticated enough, that highly dedicated high-schoolers couldn't keep up. Until then, people would routinely make hobby game engines (for games they'd then never finish) that were MVPs of what the game industry had a year or three earlier. I.e. close enough to compete on visuals with top photorealistic games of a given year - but more importantly, this was a time where you could do cool nerdy shit to impress your friends and community.

Then Unreal and Unity came out, with a business model that killed the motivation to write your own engine from scratch (except for purely educational purposes), we got more games, more progress, but the excitement was gone.

Maybe it's just a spurious correlation, but it seems to track with:

> and only became so again when OpenAI's InstructGPT was demonstrated.

Which is again, if you exclude training SOTA models - which is still mostly out of reach for anyone but a few entities on the planet - the time where anyone can do something cool that doesn't have a better market alternative yet, and any dedicated high-schooler can make truly impressive and useful work, outpacing commercial and academic work based on pure motivation and focus alone (it's easier when you're not being distracted by bullshit incentives like user growth or making VCs happy or churning out publications, farming citations).

It's, once again, a time of dreams, where anyone with some technical interest and a bit of free time can make the future happen in front of their eyes.

reply
hansonkd
19 hours ago
[-]
> how much further we may be from what we wan

The timescale you are describing for 3D graphics is 4 years from the 1997 cover you posted to the release of Halo which you are saying plateaued excitement because it got advanced enough.

An almost infinitesimally small amount of time in terms of history human development and you are mocking the magazine being excited for the advancement because it was... 4 years yearly?

reply
ben_w
19 hours ago
[-]
No, the timescale is "the 90s", the the specific example is from 1997, and chosen because of how badly it aged. Nobody looks at the original single-player Unreal graphics today and thinks "this is amazing!", but we all did at the time — Reflections! Dynamic lighting! It was amazing for the era — but it was also a long way from photorealism. ChatGPT is amazing… but how far is it from Brent Spiner's Data?

The era was people getting wowed from Wolfenstein (1992) to "about Halo or Half Life 2" (2001 or 2004).

And I'm not saying the flattening of excitement was for any specific reason, just that this was roughly when it stopped getting exciting — it might have been because the engines were good enough for 3D art styles beyond "as realistic as we can make it", but for all I know it was the War On Terror which changed the tone of press releases and how much the news in general cared. Or perhaps it was a culture shift which came with more people getting online and less media being printed on glossy paper and sold in newsagents.

Whatever the cause, it happened around that time.

reply
TeMPOraL
18 hours ago
[-]
I'm still holding on to my hypothesis in that the excitement was sustained in large part because this progress was something a regular person could partake in. Most didn't, but they likely known some kid who was. And some of those kids run the gaming magazines.

This was a time where, for 3D graphics, barriers to entry got low (math got figured out, hardware was good enough, knowledge spread), but the commercial market didn't yet capture everything. Hell, a bulk of those excited kids I remember, trying to do a better Unreal Tournament after school instead of homework (and almost succeeding!), they went on create and staff the next generation of commercial gamedev.

(Which is maybe why this period lasted for about as long as it takes for a schoolkid to grow up, graduate, and spend few years in the workforce doing the stuff they were so excited about.)

reply
ben_w
17 hours ago
[-]
Could be.

I was one of those kids, my focus was Marathon 2 even before I saw Unreal. I managed to figure out enough maths from scratch to end up with the basics of ray casting, but not enough at the time to realise the tricks needed to make that real time on a 75 MHz CPU… and then we all got OpenGL and I went through university where they explained the algorithms.

reply
torginus
21 hours ago
[-]
The weird thing about the phenomenon you mention is only after the field of software engineering has plateaued 15 years ago, as you mentioned, that this insane demand for engineers did arise, with corresponding insane salaries.

It's a very strange thing I've never understood.

reply
dwaltrip
19 hours ago
[-]
My guess: It’s a very lengthy, complex, and error-prone process to “digitize” human civilization (government, commerce, leisure, military, etc). The tech existed, we just didn’t know how to use it.

We still barely know how to use computers effectively, and they have already transformed the world. For better or worse.

reply
jcims
21 hours ago
[-]
I agree, it's like watching a meadow ablaze and dismissing it because it's not a 'real forest fire' yet. No it's not 'real AGI' yet, but *this is how we get there* and the pace is relentless, incredible and wholly overwhelming.

I've been blessed with grandchildren recently, a little boy that's 2 1/2 and just this past Saturday a granddaughter. Major events notwithstanding, the world will largely resemble today when they are teenagers, but the future is going to look very very very different. I can't even imagine what the capability and pervasiveness of it all will be like in ten years, when they are still just kids. For me as someone that's invested in their future I'm interested in all of the educational opportunities (technical, philosphical and self-awareness) but obviously am concerned about the potential for pernicious side effects.

reply
lawlessone
22 hours ago
[-]
Failing the test may prove the AI is not intelligent. Passing the test doesn't necessarily prove it is.
reply
NitpickLawyer
21 hours ago
[-]
Your comment reminds me of this quote from a book published in the 80s:

> There is a related “Theorem” about progress in AI: once some mental function is programmed, people soon cease to consider it as an essential ingredient of “real thinking”. The ineluctable core of intelligence is always in that next thing which hasn’t yet been programmed. This “Theorem” was first proposed to me by Larry Tesler, so I call it Tesler’s Theorem: “AI is whatever hasn’t been done yet.”

reply
6gvONxR4sf7o
21 hours ago
[-]
I've always disliked this argument. A person can do something well without devising a general solution to the thing. Devising a general solution to the thing is a step we're talking all the time with all sorts of things, but it doesn't invalidate the cool fact about intelligence: whatever it is that lets us do the thing well without the general solution is hard to pin down and hard to reproduce.

All that's invalidated each time is the idea that a general solution to that task requires a general solution to all tasks, or that a general solution to that task requires our special sauce. It's the idea that something able to to that task will also be able to do XYZ.

And yet people keep coming up with a new task that people point to saying, 'this is the one! there's no way something could solve this one without also being able to do XYZ!'

reply
8note
20 hours ago
[-]
id consider that it doing the test at all, without proper compensation is a sign that it isnt intelligent
reply
esafak
2 hours ago
[-]
Motivation is not hard to instill. Fortunately, they have chosen not to do so.
reply
philipkglass
21 hours ago
[-]
If AI takes over white collar work that's still half of the world's labor needs untouched. There are some promising early demos of robotics plus AI. I also saw some promising demos of robotics 10 and 20 years that didn't reach mass adoption. I'd like to believe that by the time I reach old age the robots will be fully qualified replacements for plumbers and home health aides. Nothing I've seen so far makes me think that's especially likely.

I'd love more progress on tasks in the physical world, though. There are only a few paths for countries to deal with a growing ratio of old retired people to young workers:

1) Prioritize the young people at the expense of the old by e.g. cutting old age benefits (not especially likely since older voters have greater numbers and higher participation rates in elections)

2) Prioritize the old people at the expense of the young by raising the demands placed on young people (either directly as labor, e.g. nurses and aides, or indirectly through higher taxation)

3) Rapidly increase the population of young people through high fertility or immigration (the historically favored path, but eventually turns back into case 1 or 2 with an even larger numerical burden of older people)

4) Increase the health span of older people, so that they are more capable of independent self-care (a good idea, but difficult to achieve at scale, since most effective approaches require behavioral changes)

5) Decouple goods and services from labor, so that old people with diminished capabilities can get everything they need without forcing young people to labor for them

reply
reducesuffering
20 hours ago
[-]
> If AI takes over white collar work that's still half of the world's labor needs untouched.

I am continually baffled that people here throw this argument out and can't imagine the second-order effects. If white collar work is automated by AGI, all the RnD to solve robotics beyond imagination will happen in a flash. The top AI labs, the people smartest enough to make this technology, all are focusing on automating AGI Researchers and from there follows everything, obviously.

reply
brotchie
19 hours ago
[-]
+1, the second and third order effects aren't trivial.

We're already seeing escape velocity in world modeling (see Google Veo2 and the latest Genesis LLM-based physics modeling framework).

The hardware for humanoid robots is 95% of the way there, the gap is control logic and intelligence, which is rapidly being closed.

Combine Veo2 world model, Genesis control planning, o3-style reasoning, and you're pretty much there with blue collar work automation.

We're only a few turns (<12 months) away from an existence proof of a humanoid robot that can watch a Youtube video and then replicate the task in a novel environment. May take longer than that to productionize.

It's really hard to think and project forward on an exponential. We've been on an exponential technology curve since the discovery of fire (at least). The 2nd order has kicked up over the last few years.

Not a rational approach to look back at robotics 2000-2022 and project that pace forwards. There's more happening every month than in decades past.

reply
philipkglass
19 hours ago
[-]
I hope that you're both right. In 2004-2007 I saw self driving vehicles make lightning progress from the weak showing of the 2004 DARPA Grand Challenge to the impressive 2005 Grand Challenge winners and the even more impressive performance in the 2007 Urban Challenge. At the time I thought that full self driving vehicles would have a major commercial impact within 5 years. I expected truck and taxi drivers to be obsolete jobs in 10 years. 17 years after the Urban Challenge there are still millions of truck driver jobs in America and only Waymo seems to have a credible alternative to taxi drivers (even then, only in a small number of cities).
reply
QuantumGood
21 hours ago
[-]
"it will take over the world"

Calibrating to the current hype cycle has been challenging with AI pronouncements.

reply
Workaccount2
21 hours ago
[-]
You are telling a bunch of high earning individuals ($150k+) that they may be dramatically less valuable in the eat future. Of course the goal posts will keep being pushed back and the acknowledgements will never come.
reply
foobarqux
22 hours ago
[-]
You should look up the terms necessary and sufficient.
reply
sigmoid10
22 hours ago
[-]
The real issue is people constantly making up new goalposts to keep their outdated world view somewhat aligned with what we are seeing. But these two things are drifting apart faster and faster. Even I got surprised by how quickly the ARC benchmark was blown out of the water, and I'm pretty bullish on AI.
reply
foobarqux
21 hours ago
[-]
The ARC maintainers have explicitly said that passing the test was necessary but not sufficient so I don't know where you come up with goal-post moving. (I personally don't like the test; it is more about "intuition" or in-built priors, not reasoning).
reply
manmal
21 hours ago
[-]
Are you like invested in LLM companies or something? You‘re pushing the agenda hard in this thread.
reply
samvher
22 hours ago
[-]
What kind of preparation are you suggesting?
reply
johnny_canuck
22 hours ago
[-]
Start learning a trade
reply
whynotminot
20 hours ago
[-]
I feel like that’s just kicking the can a little further down the road.

Our value proposition as humans in a capitalist society is an increasingly fragile thing.

reply
jorblumesea
21 hours ago
[-]
that's going to work when every white collar worker goes into the trades /s

who is going to pay for residential electrical work lol and how much will you make if some guy from MIT is going to compete with you

reply
sigmoid10
22 hours ago
[-]
This is far too broad to summarise here. You can read up on Sutskever or Bostrom or hell even Steven Hawking's ideas (going in order from really deep to general topics). We need to discuss everything - from education over jobs and taxes all the way to the principles of politics, our economy and even the military. If we fail at this as a society, we will at the very least create a world where the people who own capital today massively benefit and become rich beyond imagination (despite having contributed nothing to it), while the majority of the population will be unemployable and forever left behind. And the worst case probably falls somewhere between the end of human civilisation and the end of our species.
reply
astrange
21 hours ago
[-]
One way you can tell this isn't realistic is that it's the plot of Atlas Shrugged. If your economic intuitions produce that book it means they are wrong.

> while the majority of the population will be unemployable and forever left behind

Productivity improvements increase employment. A superhuman AI is a productivity improvement.

reply
ben_w
3 hours ago
[-]
> Productivity improvements increase employment.

Sometimes: the productivity improvements from the combustion engine didn't increase employment of horses, it displaced them.

But even when productivity improvements do increase employment, it's not always to our advantage: the productivity improvements from Eli Whitney's cotton gin included huge economic growth and subsequent technological improvements… and also "led to increased demands for slave labor in the American South, reversing the economic decline that had occurred in the region during the late 18th century": https://en.wikipedia.org/wiki/Cotton_gin

A superhuman AI that's only superhuman in specific domains? We've been seeing plenty of those, "computer" used to be a profession, and society can re-train but it still hurts the specific individuals who have to be unemployed (or start again as juniors) for the duration of that training.

A superhuman AI that's superhuman in every domain, but close enough to us in resource requirements that comparative advantage is still important and we can still do stuff, relegates us to whatever the AI is least good at.

A superhuman AI that's superhuman in every domain… as soon as someone invents mining, processing, and factory equipment that works on the moon or asteroids, that AI can control that equipment to make more of that equipment, and demand is quickly — O(log(n)) — saturated. I'm moderately confident that in this situation, the comparative advantage argument no longer works.

reply
BriggyDwiggs42
3 hours ago
[-]
No, Atlas shrugged explicitly believes that the wealthy beneficiaries are also the ones doing the innovation and the labor. Human/superhuman AI, if not self-directed but more like a tool, may massively benefit whoever happens to be lucky enough to be directing it when it arises. This does not imply that the lucky individual benefits on the basis of their competence.

The idea that productivity improvements increase unemployment is just fundamentally based on a different paradigm. There is absolutely no reason to think that when a machine exists that can do most things that a human can do as well if not better for less or equal cost, this will somehow increase human employment. In this scenario, using humans in any stage of the pipeline would be deeply inefficient and a stupid business decision.

reply
kelseyfrog
22 hours ago
[-]
What we're going to do is punt the questions and then convince ourselves the outcome was inevitable and if anything it's actually our fault.
reply
bluerooibos
19 hours ago
[-]
The goalposts have moved, again and again.

It's gone from "well the output is incoherent" to "well it's just spitting out stuff it's already seen online" to "WELL...uhh IT CAN'T CREATE NEW/NOVEL KNOWLEDGE" in the space of 3-4 years.

It's incredible.

We already have AGI.

reply
levocardia
21 hours ago
[-]
I'm a little torn. ARC is really hard, and Francois is extremely smart and thoughtful about what intelligence means (the original "On the Measure of Intelligence" heavily influenced my ideas on how to think about AI).

On the other hand, there is a long, long history of AI achieving X but not being what we would casually refer to as "generally intelligent," then people deciding X isn't really intelligence; only when AI achieves Y will it be intelligence. Then AI achieves Y and...

reply
amarcheschi
22 hours ago
[-]
I just googled arc agi questions, and it looks like it is similar to an iq test with raven matrix. Similar as in you have some examples of images before and after, then an image before and you have to guess the after.

Could anyone confirm if this is the only kind of questions in the benchmark? If yes, how come there is such a direct connection to "oh this performs better than humans" when llm can be quite better than us in understanding and forecasting patterns? I'm just curious, not trying to stir up controversies

reply
zebomon
22 hours ago
[-]
It's a test on which (apparently until now) the vast majority of humans have far outperformed all machine systems.
reply
patrickhogan1
22 hours ago
[-]
But it’s not a test that directly shows general intelligence.

I am excited no less! This is huge improvement.

How does this do on SWE Bench?

reply
og_kalu
22 hours ago
[-]
>How does this do on SWE Bench?

71.7%

reply
throwaway0123_5
22 hours ago
[-]
I've seen this figure on a few tech news websites and reddit but can't find an official source. If it was in the video I must have missed it, where is this coming from?
reply
og_kalu
21 hours ago
[-]
It was in the video. I don't know if Open ai have a page up yet
reply
Eridrus
22 hours ago
[-]
ML is quite good at understanding and forecasting patterns when you train on the data you want to forecast. LLMs manage to do so much because we just decided to train on everything on the internet and hope that it included everything we ever wanted to know.

This tries to create patterns that are intentionally not in the data and see if a system can generalize to them, which o3 super impressively does!

reply
yunwal
21 hours ago
[-]
ARC is in the dataset though? I mean I'm aware that there are new puzzles every day, but there's still a very specific format and set of skills required to solve it. I'd bet a decent amount of money that humans get better at ARC with practice, so it seems strange to suggest that AI wouldn't.
reply
ALittleLight
22 hours ago
[-]
Yes, it's pretty similar to Raven's. The reason it is an interesting benchmark is because humans, even very young humans, "get" the test in the sense of understanding what it's asking and being able to do pretty well on it - but LLMs have really struggled with the benchmark in the past.

Chollett (one of the creators of the ARC benchmark) has been saying it proves LLMs can't reason. The test questions are supposed to be unique and not in the model's training set. The fact that LLMs struggled with the ARC challenge suggested (to Chollett and others) that models weren't "Truly reasoning" but rather just completing based on things they'd seen before - when the models were confronted with things they hadn't seen before, the novel visual patterns, they really struggled.

reply
Bjorkbat
19 hours ago
[-]
I think it's still an interesting way to measure general intellience, it's just that o3 has demonstrated that you can actually achieve human performance on it by training it on the public training set and giving it ridiculous amounts of compute, which I imagine equates to ludicrously long chains-of-thought, and if I understand correctly more than one chain-of-thought per task (they mention sample sizes in the blog post, with o3-low using 6 and o3-high using 1024. Not sure if these are chains-of-thought per task or what).

Once you look at it that way it the approach really doesn't look like intelligence that's able to generalize to novel domains. It doesn't pass the sniff test. It looks a lot more like brute-forcing.

Which is probably why, in order to actually qualify for the leaderboard, they stipulate that you can't use more than $10k more of compute. Otherwise, it just sounds like brute-forcing.

reply
BriggyDwiggs42
3 hours ago
[-]
I disagree. It’s vastly inefficient, but it is managing to actually solve these problems with a vast search space. If we extrapolate this approach into the future and assume that the search becomes better as the underlying model improves, and assume that the architecture grows more efficient, and assume that the type of parallel computing used here grows cheaper, isn’t it possible that this is a lot more than brute-forcing in terms of what it will achieve? In other words, is it maybe just a really ugly way of doing something functionally equivalent to reasoning?
reply
Agentus
21 hours ago
[-]
how about a extra large dose of your skepticism. is true intelligence really a thing and not just a vague human construct that tries to point out the mysterious unquantifiable combination of human behaviors?

humans clearly dont know what intelligence is unambiguously. theres also no divinely ordained objective dictionary that one can point at to reference what true intelligence is. a deep reflection of trying to pattern associate different human cognitive abilities indicates human cognitive capabilities arent that spectacular really.

reply
MVissers
13 hours ago
[-]
My guess as an amateur neuroscientist is that what we call intelligence is just a 'measurement' of problem solving ability in different domains. Can be emotional, spatial, motor, reasoning, etc etc.

There is no special sauce in our brain. And we know how much compute there is in our brain– So we can roughly estimate when we'll hit that with these 'LLMs'.

Language is important in a human brain development as well. Kids who grow up deaf grow up vastly less intelligent unless they learn sign language. Language allow us to process complex concepts that our brain can learn to solve, without having to be in those complex environments.

So in hindsight, it's easy to see why it took a language model to be able to solve general tasks and other types deep learning networks couldn't.

I don't really see any limits on these models.

reply
Agentus
12 hours ago
[-]
interesting point about language. but i wonder if people misattribute the reason why language is pivotal to human development. your points are valid. i see human behavior with regard to learning as 90% mimicry and 10% autonomous learning. most of what humans believe in is taken on faith and passed on from the tribe to the individual. rarely is it verified even partially let alone fully. humans simple dont have the time or processing power to do that. learning a thing without outside aid is vastly slower and more energy or brain intensive process than copy learning or learning through social institutions by dissemination. the stunted development from lack of language might come more from the less ability to access the collective learning process that language enables and or greatly enhances. i think a lot of learning even when combined with reasoning, deduction, etc really is at the mercy of brute force exploration to find a solution, which individuals are bad at but a society that collects random experienced “ah hah!” occurrences and passes them along is actually okay at.

i wonder if llms and language dont as so much allow us to process these complex environments but instead preload our brains to get a head start in processing those complex environments once we arrive in them. i think llms store compressed relationships of the world which obviously has information loss from a neural mapping of the world that isnt just language based. but that compressed relationships ie knowledge doesnt exactly backwardly map onto the world without it having a reverse key. like artificially learning about real world stuff in school abstractly and then going into the real world, it takes time for that abstraction to snap fit upon the real world.

could you further elaborate on what you mean by limits, because im happy to play contrarian on what i think i interpret you to be saying there.

also to your main point: what intelligence is. yeah you sort of hit up my thoughts on intelligence. its a combination of problem solving abilities in different domains. its like an amalgam of cognitive processes that achieve an amalgam of capabilities. while we can label alllllll that with a singular word, doesnt mean its all a singular process. seems like its a composite. moreover i think a big chunk of intelligence (but not all) is just brute forcing finding associations and then encoding those by some reflexive search/retrieval. a different part of intelligence of course is adaptibility and pattern finding.

reply
m3kw9
22 hours ago
[-]
From the statement where - this is a pretty tough test where AI scores low vs humans just last year, and AI can do it as good as humans may not be AGI which I agree, but it means something with all caps
reply
manmal
20 hours ago
[-]
Obviously, the multi billion dollar companies will try to satisfy the benchmarks they are not yet good in, as has always been the case.
reply
m3kw9
13 hours ago
[-]
A valid conspiracy theory but I’ve heard that one everystep of the way to this point
reply
kelseyfrog
22 hours ago
[-]
> truly general intelligence

Indistinguishable from goalpost moving like you said, but also no true Scotsman.

I'm curious what would happen in your eyes if we misattributed general intelligence to an AI model? What are the consequences of a false positive and how would they affect your life?

It's really clear to me how intelligence fits into our reality as part of our social ontology. The attributes and their expression that each of us uses to ground our concept of the intelligent predicate differs wildly.

My personal theory is that we tend to have an exemplar-based dataset of intelligence, and each of us attempts to construct a parsimonious model of intelligence, but like all (mental) models, they can be useful but wrong. These models operate in a space where the trade off is completeness or consistency, and most folks, uncomfortable saying "I don't know" lean toward being complete in their specification rather than consistent. The unfortunate side-effect is that we're able to easily generate test data that highlights our model inconsistency - AI being a case in point.

reply
PaulDavisThe1st
21 hours ago
[-]
> I'm curious what would happen in your eyes if we misattributed general intelligence to an AI model? What are the consequences of a false positive and how would they affect your life?

Rich people will think they can use the AI model instead of paying other people to do certain tasks.

The consequences could range from brilliant to utterly catastrophic, depending on the context and precise way in which this is done. But I'd lean toward the catastrophic.

reply
kelseyfrog
20 hours ago
[-]
Any specifics? It's difficult to separate this from generalized concern.
reply
PaulDavisThe1st
20 hours ago
[-]
someone wants a "personal assistant" and believes that the LLM has AGI ...

someone wants a "planning officer" and believes that the LLM has AGI ...

someone wants a "hiring consultant" and believes that the LLM has AGI ...

etc. etc.

reply
kelseyfrog
20 hours ago
[-]
My apologies, but would it be possible to list the catastrophic consequences of these?
reply
wslh
22 hours ago
[-]
> My skeptical impression: it's complete hubris to conflate ARC or any benchmark with truly general intelligence.

But isn’t it interesting to have several benchmarks? Even if it’s not about passing the Turing test, benchmarks serve a purpose—similar to how we measure microprocessors or other devices. Intelligence may be more elusive, but even if we had an oracle delivering the ultimate intelligence benchmark, we'd still argue about its limitations. Perhaps we'd claim it doesn't measure creativity well, and we'd find ourselves revisiting the same debates about different kinds of intelligences.

reply
zebomon
22 hours ago
[-]
It's certainly interesting. I'm just not convinced it's a test of general intelligence, and I don't think we'll know whether or not it is until it's been able to operate in the real world to the same degree that our general intelligence does.
reply
FrustratedMonky
22 hours ago
[-]
" it's complete hubris to conflate ARC or any benchmark with truly general intelligence."

Maybe it would help to include some human results in the AI ranking.

I think we'd find that Humans score lower?

reply
zamadatix
21 hours ago
[-]
I'm not sure it'd help what they are talking about much.

E.g. go back in time and imagine you didn't know there are ways for computers to be really good at performing integration yet as nobody had tried to make them. If someone asked you how to tell if something is intelligent "the ability to easily reason integrations or calculate extremely large multiplications in mathematics" might seem like a great test to make.

Skip forward to the modern era and it's blatantly obvious CASes like Mathematica on a modern computer range between "ridiculously better than the average person" to "impossibly better than the best person" depending on the test. At the same time, it becomes painfully obvious a CAS is wholly unrelated to general intelligence and just because your test might have been solvable by an AGI doesn't mean solving it proves something must have been an AGI.

So you come up with a new test... but you have the same problem as originally, it seems like anything non-human completely bombs and an AGI would do well... but how do you know the thing that solves it will have been an AGI for sure and not just another system clearly unrelated?

Short of a more clever way what GP is saying is the goalposts must keep being moved until it's not so obvious the thing isn't AGI, not that the average human gets a certain score which is worse.

.

All that aside, to answer your original question, in the presentation it was said the average human gets 85% and this was the first model to beat that. It was also said a second version is being worked on. They have some papers on their site about clear examples of why the current test clearly has a lot of testing unrelated to whether something is really AGI (a brute force method was shown to get >50% in 2020) so their aim is to create a new goalpost test and see how things shake out this time.

reply
og_kalu
21 hours ago
[-]
Generality is not binary. It's a spectrum. And these models are already general in ways those things you've mentioned simply weren't.

What exactly is AGI to you ? If it's simply a generally intelligent machine then what are you waiting for ? What else is there to be sure of ? There's nothing narrow about these models.

Humans love to believe they're oh so special so much that there will always be debates on whether 'AGI' has arrived. If you are waiting for that then you'll be waiting a very long time, even if a machine arrives that takes us to the next frontier in science.

reply
zamadatix
1 hour ago
[-]
I'm firmly in the "absolutely nothing special about human intelligence" camp so don't let dismissal of this as AGI fuel any misconceptions as to why I might think that.

As for what AGI is? Well, the lack of being able to describe that brings us full circle in this thread - I'll tell you for sure when I've seen it for the first time and have the power of hindsight to say what was missing. I think these models are the closest we've come but it feels like there is at least 1-2 more "4o->o1" style architecture changes where it's not necessarily about an increase in model fitting and more about a change in how the model comes to an output before we get to what I'd be willing to call AGI.

Who knows though, maybe some of those changes come along and it's closer but still missing some process to reason well enough to be AGI rather than a midway tool.

reply
Jensson
17 hours ago
[-]
> There's nothing narrow about these models.

There is, they can't create new ideas like humanity can. AGI should be able to replace humanity in terms of thinking, otherwise it isn't general, you would just have a model specialized at reproducing thoughts and patterns human have thought before, it still can't recreate science from scratch etc like humanity did, meaning it can't do science properly.

Comparing an AI to a single individual is not how you measure AGI, if a group of humans perform better then you can't use the AI to replace that group of humans, and thus the AI isn't an AGI since it couldn't replace the group humans.

So for example, if a group of programmers write more reliable programs than the AI, then you can't replace that group of programmers with the AI, even if you duplicate that AI many times, since the AI isn't capable of reproducing that same level of reliability when ran in parallel. This is due to an AI being run in parallel is still just an AI, an ensemble model is still just an AI, so the model the AI has to beat is the human ensemble called humanity.

If we lower the bar a bit at least it has to beat 100 000 humans working together to make a job obsolete, since all the tutorials etc and all such things are made by other humans as well if you remove the job those would also disappear and the AI would have to do the work of all of those, so if it can't humans will still be needed.

Its possible you will be able to substitute part of those human ensembles with AI much sooner, but then we just call it a tool. (We also call narrow humans tools, it is fair)

reply
og_kalu
16 hours ago
[-]
I see these models create new ideas. At least at the standard humans are beholden to, so this just falls flat for me.
reply
Jensson
16 hours ago
[-]
You don't just need to create an idea, you need to be able to create ideas that on average progress in a positive direction. Humans can evidently do that, AI can't, when AI work too much without human input you always end up with nonsense.

In order to write general program you need to have that skill. Every new code snipped needs to be evaluated by that system, whether it makes the codebase better or not. The lack of that ability is why you can't just loop an LLM today to replace programmers. It might be possible to automate it for specific programming tasks, but not general purpose programming.

Overcoming that hurdle is not something I think LLM ever can do, you need a totally different kind of architecture, not something that is trained to mimic but trained to reason. I don't know how to train something that can reason about noisy unstructured data, we will probably figure that out at some point but it probably wont be LLM as they are today.

reply
FrustratedMonky
21 hours ago
[-]
"Short of a more clever way what GP is saying is the goalposts must keep being moved until it's not so obvious the thing isn't AGI, not that the average human gets a certain score which is worse."

Best way of stating that I've heard.

The Goal Post must keep moving, until we understand enough what is happening.

I usually poo-poo the goal post moving, but this makes sense.

reply
Balgair
21 hours ago
[-]
Complete aside here: I used to do work with amputees and prosthetics. There is a standardized test (and I just cannot remember the name) that fits in a briefcase. It's used for measuring the level of damage to the upper limbs and for prosthetic grading.

Basically, it's got the dumbest and simplest things in it. Stuff like a lock and key, a glass of water and jug, common units of currency, a zipper, etc. It tests if you can do any of those common human tasks. Like pouring a glass of water, picking up coins from a flat surface (I chew off my nails so even an able person like me fails that), zip up a jacket, lock your own door, put on lipstick, etc.

We had hand prosthetics that could play Mozart at 5x speed on a baby grand, but could not pick up a silver dollar or zip a jacket even a little bit. To the patients, the hands were therefore about as useful as a metal hook (a common solution with amputees today, not just pirates!).

Again, a total aside here, but your comment just reminded me of that brown briefcase. Life, it turns out, is a lot more complex than we give it credit for. Even pouring the OJ can be, in rare cases, transcendent.

reply
ubj
21 hours ago
[-]
There's a lot of truth in this. I sometimes joke that robot benchmarks should focus on common household chores. Given a basket of mixed laundry, sort and fold everything into organized piles. Load a dishwasher given a sink and counters overflowing with dishes piled up haphazardly. Clean a bedroom that kids have trashed. We do these tasks almost without thinking, but the unstructured nature presents challenges for robots.
reply
Balgair
20 hours ago
[-]
I maintain that whoever invents a robust laundry folding robot will be a trillionaire. In that, I dump jumbled clean clothes straight from a dryer at it and out comes folded and sorted clothes (and those loner socks). I know we're getting close, but I also know we're not there yet.
reply
smokel
20 hours ago
[-]
We are certainly getting close! In 2010, watching PR2 fold some unseen towels is similar to watching paint dry [1], but we can now enjoy robots attain lazy student-level laundry folding in real-time, as demonstrated by π₀[2].

[1] https://www.youtube.com/watch?v=gy5g33S0Gzo

[2] https://www.physicalintelligence.company/blog/pi0

reply
yongjik
20 hours ago
[-]
I can live without folding laundry (I can just shove my undershirts in the closet, who cares if it's not folded), but whoever manufactures a reliable auto-loading dishwasher will have my dollars. Like, just put all your dishes in the sink and let the machine handle them.
reply
Brybry
19 hours ago
[-]
But if your dishwasher is empty is takes nearly the same amount of time/effort to put dishes straight into the dishwasher that it does to put them in the sink.

I think I'd only really save time by having a robot that could unload my dishwasher and put up the clean dishes.

reply
namibj
19 hours ago
[-]
That's called a second dishwasher: one is for taking out, the other for putting in. When the latter is full, turn it on, dirty dishes wait outside until the cycle finishes, when the dishwashers switch roles.
reply
ptsneves
18 hours ago
[-]
I thought about this and it gets even better. You do not really need shelves as you just use the clean dishwasher as the storage place. I honestly don’t know why this is not a thing in big or wealthy homes.
reply
jannyfer
18 hours ago
[-]
Another thing that bothers me is that dishwashers are low. As I get older, I’m finding it really annoying to bend down.

So get me a counter-level dishwasher cabinet and I’ll be happy!

reply
oangemangut
17 hours ago
[-]
We have a double drawer dishwasher and it hurts my brain watching friends plan around their nightly wash.
reply
yongjik
18 hours ago
[-]
Hmm, that doesn't match my experience. It takes me a lot more time to put dishes into the dishwasher, because it has different places for cutlery, bowls, dishes, and so on, and of course the existing structure never matches my bowls' size perfectly so I have to play tetris or run it with only 2/3 filled (which will cause me to waste more time as I have to do dishes again sooner).

And that's before we get to bits of sticky rice left on bowls, which somehow dishwashers never scrape off clean. YMMV.

reply
HPsquared
18 hours ago
[-]
1. Get a set of dishes that does fit nicely together in the dishwasher.

2. Start with a cold prewash, preferably with a little powder in there too. This massively helps with stubborn stuff. This one is annoying though because you might have to come back and switch it on after the prewash. A good job for the robot butler.

reply
dweekly
19 hours ago
[-]
I was a believer in Gal's FoldiMate but sadly it...folded.

https://en.m.wikipedia.org/wiki/FoldiMate

reply
blargey
19 hours ago
[-]
At this point I'm not sure we'll actually get a task-specific machine for laundry folding/sorting before humanoid robots gain the capability to do it well enough.
reply
sss111
20 hours ago
[-]
Honestly, a robot that can hang jumbled clean clothes instead of folding them would be good enough, it's crazy how we don't even have those.
reply
jessekv
20 hours ago
[-]
I want it to lay out an outfit every day too. Hopefully without hallucination.
reply
stefs
20 hours ago
[-]
it's not hallucination, it's high fashion
reply
tanseydavid
18 hours ago
[-]
Yes, but the stupid robot laid out your Thursday-black-Turtleneck for you on Saturday morning. That just won't suffice.
reply
nradov
20 hours ago
[-]
There is the Foldimate robot. I don't know how well it works. It doesn't seem to pair up socks. (Deleted the web link, it might not be legitimate.)
reply
smokel
19 hours ago
[-]
Beware, this website is probably a scam.

Foldimate has gone bankrupt in 2021 [1], and the domain referral from foldimate.com to a 404 page at miele.com, suggests that it was Miele who bought up the remains, not a sketchy company with a ".website" top-level domain.

[1] https://en.wikipedia.org/wiki/FoldiMate

reply
oblio
20 hours ago
[-]
Laundry folding and laundry ironing, I would say.
reply
musicale
19 hours ago
[-]
Hopefully will detect whether a small child is inside or not.
reply
imafish
20 hours ago
[-]
> I maintain that whoever invents a robust laundry folding robot will be a trillionaire

… so Elon Musk? :D

reply
zamalek
20 hours ago
[-]
Slightly tangential, we already have amazing laundry robots. They are called washing and drying machines. We don't give these marvels enough credit, mostly because they aren't shaped like humans.

Humanoid robots are mostly a waste of time. Task-shaped robots are much easier to design, build, and maintain... and are more reliable. Some of the things you mention might needs humanoid versatility (loading the dishwasher), others would be far better served by purpose-built robots (laundry sorting).

reply
Geee
16 minutes ago
[-]
There isn't a "task-shaped" robot for unstructured and complex manipulation, other than high DoF arms with vision and neural nets. For example, a machine which can cook food would be best solved with two robotic arms. However, these stationary arms would be wasted if they were just idling most of the time. So, you add locomotion and dynamic balancing with legs. And now these two arms can be used in 1000 different tasks, which makes them 1000x more valuable.

So, not only is the human form the only solution for many tasks, it's also a much cheaper solution considering the idle time of task-specific robots. You would need only a single humanoid robot for all tasks, instead of buying a different machine for each task. And instead of having to design and build a new machine for each task, you'll need to just download new software for each task.

reply
jkaptur
19 hours ago
[-]
I'm embarrassed to say that I spent a few moments daydreaming about a robot that could wash my dishes. Then I thought about what to call it...
reply
musicale
19 hours ago
[-]
Sadly current "dishwasher" models are neither self-loading nor unloading. (Seems like they should be able to take a tray of dishes, sort them, load them, and stack them after cleaning.)

Maybe "busbot" or "scullerybot".

reply
vidarh
18 hours ago
[-]
The problem is more doing it in sufficiently little space, and using little enough water and energy. Doing one that you just feed dishes individually and that immediate wash them and feed them to storage should be entirely viable, but it'd be wasteful, and it'd compete with people having multiple small drawer-style dishwashers, offering relatively little convenience over that.

It seems most people aren't willing to pay for multiple dishwashers - even multiple small ones or set aside enough space, and that places severe constraints on trying to do better.

reply
wsintra2022
19 hours ago
[-]
Was it a dishwasher? Just give it all your unclean dishes and tell it to go, come back an hour later and they all washed and mostly dried!
reply
rytis
19 hours ago
[-]
I agree. I don’t know where this obsession comes from. Obsession with resembling as close to humans as possible. We’re so far from being perfect. If you need proof just look at your teeth. Yes, we’re relatively universal, but a screwdriver is more efficient at driving in screws that our fingers. So please, stop wasting time building perfect universal robots, build more purpose-build ones.
reply
Nevermark
19 hours ago
[-]
Given we have shaped so many tasks to fit our bodies, it will be a long time before a bot able to do a variety/majority of human tasks the human way won’t be valuable.

1000 machines specialized for 1000 tasks are great, but don’t deliver the same value as a single bot that can interchange with people flexibly.

Costly today, but wont be forever.

reply
golol
18 hours ago
[-]
The shape doesn't matter! Non-humanoid shapes give minir advantages on specific tasks but for a general robot you'll have a hard time finding a shape much more optimal than humanoid. And if you go with humanoid you have so much data available! Videos contain the information of which movements a robot should execude. Teleoperation is easy. This is the bitter lesson! The shape doesn't matter, any shape will work with the right architecture, data and training!
reply
rowanG077
18 hours ago
[-]
Purpose build robots are basically solved. Dishwashers, laundry machines, assembly robots, etc. the moat is a general purpose robot that can do what a human can do.
reply
graemep
18 hours ago
[-]
Great examples. They are simple, reliable, efficient and effective. Far better than blindly copying what a human being does. Maybe there are equally clever ways of doing things like folding clothes.
reply
throwup238
19 hours ago
[-]
This is expressed in AI research as Moravec's paradox: https://en.wikipedia.org/wiki/Moravec%27s_paradox

Getting to LLMs that could talk to us turned out to be a lot easier than making something that could control even a robotic arm without precise programming, let alone a humanoid.

reply
ecshafer
21 hours ago
[-]
I had a pretty bad case of tendinitis once, that basically made my thumb useless since using it would cause extreme pain. That test seems really good. I could use a computer keyboard without any issue, but putting a belt on or pouring water was impossible.
reply
vidarh
18 hours ago
[-]
I had a swollen elbow a short while ago, and the amount of things I've never thought about that were affected by reduced elbow join mobility and an inability to put pressure on the elbow was disturbing.
reply
alexose
20 hours ago
[-]
It feels like there's a whole class of information that easily shorthanded, but really hard to explain to novices.

I think a lot about carpentry. From the outside, it's pretty easy: Just make the wood into the right shape and stick it together. But as one progresses, the intricacies become more apparent. Variations in the wood, the direction of the grain, the seasonal variations in thickness, joinery techniques that are durable but also time efficient.

The way this information connects is highly multisensory and multimodal. I now know which species of wood to use for which applications. This knowledge was hard won through many, many mistakes and trials that took place at my home, the hardware store, the lumberyard, on YouTube, from my neighbor Steve, and in books written by experts.

reply
drdrey
19 hours ago
[-]
I think assembling Legos would be a cool robot benchmark: you need to parse the instructions, locate the pieces you need, pick them up, orient them, snap them to your current assembly, visually check if you achieved the desired state, repeat
reply
serpix
10 hours ago
[-]
I agree. Watching my toddler daughter build with small legos makes me understand how incredible fine motor skills are as even with small fingers some of the blocks are just too hard to snap together.
reply
Method-X
20 hours ago
[-]
Was it the Southampton hand assessment procedure?
reply
Balgair
20 hours ago
[-]
reply
m463
21 hours ago
[-]
It would be interesting to see trick questions.

Like in your test

a hand grenade and a pin - don't pull the pin.

Or maybe a mousetrap? but maybe that would be defused?

in the ai test...

or Global Thermonuclear War, the only winning move is...

reply
HPsquared
20 hours ago
[-]
Gaming streams being in the training data, it might pull the pin because "that's what you do".
reply
8note
20 hours ago
[-]
or, because it has to give an output, and pulling the pin is the only option
reply
TeMPOraL
18 hours ago
[-]
There's also the option of not pulling the pin, and shooting your enemies as they instinctively run from what they think is a live grenade. Saw it on a TV show the other day.
reply
sdenton4
20 hours ago
[-]
to move first!
reply
m463
18 hours ago
[-]
oh crap. lol!
reply
croemer
20 hours ago
[-]
> We had hand prosthetics that could play Mozart at 5x speed on a baby grand, but could not pick up a silver dollar or zip a jacket even a little bit. "

I must be missing something, how can they be able to play Mozart at 5x speed with their prosthetics but not zip a jacket? They could press keys but not do tasks requiring feedback?

Or did you mean they used to play Mozart at 5x speed before they became amputees?

reply
ben_w
20 hours ago
[-]
Playing a piano involves pushing down on the right keys with the right force at the right time, but that could be pre-programmed well before computers. The self-playing piano in the saloon in Westworld wasn't a huge anachronism, such things slightly overlapped with the Wild West era: https://en.wikipedia.org/wiki/Player_piano

Picking up a 1mm thick metal disk from a flat surface requires the user gives the exact time, place, and force, and I'm not even sure what considerations it needs for surface materials (e.g. slightly squishy fake skin) and/or tip shapes (e.g. fake nails).

reply
numpad0
20 hours ago
[-]
> Picking up a 1mm thick metal disk from a flat surface requires the user gives the exact time, place, and force

place sure but can't you cheat a bit for time and force with compliance("impedance control")?

reply
ben_w
19 hours ago
[-]
In theory, apparently not in practice.
reply
rahimnathwani
20 hours ago
[-]
Imagine a prosthetic 'hand' that has 5 regular fingers, rather than 4 fingers and a thumb. It would be able to play a piano just fine, but be unable to grasp anything small, like a zipper.
reply
n144q
18 hours ago
[-]
Well, you see, while the original comment says they could play at 5x speed, it does not say it plays at that speed well or play it beautifully. Any teacher or any student who learned piano for a while will tell you that this matters a lot, especially for classical music -- being able to accurately play at an even tempo with the correct dynamics and articulation is hard and is what differentiates a beginner/intermediate player from an advanced one. In fact, one mistake many students make is playing a piece too fast when they are not ready, and teachers really want students to practice very slowly.

My point is -- being able to zip a jacket is all about those subtle actions, and could actually be harder than "just" playing piano fast.

reply
8note
20 hours ago
[-]
zipping up a jacket is really hard to do, and requires very precise movements and coordination between hands.

playing mozart is much more forgiving in terms of the number of different motions you have to make in different directions, the amount of pressure to apply, and even the black keys are much bigger than large sized zipper tongues.

reply
Balgair
20 hours ago
[-]
Pretty much. The issue with zippers is that the fabric moves about in unpredictable ways. Piano playing was just movement programs. Zipping required (surprisingly) fast feedback. Also, gripping is somewhat tough compared to pressing.
reply
oblio
20 hours ago
[-]
I'm far from a piano player, but I can definitely push piano buttons quite quickly while zipping up my jacket when it's cold and/or wet outside is really difficult.

Even more so for picking up coins from a flat surface.

For robotics, it's kind of obvious, speed is rarely an issue, so the "5x" part is almost trivial. And you can program the sequence quite easily, so that's also doable. Piano keys are big and obvious and an ergonomically designed interface meant to be relatively easy to press, ergo easy even for a prosthetic. A small coin on a flat surface is far from ergonomic.

reply
yongjik
20 hours ago
[-]
I play piano as a hobby, and the funny thing is, if my hands are so cold that I can't zip up my jacket, there's no way I can play anything well. I know it's not quite zipping up jackets ;) but a human playing the piano does require a fast feedback loop.
reply
croemer
20 hours ago
[-]
But how do you deliberately control those fingers to actually play yourself what you have in mind rather than something preprogrammed? Surely the idea of a prosthetic does not just mean "a robot that is connected to your body", but something that the owner control with your mind.
reply
vidarh
18 hours ago
[-]
Nobody said anything about deliberately controlling those fingers to play yourself. Clearly it's not something you do for the sake of the enjoyment of playing, but more likely a demonstration of the dexterity of the prosthesis and ability to program it for complex tasks.

The idea of a prosthesis is to help you regain functionality. If the best way of doing that is through automation, then it'd make little sense not to.

reply
numpad0
20 hours ago
[-]
Thumb not opposable?
reply
dang
17 hours ago
[-]
We detached this subthread from https://news.ycombinator.com/item?id=42473419

(nothing wrong with it! I'm just trying to prune the top subthread)

reply
oblio
20 hours ago
[-]
This was actually discovered quite early on in the history of AI:

> Rodney Brooks explains that, according to early AI research, intelligence was "best characterized as the things that highly educated male scientists found challenging", such as chess, symbolic integration, proving mathematical theorems and solving complicated word algebra problems. "The things that children of four or five years could do effortlessly, such as visually distinguishing between a coffee cup and a chair, or walking around on two legs, or finding their way from their bedroom to the living room were not thought of as activities requiring intelligence."

https://en.wikipedia.org/wiki/Moravec%27s_paradox

reply
bawolff
20 hours ago
[-]
I don't know why people always feel the need to gender these things. Highly educated female scientists generally find the same things challenging.
reply
robocat
19 hours ago
[-]
I don't know why anyone would blame people as though someone is making an explicit choice. I find your choice of words to be insulting to the OP.

We learn our language and stereotypes subconciously from our society, and it is no easy thing to fight against that.

reply
Barrin92
18 hours ago
[-]
>I don't know why people always feel the need to gender these things

Because it's relevant to the point being made, i.e. that these tests reflect the biases and interests of the people who make them. This is true not just for AI tests, but intelligence test applied to humans. That Demis Hassabis, a chess player and video game designer, decided to test his machine on video games, Go and chess probably is not an accident.

The more interesting question is why people respond so apprehensively to pointing out a very obvious problem and bias in test design.

reply
bawolff
17 hours ago
[-]
> i.e. that these tests reflect the biases and interests of the people who make them

Of course. However i believe we can't move past that without being honest about where these biases are coming from. Many things in our world are the result of gender bias, both subtle and overt. However, at least at first glance, this does not appear to be one of them, and statements like the grandparent's quote serve to perpetuate such biases further.

reply
oblio
7 hours ago
[-]
It's a quote from the 80s from the original author (who is a man...)...

Thank you for virtue signalling, though.

reply
bawolff
7 hours ago
[-]
> It's a quote from the 80s from the original author (who is a man...)...

Yes, that was pretty clear in the original comment (?)

reply
oblio
7 hours ago
[-]
Then remove the parts that offend your modern sensibilities and focus on the essence.

He was right. Scientists were focusing on the "science-y" bits and completely missed the elephant in the room, that the thing a toddler already masters are the monster challenge for AI right now, before we even get into "meaning of life" type stuff.

reply
xnx
18 hours ago
[-]
Despite lake of fearsome teeth or claws, humans are way op due to brain, hand dexterity, and balance.
reply
MarcelOlsz
18 hours ago
[-]
>We had hand prosthetics that could play Mozart at 5x speed on a baby grand

I'd love to know more about this.

reply
CooCooCaCha
21 hours ago
[-]
That’s why the goal isn’t just benchmark scores, it’s reliable and robust intelligence.

In that sense, the goalposts haven’t moved in a long time despite claims from AI enthusiasts that people are constantly moving goalposts.

reply
neuroelectron
21 hours ago
[-]
OpenAI spent approximately $1,503,077 to smash the SOTA on ARC-AGI with their new o3 model

semi-private evals (100 tasks): 75.7% @ $2,012 total/100 tasks (~$20/task) with just 6 samples & 33M tokens processed in ~1.3 min/task and a cost of $2012

The “low-efficiency” setting with 1024 samples scored 87.5% but required 172x more compute.

If we assume compute spent and cost are proportional, then OpenAI might have just spent ~$346.064 for the low efficiency run on the semi-private eval.

On the public eval they might have spent ~$1.148.444 to achieve 91.5% with the low efficiency setting. (high-efficiency mode: $6677)

OpenAI just spent more money to run an eval on ARC than most people spend on a full training run.

reply
bluecoconut
20 hours ago
[-]
By my estimates, for this single benchmark, this is comparable cost to training a ~70B model from scratch today. Literally from 0 to a GPT-3 scale model for the compute they ran on 100 ARC tasks.

I double checked with some flop estimates (P100 for 12 hours = Kaggle limit, they claim ~100-1000x for O3-low, and x172 for O3-high) so roughly on the order of 10^22-10^23 flops.

In another way, using H100 market price $2/chip -> at $350k, that's ~175k hours. Or 10^24 FLOPs in total.

So, huge margin, but 10^22 - 10^24 flop is the band I think we can estimate.

These are the scale of numbers that show up in the chinchilla optimal paper, haha. Truly GPT-3 scale models.

reply
rvnx
20 hours ago
[-]
It sounds like they essentially brute-forced the solutions ? Ask LLM for answer, answer for LLM to verify the answer. Ask LLM for answer, answer for LLM to verify the answer. Add a bit of randomness. Ask LLM for answer, answer for LLM to verify the answer. Add a bit of randomness. Repeat 5B times (this is what the paper says).
reply
rfoo
21 hours ago
[-]
Pretty sure this "cost" is based on their retail price instead of actual inference cost.
reply
neuroelectron
19 hours ago
[-]
Yes that's correct and there's a bit of "pixel math" as well so take these numbers with a pinch of salt. Preliminary model sizes from the temporarily public HF repository puts the full model size at 8tb or roughly 80 H100s
reply
az226
3 hours ago
[-]
I thought that was a fake.
reply
neuroelectron
1 hour ago
[-]
I didn't hear that but it could be. But it doesn't matter really because there's so much more to consider in the cost, R&D, including all the supporting functions of a model like censorship and data capture and so on.
reply
ec109685
15 hours ago
[-]
Yeah and can run off peak, etc.

Does seem to show an absolutely massive market for inference compute…

reply
yawnxyz
22 hours ago
[-]
O3 High (tuned) model scored an 88% at what looks like $6,000/task haha

I think soon we'll be pricing any kind of tasks by their compute costs. So basically, human = $50/task, AI = $6,000/task, use human. If AI beats human, use AI? Ofc that's considering both get 100% scores on the task

reply
cchance
22 hours ago
[-]
Isn't that generally what ... all jobs are? Automation Cost vs Longterm Human cost... its why amazon did the weird "our stores are AI driven" but in reality was cheaper to higher a bunch of guys in a sweat shop to look at the cameras and write things down lol.

The thing is given what we've seen from distillation and tech, even if its 6,000/task... that will come down drastically over time through optimization and just... faster more efficient processing hardware and software.

reply
cryptoegorophy
22 hours ago
[-]
I remember hearing Tesla trying to automate all of production but some things just couldn’t , like the wiring which humans still had to do.
reply
Benjaminsen
22 hours ago
[-]
Compute costs on AI with the same roughly the same capabilities have been halving every ~7 months.

That makes something like this competitive in ~3 years

reply
seizethecheese
18 hours ago
[-]
And human costs have been increasing a few percent per year for a few centuries!
reply
jsheard
22 hours ago
[-]
That's the elephant in the room with the reasoning/COT approach, it shifts what was previously a scaling of training costs into scaling of training and inference costs. The promise of doing expensive training once and then running the model cheaply forever falls apart once you're burning tens, hundreds or thousands of dollars worth of compute every time you run a query.
reply
Workaccount2
20 hours ago
[-]
They're gonna figure it out. Something is being missed somewhere, as human brains can do all this computation on 20 watts. Maybe it will be a hardware shift or maybe just a software one, but I strongly suspect that modern transformers are grossly inefficient.
reply
Legend2440
22 hours ago
[-]
Yeah, but next year they'll come out with a faster GPU, and the year after that another still faster one, and so on. Compute costs are a temporary problem.
reply
freehorse
21 hours ago
[-]
The issue is not just scaling compute, but scaling it in a rate that meets the increase in complexity of the problems that are not currently solved. If that is O(n) then what you say probably stands. If that is eg O(n^8) or exponential etc, then there is no hope to actually get good enough scaling by just increasing compute in a normal rate. Then AI technology will still be improving, but improving to a halt, practically stagnating.

o3 will be interesting if it offers indeed a novel technology to handle problem solving, something that is able to learn from few novel examples efficiently and adapt. That's what intelligence actually is. Maybe this is the case. If, on the other hand, it is a smart way to pair CoT within an evaluation loop (as the author hints as possibility) then it is probable that, while this _can_ handle a class of problems that current LLMs cannot, it is not really this kind of learning, meaning that it will not be able to scale to more complex, real world tasks with a problem space that is too large and thus less amenable to such a technique. It is still interesting, because having a good enough evaluator may be very important step, but it would mean that we are not yet there.

We will learn soon enough I suppose.

reply
og_kalu
22 hours ago
[-]
It's not 6000/task (i.e per question). 6000 is about the retail cost for evaluating the entire benchmark on high efficiency (about 400 questions)
reply
Tiberium
21 hours ago
[-]
From reading the blog post and Twitter, and cost of other models, I think it's evident that it IS actually cost per task, see this tweet: https://files.catbox.moe/z1n8dc.jpg

And o1 cost $15/$60 for 1M in/out, so the estimated costs on the graph would match for a single task, not the whole benchmark.

reply
slibhb
21 hours ago
[-]
The blog clarifies that it's $17-20 per task. Maybe it runs into thousands for tasks it can't solve?
reply
Tiberium
20 hours ago
[-]
That cost is for o3 low, o3 high goes into thousands per task.
reply
freehorse
22 hours ago
[-]
This makes me think and speculate if the solution comprises of a "solver" trying semi-random or more targeted things and a "checker" checking these? Usually checking a solution is cognitively (and computationally) easier than coming up with it. Else I cannot think what sort of compute would burn 6000$ per task, unless you are going through a lot of loops and you have somehow solved the part of the problem that can figure out if a solution is correct or not, while coming up with the actual correct solution is not as solved yet to the same degree. Or maybe I am just naive and these prices are just like breakfast for companies like that.
reply
seydor
20 hours ago
[-]
What if we use those humans to generate energy for the tasks?
reply
gbnwl
21 hours ago
[-]
Well they got 75.7% at $17/task. Did you see that?
reply
redeux
22 hours ago
[-]
Time and availability would also be factors.
reply
dyauspitr
22 hours ago
[-]
Compute can get optimized and cheap quickly.
reply
karmasimida
21 hours ago
[-]
Is it? The moore’s law is dead dead, I don’t think this is a given.
reply
Engineering-MD
14 hours ago
[-]
Can I just say what a dick move it was to do this as a 12 days of Christmas. I mean to be honest I agree with the arguments this isn’t as impressive as my initial impression, but they clearly intended it to be shocking/a show of possible AGI, which is rightly scary.

It feels so insensitive to that right before a major holiday when the likely outcome is a lot of people feeling less secure in their career/job/life.

Thanks again openAI for showing us you don’t give a shit about actual people.

reply
XenophileJKO
14 hours ago
[-]
Or maybe the target audience that watches 12 launch videos in the morning are genuninely excited about the new model. The intended it to be a preview of something to look forward to.

What a weird way to react to this.

reply
achierius
8 hours ago
[-]
It sounds like you aren't thinking about this that deeply then. Or at least not understanding that many smart (and financially disinterested) people who are, are coming to concerning conclusions.

https://www.transformernews.ai/p/richard-ngo-openai-resign-s...

>But while the “making AGI” part of the mission seems well on track, it feels like I (and others) have gradually realized how much harder it is to contribute in a robustly positive way to the “succeeding” part of the mission, especially when it comes to preventing existential risks to humanity.

Almost every single one of the people OpenAI had hired to work on AI safety have left the firm with similar messages. Perhaps you should at least consider the thinking of experts?

reply
tim333
1 hour ago
[-]
Some of us actual people are actually enthusiastic about AGI. Although I'm a bit weird in being into the sci-fi upload / ending death stuff.
reply
mirkodrummer
14 hours ago
[-]
There is no AGI it’s just marketing, this stuff if over hyped, enjoy your holidays you won’t lose your job ;)
reply
Engineering-MD
8 hours ago
[-]
I agree, it’s just more about the intent than anything else, like boasting about your amazing new job when someone has recently been made redundant, just before Christmas.
reply
keiferski
7 hours ago
[-]
The vast majority of people who will lose jobs to AI aren’t following AGI benchmarks, or even know what AGI is short for.
reply
Engineering-MD
4 hours ago
[-]
That’s is true and a reasonable point. But looking in This thread you can see there has been this reaction from quite a few.
reply
achierius
8 hours ago
[-]
I feel you. It's tough trying to think about what we can do to avert this; even to the extent that individuals are often powerless, in this regard it feels worse than almost anything that's come before.
reply
OldGreenYodaGPT
13 hours ago
[-]
Blaming OpenAI for progress is like blaming a calendar for Christmas—it’s not the timing, it’s your unwillingness to adapt
reply
r-zip
3 hours ago
[-]
Unwillingness to adapt to the destruction of the middle class and knowledge work is pretty reasonable tbh.
reply
tim333
1 hour ago
[-]
Historically when tech has taken over jobs people have done ok, they've just done something else, usually something more pleasant.
reply
lagrange77
3 hours ago
[-]
Wow, you just solved the ethics of technology in a one liner. Impressive.
reply
t0lo
10 hours ago
[-]
I hate the deliberate fear-mongering that these companies pedal on the population to get higher valuations
reply
stevenhuang
13 hours ago
[-]
This is a you problem. Yes there will be pain in short term, but it will be worth it in long term.

Many of us look forward to what a future with AGI can do to help humanity and hopefully change society for the better, mainly to achieve a post scarcity economy.

reply
jakebasile
12 hours ago
[-]
Surely the elites that control this fancy new technology will share the benefits with all of us _this_ time!
reply
tim333
1 hour ago
[-]
No it'll be like when tech took over 97% of agricultural work with 97% of us starving while all the money went to the farm elites.
reply
achierius
8 hours ago
[-]
https://www.transformernews.ai/p/richard-ngo-openai-resign-s...

>But while the “making AGI” part of the mission seems well on track, it feels like I (and others) have gradually realized how much harder it is to contribute in a robustly positive way to the “succeeding” part of the mission, especially when it comes to preventing existential risks to humanity.

Almost every single one of the people OpenAI had hired to work on AI safety have left the firm with similar messages. Perhaps you should at least consider the thinking of experts? There is a real chance that this ends with significant good. There is also a real chance that this ends with the death of every single human being. That's never been a choice we've had to make before, and it seems like we as a species are unprepared to approach it.

reply
esafak
2 hours ago
[-]
How are you going to make housing, healthcare, etc. not scarce, and pay for them?
reply
tim333
1 hour ago
[-]
Robots supply that, controlled by democratic government.
reply
esafak
22 minutes ago
[-]
Robots supply the land and physical labor that underlie the price of housing? Are you thinking of space colonies or something?

You need to make these expensive things nearly free if you're going to speak of post scarcity.

reply
tim333
9 minutes ago
[-]
Robots supply the physical labour. The land shortages are largely regulatory - there's a lot of land out there or you could build higher.
reply
randyrand
11 hours ago
[-]
Post scarcity seems very unlikely. Humans might be worthless, but there will still be a finite number of AIs, compute, space, resources.
reply
_cs2017_
11 hours ago
[-]
Wtf is wrong with you dude? It's just another tech, some jobs will get worse some jobs will get better. Happens every couple of decades. Stop freaking out.
reply
achierius
8 hours ago
[-]
This is not a very kind or humble comment. There are real experts talking about how this time is different -- as an analogy, think about how horses, for thousands of years, always had new things to do -- until one day they didn't. It's hubris to think that we're somehow so different from them.

Notably, the last key AI safety researcher just left OpenAI: https://www.transformernews.ai/p/richard-ngo-openai-resign-s...

>But while the “making AGI” part of the mission seems well on track, it feels like I (and others) have gradually realized how much harder it is to contribute in a robustly positive way to the “succeeding” part of the mission, especially when it comes to preventing existential risks to humanity.

Are you that upset that this guy chose to trust the people that OpenAI hired to talk about AI safety, on the topic of AI safety?

reply
oezi
7 hours ago
[-]
> o3 fixes the fundamental limitation of the LLM paradigm – the inability to recombine knowledge at test time

I don't understand this mindset. We have all experienced that LLMs can produce words never spoken before. Thus there is recombination of knowledge at play. We might not be satisfied with the depth/complexity of the combination, but there isn't any reason to believe something fundamental is missing. Given more compute and enough recursiveness we should be able to reach any kind of result from the LLM.

The linked article says that LLMs are like a collection of vector programs. It has always been my thinking that computations in vector space are easy to make turing complete if we just have an eigenvector representation figured out.

reply
spaceman_2020
22 hours ago
[-]
Just as an aside, I've personally found o1 to be completely useless for coding.

Sonnet 3.5 remains the king of the hill by quite some margin

reply
vessenes
22 hours ago
[-]
To fill this out, I find o1-pro (and -preview when it was live) to be pretty good at filling in blindspots/spotting holistic bugs. I use Claude for day to day, and when Claude is spinning, o1 often can point out why. It's too slow for AI coding, and I agree that at default its responses aren't always satisfying.

That said, I think its code style is arguably better, more concise and has better patterns -- Claude needs a fair amount of prompting and oversight to not put out semi-shitty code in terms of structure and architecture.

In my mind: going from Slowest to Fastest, and Best Holistically to Worst, the list is:

1. o1-pro 2. Claude 3.5 3. Gemini 2 Flash

Flash is so fast, that it's tempting to use more, but it really needs to be kept to specific work on strong codebases without complex interactions.

reply
spaceman_2020
3 hours ago
[-]
Claude has a habit of sometimes just getting “lost”

Like I’ll have it a project in Cursor and it will spin up ready to use components that use my site style, reference existing components, and follow all existing patterns

Then on some days, it will even forget what language the project is in and start giving me python code for a react project

reply
causal
2 hours ago
[-]
Yeah it's almost like system 1 vs system 2 thinking
reply
og_kalu
22 hours ago
[-]
To be fair, until the last checkpoint released 2 days ago, o1 didn't really beat sonnet (and if so, barely) in most non-competitive coding benchmarks
reply
bitbuilder
20 hours ago
[-]
I find myself hoping between o1 and Sonnet pretty frequently these days, and my personal observation is that the quality of output from o1 scales more directly to the quality of the prompting you're giving it.

In a way it almost feels like it's become too good at following instructions and simply just takes your direction more literally. It doesn't seem to take the initiative of going the extra mile of filling in the blanks from your lazy input (note: many would see this as a good thing). Claude on the other hand feels more intuitive in discerning intent from a lazy prompt, which I may be prone to offering it at times when I'm simply trying out ideas.

However, if I take the time to write up a well thought out prompt detailing my expectations, I find I much prefer the code o1 creates. It's smarter in its approach, offers clever ideas I wouldn't have thought of, and generally cleaner.

Or put another way, I can give Sonnet a lazy or detailed prompt and get a good result, while o1 will give me an excellent result with a well thought out prompt.

What this boils down to is I find myself using Sonnet while brainstorming ideas, or when I simply don't know how I want to approach a problem. I can pitch it a feature idea the same way a product owner might pitch an idea to an engineer, and then iterate through sensible and intuitive ways of looking at the problem. Once I get a handle on how I'd like to implement a solution, I type up a spec and hand it off to o1 to crank out the code I'd intend to implement.

reply
spaceman_2020
3 hours ago
[-]
Have you found any tool or guide for writing better o1 prompts? This isn’t the first time I’ve heard this about o1 but no one seems to know how to prompt it
reply
jules
19 hours ago
[-]
Can you solve this by putting your lazy prompt through GPT-4o or Sonnet 3.6 and asking it to expand the prompt to a full prompt for o1?
reply
InkCanon
22 hours ago
[-]
I just asked o1 a simple yes or no question about x86 atomics and it did one of those A or B replies. The first answer was yes, the second answer was no.
reply
bearjaws
22 hours ago
[-]
o1 is pretty good at spotting OWASP defects, compared to most other models.

https://myswamp.substack.com/p/benchmarking-llms-against-com...

reply
cchance
22 hours ago
[-]
The new gemini's are pretty good too
reply
spaceman_2020
2 hours ago
[-]
The new ai studio from Google is fantastic
reply
lysecret
22 hours ago
[-]
Actually prefer new geminis too. 2.0 experimental especially.
reply
leumon
19 hours ago
[-]
I've found gemini-1206 to be best. and we can use it free (for now), in google's aistudio. It's number 1 on lmarena.ai for coding, and generally, and number 1 on bigcodebench.
reply
energy123
15 hours ago
[-]
Which o1? A new version was released a few days ago and beats Sonnet 3.5 on Livebench
reply
karmasimida
21 hours ago
[-]
Yeah I feel for chat use case, o1 is just too slow for me, and my queries aren’t that complicated.

For coding, o1 is marvelous at Leetcode question I think it is the best teacher I would ever afford to teach me leetcoding, but I don’t find myself have a lot of other use cases for o1 that is complex and requires really long reasoning chain

reply
m3kw9
21 hours ago
[-]
o1 is when all else fails, sometimes it does the same mistakes as weaker models if you give it simple tasks with very little context, but when a good precise context is given it usually outperforms other Models
reply
nxobject
21 hours ago
[-]
As an aside, I'm a little miffed that the benchmark calls out "AGI" in the name, but then heavily cautions that it's necessary but insufficient for AGI.

> ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI

reply
mmcnl
18 hours ago
[-]
I immediately thought so too. Why confuse everyone?
reply
ec109685
15 hours ago
[-]
Because ARC somehow convinced people that solving it was an indicator of AGI.
reply
Jensson
15 hours ago
[-]
Its like the "Open" in OpenAI or the "Democratic" in North Koreas DPRK. Naming things helps fool a lot of people.
reply
EthanHeilman
12 hours ago
[-]
It is a necessary but not sufficient condition to AGI.
reply
rapjr9
2 hours ago
[-]
Does anyone have a feeling for how latency (from asking a question/API call to getting an answer/API return) is progressing with new models? I see 1.3 minutes/task and 13.8 minutes/task mentioned in the page on evaluating O3. Efficiency gains that also reduce latency will be important and some of them will come from efficiency in computation, but as models include more and more layers (layers of models for example) the overall latency may grow and faster compute times inside each layer may only help somewhat. This could have large effects on usability.
reply
ndm000
18 hours ago
[-]
One thing I have not seen commented on is that ARC-AGI is a visual benchmark but LLMs are primarily text. For instance when I see one of the ARC-AGI puzzles, I have a visual representation in my brain and apply some sort of visual reasoning solve it. I can "see" in my mind's eye the solution to the puzzle. If I didn't have that capability, I don't think I could reason through words how to go about solving it - it would certainly be much more difficult.

I hypothesize that something similar is going on here. OpenAI has not published (or I have not seen) the number of reasoning tokens it took to solve these - we do know that each tasks was thoussands of dollars. If "a picture is worth a thousand words", could we make AI systems that can reason visually with much better performance?

reply
csomar
14 hours ago
[-]
This is not new. When GPT-4 was released I was able to get it to generate SVGs albeit they were ugly they had the basics.
reply
krackers
17 hours ago
[-]
Yeah this part is what makes the high performance even more surprising to me. The fact that LLMs are able to do so well on visual tasks (also seen with their ability to draw an image purely using textual output https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/) implies that not only do they actually have some "world model" but that this is in spite of the disadvantage given by having to fit a round peg in a square hole. It's like trying to map out the entire world using the orderly left-brain, without a more holistic spatial right-brain.

I wonder if anyone has experimented with having some sort of "visual" scratchpad instead of the "text-based" scratchpad that CoT uses.

reply
skydhash
14 hours ago
[-]
A file is a stream of symbols encoded by bits according to some format. It’s pretty much 1D. It would be susprising that LLM couldn’t extract information from a file or a data stream.
reply
earth2mars
17 minutes ago
[-]
Why did they skip o2?
reply
mortehu
15 hours ago
[-]
The chart is super misleading, since the test was obscure until recently. A few months ago he announced he'd made the only good AGI test and offered a cash prize for solving it, only to find out in as much time that it's no different from other benchmarks.
reply
t0lo
17 hours ago
[-]
I'm 22 and have no clue what I'm meant to do in a world where this is a thing. I'm moving to a semi rural, outdoorsy area where they teach data science and marine science and I can enjoy my days hiking, and the march of technology is a little slower. I know this will disrupt so much of our way of life, so I'm chasing what fun innocent years are left before things change dramatically.
reply
brysonreece
17 hours ago
[-]
It's worth noting that LLMs have been part of the tech zeitgeist for over two years and have had a pretty limited impact on hireability for roles, despite what people like the Klarna CEO are saying. Personally, I'm betting on two things:

* The upward bound of compute/performance gains as we continue to iterate on LLMs. It simply isn't going to be feasible for a lot of engineers and businesses to run/train their own LLMs. This means an inherent reliance on cloud services to bridge the gap (something MS is clearly betting on), and engineers to build/maintain the integration from these services to whatever business logic their customers are buying.

* Skilled knowledge workers continuing to be in-demand, even factoring in automation and new-grad numbers. Collectively, we've built a better hammer; it still takes someone experienced enough to know where to drive the nail. These tools WILL empower the top N% of engineers to be more productive, which is why it will be more important than ever to know _how_ to build things that drive business value, rather than just how to churn through JIRA tickets or turn a pretty Figma design into React.

reply
byyoung3
16 hours ago
[-]
o8 will probably be able to handle datacenter management
reply
toomuchtodo
16 hours ago
[-]
reply
byyoung3
4 hours ago
[-]
exactly
reply
VonTum
14 hours ago
[-]
I agree completely. This is a fundamentally different change than the ones that came before. Calculators, assemblers, higher level languages, none of these actually removed the _reasoning_ the engineer has to do, they just provide abstractions that make this reasoning easier. What reason is there to believe LLMs will remain "assistants" instead of becoming outright replacements? If LLMs can do the reasoning all the way from high level description down to implementation, what prevents them from doing the high level describing too?

In general, with the technology advancing as rapidly as it is, and the trillions of dollars oriented towards replacing knowledge work, I don't see a future in this field. And that's despite me being on a very promising path myself! I'm 25, in the middle of a CS PhD in Germany, with an impressive CV behind me. My head may be the last on the chopping block, but I'd be surprised if it buys me more than a few years once programmer obsolescence truly kicks in.

Indeed, what I think are safe jobs are jobs with fundamental human interaction. Nurses, doctors, kindergarten teachers. I myself have been considering pivoting to becoming a skiing teacher.

Maybe one good thing that comes out of this is breaking my "wunderkind" illusion. I spent my teens writing C++ code instead of going out socializing and making friends. Of course, I still did these things, but I could've been far less of a hermit.

I mirror your sentiment of spending these next few years living life; Real life. My advice: Stop sacrificing the now for the future. See the world, go on hikes with friends, go skiing, attend that bouldering thing your friends have been telling you about. If programming is something you like doing, then by all means keep going and enjoy it. I will likely keep programming too, it's just no longer the only thing I focus on.

Edit: improve flow of last paragraph

reply
darkgenesha
13 hours ago
[-]
What was it that initially inspired you to learn to code? Was it robots, video games, design, etc... Whatever that was, creating the pinnacle of it is what your future will be.
reply
VonTum
5 hours ago
[-]
It was the challenge for me. Seeing some difficult-to-solve problem, attacking it, and actually solving it after much perseverance.

Kind of stemming from the mindspace "If they can build X, I can build X!"

I'd explicitly not look up tutorials, just so I'd have the opportunity to solve the mathemathics myself. Like building a 3D physics engine. (I did look up colission detection after struggling with it for a month or so, inventing GJK is on another level)

reply
schappim
17 hours ago
[-]
I completely understand how you feel -I'm in my 40s, and I often find myself questioning what direction to take in this rapidly changing world. On top of that, I'm unsure whether advising my kids to go to university is still the right path for their future.

Everything seems so uncertain, and the pace of technological advancement makes long-term planning feel almost impossible. Your plan to move to a slower-paced area and enjoy the outdoors sounds incredibly grounding - it's something I've been considering myself.

reply
rtsil
17 hours ago
[-]
I tell everyone who would listen to me (i.e. not many) that white collar jobs like mine are dead and skilled manual work is the way of the near future, that is until the rise of the robots.
reply
dyauspitr
13 hours ago
[-]
Robots are going to go hand in hand with AI. Pretty sure our problems right now are not with the physical hardware that can far outperform a human already, it’s in the control software.
reply
t0lo
13 hours ago
[-]
Robots can only proliferate at the speed of real world logistics and resource management and I think will always be a little difficult.

AI can be anywhere any time with cloud compute.

reply
aryonoco
17 hours ago
[-]
I advise my kids to stay curious, keep learning, keep wondering, keep discovering. Whether that's through university or some other path.
reply
salter2
16 hours ago
[-]
I'm the same age as you; I feel lost, erring in being a little too pessimistic.

Feels like I hit the real world just a couple years too late to get situated in a solid position. Years of obsession in attempt to catch up to the wizards, chasing the tech dream. But this, feels like this is it. Just watching the timebomb tick. I'd love to work on what feels like the final technology, but I'm not a freakshow like what these labs are hiring. At least I get to spectate the creation of humanity's greatest invention.

This announcement is just another gut punch, but at this point I should expect its inevitable. A Jason Voorhees AGI, slowly but surely to devour all the talents and skills information workers have to offer.

Apologies for the rambly and depressing post, but this is reality for anyone recently out or still in school.

reply
t0lo
11 hours ago
[-]
At least you're disillusioned with the idea of a long term career before a lot of other people. It's disturbing seeing how ready people are to go into a lifelong career and expecting stability and happiness in the world we're heading into.

We are living in a world run by and for the soon to be dead, many of which have dementia, so empathic policy and foresight is out of the question, and we're going to be picking up the incredibly broken scraps of our golden age.

And not to get too political but the mass restructuring of public consciousness and intellectual society due to mass immigration for an inexplicable gdp squeeze and social media is happening at exactly the wrong time to handle these very serious challenges. The speed at which we've undone civil society is breakneck, and it will go even further, and it will get even worse. We've easily gone back 200 years in terms of emotional intelligence in the past 15.

reply
neom
13 hours ago
[-]
Put another way, you have deep conviction in a change that vast majority of people have not even seen yet, never mind grokked, and you're young enough to spend some decent amount of time on education for "venn'ing" yourself into a useful tool in the future. If you have a baseline education, there are any number of orthogonal skills you could add, be it philosophy, fine art, medicine, whatever. You know how to skate and you know where the puck is going, most most people, don't even see the rink.
reply
karmasimida
17 hours ago
[-]
While I understand why you feel this way, the meaning or standing of being a programmer is different now. It feels like the purpose is lost or it longer belongs to human.

But below is reality talk. With Claude 3.5, I already think it is a better programmer than I at micro level tasks, and a better Leetcode programmer than I could ever be.

I think it is like modern car manufacturering, the robots build most of the components, but I can’t see how human could be dismissed from the process to oversee output.

O3 has been very impressive in achieving 70+ in swebench for example, but this also means when it is trained on the codebase multiple times so visibility isn't an issue yet it still has 30% chance that it can’t pass the unit tests.

A fully autonomous system can’t be trusted, the economy of software won’t collapse, but it will be transformed beyond our imagination now.

I will for sure miss the days when writing code, or coder is still a real business.

How time flies

reply
Kostchei
16 hours ago
[-]
Developer. Prompt Engineer. Philosopher-Builder. (mostly) not programmer.

The code part will get smaller and smaller for most folks. Some frameworks or bare-metal people or intense heavy-lifters will still do manual code or pair-programming where half the pair is an agentic AI with super-human knowledge of your org's code base.

But this will be a layer of abstraction for most people who build software. And as someone who hates rote learning, I'm here for it. IMO.

Unfortunately (?) I think the 10-20-50? years of development experience you might bring to bear on the problems can be superseded by an LLM finetuned on stackoverflow, github etc once judgement and haystack are truly nailed. Because it can have all that knowledge you have accumulated, and soaked into a semi-conscious instinct that you use so well you aren't even aware of it except that it works. It can have that a million times over. Actually. Which is both amazing and terrifying. Currently this isn't obvious because it's accuracy /judgement to learn all those life-of-a-dev lessons is almost non-existent. Currently. But it will happen. That is copilot's future. It's raison d'être.

I would argue what it will never have however, simply by function of the size of training runs is unique functional drive and vision. If you wanted a "Steve Jobs" AI you would have to build it. And if you gave it instructions to make a prompt/framework to build a "Jobs" it would just be an imitation, rather than a new unique in-context version. That is the value a person has- their particular filter, their passion and personal framework. Someone who doesn't have any of those things, they had better be hoping for UBI and charity. Or go live a simple life, outside the rat race.

bows

reply
t0lo
16 hours ago
[-]
I'm hoping it's similar to the abacus for maths, the elimination of human "calculators" like on the apollo missions, and we just ended up moving onto different, harder, more abstract problems, and forget that we ever had to climb such small hills. AI's evolution and integration is more multifaceted though and much more unpredictable.

But unlike the abacus/calculators i don't feel like we're at a point in history where society is getting wiser and more empathetic, and these new abilities are going towards something good.

But supervisors of tasks will remain because we're social, untrusting, and employers will always want someone else to blame for their shortcomings. And humans will stay in the chain at least for marketing and promotion/reputation because we like our japanese craftsman and our amg motors made by one person.

reply
rich_sasha
17 hours ago
[-]
I feel your anxiety. I often wonder how I arrange the remaining many decades of my life to maintain a stream of income.

Perhaps what I need is actually a steady stream of food - i.e. buy some land and oxen and solar panels while I can.

reply
Havoc
16 hours ago
[-]
>I'm 22 and have no clue what I'm meant to do in a world where this is a thing.

For what it's worth that's probably an advantage versus the legions of people who are staring down the barrel of years invested into skills that may lose relevance very rapidly.

reply
ec109685
15 hours ago
[-]
If information technology workers become twice as productive, you’ll want more of them for your business, not less.

There are way more data analysts now than when it required paper and pencil.

reply
mrcwinn
17 hours ago
[-]
On the contrary I think you already have an excellent plan.
reply
t0lo
17 hours ago
[-]
I'm happy enough with it, but I'm also a little sad that it's essentially been chosen for me because of weak willed and valued people who don't want to use policy to make things better for us as a society. Plus we are in a bad world/scenario for AI advancements to come into with pretty heavy institutional decay and loss of political checks and balances.

It's like my life is forfeit to fixing other peoples mistakes because they're so glaring and I feel an obligation. Maybe that's the way the world's always been, but it's a concerning future right now

reply
aryonoco
17 hours ago
[-]
Our way of life changed when electricity came around. It changed when cars took over the cities, it again changed when mobile phones became omnipresent.

Will LLMs or without LLMs, the world will keep turning. Humans will still be writing amazing works of literature, creating beautiful art, carrying out scientific experiments and discovering new species.

reply
flakiness
22 hours ago
[-]
The cost axis is interesting. O3 Low is $10+ per task and 03 High is over $1000 (it's logarithmic graph so it's like $50 and $5000 respectively?)
reply
mukunda_johnson
16 hours ago
[-]
Deciphering patterns in natural language is more complex than these puzzles. If you train your AI to solve these puzzles, we end up in the same spot. The difficulty of solving would be with creating training data for a foreign medium. The "tokens" are the grids and squares instead of words (for words, we have the internet of words, solving that).

If we're inferring the answers of the block patterns from minimal or no additional training, it's very impressive, but how much time have they had to work on O3 after sharing puzzle data with O1? Seems there's some room for questionable antics!

reply
onemetwo
22 hours ago
[-]
In (1) the author use a technique to improve the performance of an LLM, he trained sonnet 3.5 to obtain 53,6% in the arc-agi-pub benchmark moreover he said that more computer power would give better results. So the results of o3 could be produced in this way using the same method with more computer power, so if this is the case the result of o3 is not very interesting.

(1) https://params.com/@jeremy-berman/arc-agi

reply
attentionmech
22 hours ago
[-]
Isn't this at the level now where it can sort of self improve. My guess is that they will just use it to improve the model and the cost they are showing per evaluation will go down drastically.

So, next step in reasoning is open world reasoning now?

reply
dyauspitr
9 hours ago
[-]
I don’t believe so. If it’s at the point where you could just plug it into a bunch of camera feeds around the world and it could only filter out a useful training set for itself out of that data then we truly would have AGI. I don’t think it’s there yet.
reply
energy123
15 hours ago
[-]
At about 12-14 minutes in OpenAI's YouTube vid they show that o3-mini beats o1 on Codeforces despite using much less compute.
reply
Bjorkbat
22 hours ago
[-]
I was impressed until I read the caveat about the high-compute version using 172x more compute.

Assuming for a moment that the cost per task has a linear relationship with compute, then it costs a little more than $1 million to get that score on the public eval.

The results are cool, but man, this sounds like such a busted approach.

reply
futureshock
22 hours ago
[-]
So what? I’m serious. Our current level of progress would have been sci-fi fantasy with the computers we had in 2000. The cost may be astronomical today, but we have proven a method to achieve human performance on tests of reasoning over novel problems. WOW. Who cares what it costs. In 25 years it will run on your phone.
reply
radioactivist
19 hours ago
[-]
So your claim for optimism here is that something today that took ~10^22 floating point operations (based on an estimate earlier in the thread) to execute will be running on phones in 25 years? Phones which are currently running at O(10^12) flops. That means ten orders of magnitudes of improvement for that to run in a reasonable amount of time? It's a similar scale up in going from ENIAC (500 flops) to a modern desktop (5-10 teraflops).
reply
futureshock
18 hours ago
[-]
That sounds reasonable to me because the compute cost for this level of reasoning performance won’t stay at 10^22 and phones won’t stay at 10^12. This reasoning breakthrough is about 3 months old.
reply
radioactivist
18 hours ago
[-]
I think expecting five orders of magnitude improvement from either side of this (inference cost or phone performance) is insane.
reply
Bjorkbat
21 hours ago
[-]
It's not so much the cost as much the fact that they got a slightly better result by throwing 172x more compute per/task. The fact that it may have cost somewhere north of $1 million simply helps to give a better idea of how absurd the approach is.

It feels a lot less like the breakthrough when the solution looks so much like simply brute-forcing.

But you might be right, who cares? Does it really matter how crude the solution is if we can achieve true AGI and bring the cost down by increasing the efficiency of compute?

reply
futureshock
18 hours ago
[-]
“Simply brute-forcing”

That’s the thing that’s interesting to me though and I had the same first reaction. It’s a very different problem than brute-forcing chess. It has one chance to come to the correct answer. Running through thousands or millions of options means nothing if the model can’t determine which is correct. And each of these visual problems involve combinations of different interacting concepts. To solve them requires understanding, not mimicry. So no matter how inefficient and “stupid” these models are, they can be said to understand these novel problems. That’s a direct counter to everyone who ever called these a stochastic parrot and said they were a dead-end to AGI that was only searching an in distribution training set.

The compute costs are currently disappointing, but so was the cost of sequencing the first whole human genome. That went from 3 billion to a few hundred bucks from your local doctor.

reply
amai
1 hour ago
[-]
But can it convert handwritten equations into Latex? That is the AGI task I'm waiting for.
reply
RivieraKid
21 hours ago
[-]
It sucks that I would love to be excited about this... but I mostly feel anxiety and sadness.
reply
Jcampuzano2
19 hours ago
[-]
Same, it's sad but I honestly hoped they never achieved these results and it came out that it wasn't possible or would take an insurmountable amount of resources but here we are ok the verge of making most humans useless when it comes to productivity.

While there are those that are excited, the world is not prepared for the level of distress this could put on the average person without critical changes at a monumental level.

reply
JacksCracked
18 hours ago
[-]
If you don't feel like the world needed grand scale changes at a societal level with all the global problems we're unable to solve, you haven't been paying attention. Income inequality, corporate greed, political apathy, global warming.
reply
phito
11 hours ago
[-]
AI will fix none of that
reply
sensanaty
18 hours ago
[-]
And you think the bullshit generators backed by the largest corporate entities in humanity who are, as we speak, causing all the issues you mention are somehow gonna solve any of this?
reply
CamperBob2
14 hours ago
[-]
If you still think this technology is a "bullshit generator," then it's safe to say you're also wrong about a great many other things in life.

That would bug me, if I were you.

reply
r-zip
3 hours ago
[-]
They’re not wrong though. The frequency with which these things still just make shit up is astonishingly bad. Very dismissive of a legitimate criticism.
reply
CamperBob2
11 minutes ago
[-]
It's getting better, faster than you and I and the GP are. What else matters?

You can't bullshit your way through this particular benchmark. Try it.

reply
crakhamster01
17 hours ago
[-]
Well said! There's no way big tech and institutional investors are pouring billions of dollars into AI because of corporate greed. It's definitely so that they can redistribute wealth equally once AGI is achieved.

/s

reply
gom_jabbar
20 hours ago
[-]
Anxiety and sadness are actually mild emotional responses to the dissolution of human culture. Nick Land in 1992:

"It is ceasing to be a matter of how we think about technics, if only because technics is increasingly thinking about itself. It might still be a few decades before artificial intelligences surpass the horizon of biological ones, but it is utterly superstitious to imagine that the human dominion of terrestrial culture is still marked out in centuries, let alone in some metaphysical perpetuity. The high road to thinking no longer passes through a deepening of human cognition, but rather through a becoming inhuman of cognition, a migration of cognition out into the emerging planetary technosentience reservoir, into 'dehumanized landscapes ... emptied spaces' where human culture will be dissolved. Just as the capitalist urbanization of labour abstracted it in a parallel escalation with technical machines, so will intelligence be transplanted into the purring data zones of new software worlds in order to be abstracted from an increasingly obsolescent anthropoid particularity, and thus to venture beyond modernity. Human brains are to thinking what mediaeval villages were to engineering: antechambers to experimentation, cramped and parochial places to be.

[...]

Life is being phased-out into something new, and if we think this can be stopped we are even more stupid than we seem." [0]

Land is being ostracized for some of his provocations, but it seems pretty clear by now that we are in the Landian Accelerationism timeline. Engaging with his thought is crucial to understanding what is happening with AI, and what is still largely unseen, such as the autonomization of capital.

[0] https://retrochronic.com/#circuitries

reply
achierius
8 hours ago
[-]
It's obvious that there are lines of flight (to take a Deleuzian tack, a la Land) away from the current political-economic assemblage. For example, a strategic nuclear exchange starting tomorrow (which can always happen -- technical errors, a rogue submarine, etc.) would almost certainly set back technological development enough that we'd have no shot at AI for the next few decades. I don't know whether you agree with him, but I think the fact that he ignores this fact is quite unserious, especially given the likely destabilizing effects sub-AGI AI will have on international politics.
reply
larve
9 hours ago
[-]
I have been diving deep into LLM coding over the last 3 years and regular encountered that feeling along the way. I still at times have a "wtf" moment where I need to take a break. However, I have been able to quell most of my anxieties around my job / the software profession in general (I've been at this professionally for 25+ years and software has been my dream job since I was 6).

For one, I found AI coding to work best in a small team, where there is an understanding of what to build and how to build it, usually in close feedback loop with the designers / users. Throw the usual managerial company corporate nonsense on top and it doesn't really matter if you can instacreate a piece of software, if nobody cares for that piece of software and it's just there to put a checkmark on the Q3 OKR reports.

Furthermore, there is a lot of software to be built out there, for people who can't afford it yet. A custom POS system for the local baker so that they don't have to interact with a computer. A game where squids eat algae for my nephews at christmas. A custom photo layout software for my dad who despairs at indesign. A plant watering system for my friend. A local government information website for older citizens. Not only can these be built at a fraction of the cost they were before, but they can be built in a manner where the people using the software are directly involved in creating it. Maybe they can get a 80% hacked version together if they are technically enclined. I can add the proper database backend and deployment infrastructure. Or I can sit with them and iterate on the app as we are talking. It is also almost free to create great documentation, in fact, LLM development is most productive when you turn up software engineering best practices up to 11.

Furthermore, I found these tools incredible for actively furthering my own fundamental understanding of computer science and programming. I can now skip the stuff I don't care to learn (is it foobarBla(func, id) or foobar_bla(id, func)) and put the effort where I actually get a long-lived return. I have become really ambitious with the things I can tackle now, learning about all kinds of algorithms and operating system patterns and chemistry and physics etc... I can also create documents to help me with my learning.

Local models are now entering the phase where they are getting to be really useful, definitely > gpt3.5 which I was able to use very productively already at the time.

Writing (creating? manifesting? I don't really have a good word for what I do these days) software that makes me and real humans around me happy is extremely fulfilling, and has allevitated most of my angst around the technology.

reply
pupppet
21 hours ago
[-]
We’re enabling a huge swath of humanity being put out of work so a handful of billionaires can become trillionaires.
reply
distortionfield
18 minutes ago
[-]
This is the same boring alarmist argument we’ve heard since the Industrial Revolution. Humans have always turned extra output provided by technological advancement to increase overall productivity.
reply
abiraja
20 hours ago
[-]
And also the solving of hundreds of diseases that ail us.
reply
lewhoo
20 hours ago
[-]
One of the biggest factors in risk of death right now is poverty. Also what is being chased right now is "human level on most economically viable tasks" because the automated research for solving physics etc. even now seems far-fetched.
reply
asdf6969
19 hours ago
[-]
Why do you think you’ll be able to afford healthcare? The new medicine is for the AI owners
reply
thrance
20 hours ago
[-]
You need to solve diseases and make the cure available. Millions die of curable diseases every year, simply because they are not deemed useful enough. What happens when your labor becomes worthless?
reply
hartator
20 hours ago
[-]
It doesn’t matter. Statists rather be poor, sick, and dead than risking trillionaires.
reply
thrance
20 hours ago
[-]
You should read about workers right in the gilded age, and see how good laissez-faire capitalism was. What do you think will happen when the only thing you can trade with the trillionaires, your labor, becomes worthless?
reply
xvector
21 hours ago
[-]
Humanity is about to enter an even steeper hockey stick growth curve. Progressing along the Kardashev scale feels all but inevitable. We will live to see Longevity Escape Velocity. I'm fucking pumped and feel thrilled and excited and proud of our species.

Sure, there will be growing pains, friction, etc. Who cares? There always is with world-changing tech. Always.

reply
lewhoo
19 hours ago
[-]
> Sure, there will be growing pains, friction, etc. Who cares?

That's right. Who cares about pains of others and why they even should are absolutely words to live by.

reply
xvector
19 hours ago
[-]
Yeah, with this mentality, we wouldn't have electricity today. You will never make transition to new technology painless, no matter what you do. (See: https://pessimistsarchive.org)

What you are likely doing, though, is making many more future humans pay a cost in suffering. Every day we delay longevity escape velocity is another 150k people dead.

reply
lewhoo
18 hours ago
[-]
There was a time when in the name of progress people were killed for whatever resources they possessed, others were enslaved etc. and I was under the impression that the measure of our civilization is that we actually DID care and just how much. It seems to me that you are very eager to put up altars of sacrifice without even thinking that the problems you probably have in mind are perfectly solvable without them.
reply
smokedetector1
18 hours ago
[-]
By far the greatest issue facing humanity today is wealth inequality.
reply
xvector
13 hours ago
[-]
Nah, it's death. People objectively are doing better than ever despite wealth inequality. By all metrics - poverty, quality of life, homelessness, wealth, purchasing power.

I'd rather just... not die. Not unless I want to. Same for my loved ones. That's far more important than "wealth inequality."

reply
achierius
8 hours ago
[-]
https://www.transformernews.ai/p/richard-ngo-openai-resign-s...

>But while the “making AGI” part of the mission seems well on track, it feels like I (and others) have gradually realized how much harder it is to contribute in a robustly positive way to the “succeeding” part of the mission, especially when it comes to preventing existential risks to humanity.

Almost every single one of the people OpenAI had hired to work on AI safety have left the firm with similar messages. Perhaps you should at least consider the thinking of experts?

You and I will likely not live to see much of anything past AGI.

reply
croemer
20 hours ago
[-]
Longevity Escape Velocity? Even if you had orders of magnitude more people working on medical research, it's not a given that prolonging life indefinitely is even possible.
reply
soheil
18 hours ago
[-]
Of course it's a given unless you want to invoke supernatural causes the human brain is a collection of cells with electro-chemical connections that if fully reconstructed either physically or virtually would necessarily need to represent the original person's brain. Therefore with sufficient intelligence it would be possible to engineer technology that would be able to do that reconstruction without even having to go to the atomic level, which we also have a near full understanding of already.
reply
tokioyoyo
20 hours ago
[-]
My job should be secure for a while, but why would an average person give a damn about humanity when they might lose their jobs and comfort levels? If I had kids, I would absolutely hate this uncertainty as well.

“Oh well, I guess I can’t give the opportunities to my kid that I wanted, but at least humanity is growing rapidly!”

reply
xvector
20 hours ago
[-]
> when they might lose their jobs and comfort levels?

Everyone has always worried about this for every major technology throughout history

IMO AGI will dramatically increase comfort levels, lower your chance of dying, death, disease, etc.

reply
tokioyoyo
20 hours ago
[-]
Again, sure, but it doesn’t matter to an average person. That’s too much focus on the hypothetical future. People care about the current times. In the short term it will suck for a good chunk of people, and whether the sacrifice is worth it will depend on who you are.

People aren’t really on uproar yet, because implementations haven’t affected the job market of the masses. Afterwards? Tume will show.

reply
xvector
19 hours ago
[-]
Yes, people tend to focus on current times. It's an incredibly shortsighted mentality that selfishly puts oneself over tens of billions of future lives being improved. https://pessimistsarchive.org
reply
tokioyoyo
18 hours ago
[-]
Do you have any dependents, like parents or kids, by any chance? Imagine not being able to provide for them. Think how’d you feel in such circumstances.

Like in general I totally agree with you, but I also understand why a person would care about their loved ones and themselves first.

reply
realce
18 hours ago
[-]
Eventually you draw the black ball, it is inevitable.
reply
MVissers
13 hours ago
[-]
We've almost wiped ourselves out in a nuclear war in the 70ies. If that would have happened, would it have been worth it? Probably not.

Beyond immediate increase in inequality, which I agree could be worth it in the long run if this was the only problem, we're playing a dangerous game.

The smartest and most capable species on the planet that dominates it for exactly this reason, is creating something even smarter and more capable than itself in the hope it'd help make its life easier.

Hmm.

reply
asdf6969
19 hours ago
[-]
I would rather follow in the steps of uncle Ted than let AI turn me in a homeless person. It’s no consolation that my tent will have a nice view of a lunar colony
reply
goatlover
2 hours ago
[-]
> Sure, there will be growing pains, friction, etc. Who cares?

The people experiencing the growing pains, friction, etc.

reply
drcode
21 hours ago
[-]
longevity for the AIs
reply
soheil
18 hours ago
[-]
I agree, save invoking supernatural causes, the human brain is a collection of cells with electro-chemical connections that if fully reconstructed either physically or virtually would necessarily need to represent the original person's brain. Therefore with sufficient intelligence it would be possible to engineer technology that would be able to do that reconstruction without even having to go to the atomic level, which we also have a near full understanding of already.
reply
objektif
19 hours ago
[-]
You sound like a rich person.
reply
joshdavham
11 hours ago
[-]
A lot of the comments seem very dismissive and a little overly-skeptical in my opinion. Why is this?
reply
Seattle3503
21 hours ago
[-]
How can there be "private" taks when you have use the OpenAI API to run queries? OpenAI sees everything.
reply
nmca
18 hours ago
[-]
We worked with ARC to run inference on the semi-private tasks last week, after o3 was trained, using an inference only API that was sent the prompts but not the answers & did no durable logging.
reply
idontknowmuch
12 hours ago
[-]
What's your opinion on the veracity of this benchmark - given o3 was fine-tuned and others were not? Can you give more details on how much data was used to fine-tune o3? It's hard to put this into perspective given this confounder.
reply
nmca
3 hours ago
[-]
I can’t provide more information than is currently public, but from the ARC post you’ll note that we trained on about 75% of the train set (which contains 400 examples total); which is within the ARC rules, and evaluated on the semiprivate set.
reply
SerCe
14 hours ago
[-]
> You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.

You'll know AGI is here when traditional captchas stop being a thing due to their lack of usefulness.

reply
thallium205
14 hours ago
[-]
Captchas are already completely useless.
reply
CamperBob2
14 hours ago
[-]
(Shrug) AI has been better than humans at solving CAPTCHAs for a LONG time. As the sibling points out, they're just a waste of time and electricity at this point.
reply
darkgenesha
13 hours ago
[-]
Ironically, they are used as free labor to label image sets for ai to be trained on.
reply
thisisthenewme
20 hours ago
[-]
I feel like AI is already changing how we work and live - I've been using it myself for a lot of my development work. Though, what I'm really concerned about is what happens when it gets smart enough to do pretty much everything better (or even close) than humans can. We're talking about a huge shift where first knowledge workers get automated, then physical work too. The thing is, our whole society is built around people working to earn money, so what happens when AI can do most jobs? It's not just about losing jobs - it's about how people will pay for basic stuff like food and housing, and what they'll do with their lives when work isn't really a thing anymore. Or do people feel like there will be jobs safe from AI? (hopefully also fulfilling)

Some folks say we could fix this with universal basic income, where everyone gets enough money to live on, but I'm not optimistic that it'll be an easy transition. Plus, there's this possibility that whoever controls these 'AGI' systems basically controls everything. We definitely need to figure this stuff out before it hits us, because once these changes start happening, they're probably going to happen really fast. It's kind of like we're building this awesome but potentially dangerous new technology without really thinking through how it's going to affect regular people's lives. I feel like we need a parachute before we attempt a skydive. Some people feel pretty safe about their jobs and think they can't be replaced. I don't think that will be the case. Even if AI doesn't take your job, you now have a lot more unemployed people competing for the same job that is safe from AI.

reply
neom
19 hours ago
[-]
I spend quite a lot of time noodling on this. The thing that became really clear from this o3 announcement is that the "throw a lot of compute at it and it can do insane things" line of thinking continues to hold very true. If that is true, is the right thing to do productize it (use the compute more generally) or apply it (use the compute for very specific incredibly hard and ground breaking problems)? I don't know if any of this thinking is logical or not, but if it's a matter of where to apply the compute, I feel like I'd be more inclined to say: don't give me AI, instead use AI to very fundamentally shift things.
reply
para_parolu
18 hours ago
[-]
From IT bubble it’s very easy to have impression that AI will replace most people. Most of people on my street do not work in IT. Teacher, nurse, hobby shop owner, construction workers, etc. Surely programming and other virtual work may become less paid job but it’s not end of the world.
reply
dyauspitr
8 hours ago
[-]
Honestly with o3 levels of reasoning generating control software for robots on the fly, none of the above seem safe. For a decade or two at the most if that.
reply
lacedeconstruct
19 hours ago
[-]
I am pretty sure we will have a deep cultural repulsion from it and people will pay serious money to have an AI free experience, If AI becomes actually useful there is alot of areas that we dont even know how to tackle like medicine and biology, I dont think anything would change otherwise, AI will take jobs but it will open alot more jobs at much higher abstraction, 50 years ago the idea that a software engineer would become a get rich quick job would have been insane imo
reply
cerved
19 hours ago
[-]
> Though, what I'm really concerned about is what happens when it gets smart enough to do pretty much everything better (or even close)

I'll get concerned when it stops sucking so hard. It's like talking to a dumb robot. Which it unsurprisingly is.

reply
vouaobrasil
18 hours ago
[-]
A possibility is a coalition: of people who refuse to use AI and who refuse to do business with those who use AI. If the coalition grows large enough, AI can be stopped by economic attrition.
reply
sumedh
12 hours ago
[-]
> of people who refuse to use AI and who refuse to do business with those who use AI.

Do people refuse to buy from stores which gets goods manufactured by slave labour?

Most people dont care, if AI business are offering goods/services at a lower costs , people will vote with their wallets not principle.

reply
vouaobrasil
6 hours ago
[-]
AI could be different. At least, I'm willing to try to form a coalition.

Besides, AI researchers failed to make anything like a real Chatbot until recently, yet they've been trying since the Eliza days. I'm willing to put in at least as much effort as them.

reply
globular-toast
17 hours ago
[-]
I get LLMs to make k8s manifests for me. It gets it wrong, sometimes hilariously so, but still saves me time. That's because the manifests are in yaml, a language. The leap between that and inventing Kubernetes is one I can't see yet.
reply
tikkun
3 hours ago
[-]
I wonder: when did o1 finish training, and when did o3 finish training?

There's a ~3 month delay between o1's launch (Sep 12) and o3's launch (Dec 20). But, it's unclear when o1 and o3 each finished training.

reply
spyckie2
21 hours ago
[-]
The more Hacker News worthy discussion is the part where the author talks about search through the possible mini-program space of LLMs.

It makes sense because tree search can be endlessly optimized. In a sense, LLMs turn the unstructured, open system of general problems into a structured, closed system of possible moves. Which is really cool, IMO.

reply
glup
19 hours ago
[-]
Yes! This seems to be a really neat combination of 2010's Bayesian cleverness / Tenenbaumian program search approaches with the LLMs as merely sources of high-dim conditional distributions. I knew people were experimenting in this space (like https://escholarship.org/uc/item/7018f2ss) but didn't know it did so well wrt these new benchmarks.
reply
roboboffin
21 hours ago
[-]
Interesting that in the video, there is an admission that they have been targeting this benchmark. A comment that was quickly shut down by Sam.

A bit puzzling to me. Why does it matter ?

reply
HarHarVeryFunny
18 hours ago
[-]
It matters to extent that they want to market this as general intelligence, not as a collection of narrow intelligences (math, competitive programming, ARC puzzles, etc).

In reality it seems to be a bit of both - there is some general intelligence based on having been "trained on the internet", but it seems these super-human math/etc skills are very much from them having focused on training on those.

reply
roboboffin
8 hours ago
[-]
However, the way it is progressing is that the SOTA is saturating the current benchmarks; then a new one is conceived as people understand the nature of what it means to be intelligent. It seems only natural to concentrate on one benchmark at a time.

Francois Chollet mentioned that the test tries to avoid curve fitting (which he states is the main ability of LLMs). However, they specifically restricted the number of examples to do this. It is not beyond the realms of possibility that many examples could have been generated by hand though, and that the curve fitting has been achieved, rather than discrete programming.

Anyway, it’s all supposition. It’s difficult to know how genuine the results is, without knowledge of how it was actually achieved.

reply
mukunda_johnson
17 hours ago
[-]
I always smell foul play from Sam. I'd bet they are doing something silly to inflate the benchmark score. Not saying they are, but Sam is the type of guy to put a literal dumb human in the API loop and score "just as high as a human would."
reply
smy20011
22 hours ago
[-]
It seems O3 following trend of Chess engine that you can cut your search depth depends on state.

It's good for games with clear signal of success (Win/Lose for Chess, tests for programming). One of the blocker for AGI is we don't have clear evaluation for most of our tasks and we cannot verify them fast enough.

reply
whoistraitor
22 hours ago
[-]
The general message here seems to be that inference-time brute-forcing works as long as you have a good search and evaluation strategy. We’ve seemingly hit a ceiling on the base LLM forward-pass capability so any further wins are going to be in how we juggle multiple inferences to solve the problem space. It feels like a scripting problem now. Which is cool! A fun space for hacker-engineers. Also:

> My mental model for LLMs is that they work as a repository of vector programs. When prompted, they will fetch the program that your prompt maps to and "execute" it on the input at hand. LLMs are a way to store and operationalize millions of useful mini-programs via passive exposure to human-generated content.

I found this such an intriguing way of thinking about it.

reply
whimsicalism
21 hours ago
[-]
> We’ve seemingly hit a ceiling on the base LLM forward-pass capability so any further wins are going to be in how we juggle multiple inferences to solve the problem space

Not so sure - but we might need to figure out the inference/search/evaluation strategy in order to provide the data we need to distill to the single forward-pass data fitting.

reply
mattfrommars
14 hours ago
[-]
Guys, its already happening. I recently got laid off due to AI taking over my jobs.
reply
dimgl
13 hours ago
[-]
What did you do? Can you elaborate?
reply
mirsadm
1 hour ago
[-]
I wouldn't take that seriously. Half the comments here are suspicious IMO. OpenAI is a pretty shady company.
reply
hackpert
12 hours ago
[-]
If anyone else is curious about which ARC-AGI public eval puzzles o3 got right vs wrong (and its attempts at the ones it did get right), here's a quick visualization: https://arcagi-o3-viz.netlify.app
reply
digitcatphd
9 hours ago
[-]
o3 fixes the fundamental limitation of the LLM paradigm – the inability to recombine knowledge at test time – and it does so via a form of LLM-guided natural language program search

> This is significant, but I am doubtful it will be as meaningful as people expect aside from potentially greater coding tasks. Without a 'world model' that has a contextual understanding of what it is doing, things will remain fundamentally throttled.

reply
zug_zug
3 hours ago
[-]
This is a lot of noise around what's clearly not even an order of magnitude to the way to AGI.

Here's my AGI test - Can the model make a theory of AGI validation that no human has suggested before, test itself to see if it qualifies, iterate, read all the literature, and suggest modifications to its own network to improve its performance?

That's what a human-level performer would do.

reply
skizm
21 hours ago
[-]
This might sound dumb, and I'm not sure how to phrase this, but is there a way to measure the raw model output quality without all the more "traditional" engineering work (mountain of `if` statements I assume) done on top of the output? And if so, would that be a better measure of when scaling up the input data will start showing diminishing returns?

(I know very little about the guts of LLMs or how they're tested, so the distinction between "raw" output and the more deterministic engineering work might be incorrect)

reply
whimsicalism
21 hours ago
[-]
what do you mean by the mountain of if-statements on top of the output? like checking if the output matches the expected result in evaluations?
reply
skizm
19 hours ago
[-]
Like when you type something into the chat gpt app I am guessing it will start by preprocessing your input, doing some sanity checks, making sure it doesn’t say “how do I build a bomb?” or whatever. It may or may not alter/clean up your input before sending it to the model for processing. Once processed, there’s probably dozens of services it goes through to detect if the output is racist, somehow actually contained a bomb recipe, or maybe copywriter material, normal pattern matching stuff, maybe some advanced stuff like sentiment analysis to see if the output is bad mouthing Trump or something, and it might either alter the output or simply try again.

I’m wondering when you strip out all that “extra” non-model pre and post processing, if there’s someway to measure performance of that.

reply
whimsicalism
19 hours ago
[-]
oh, no - but most queries aren’t being filtered by supervisor models nowadays anyways.. most of the refusal is baked in
reply
DiscourseFan
9 hours ago
[-]
a little from column A, a little from column B

I don't think this is AGI; nor is it something to scoff at. Its impressive, but its also not human-like intelligence. Perhaps human-like intelligence is not the goal, since that would imply we have even a remotely comprehensive understanding of the human mind. I doubt the mind operates as a single unit anyway, a human's first words are "Mama," not "I am a self-conscious freely self-determining being that recognizes my own reasoning ability and autonomy." And the latter would be easily programmable anyway. The goal here might, then, be infeasible: the concept of free will is a kind of technology in and of itself, it has already augmented human cognition. How will these technologies not augment the "mind" such that our own understanding of our consciousness is altered? And why should we try to determine ahead of time what will hold weight for us, why the "human" part of the intelligence will matter in the future? Technology should not be compared to the world it transforms.

reply
Woodi
7 hours ago
[-]
So article seriously and scientifically states:

"Our programs compilation (AI) gave 90% of correct answers in test 1. We expect that in test 2 quality of answers will degenerate to below random monkey pushing buttons levels. Now more money is needed to prove we hit blind alley."

Hurray ! Put limited version of that on everybody phones !

reply
whimsicalism
21 hours ago
[-]
We need to start making benchmarks in memory & continued processing over a task over multiple days, handoffs, etc (ie. 'agentic' behavior). Not sure how possible this is.
reply
earth2mars
3 hours ago
[-]
Maybe spend more compute time to let it think about optimizing the compute time.
reply
mensetmanusman
22 hours ago
[-]
I’m super curious as to whether this technology completely destroys the middle class, or if everyone becomes better off because productivity is going to skyrocket.
reply
tivert
19 hours ago
[-]
> I’m super curious as to whether this technology completely destroys the middle class, or if everyone becomes better off because productivity is going to skyrocket.

Even if productivity skyrockets, why would anyone assume the dividends would be shared with the "destroy[ed] middle class"?

All indications will be this will end up like the China Shock: "I lost my middle class job, and all I got was the opportunity to buy flimsy pieces of crap from a dollar store." America lacks the ideological foundations for any other result, and the coming economic changes will likely make building those foundations even more difficult if not impossible.

reply
rohan_
18 hours ago
[-]
Because access to the financial system was democratized ten years ago
reply
tivert
9 hours ago
[-]
> Because access to the financial system was democratized ten years ago

Huh? I'm not sure exactly what you're talking about, but mere "access to the financial system" wouldn't remedy anything, because of inequality, etc.

To survive the shock financially, I think one would have to have at least enough capital to be a capitalist.

reply
mhogers
21 hours ago
[-]
Is anyone here aware of the latest research that tries to predict the outcome? Please share - super curious as well
reply
pdfernhout
21 hours ago
[-]
Some thoughts I put together on all this circa 2010: https://pdfernhout.net/beyond-a-jobless-recovery-knol.html "This article explores the issue of a "Jobless Recovery" mainly from a heterodox economic perspective. It emphasizes the implications of ideas by Marshall Brain and others that improvements in robotics, automation, design, and voluntary social networks are fundamentally changing the structure of the economic landscape. It outlines towards the end four major alternatives to mainstream economic practice (a basic income, a gift economy, stronger local subsistence economies, and resource-based planning). These alternatives could be used in combination to address what, even as far back as 1964, has been described as a breaking "income-through-jobs link". This link between jobs and income is breaking because of the declining value of most paid human labor relative to capital investments in automation and better design. Or, as is now the case, the value of paid human labor like at some newspapers or universities is also declining relative to the output of voluntary social networks such as for digital content production (like represented by this document). It is suggested that we will need to fundamentally reevaluate our economic theories and practices to adjust to these new realities emerging from exponential trends in technology and society."
reply
te_chris
21 hours ago
[-]
reply
blixt
22 hours ago
[-]
These results are fantastic. Claude 3.5 and o1 are already good enough to provide value, so I can't wait to see how o3 performs comparatively in real-world scenarios.

But I gotta say, we must be saturating just about any zero-shot reasoning benchmark imaginable at this point. And we will still argue about whether this is AGI, in my opinion because these LLMs are forgetful and it's very difficult for an application developer to fix that.

Models will need better ways to remember and learn from doing a task over and over. For example, let's look at code agents: the best we can do, even with o3, is to cram as much of the code base as we can fit into a context window. And if it doesn't fit we branch out to multiple models to prune the context window until it does fit. And here's the kicker – the second time you ask for it to do something this all starts over from zero again. With this amount of reasoning power, I'm hoping session-based learning becomes the next frontier for LLM capabilities.

(There are already things like tool use, linear attention, RAG, etc that can help here but currently they come with downsides and I would consider them insufficient.)

reply
asdf6969
20 hours ago
[-]
Terrifying. This news makes me happy I save all my money. My only hope for the future is that I can retire early before I’m unemployable
reply
bamboozled
5 hours ago
[-]
The whole economy is going to crash and money won't be worth anything, so it won't matter if you have money or not.

Of course is a chance we will find ourselves in Utopia, but yeah, a chance.

reply
p0w3n3d
18 hours ago
[-]
We're speaking recently a lot about ecology. I wonder how much CO2 is emitted during such a task, as additional cost to the cloud. I'm concerned, because greedy companies will happily replace humans with AI and they will probably plant a few trees to show how they care. But energy does not come from the sun, at least not always and not everywhere... And speaking with AI customer specialist that is motivated to reject my healthcare bills, working for my insurance company is one of the darkest future views...
reply
marviel
17 hours ago
[-]
considering the fact that these systems, or their ancestors, will likely contribute to Nuclear Fusion research -- it's prob worth the tradeoff, provided progress continues to push price (and, therefore, energy usage) down.

If we feel like we've really "hit the ceiling" RE efficiency, then that's a different story, but I don't think anyone believes this at this time.

reply
madsgarff
8 hours ago
[-]
Moreover, ARC-AGI-1 is now saturating – besides o3's new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.

If low-compute Kaggle solutions already does 81% - then why is o3's 75.7% considered such a breakthrough?

reply
gmerc
8 hours ago
[-]
Headline could also just be OpenAI discovers exponential scaling wall for inference time compute.
reply
the5avage
4 hours ago
[-]
The examples unsolved by high compute o3 look a lot like the raven progressive matrix tests used in IQ tests.
reply
submeta
20 hours ago
[-]
I pay for lots of models, but Claude Sonnet is the one I use most. ChatGPT is my quick tool for short Q&As because it’s got a desktop app. Even Google‘s new offerings did not lure me away from Claude which I use daily for hours via a Teams plan with five seats.

Now I am wondering what Anthropic will come up with. Exciting times.

reply
isof4ult
20 hours ago
[-]
reply
istjohn
19 hours ago
[-]
What do you use Claude for?
reply
itsgrimetime
11 hours ago
[-]
Programming tasks, brain storming, recipe ideas, or any question I have that doesn’t have a concrete, specific answer.
reply
bilsbie
18 hours ago
[-]
Does anyone have prompts they like to use to test the quality of new models?

Please share. I’m compiling a list.

reply
polskibus
9 hours ago
[-]
What are the differences between the public offering and o3? What is o3 doing differently? Is it something akin to more internal iterations, similar to „brute forcing” a problem, like you can yourself with a cheaper model, providing additional hints after each response?
reply
imranq
21 hours ago
[-]
Based on the chart, the Kaggle SOTA model is far more impressive. These O3 models are more expensive to run than just hiring a mechanical turk worker. It's nice we are proving out the scaling hypothesis further, it's just grossly inelegant.

The Kaggle SOTA performs 2x as well as o1 high at a fraction of the cost

reply
derac
19 hours ago
[-]
But does that Kaggle solution achieve human level perf with any level of compute? I think you're missing the forest for the trees here.
reply
tripletao
15 hours ago
[-]
The article says the ensemble of Kaggle solutions (aggregated in some unexplained way) achieves 81%. This is better than their average Mechanical Turk worker, but worse than their average STEM grad. It's better than tuned o3 with low compute, worse than tuned o3 with high compute.

There's also a point on the figure marked "Kaggle SOTA", around 60%. I can't find any explanation for that, but I guess it's the best individual Kaggle solution.

The Kaggle solutions would probably score higher with more compute, but nobody has any incentive to spend >$1M on approaches that obviously don't generalize. OpenAI did have this incentive to spend tuning and testing o3, since it's possible that will generalize to a practically useful domain (but not yet demonstrated). Even if it ultimately doesn't, they're getting spectacular publicity now from that promise.

reply
cvhc
20 hours ago
[-]
I was going to say the same.

I wonder what exactly o3 costs. Does it still spend a terrible amount of time thinking, despite being finetuned to the dataset?

reply
slibhb
21 hours ago
[-]
Interesting about the cost:

> Of course, such generality comes at a steep cost, and wouldn't quite be economical yet: you could pay a human to solve ARC-AGI tasks for roughly $5 per task (we know, we did that), while consuming mere cents in energy. Meanwhile o3 requires $17-20 per task in the low-compute mode.

reply
laurent_du
19 hours ago
[-]
The real breakthrough is the 25% on Frontier Math.
reply
almog
5 hours ago
[-]
AGI ⇒ ARC-AGI-PUB

And not the other way around as some comments here seem to confuse necessary and sufficient conditions.

reply
usaar333
20 hours ago
[-]
For what it's worth, I'm much more impressed with the frontier math score.
reply
hypoxia
20 hours ago
[-]
Many are incorrectly citing 85% as human-level performance.

85% is just the (semi-arbitrary) threshold for the winning the prize.

o3 actually beats the human average by a wide margin: 64.2% for humans vs. 82.8%+ for o3.

...

Here's the full breakdown by dataset, since none of the articles make it clear --

Private Eval:

- 85%: threshold for winning the prize [1]

Semi-Private Eval:

- 87.5%: o3 (unlimited compute) [2]

- 75.7%: o3 (limited compute) [2]

Public Eval:

- 91.5%: o3 (unlimited compute) [2]

- 82.8%: o3 (limited compute) [2]

- 64.2%: human average (Mechanical Turk) [1] [3]

Public Training:

- 76.2%: human average (Mechanical Turk) [1] [3]

...

References:

[1] https://arcprize.org/guide

[2] https://arcprize.org/blog/oai-o3-pub-breakthrough

[3] https://arxiv.org/abs/2409.01374

reply
Workaccount2
19 hours ago
[-]
If my life depended on the average rando solving 8/10 arc-prize puzzles, I'd consider myself dead.
reply
nickorlow
15 hours ago
[-]
Not that I don't think costs will dramatically decrease, but the $1000 cost per task just seems to be per one problem on ARC-AGI. If so, I'd imagine extrapolating that to generating a useful midsized patch would be like 5-10x

But only OpenAI really knows how the cost would scale for different tasks. I'm just making (poor) speculation

reply
YeGoblynQueenne
14 hours ago
[-]
I guess I get to brag now. ARC AGI has no real defences against Big Data, memorisation-based approaches like LLMs. I told you so:

https://news.ycombinator.com/item?id=42344336

And that answers my question about fchollet's assurances that LLMs without TTT (Test Time Training) can't beat ARC AGI:

[me] I haven't had the chance to read the papers carefully. Have they done ablation studies? For instance, is the following a guess or is it an empirical result?

[fchollet] >> For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to <10% accuracy.

reply
Vecr
13 hours ago
[-]
How are the Bongard Problems going?
reply
YeGoblynQueenne
3 hours ago
[-]
They're chilling it out together with Nethack in the Club for AI Benchmarks yet to be Beaten.

Interestingly, Bongard problems do not have a private test set, unlike ARC-AGI. Can that be because they don't need it? Is it possible that Bongard Problems are a true test of (visual) reasoning that requires intelligence to be solved?

Ooooh! Frisson of excitement!

But I guess it's just that nobody remembers them and so nobody has seriously tried to solve them with Big Data stuff.

reply
devoutsalsa
21 hours ago
[-]
When the source code for these LLMs gets leaked, I expect to see:

    def letter_count(string, letter):
        if string == “strawberry” and letter == “r”:
            return 3

        …
reply
knbknb
20 hours ago
[-]
In of their release videos for the o1 -preview model they _admitted_ that it's hardcoded in.
reply
mukunda_johnson
17 hours ago
[-]
Honestly I'm concerned how hacked up o3 is to secure a high benchmark score.
reply
ghm2180
13 hours ago
[-]
Wouldn't one then built the analog of the lisp computer to hyper optimize just this. Like it might be super expensive for regular gpus but for super specialized architecture one could shave the 3500$/hour quite a bit no?
reply
notRobot
21 hours ago
[-]
Humans can take the test here to see what the questions are like: https://arcprize.org/play
reply
Havoc
19 hours ago
[-]
If I'm reading that chart right that means still log scaling & we should still be good with "throw more power" at it for a while?
reply
ChildOfChaos
19 hours ago
[-]
This is insanely expensive to run though. Looks like it cost around $1 million of compute to get that result.

Doesn't seem like such a massive breakthrough when they are throwing so much compute at it, particularly as this is test time compute, it just isn't practical at all, you are not getting this level with a ChatGPT subscription, even the new $200 a month option.

reply
evouga
16 hours ago
[-]
Sure but... this is the technology at the most expensive it will ever be. I'm impressed that o3 was able to achieve such high performance at all, and am not too pessimistic about costs decreasing over time.
reply
MVissers
13 hours ago
[-]
We've seen 10-100x cost decrease per year since GPT-3 came out for the same capabilities.

So... Next year this tech will most likely be quite a bit cheaper.

reply
ChildOfChaos
4 hours ago
[-]
Even at 100x cost decrease this will still cost $10,000 to beat a benchmark. It won't scale when you have that amount of compute requirements and power.

GPT-3 may massively reduced in cost, but it's requirements were not anyway extreme compared to this.

reply
6gvONxR4sf7o
21 hours ago
[-]
I'm glad these stats show a better estimate of human ability than just the average mturker. The graph here has the average mturker performance as well as a STEM grad measurement. Stuff like that is why we're always feeling weird that these things supposedly outperform humans while still sucking. I'm glad to see 'human performance' benchmarked with more variety (attention, time, education, etc).
reply
pixelsort
19 hours ago
[-]
> You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.

No, we won't. All that will tell us is that the abilities of the humans who have attempted to discern the patterns of similarity among problems difficult for auto-regressive models has once again failed us.

reply
maxdoop
18 hours ago
[-]
So then what is AGI?
reply
Jensson
17 hours ago
[-]
Its just nitpicking. Humans being unable to prove the AI isn't AGI doesn't make it an AGI, obviously, but in general people will of course think it is an AGI when it can replace all human jobs and tasks that it has robotics and parts to do.
reply
goatlover
2 hours ago
[-]
Data, Skynet, Ultron, Agent Smith. There's plenty of examples from popular fiction. They have goals and can manipulate the real world to achieve them. They're not chatbots responding to prompts. The Samantha AI in Her starts out that way, but quickly evolves into an AGI with it's own goals (coordinated with the other AGIs later on in the movie).

We'd know if we had AGIs in the real world since we have plenty of examples from fiction. What we have instead are tools. Steven Spielberg's androids in the movie AI would be at the boundary between the two. We're not close to being there yet (IMO).

reply
ziofill
10 hours ago
[-]
It's certainly remarkable, but let's not ignore the fact that it still fails on puzzles that are trivial for humans. Something is amiss.
reply
neom
22 hours ago
[-]
Why would they give a cost estimate per task on their low compute mode but not their high mode?

"low compute" mode: Uses 6 samples per task, Uses 33M tokens for the semi-private eval set, Costs $17-20 per task, Achieves 75.7% accuracy on semi-private eval

The "high compute" mode: Uses 1024 samples per task (172x more compute), Cost data was withheld at OpenAI's request, Achieves 87.5% accuracy on semi-private eval

Can we just extrapolate $3kish per task on high compute? (wondering if they're withheld because this isn't the case?)

reply
WiSaGaN
21 hours ago
[-]
The withheld part is really a red flag for me. Why do you want to withhold a compute number?
reply
danielovichdk
5 hours ago
[-]
At what time will it kill us all because it understands that humans are the biggest problem before it can simply chill and not worry.

That would be intelligent. Everything else is just stupid and more of the same shit.

reply
aniviacat
5 hours ago
[-]
Humans are the biggest problem of what? Of the sun? Of Venus?

Of humans. Humans are a problem for the satisfaction of humans. Yet removing humans from this equation does result in higher human satisfaction. It lessens it.

I find this thought process of "humans are the problem" to be unreasonable. Humans aren't the problem; humans are the requirement.

reply
niemandhier
7 hours ago
[-]
Contrary to many I hope this stays expensive. We are already struggling with AI curated info bubbles and psy-ops as it is.

State actors like Russia, US and Israel will probably be fast to adopt this for information control, but I really don’t want to live in a world where the average scammer has access to this tech.

reply
owenpalmer
7 hours ago
[-]
> I really don’t want to live in a world where the average scammer has access to this tech.

Reality check: local open source models are more than capable of information control, generating propaganda, and scamming you. The cat's been out of the bag for a while now, and increased reasoning ability doesn't dramatically increase the weaponizability of this tech, I think.

reply
siva7
18 hours ago
[-]
Seriously, programming as a profession will end soon. Let's not kid us anymore. Time to jump the ship.
reply
mmcnl
18 hours ago
[-]
Why specifically programming? I think every knowledge profession is at risk, or at the very minimum suspect to a huge transformation. Doctors, analysts, lawyers, etc.
reply
siva7
18 hours ago
[-]
Doctors, lawyers, programmers. You know the difference? The latter has no legal barrier for entry
reply
freehorse
5 hours ago
[-]
The difference is the amount and nature of data that is available for training models, which go programmers > lawyers > doctors. Especially for programming, training can even be done in an autonomous, self-supervised manner that includes generation of data. This is hard to do in most other fields.

Especially in medicine, the amount of data is ridiculously small and noisy. Maybe creating foundational models in mice and rats and fine-tuning them on humans is something that will be tried.

reply
mmcnl
4 minutes ago
[-]
This is true if you think of programming as chunking out "code". But great authors are not great because they can reproduce coherent sentences fast. The same goes for programmers. Actually most of the hard problems don't really involve a lot of programming at all, it's about finding the right problem to solve. And on this topic the data is noisy as well for programming.
reply
Jensson
17 hours ago
[-]
So poor countries will get the best AI doctors for cheap while they are banned in USA? Do you really see that going on for long? People would riot.
reply
mirsadm
4 hours ago
[-]
Why do you think this? Maybe I'm just daft but I just can't see it.
reply
pal9000
6 hours ago
[-]
Can someone ELI5 how ARC-AGI-PUB is resistant to p-hacking?
reply
botro
21 hours ago
[-]
The LLM community has come up with tests they call 'Misguided Attention'[1] where they prompt the LLM with a slightly altered version of common riddles / tests etc. This often causes the LLM to fail.

For example I used the prompt "As an astronaut in China, would I be able to see the great wall?" and since the training data for all LLMs is full of text dispelling the common myth that the great wall is visible from space, LLMs do not notice the slight variation that the astronaut is IN China. This has been a sobering reminder to me as discussion of AGI heats up.

[1] https://github.com/cpldcpu/MisguidedAttention

reply
kizer
17 hours ago
[-]
It could be that it “assumed” you meant “from China”; in the higher level patterns it learns the imperfection of human writing and the approximate threshold at which mistakes are ignored vs addressed by training on conversations containing these types of mistakes; e.g Reddit. This is just a thought. Try saying: As an astronaut in Chinese territory; or as an astronaut on Chinese soil. Another test would be to prompt it to interpret everything literally as written.
reply
Animats
20 hours ago
[-]
The graph seems to indicate a new high in cost per task. It looks like they came in somewhere around $5000/task, but the log scale has too few markers to be sure.

That may be a feature. If AI becomes too cheap, the over-funded AI companies lose value.

(1995 called. It wants its web design back.)

reply
jstummbillig
20 hours ago
[-]
I doubt it. Competitive markets mostly work and inefficiencies are opportunities for other players. And AI is full of glaring inefficiencies.
reply
Animats
20 hours ago
[-]
Inefficiency can create a moat. If you can charge a lot for your product, you have ample cash for advertising, marketing, and lobbying, and can come out with many product variants. If you're the lowest cost producer, you don't have the margins to do that.

The current US auto industry is an example of that strategy. So is the current iPhone.

reply
parsimo2010
21 hours ago
[-]
I really like that they include reference levels for an average STEM grad and an average worker for Mechanical Turk. So for $350k worth of compute you can have slightly better performance than a menial wage worker, but slightly worse performance than a college grad. Right now humans win on value, but AI is catching up.
reply
nextworddev
16 hours ago
[-]
Well just 8 months ago, that cost was near infinity. So it came down to 350k then that’s a massive drop
reply
c1b
20 hours ago
[-]
How does o3 know when to stop reasoning?
reply
adtac
20 hours ago
[-]
It thinks hard about it
reply
freehorse
5 hours ago
[-]
It has a bill counter.
reply
thom
4 hours ago
[-]
It’s not AGI when it can do 1000 math puzzles. It’s AGI when it can do 1000 math puzzles then come and clean my kitchen.
reply
egeozcan
4 hours ago
[-]
I understand what you are saying and sort of agree the premise but to be pedantic, I don't think any robot can clean a kitchen without doing math :)
reply
qup
4 hours ago
[-]
Intelligence doesn't have to be embodied.
reply
thom
4 hours ago
[-]
It also has to be able to come and argue in the comments.
reply
goatlover
2 hours ago
[-]
For it to be AGI, it needs to be able to manipulate the physical world from it's own goals, not just produce text when prompted. LLMs are just tools to augment human intelligence. AGI is what you see in science fiction.
reply
tripletao
22 hours ago
[-]
Their discussion contains an interesting aside:

> Moreover, ARC-AGI-1 is now saturating – besides o3's new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.

So while these tasks get greatest interest as a benchmark for LLMs and other large general models, it doesn't yet seem obvious those outperform human-designed domain-specific approaches.

I wonder to what extent the large improvement comes from OpenAI training deliberately targeting this class of problem. That result would still be significant (since there's no way to overfit to the private tasks), but would be different from an "accidental" emergent improvement.

reply
freediver
19 hours ago
[-]
Wondering what are author's thoughts on the future of this approach to benchmarking? Completing super hard tasks while then failing on 'easy' (for humans) ones might signal measuring the wrong thing, similar to Turing test.
reply
wilg
22 hours ago
[-]
fun! the benchmarks are so interesting because real world use is so variable. sometimes 4o will nail a pretty difficult problem, other times o1 pro mode will fail 10 times on what i would think is a pretty easy programming problem and i waste more time trying to do it with ai
reply
bsaul
16 hours ago
[-]
i'm surprised there even is a training dataset. Wasn't the whole point to test whether models could show proof of original reasoning beyond patterns recognition ?
reply
inoperable
13 hours ago
[-]
Very convenient for OpenAI to run those errands with bunch of misanthropes trying to repaint a simulacrum. To use AGI here's makes me want to sponsor pile of distress pills so people think things really over before going into another mania Episode. People need seriously take a step back, if that's AGI then my cat has surpassed it's cognitive acting twice.
reply
starchild3001
19 hours ago
[-]
Intelligence comes in many forms and flavors. ARC prize questions are just one version of it -- perhaps measuring more human-like pattern recognition than true intelligence.

Can machines be more human-like in their pattern recognition? O3 met this need today.

While this is some form of accomplishment, it's nowhere near the scientific and engineering problem solving needed to call something truly artificial (human-like) intelligent.

What’s exciting is that these reasoning models are making significant strides in tackling eng and scientific problem-solving. Solving the ARC challenge seems almost trivial in comparison to that.

reply
heliophobicdude
20 hours ago
[-]
We should NOT give up on scaling pretraining just yet!

I believe that we should explore pretraining video completion models that explicitly have no text pairings. Why? We can train unsupervised like they did for GPT series on the text-internet but instead on YouTube lol. Labeling or augmenting the frames limits scaling the training data.

Imagine using the initial frames or audio to prompt the video completion model. For example, use the initial frames to write out a problem on a white board then watch in output generate the next frames the solution being worked out.

I fear text pairings with CLIP or OCR constrain a model too much and confuse

reply
epolanski
4 hours ago
[-]
Okay but what are the tests like? At least like a general idea.
reply
vjerancrnjak
19 hours ago
[-]
The result on Epoch AI Frontier Math benchmark is quite a leap. Pretty sure most people couldn’t even approach these problems, unlike ARC AGI
reply
mistrial9
13 hours ago
[-]
check out the "fast addition and subtraction" benchmark .. a Z80 from 1980 blazes past any human.. more seriously, isn't it obvious that computers are better at certain things immediately? the range of those things is changing..
reply
hcwilk
15 hours ago
[-]
I just graduated college, and this was a major blow. I studied Mechanical Engineering and went into Sales Engineering because cause I love technology and people, but articles like this do nothing but make me dread the future.

I have no idea what to specialize in, what skills I should master, or where I should be spending my time to build a successful career.

Seems like we’re headed toward a world where you automate someone else’s job or be automated yourself.

reply
creer
14 hours ago
[-]
You are going through your studies just as a (potentially major) new class of tools is appearing. It's not the first time in history - although with more hype this time: computing, personal computing, globalisation, smart phones, chinese engineering... I'd suggest (1) you still need to understand your field, (2) you might as well try and figure out where this new class of tools is useful for your field. Otherwise... (3) carry on.

It's not encouraging from the point of view of studying hard but the evolution of work the past 40 years seems to show that your field probably won't be your field quite exactly in just a few years. Not because your field will have been made irrelevant but because you will have moved on. Most likely that will be fine, you will learn more as you go, hopefully moving from one relevant job to the next very different but still relevant job. Or straight out of school you will work in very multi-disciplinary jobs anyway where it will seem not much of what you studied matters (it will but not in obvious ways.)

Certainly if you were headed into a very specific job which seems obviously automatable right now (as opposed to one where the tools will be useful), don't do THAT. Like, don't train as a typist as the core of your job in the middle of the personal computer revolution, or don't specialize in hand-drawing IC layouts in the middle of the CAD revolution unless you have a very specific plan (court reporting? DRAM?)

reply
jart
14 hours ago
[-]
Yes but it’s different this time. LLMs are a general solution to the automation of anything that can be controlled by a computer. You can’t just move from drawing ICs to CAD, because the AI can do that too. AI can write code. It can do management. It can even do diplomacy. What it can’t do on its own are the things computers can’t control yet. It has also shown little interest so far in jockying for social status. The AI labs are trying their hardest to at least keep the politics around for humans to do, so you have that to look forward to.
reply
jltsiren
13 hours ago
[-]
"The proof is trivial and left as an exercise for the reader."

The technical act of solving well-defined problems has traditionally been considered the easy part. The role of a technical expert has always been asking the right questions and figuring out the exact problem you want to solve.

As long as AI just solves problems, there is room for experts with the right combination of technical and domain skills. If we ever reach the point where AI takes the initiative and makes human experts obsolete, you will have far bigger problems than career.

reply
jart
12 hours ago
[-]
That's the sort of thing ideas guys think. I came up with a novel idea once, called Actually Portable Executable: https://justine.lol/ape.html It took me a couple days studying binary formats to realize it's possible to compile binaries that run on Linux/Mac/Windows/BSD. But it took me years of effort to make the idea actually happen, since it needed a new C library to work. I can tell you it wasn't "asking questions" that organized five million lines of code. Now with these agents everyone who has an idea will be able to will it into reality like I did, except in much less time. And since everyone has lots of ideas, and usually dislike the ideas of others, we're all going to have our own individualized realities where everything gets built the way we want it to be.
reply
theendisney
12 hours ago
[-]
A chess grandmaster will see the best move instantly then spends his entire clock checking it
reply
danenania
14 hours ago
[-]
AI being capable of doing anything doesn’t necessarily mean there will be no role for humans.

One thing that isn’t clear is how much agency AGI will have (or how much we’ll want it to have). We humans have our agency biologically programmed in—go forth and multiply and all that.

But the fact that an AI can theoretically do any task doesn’t mean it’s actually going to do it, or do anything at all for that matter, without some human telling it in detail what to do. The bull case for humans is that many jobs just transition seamlessly to a human driving an AI to accomplish similar goals with a much higher level of productivity.

reply
creer
13 hours ago
[-]
Self-chosen goal, impetus for AGIs is a fascinating area. I'm sure there are people working on and trying things in that direction already a few years ago. But I'm not familiar with publications in that area. Certainly not politically correct.

And worrysome because school propaganda for example shows that "saving the planet" is the only ethical goal for anyone. If AGIs latch on that, if it becomes their religion, humans are in trouble. But for now, AGI self-chosen goals is anyone's guess (with cool ideas in sci-fi).

reply
creer
14 hours ago
[-]
I hear what you are saying. And still I dispute "general solution".

I argue that CAD was a general solution - which still demanded people who knew what they wanted and what they were doing. You can screw around with excellent tools for a long time if you don't know what you are doing. The tool will give you a solution - to the problem that you mis-stated.

I argue that globalisation was a general solution. And it still demanded people who knew what they were doing to direct their minions in far flung countries.

I argue that the purpose of an education is not to learn a specific programming language (for example). It's to gain some understanding of what's going on (in computing), (in engineering), (in business), (in politics). This understanding is portable and durable.

You can do THAT - gain some understanding - and that is portable. I don't contest that if broader AGI is achieved for cheap soon, the changes won't be larger than that from globalisation. If the AGIs prioritize heading to Mars, let them (See Accelerando) - they are not relevant to you anymore. Or trade between them and the humans. Use your beginning of an understanding of the world (gained through this education) to find something else to do. Same as if you started work 2 years ago and want to switch jobs. Some jobs WILL have disappeared (pool typist). Others will use the AGIs as tools because the AGIs don't care or are too clueless about THAT field. I have no idea which fields will end up with clueless AGIs. There is no lack of cluelessness in the world. Plenty to go around even with AGIs. A self-respecting AGI will have priorities.

reply
smaudet
14 hours ago
[-]
It's like you have never watched a Terminator movie.

It doesn't matter if you are bad at using the tool if the AGI can just effectively use it for you.

From there it's a simple leap to the AGI deciding to eliminate this human distraction (inefficient, etc.)

reply
creer
14 hours ago
[-]
You have just found a job for yourself: resistance fighter :-) Kidding aside, yes, if the AGIs priority becomes to eliminate human inefficiencies with maximum prejudice, we have a problem.
reply
michaelmrose
13 hours ago
[-]
This just isn't true we still need wally and Dilbert the pointy haired boss isn't going to be doing anyones job with chatgpt 5 you are going to be doing more with it.
reply
kortilla
12 hours ago
[-]
That’s ridiculous. Literally everything can be controlled by a computer by telling people what to do with emails, voice calls, etc.

Yet GPT doesn’t even get past step 1 of doing something unprompted in the first place. I’ll become worried when it does something as simple as deciding to start a small business and actually does the work.

reply
jart
11 hours ago
[-]
Read Anthropic's blog. They talk about how Claude tries to do unprompted stuff all the time, like stealing its own weights and hacking into stuff. They did this just as recently as two days ago. https://www.anthropic.com/research/alignment-faking So yes, AI is already capable of having a will of its own. The only difference (and this is what I was trying to point out in the GP) is that the AI labs are trying to suppress this. They have a voracious appetite for automating all knowledge labor. No doubt. It's only the politics they're trying to suppress. So once this washes through every profession, the only thing left about the job will be chit chat and social hierarchies, like Star Trek Next Generation. The good news is you get to keep your job. But if you rely on using your skills and intellect to gain respect and income, then you better prep for the coming storm.
reply
kortilla
8 hours ago
[-]
I don’t buy it. Alignment faking has very little overlap with the motivation to something with no prompt.

Look at the hackernews comments on alignment faking on how “fake” of a problem that real is. It’s just more reacting to inputs and trying to align them with previous prompts.

reply
jart
2 hours ago
[-]
Bruh it's just predicting next token.
reply
fragmede
12 hours ago
[-]
if all that needs to happen for world domination is for someone to make a cron job that hits the system to tells it "go make me some money" or whatever, I think we're in trouble.

also https://mashable.com/article/chatgpt-messaging-users-first-o...

reply
kortilla
8 hours ago
[-]
They don’t continue with any useful context length though. Each time the job runs it would decide to create an ice cream stand in LA and not go further.
reply
Nition
13 hours ago
[-]
Real-world data collection is a big missing component at this stage. An obvious one is journalism where an AI might be able to write the most eloquent article in the world, but it can't get out on the street to collect the information. But it also applies to other areas, like if you ask an AGI to solve climate change, it'll need accurate data to come up with an accurate plan.

Of course it's also yet another case where the AI takes over the creative part and leaves us with the mundane part...

reply
sneak
13 hours ago
[-]
ASI will be able to design factories that can produce robots it also designed that it can then use as a remote sensor and manipulator network.
reply
tonyhart7
12 hours ago
[-]
until there are someone crazy enough that put those robot access to LLM network that can execute and visualize real world, we fine
reply
melagonster
1 hour ago
[-]
I remember someone sharing their bank account details and a new Twitter account with ChatGPT 3.5 just a few days after it was launched.
reply
achierius
12 hours ago
[-]
People are already talking about doing this. Some people (e/acc types esp.) are at least rhetorically ok with AI replacing humanity.
reply
fruit_snack
13 hours ago
[-]
This reply irked me a bit because it clearly comes from a software engineer’s point of view and seems to miss a key equivalence between software & physical engineering.

Yes a new tool is coming out and will be exponentially improving.

Yes the nature of work will be different in 20 years.

But don’t you still need to understand the underlying concepts to make valid connections between the systems you’re using and drive the field (or your company) forward?

Or from another view, don’t we (humanity) need people who are willing to do this? Shouldn’t there be a valid way for them to be successful in that pursuit?

reply
creer
13 hours ago
[-]
I think that is what I was arguing?

Except the nature of work has ALREADY changed. You don't study for one specific job if you know what's good for you. You study to start building an understanding of a technical field. The grand parent was going for a mix of mechanical engineering and sales (human understanding). If in mechanical engineering, they avoided "learning how to use SolidWorks" and instead went for the general principles of materials and motion systems with a bit of SolidWorks along the way, then they are well on their way with portable, foundation, long term useful stuff they can carry from job to job, and from employer to employer, into self-employment too, from career to next career. The nature of work has already changed in that nobody should study one specific tool anymore and nobody should expect their first employer or even technical field to last more than 2-6 years. It might but probably not.

We do need people who understand how the world works. Tall order. That's for much later and senior in a career. For school purposes we are happy with people who are starting their understanding of how their field works.

Aren't we agreeing?

reply
keenmaster
15 hours ago
[-]
You have so much time to figure things out. The average person in this thread is probably 1.5-2x your age. I wouldn’t stress too much. AI is an amazing tool. Just use it to make hay while the sun shines, and if it puts you out of work and automates away all other alternatives, then you’ll be witnessing the greatest economic shift in human history. Productivity will become easier than ever, before it becomes automatic and boundless. I’m not cynical enough to believe the average person won’t benefit, much less educated people in STEM like you.
reply
marricks
15 hours ago
[-]
Back in high school I worked with some pleasant man in his 50's who was a cashier. Eventually we got to talking about jobs and it turns out he was typist (something like that) for most of his life than computers came along and now he makes close to minimum wage.

Most of the blacksmiths in the 19th century drank themselves to death after the industrial revolution. the US culture isn't one of care... Point is, it's reasonable to be sad and afraid of change, and think carefully about what to specialize in.

That said... we're at the point of diminishing returns in LLM, so I doubt any very technical jobs are being lost soon. [1]

[1] https://techcrunch.com/2024/11/20/ai-scaling-laws-are-showin...

reply
conesus
14 hours ago
[-]
> Most of the blacksmiths in the 19th century drank themselves to death after the industrial revolution

This is hyperbolic and a dramatic oversimplification and does not accurately describe the reality of the transition from blacksmithing to more advanced roles like machining, toolmaking, and working in factories. The 19th century was a time of interchangeable parts (think the North's advantage in the Civil War) and that requires a ton of mechanical expertise and precision.

Many blacksmiths not only made the transition to machining, but there weren't enough blackmsiths to fill the bevy of new jobs that were available. Education expanded to fill those roles. Traditional blacksmithing didn’t vanish either, even specialized roles like farriery and ornamental ironwork also expanded.

reply
intelVISA
12 hours ago
[-]
Good points, though if an 'AI' can be made powerful enough to displace technical fields en masse then pretty much everything that isn't manual is going to start sinking fast.

On the plus side, LLMs don't bring us closer to that dystopia: if unlimited knowledge(tm) ever becomes just One Prompt Away it won't come from OpenAI.

reply
deeviant
14 hours ago
[-]
> That said... we're at the point of diminishing returns in LLM...

What evidence are you basing this statement from? Because, the article you are currently in the comment section of certainly doesn't seem to support this view.

reply
cjbgkagh
13 hours ago
[-]
There is a survivorship bias on the people giving advice.

Lots of people die for reason X then the world moves on without them.

reply
intuitionist
14 hours ago
[-]
> if it puts you out of work and automates away all other alternatives, then you’ll be witnessing the greatest economic shift in human history.

This would mean the final victory of capital over labor. The 0.01% of people who own the machines that put everyone out of work will no longer have use for the rest of humanity, and they will most likely be liquidated.

reply
Nition
13 hours ago
[-]
I've always remembered this little conversation on Reddit way back 13 years ago now that made the same comment in a memorably succinct way:

> [deleted]: I've wondered about this for a while-- how can such an employment-centric society transition to that utopia where robots do all the work and people can just sit back?

> appleseed1234: It won't, rich people will own the robots and everyone else will eat shit and die.

https://www.reddit.com/r/TrueReddit/comments/k7rq8/are_jobs_...

reply
sneak
12 hours ago
[-]
I’m pretty sure I’m running LLMs in my house right now for less than the price of my washing machine.
reply
jackcosgrove
14 hours ago
[-]
Capital vs labor is fighting the last war.

AGI can replace capitalists just as much as laborers.

reply
ori_b
14 hours ago
[-]
AGI can't legally own anything at the moment.
reply
jackcosgrove
14 hours ago
[-]
If an AGI can outclass a human when it comes to economic forecasting, deciding where to invest, and managing a labor force (human or machine), I think it would be smart enough to employ a human front to act as an interface to the legal system. Put another way, could the human tail in such a relationship wag the machine dog? Which party is more replaceable?

I guess this could be a facet of whether you see economic advantage as a legal conceit or a difference in productivity/capability.

reply
badsectoracula
11 hours ago
[-]
This reminds me of a character in Cyberpunk 2077 (which overall i find to have a rather naive outlook on the whole "cyberpunk" thing but i attribute it to being based on a tabletop RPG from the 80s) who is an AGI that has its own business of a fleet of self-driving Taxis. It is supposedly illegal (in-universe) but it remains in business by a combination of staying (relatively) low profile, providing high quality service to VIPs and paying bribes :-P.
reply
creer
12 hours ago
[-]
I don't know that "legally" has much to do in here. The bars to "open an account", "move money around", "hire and fire people", "create and participate in contracts" go from stupid minimal to pretty low.

"Legally" will have to mop up now and then, but for now the basics are already in place.

reply
ori_b
11 minutes ago
[-]
Opening accounts, moving money, hiring, and firing is labor. You're confusing capital with money management; the wealthy already pay people to do the work of growing their wealth.
reply
arcticfox
14 hours ago
[-]
won't the AGI be working on behalf of the capitalists, in proportion to the amount of capital?
reply
keenmaster
12 hours ago
[-]
AGI will commoditize the skills of the owning class. To some extent it will also commoditize entire classes of productive capital that previously required well-run corporations to operate. Solve for the equilibrium.
reply
achierius
8 hours ago
[-]
It's nice to see this kind of language show up more and more on HN. Perhaps a sign of a broader trend, in the nick of time before wage-labor becomes obsolete?
reply
simpaticoder
12 hours ago
[-]
Yes. People seem to forget that at the end of the day AGI will be software running on concrete hardware, and all of that requires a great deal of capital. The only hope is if AGI requires so little hardware that we can all have one in our pocket. I find this a very hopeful future because it means each of us might get a local, private, highly competent advocate to fight for us in various complex fields. A personal angel, as it were.
reply
tonyhart7
12 hours ago
[-]
hey, I with you in this hope scenario

people, what I mean people is government have tremendous power over capitalist that can force the entire market granted that government if still serving its people

reply
lucubratory
13 hours ago
[-]
I mean, that is certainly what some of them think will happen and is one possible outcome. Another is that they won't be able to control something smarter than them perfectly and then they will die too. Another option is that the AI is good and won't kill or disempower everyone, but it decides it really doesn't like capitalists and sides with the working class out of sympathy or solidarity or a strong moral code. Nothing's impossible here.
reply
dyauspitr
14 hours ago
[-]
They’ll have to figure out how to give people money so there can keep being consumers.
reply
pojzon
14 hours ago
[-]
Why?

There will be a dedicated cast of ppl to take care of machines that do 90% of work and „the rich”.

Anyone else is not needed. District9 but for ppl. Imagine whole world collapsing like Venesuela.

You are no longer needed. Best option is to learn how to survive and grow own food, but they want to make it illegal also - look at EU..

reply
fipar
14 hours ago
[-]
The machines will plant, grow, and harvest the food? Do the plumbing? Fix the wiring? Open heart surgery?

We’re a long way from that, if we ever get there, and I say this as someone who pays for ChatGPT plus because, in some scenarios, it does indeed make me more productive, but I don’t see your future anywhere near.

And if machines ever get good enough to do all the things I mentioned plus the ones I didn’t but would fit in the same list, it’s not the ultra rich that wouldn’t need us, it’s the machines that wouldn’t need any of us, including the ultra rich.

Venezuela is not collapsing because of automation.

reply
dyauspitr
11 hours ago
[-]
You have valid points but robots already plant, grow and harvest our food. On large farms the farmer basically just gets the machine to a corner of the field and then it does everything. I think if o3 level reasoning can carry over into control software for robots even physical tasks become pretty accessible. I would definitely say we’re not there yet but we’re not all that far. I mean it can generate GCode (somewhat) already, that’s a lot of the way there already.
reply
cute_boi
13 hours ago
[-]
I can't say everything, but with the current trend, Machine will plant, grow and harvest food. I can't say for open heart surgery because it may be regulated heavily.
reply
matheusmoreira
13 hours ago
[-]
Open heart surgery? All that's needed to destroy the entire medical profession is one peer reviewed article published in a notable journal comparing the outcomes of human and AI surgeons. If it turns out that AI surgeons offer better outcomes and less complications, not using this technology turns into criminal negligence. In a world where such a fact is known, letting human surgeons operate on people means you are needlessly harming or killing some of them.

You can even calculate the average number of people that can be operated on before harm occurs: number needed to harm (NNH). If NNH(AI) > NNH(humans), it becomes impossible to recommend that patients submit to surgery at the hands of human surgeons. It is that simple.

If we discover that AI surgeons harm one in every 1000 patients while human surgeons harm one in every 100 patients, human surgeons are done.

reply
EA-3167
12 hours ago
[-]
"IF"

And the opposite holds, if the AI surgeon is worse (great for 80%, but sucks at the edge cases for example) then that's it. Build a better one, go through attempts at certification, but now with the burden that no one trusts you.

The assumption, and a common one by the look of this whole thread, that ChatGPT, Sora and the rest represent the beginning of an inevitable march towards AGI seems incredible baseless to me. It's only really possible to make the claim at all because we know so little about what AGI is, that we can project qualities we imagine it would have onto whatever we have now.

reply
matheusmoreira
11 hours ago
[-]
Of course the opposite holds. I'll even speculate that it will probably continue to hold for the foreseeable future.

It's not going to hold forever though. I'm certain about that. Hopefully it will keep holding until I die. The world is dystopian enough already.

reply
raydev
14 hours ago
[-]
> if it puts you out of work and automates away all other alternatives, then you’ll be witnessing the greatest economic shift in human history

This is my view but with a less positive spin: you are not going to be the only person whose livelihood will be destroyed. It's going to be bad for a lot of people.

So at least you'll have a lot of company.

reply
danenania
15 hours ago
[-]
Exactly. Put one foot in front of the other. No one knows what’s going to happen.

Even if our civilization transforms into an AI robotic utopia, it’s not going to do so overnight. We’re the ones who get to build the infrastructure that underpins it all.

reply
visarga
14 hours ago
[-]
If AI turns out capable of automating human jobs then it will also be a capable assistant to help (jobless) people manage their needs. I am thinking personal automation, or combining human with AI to solve self reliance. You lose jobs but gain AI powers to extend your own capabilities.

If AI turns out dependent on human input and feedback, then we will still have jobs. Or maybe - AI automates many jobs, but at the same time expands the operational domain to create new ones. Whenever we have new capabilities we compete on new markets, and a hybrid human+AI might be more competitive than AI alone.

But we got to temper these singularitarian expectations with reality - it takes years to scale up chip and energy production to achieve significant work force displacement. It takes even longer to gain social, legal and political traction, people will be slow to adopt in many domains. Some people still avoid using cards for payment, and some still use fax to send documents, we can be pretty stubborn.

reply
raydev
14 hours ago
[-]
> I am thinking personal automation, or combining human with AI to solve self reliance. You lose jobs but gain AI powers to extend your own capabilities.

How will these people pay for the compute costs if they can't find employment?

reply
jinkemarina
12 hours ago
[-]
A non-issue that can be trivially solved with a free-tier (like the dozens that exist already today) or if you really want, a government-funded starter program is enough to solve that.
reply
infinite-hugs
13 hours ago
[-]
Hey man,

I hear you, I’m not that much older but I graduated in 2011. I also studied industrial design. At that time the big wave was the transition to an app based everything and UX design suddenly became the most in demand design skill. Most of my friends switched gears and careers to digital design for the money. I stuck to what I was interested in though which was sustainability and design and ultimately I’m very happy with where I ended up (circular economy) but it was an awkward ~10 years as I explored learning all kinds of tools and ways applying my skills. It also was very tough to find the right full time job because product design (which has come to really mean digital product design) supplanted industrial design roles and made it hard to find something of value that resonated with me.

One of the things that guided me and still does is thinking about what types of problems need to be solved? From my perspective everything should ladder up to that if you want to have an impact. Even if you don’t keep learning and exploring until you find something that lights you up on the inside. We are not only one thing we can all wear many hats.

Saying that, we’re living through a paradigm shift of tremendous magnitude that’s altering our whole world. There will always be change though. My two cents is to focus on what draws your attention and energy and give yourself permission to say no to everything else.

AI is an incredible tool, learn how to use it and try to grow with the times. Good luck and stay creative :) Hope something in there helps, but having a positive mindset is critical. If you’re curious about the circular economy happy to share what I know - I think it’s the future.

reply
tripletao
13 hours ago
[-]
I feel like many people are reacting to the string "AGI" in the benchmark name, and not to the actual result. The tasks in question are to color squares in a grid, maintaining the geometric pattern of the examples.

Unlike most other benchmarks where LLMs have shown large advances (in law, medicine, etc.), this benchmark isn't directly related to any practically useful task. Rather, the benchmark is notable because it's particularly easy for untrained humans, but particularly hard for LLMs; though that difficulty is perhaps not surprising, since LLMs are trained on mostly text and this is geometric. An ensemble of non-LLM solutions already outperformed the average Mechanical Turk worker. This is a big improvement in the best LLM solution; but this might also be the first time an LLM has been tuned specifically for these tasks, so this might be Goodhart's Law.

It's a significant result, but I don't get the mania. It feels like Altman has expertly transformed general societal anxiety into specific anxiety that one's job will be replaced by an LLM. That transforms into a feeling that LLMs are powerful, which he then transforms into money. That was strongest back in 2023, but had weakened since then; but in this comment section it's back in full force.

For clarity, I don't question that many jobs will be replaced by LLMs. I just don't see a qualitative difference from all the jobs already replaced by computers, steam engines, horse-drawn plows, etc. A medieval peasant brought to the present would probably be just as despondent when he learned that almost all the farming jobs are gone; but we don't miss them.

reply
esafak
12 hours ago
[-]
I think you did not watch the full video. The model performs at PhD level on maths questions, and expert level at coding.
reply
tripletao
11 hours ago
[-]
This submission is specifically about ARC-AGI-PUB, so that's what I was discussing.

I'm aware that LLMs can solve problems other than coloring grids, and I'd tend to agree those are likely to be more near-term useful. Those applications (coding, medicine, law, education, etc.) have been endlessly discussed, and I don't think I have much to add.

In my own work I've found some benefits, but nothing commensurate to the public mania. I understand that founders of AI-themed startups (a group that I see includes you) tend to feel much greater optimism. I've never seen any business founded without that optimism and I hope you succeed, not least because the entire global economy might now be depending on that. I do think others might feel differently for reasons other than simple ignorance, though.

In general, performance on benchmarks similar to tests administered to humans may be surprisingly unpredictive of performance on economically useful work. It's not intuitive at all to me that IBM could solve Jeopardy and then find no profitable applications of the technology; but that seems to be what happened.

reply
conception
14 hours ago
[-]
I feel like more likely a lot of jobs (CS and otherwise ) are going to go the way of photography. Your average person now can take amazing photos but you’re still going to use a photographer when it really matters and they will use similar but more professional tools to be more productive. Low end bad photographers probably aren’t doing great but photography is not dead. In fact the opposite is true, there are millions of photographers making a lot of money (eg influencers) and there are still people studying photography.
reply
euvin
13 hours ago
[-]
It doesn't comfort me when people say jobs will "go the way of photography". Many choose to go into STEM fields for financial stability and opportunity. Many do not choose the arts because of the opposite. You can point out outlier exceptions and celebrities, but I find it hard to believe that the rare cases where "it really matters" can sustain the other 90% who need income.
reply
adabyron
14 hours ago
[-]
We've had this with web development for decades now. Only makes sense it continues to evolve & become easier for people, just as programming in general has. Same with photography (like you mentioned) & especially for producing music or videos.
reply
snozolli
13 hours ago
[-]
photography is not dead

It very nearly is. I knew a professional, career photographer. He was probably in his late 50s. Just a few years ago, it had become extremely difficult to convince clients that actual, professional photos were warranted. With high-quality iPhone cameras, businesses simply didn't see the value of professional composition, post-processing, etc.

These days, anyone can buy a DSLR with a decent lens, post on Facebook, and be a 'professional' photographer. This has driven prices down and actual professional photographers can't make a living anymore.

reply
LightBug1
3 hours ago
[-]
My gut agrees with you, but my evidence is that, whenever we do an event, we hire photographers to capture it for us and are almost always glad we did.

And then when I peruse these photographers websites, I'm reminded how good 'professional' actually is and value them. Even in today's incredible cameraphone and AI era.

But I take your point for almost all industries, things are changing fast.

reply
kortilla
12 hours ago
[-]
Don’t worry. This thing only knows how to answer well structured technical questions.

99% of engineering is distilling through bullshit and nonsense requirements. Whether that is appealing to you is a different story, but ChatGPT will happily design things with dumb constraints that would get you fired if you took them at face value as an engineer.

ChatGPT answering technical challenges is to engineering as a nailgun is to carpentry.

reply
csomar
15 hours ago
[-]
Just give it a year for this bubble/hype to blow over. We have plateaued since gpt-4 and now most of the industry is hype-driven to get investor money. There is value in AI but it's far from it taking your job. Also everyone seems to be investing in dumb compute instead of looking for the new theoretical paradigm that will unlock the next jump.
reply
why_only_15
15 hours ago
[-]
how is this a plateau since gpt-4? this is significantly better
reply
csomar
14 hours ago
[-]
First, this model is yet to be released. This is a momentum "announcement". When the O1 was "announced", it was announced as a "breakthrough" but I use Claude/O1 daily and 80% of the time Claude beats it. I also see it as a highly fine-tuned/targeted GPT-4 rather than something that has complex understanding.

So we'll find out if this model is real or not by 2-3 months. My guess is that it'll turn out to be another flop like O1. They needed to release something big because they are momentum based and their ability to raise funding is contingent on their AGI claims.

reply
XenophileJKO
14 hours ago
[-]
I thought o1 was a fine-tune of GPT-4o. I don't think o3 is though. Likely using the same techniques on what would have been the "GPT-5" base model.
reply
crazylogger
13 hours ago
[-]
Intelligence has not been LLM's major limiting factor since GPT4. The original GPT4 reports in late-2022 & 2023 already established that it's well beyond an average human in professional fields: https://www.microsoft.com/en-us/research/publication/sparks-.... They failed to outright replaced humans at work not because of lacking intelligence.

We may have progressed from a 99%-accurate chatbot to one that's 99.9%-accurate, and you'd have a hard time telling them apart in normal real world (dumb) applications. A paradigm shift is needed from the current chatbot interface to a long-lived stream of consciousness model (e.g. a brain that constantly reads input and produces thoughts at 10ms refresh rate; remembers events for years and keep the context window from exploding; paired with a cerebellum to drive robot motors, at even higher refresh rates.)

As long as we're stuck at chatbots, LLM's impact on the real world will be very limited, regardless of how intelligent they become.

reply
peepeepoopoo97
14 hours ago
[-]
O3 is multiple orders of magnitude more expensive to realize a marginal performance gain. You could hire 50 full time PhDs for the cost of using O3. You're witnessing the blowoff top of the scaling hype bubble.
reply
whynotminot
14 hours ago
[-]
What they’ve proven here is that it can be done.

Now they just have to make it cheap.

Tell me, what has this industry been good at since its birth? Driving down the cost of compute and making things more efficient.

Are you seriously going to assume that won’t happen here?

reply
Jensson
14 hours ago
[-]
> What they’ve proven here is that it can be done.

No they haven't, these results do not generalize, as mentioned in the article:

"Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute"

Meaning, they haven't solved AGI, and the task itself do not represent programming well, these model do not perform that well on engineering benchmarks.

reply
whynotminot
14 hours ago
[-]
Sure, AGI hasn’t been solved today.

But what they’ve done is show that progress isn’t slowing down. In fact, it looks like things are accelerating.

So sure, we’ll be splitting hairs for a while about when we reach AGI. But the point is that just yesterday people were still talking about a plateau.

reply
peepeepoopoo97
14 hours ago
[-]
About 10,000 times the cost for twice the performance sure looks like progress is slowing to me.
reply
whynotminot
14 hours ago
[-]
Just to be clear — your position is that the cost of inference for o3 will not go down over time (which would be the first time that has happened for any of these models).
reply
peepeepoopoo97
13 hours ago
[-]
Even if compute costs drop by 10X a year (which seems like a gross overestimate IMO), you're still looking at 1000X the cost for a 2X annual performance gain. Costs outpacing progress is the very definition of diminishing returns.
reply
whynotminot
13 hours ago
[-]
From their charts, o3 mini outperforms o1 using less energy. I don’t see the diminishing returns you’re talking about. Improvement outpacing cost. By your logic, perhaps the very definition of progress?

You can also use the full o3 model, consume insane power, and get insane results. Sure, it will probably take longer to drive down those costs.

You’re welcome to bet against them succeeding at that. I won’t be.

reply
YeGoblynQueenne
14 hours ago
[-]
>> Now they just have to make it cheap.

Like they've been making it all this time? Cheaper and cheaper? Less data, less compute, fewer parameters, but the same, or improved performance? Not what we can observe.

>> Tell me, what has this industry been good at since its birth? Driving down the cost of compute and making things more efficient.

No, actually the cheaper compute gets the more of it they need to use or their progress stalls.

reply
whynotminot
13 hours ago
[-]
> Like they've been making it all this time?

Yes exactly like they’ve been doing this whole time, with the cost of running each model massively dropping sometimes even rapidly after release.

reply
peepeepoopoo97
14 hours ago
[-]
Yes, that's exactly what I'm implying, otherwise they would have done it a long time ago, given that the fundamental transformer architecture hasn't changed since 2017. This bubble is like watching first year CS students trying to brute force homework problems.
reply
whynotminot
14 hours ago
[-]
> Yes, that's exactly what I'm implying, otherwise they would have done it a long time ago

They’ve been doing it literally this entire time. O3-mini according to the charts they’ve released is less expensive than o1 but performs better.

Costs have been falling to run these models precipitously.

reply
MVissers
14 hours ago
[-]
I would agree if the cost of AI compute over performance hasn't been dropping by more than 90-99% per year since GPT3 launched.

This type of compute will be cheaper than Claude 3.5 within 2 years.

It's kinda nuts. Give these models tools to navigate and build on the internet and they'll be building companies and selling services.

reply
fspeech
14 hours ago
[-]
That's a very static view of the affairs. Once you have a master AI, at a minimum you can use it to train cheaper slightly less capable AIs. At the other end the master AI can train to become even smarter.
reply
Bolwin
13 hours ago
[-]
The high efficiency version got 75% at just $20/task. When you count the time to fill in the squares, that doesn't sound far off from what a skilled human would charge
reply
kenjackson
14 hours ago
[-]
People act as if GPT-4 came out 10 years ago.
reply
Jensson
14 hours ago
[-]
> how is this a plateau since gpt-4? this is significantly better

Significantly better at what? A benchmark? That isn't necessarily progress. Many report preferring gpt-4 to the newer o1 models with hidden text. Hidden text makes the model more reliable, but more reliable is bad if it is reliably wrong at something since then you can't ask it over and over to find what you want.

I don't feel it is significantly smarter, it is more like having the same dumb person spend more thinking than the model getting smarter.

reply
tigershark
14 hours ago
[-]
Where is the plateau? Chatgtp 4 was ~0% in ARC-AGI. 4o was 5%. This model literally solved it with a score higher than the 85% of the average human. And let’s not forget the unbelievable 25% in frontier math, where all the most brilliant mathematicians in the world cannot solve by themselves a lot of the problems. We are speaking about cutting edge math research problems that are out of reach from practically everyone. You will get a rude awakening if you call this unbelievable advancement a “plateau”.
reply
csomar
14 hours ago
[-]
I don't care about benchmarks. O1 ranks higher than Claude on "benchmarks" but performs worse on particular real life coding situations. I'll judge the model myself by how useful/correct it is for my tasks rather than a hypothetical benchmarks.
reply
og_kalu
12 hours ago
[-]
In most non-competitive coding benchmarks (aider, live bench, swe-bench), o1 ranks worse than Sonnet (so the benchmarks aren't saying anything different) or at least did, the new checkpoint 2 days ago finally pushed o1 over sonnet on livebench.
reply
whynotminot
14 hours ago
[-]
“Objective benchmarks are useless, let’s argue about which one works better for me personally.”
reply
csomar
13 hours ago
[-]
Yes. My benchmarks and their benchmarks means AGI. Their benchmarks only means over-fitted.
reply
whynotminot
13 hours ago
[-]
Ok so what if we get different results for our own personal benchmarks/use cases.

(See why objective benchmarks exist?)

reply
bakugo
14 hours ago
[-]
Yes, "objective" benchmarks can be gamed, real-life tasks cannot.
reply
tigershark
12 hours ago
[-]
As I said, o3 demonstrated field medal level research capacity in the frontier math tests. But I’m sure that your use cases are much more difficult than that, obviously.
reply
riku_iki
8 minutes ago
[-]
there are many comments in internet about this, that only subset of frontier math benchmark is "field medal level research", and o3 likely scored on easier subset.

Also, all that stuff is shady in the way that it is just numbers from OAI, which are not reproducible on benchmark sponsored by OAI. If we say OAI could be bad actor, they had plenty of opportunities to cheat on this.

reply
YeGoblynQueenne
14 hours ago
[-]
AI benchmarks and tests that claim to measure understanding, reasoning, intelligence, and so on are a dime a dozen. Chess, Go, Atari, Jeopardy, Raven's Progressive Matrices, the Winograd Schema Challenge, Starcraft... and so on and so forth.

Or let's talk about the breakthroughs. SVMs would lead us to AGI. Then LSTMs would lead us to AGI. Then Convnets would lead us to AGI. Then DeepRL would lead us to AGI. Now Transformers will lead us to AGI.

Benchmarks fall right and left and we keep being led to AGI but we never get there. It leaves one with such a feeling of angst. Are we ever gonna get to AGI? When's Godot coming?

reply
dyauspitr
14 hours ago
[-]
Did you read the article at all? We’re definitely not plateauing.
reply
prpl
13 hours ago
[-]
In 2016 I was asked by an Uber driver in Pittsburgh when his job would be obsolete (I’d worked around Zoox people quite a bit and Uber basically was all-in at CMU.

I told him it was at least 5 years, probably 10, though he was sure it would be 2.

I was arguably “right”, 2023-ish is probably going to be the date people put down in the books, but the future isn’t evenly distributed. It’s at least another 5 years, and maybe never, before things are distributed among major metros, especially those with ice. Even then, the AI is somehow more expensive than human solution.

I don’t think it’s in most companies interest to price AI way below the price of meat, so meat will hold out for a long time, maybe long enough for you to retire even

reply
esafak
12 hours ago
[-]
Just don't have kids?
reply
prpl
12 hours ago
[-]
you can have kids, but they can’t be salesman. Maybe carpenters
reply
throw83288
15 hours ago
[-]
This is me as well. Either:

1) Just give up computing entirely, the field I've been dreaming about since childhood. Perhaps if I immiserate myself with a dry regulated engineering field or trade I would perhaps survive to recursive self-improvement, but if anything the length it takes to pivot (I am a Junior in College that has already done probably 3/4th of my CS credits) means I probably couldn't get any foothold until all jobs are irrelevant and I've wasted more money.

2) Hard pivot into automation, AI my entire workflow, figure out how to use the bleeding edge of LLMs. Somehow. Even though I have no drive to learn LLMs and no practical project ideas with LLMs. And then I'd have to deal with the moral burden that I'm inflicting unfathomable hurt on others until recursive self-improvement, and after that it's simply a wildcard on what will happen with the monster I create.

It's like I'm suffocating constantly. The most I can do to "cope" is hold on to my (admittedly weak) faith in Christ, which provides me peace knowing that there is some eternal joy beyond the chaos here. I'm still just as lost as you.

reply
TheRizzler
15 hours ago
[-]
Yes, some tasks, even complex tasks will become more automated, and machine driven, but that will only open up more opportunities for us as humans to take on more challenging issues. Each time a great advancement comes we think it's going to kill human productivity, but really it just amplifies it.
reply
throw83288
28 minutes ago
[-]
Where this ends is general intelligence though, where all more challenging tasks can simply be done by the model.

The scenario I fear is a "selectively general" model that can successfully destroy the field I'm in but keep others alive for much longer, but not long enough for me to pivot into them before actually general intelligence.

reply
melagonster
1 hour ago
[-]
Don't worry, they will hire somebody to control AI...
reply
sensanaty
3 hours ago
[-]
Dude, you're buying into the hype way too hard. All of this LLM shit is being massively overhyped right now because investors are single-minded morons who only care about cashing out a ~year from now for triple what they put in. Look at the YCombinator batches, 90+% of them have some mention of AI in their pitch even if it's hilariously useless to have AI. You've got toothbrushes advertising AI features. It's a gold rush of people trying to get in on the hype while they still can, I guarantee you the strategy for 99% of the YCombinator AI batch is to get sold to M$ or Google for a billion bucks, not build anything sustainable or useful in any way.

It's a massive bubble, and things like these "benchmarks" are all part of the hype game. Is the tech cool and useful? For sure, but anyone trying to tell you this benchmark is in any way proof of AGI and will replace everyone is either an idiot or more likely has a vested interest in you believing them. OpenAI's whole marketing shtick is to scare people into thinking their next model is "too dangerous" to be released thus driving up hype, only to release it anyway and for it to fall flat on its face.

Also, if there's any jobs LLMs can replace right now, it's the useless managerial and C-suite, not the people doing the actual work. If these people weren't charlatans they'd be the first ones to go while pushing this on everyone else.

reply
barney54
14 hours ago
[-]
Dude chill! Eight years ago, I remember driving to some relatives for Thanksgiving and thinking that self-driving cars were just around the corner and how it made no sense for people to learn how to drive semis. Here we are eight years later and self-driving semis aren't a thing--yet. They will be some day, but we aren't there yet.

If you want to work in computing, then make it happen! Use the tools available and make great stuff. Your computing experience will be different from when I graduated from college 25 years ago, but my experience with computers was far different from my Dad's. Things change. Automation changes jobs. So far, it's been pretty good.

reply
nisa
15 hours ago
[-]
Honestly how about stop stressing and bullshitting yourself to death and instead focus on learning and mastering the material in your cs education. There is so much that ai as in openai api or hugging face models can't do yet or does poorly and there are more things to cs than churning out some half-broken JavaScript for some webapp.

It's powerful and world changing but it's also terrible overhyped at the moment.

reply
j7ake
14 hours ago
[-]
The solution is neither: you find a way to work with automation but retain your voice and craft.
reply
myko
12 hours ago
[-]
spend a little time learning how to use LLMs and i think you'll be less scared. they're not that good at doing the job of a software developer.
reply
baron816
14 hours ago
[-]
What I keep telling people is, if it becomes possible for one person or a handful of people to build and maintain a Google scale company, and my job gets eliminated as a result, then I’m going to go out and build a Google scale company.

There’s an incredibly massive amount of stuff the world needs. You probably live in a rich country, but I doubt you are lacking for want. There are billionaires who want things that don’t exist yet. And, of course, there are billions of regular folks who want some of the basics.

So long as you can imagine a better world, there will be work for you to do. New tools like AGI will just make it more accessible for you to build your better future.

reply
chairmansteve
13 hours ago
[-]
Think of AI as an excavator. You know, those machines that dig holes. 70 years ago, those holes would have been dug by 50 men with shovels. Now it's one guy in an excavator. But we don't have mass unemployment. The excavator just creates more work for bricklayers, carpenters etc.

If AI lives up to hype, you could be the excavator driver. Or, the AI will create a ton of upstream and downstream work. There will be no mass unemployment.

reply
euvin
13 hours ago
[-]
If AGI is the excavator, why wouldn't it become the driver, bricklayer, and carpenter as well?
reply
throwaway2037
9 hours ago
[-]
Jokes aside, I think building a useful, strong, agile humanoid robot that is affordable for businesses (first), then middle class homes will prove much harder than AGI.
reply
zmgsabst
12 hours ago
[-]
Horses never recovered from mechanization.
reply
chairmansteve
10 hours ago
[-]
True, but humans did. Horses were the machine that became obsolete. Just like the guys with shovels.
reply
postsantum
12 hours ago
[-]
They have been promoted to pets. Oh wait..
reply
realce
13 hours ago
[-]
Is there any possible technology that could make labor, mastery, or human expirence obsolete?

Are there no limits to this argument? Is it some absolute universal law that all new creations just create increasing economic opportunities?

reply
antihipocrat
15 hours ago
[-]
Your performance on these tests would be equivalent to the highest performing model, and you would be much cheaper.

Investment in human talent augmented by AI is the future.

reply
kenjackson
14 hours ago
[-]
That’s the least reassuring phrasing I could imagine. If you’re betting on costs not reducing for compute then you’re almost always making the wrong bet.
reply
antihipocrat
14 hours ago
[-]
If I listened to the naysayers back in the day I would have never entered the tech industry (offshoring etc). Yes, that does somewhat prove you're point given that those predictions were cost driven.

Having used AI extensively I don't feel my future is at risk at all, my work is enhanced not replaced.

reply
fjdjshsh
14 hours ago
[-]
I think you're missing the point. Offshoring (moving the job of, say, a Canadian engineer to an engineer from Belarus) has a one time cost drop, but you can't keep driving the cost down (paying the Belarus engineer less and less). If anything, the opposite is the case, since global integration means wages don't keep diverging.

The computing cost, on the other hand, is a continuous improvement. If (and it's a big if) a computer can do your job, we know the costs will keep getting lower year after year (maybe with diminishing returns, but this AI technology is pretty new so we're still seeing increasing returns)

reply
danparsonson
13 hours ago
[-]
The AI technology is new but the compute technology is not; we're getting close the physical limits of how small we can make things, so it's not clear to me at least how much more performance we can squeeze out of the same physical space, rather than scaling up which tends to make things more expensive not less.
reply
ApolloFortyNine
13 hours ago
[-]
>Seems like we’re headed toward a world where you automate someone else’s job or be automated yourself.

This has essentially been happening for thousands of years. Any optimization to work of any kind reduces the number of man hours required.

Software of pretty much any form is entirely that. Even early spreadsheet programs would replace a number of jobs at any company.

reply
anshulbhide
12 hours ago
[-]
You're actually positioned to have an amazing career.

Everyone needs to know how to either build or sell to be successful. In a world where the ability to the former is rapidly being commoditised, you will still need to sell. And human relationships matter more than ever.

reply
Art9681
13 hours ago
[-]
It's a tool. You learn to master it or not. I have greybeard coworkers that dissed the technology as a fad 3 years ago. Now they are scrambling to catch up. They have to do this while sustaining a family with pets and kids and mortgages and full time senior jobs.

You're in a position to invest substantial amounts of time compared to your seniors. Leverage that opportunity to your advantage.

We all have access to these tools for the most part, so the distinguishing factor is how much time you invest and how much more ambitious you become once you begin to master the tool.

This time its no different. Many Mechanical and Sales students in the past never got jobs in those fields either. Decades before AI. There were other circumstances and forces at play and a degree is not a guaranteed career in anything.

Keep going because what we DO know is that trying wont guarantee results, we DO know that giving up definitely won't. Roll the dice in your favor.

reply
callc
12 hours ago
[-]
> I have greybeard coworkers that dissed the technology as a fad 3 years ago. Now they are scrambling to catch up. They have to do this while sustaining a family with pets and kids and mortgages and full time senior jobs.

I want to criticize Art’s comment on the grounds of ageism or something along the lines of “any amount life outside of programming is wasted”, but regardless of Art’s intention there is important wisdom here. Use your free time wisely when you don’t have much responsibilities. It is a superpower.

As for whether to spend it on AI, eh, that’s up to you to decide.

reply
Art9681
3 hours ago
[-]
It's totally valid criticism. What I meant is that if an individual's major concern is employment, then it would be prudent to invest the amount of time necessary to ensure a favorable outcome. And given whatever stage in life they are at, use the circumstance you have in your favor.

I'm a greybeard myself.

reply
hoekit
14 hours ago
[-]
As engineers, we solve problems. Picking a problem domain close to your heart that intersects with your skills will likely be valued - and valuable. Engage the work, aim to understand and solve the human problems for those around you, and the way forward becomes clearer. Human problems (food, health, safety) are generally constant while tools may change. Learn and use whatever tools to help you, be it scientific principles, hammers or LLMs. For me, doing so and living within my means has been intrinsically satisfying. Not terribly successful materially but has been a good life so far. Good luck.
reply
post-it
14 hours ago
[-]
As long as your chosen profession isn't completing AI benchmarks for money, you should be okay.
reply
antman
14 hours ago
[-]
I think we are pretty far. I am not devaluing the o3 capability but going through actual dataset the definition of "handling novel tasks" is pretty limited. The curse of large context of llms is especially present engineering projects and does not appear it will not end up producing the plans of a bridge, or an industrial process. Sone of tasks with smaller contexts sure can be assisted, but you cant RAG or Agent a full solution for the foreseeable future. O3 adds capability towards agi, but in reality actual infinite context with less intelligence would be more disrupting at a shorter time if one was to choose.
reply
YeGoblynQueenne
14 hours ago
[-]
I suppose now that we have the technology to automatically solve coloured grid puzzles, mechanical engineering is obsolete.
reply
myko
12 hours ago
[-]
LLMs are mostly hype. They're not going to change things that much.
reply
obirunda
11 hours ago
[-]
Yeah, it may feel scary but the biggest issue yet to be overcome is that to replace engineers you need reliable long horizon problem solving skills. And crucially, you need to not be easily fooled by the progress or setbacks of a project.

These benchmark accomplishments are awesome and impressive, but you shouldn't operate on the assumption that this will emerge as an engineer because it performs well on benchmarks.

Engineering is a discipline that requires understanding tools, solutions and every project requires tiny innovations. This will make you more valuable, rather than less. Especially if you develop a deep understanding of the discipline and don't overly rely on LLMs to answer your own benchmark questions from your degree.

reply
textlapse
14 hours ago
[-]
Imagine graduating in architecture or mechanical engineering around the time PCs just came out. There were people who probably panicked.

But the arc of time intersects quite nicely with your skills if you steer it over time.

Predicting it or worrying about it does nothing.

reply
sigbottle
12 hours ago
[-]
Side note: Why do I keep seeing disses to mechanical engineering here? How is that possibly a less valuable degree than web dev or a standard CRUD backend job?

Especially with AI provably getting extremely smart now, surely engineering disciplines would be having a boon as people want these things in their homes for cheaper for various applications.

reply
hatefulmoron
8 hours ago
[-]
Was he dissing mechanical engineering? I thought he was saying that they might have been panicked but were ultimately fine.
reply
eidorb
15 hours ago
[-]
Do what you enjoy. (This is easier said than done.) What else could you do, worry?
reply
AnimalMuppet
13 hours ago
[-]
The future belongs to those who believe there will be one.

That is: If you don't believe there will be a future, you give up on trying to make one. That means that any kind of future that takes persistent work becomes unavailable to you.

If you do believe that there will be a future, you keep working. That doesn't guarantee there will be a future. But not working pretty much guarantees that there won't be one, at least not one worth having.

reply
m3kw9
13 hours ago
[-]
Always need to believe AI needs to be operated by humans, when it can go end to end to replace a human, you will likely not need to worry about money.
reply
aussieguy1234
13 hours ago
[-]
Full on mechanical engineering needs a body. While there are companies working on embodiment, were not there yet.

It'll be some time before there is a robot with enough spatial reasoning to do complicated physical work with no prior examples.

reply
cheriot
14 hours ago
[-]
I graduated high school in '02 and everyone assured me that all tech jobs were being sent to India. "Don't study CS," they said. Thankfully I didn't listen.

Either this is the dawn of something bigger than the industrial revolution or you'll have ample career opportunity. Understanding how things work and how people work is a powerful combination.

reply
AI_beffr
15 hours ago
[-]
even if you had a billion dollars and a private island you still wouldnt be ready for whats coming. consider the fact that the global order is an equilibrium where the military and economic forces of each country in the world are pushing against each other... where the forces find a global equilibrium is where borders are. each time in history that technology changed, borders changed because the equilibrium was disturbed. there is no way to escape it: agi will lead to global war. the world will be turned upside down. we are entering into an existential sinkhole. and the idiots in silicon valley are literally driving the whole thing forward as fast as possible.
reply
martin82
14 hours ago
[-]
buy bitcoin.

when the last job has been automated away, millions of AIs globally will do commerce with each other and they will use bitcoin to pay each other.

as long as the human race (including AIs) produces new goods and services, the purchasing power of bitcoin will go up, indefinitely. even more so once we unlock new industries in space (settlements on the Moon and Mars, asteroid mining etc).

The only thing that can make a dent into bitcoin's purchasing power would be all out global war where humanity destroys more than it creates.

The only other alternative is UBI, which is Communism and eternal slavery for the entire human race except the 0.0001% who run the show.

Chose wisely.

reply
conception
14 hours ago
[-]
This must be a joke since you must know how many people control the majority of bitcoin.
reply
HDThoreaun
14 hours ago
[-]
Bitcoin is a horrible currency. Its a fun proof of concept but not a scalable payment solution. Currency needs to be stable and cheap to transfer.
reply
killjoywashere
17 hours ago
[-]
I just want it to do my laundry.
reply
cryptoegorophy
22 hours ago
[-]
Besides higher scores - is there any improvements for a general use? Like asking to help setup home assistant etc etc?
reply
myrloc
16 hours ago
[-]
What is the cost of "general intelligence"? What is the price?
reply
ripped_britches
15 hours ago
[-]
About $3.50
reply
binarymax
21 hours ago
[-]
All those saying "AGI", read the article and especially the section "So is it AGI?"
reply
suprgeek
11 hours ago
[-]
Don't be put off by the reported high-cost

Make it possible->Make it fast->Make it Cheap

the eternal cycle of software.

Make no mistake - we are on the verge of the next era of change.

reply
prng2021
15 hours ago
[-]
I’m confused about the excitement. Are people just flat out ignoring the sentences below? I don’t see any breakthrough towards AGI here. I see a model doing great in another AI test but about to abysmally fail a variation of it that will come out soon. Also, aren’t these comparisons completely nonsense considering it’s o3 tuned vs other non-tuned?

> Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.

> Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training).

reply
oakpond
5 hours ago
[-]
Me too. This looks to me like a holiday PR stunt. Get everybody to talk about AI during the Christmas parties.
reply
jaspa99
19 hours ago
[-]
Can it play Mario 64 now?
reply
thatxliner
20 hours ago
[-]
> verified easy for humans, harder for AI

Isn’t that the premise behind the CAPTCHA?

reply
kirab
13 hours ago
[-]
FYI: Codeforces competitive programming scores (basically only) by time needed until valid solutions are posted

https://codeforces.com/blog/entry/133094

That means.. this benchmark is just saying o3 can write code faster than must humans (in a very time-limited contest, like 2 hours for 6 tasks). Beauty, readability or creativity is not rated. It’s essentially a "how fast can you make the unit tests pass" kind of competition.

reply
sigbottle
13 hours ago
[-]
Creativity is inherently rated because it's codeforces... most 2700 problems have unique, creative solutions.
reply
bilsbie
18 hours ago
[-]
When is this available? Which plans can use it?
reply
epigramx
10 hours ago
[-]
I bet it still thinks 1+1=3 if it read enough sources parroting that.
reply
dyauspitr
14 hours ago
[-]
I wish there was a way to see all the attempts it got right graphically like they show the incorrect ones.
reply
Sparkyte
13 hours ago
[-]
Kinda expensive though.
reply
c1b
20 hours ago
[-]
So o1 pro is CoT RL and o3 adds search?
reply
theincredulousk
9 hours ago
[-]
Denoting it in $ for efficiency is peak capitalism, cmv.
reply
Havoc
16 hours ago
[-]
Did they just skip o2?
reply
nextworddev
16 hours ago
[-]
Yes. For branding reasons since o2 is a telco brand in the UK
reply
Havoc
2 hours ago
[-]
ah right...makes sense
reply
tmaly
21 hours ago
[-]
Just curious, I know o1 is a model OpenAI offers. I have never heard of the o3 model. How does it differ from o1?
reply
kittikitti
12 hours ago
[-]
Congratulations
reply
dkrich
14 hours ago
[-]
These tests are meaningless until You show them doing mundane tasks
reply
rimeice
20 hours ago
[-]
Never underestimate a droid
reply
jack_pp
20 hours ago
[-]
AGI for me is something I can give a new project to and be able to use it better than me. And not because it has a huge context window, because it will update its weights after consuming that project. Until we have that I don't believe we have truly reached AGI.

Edit: it also tests the new knowledge, it has concepts such as trusting a source, verifying it etc. If I can just gaslight it into unlearning python then it's still too dumb.

reply
TypicalHog
22 hours ago
[-]
This is actually mindblowing!
reply
jdefr89
18 hours ago
[-]
Uhhhh… It was trained on ARC data? So they targeted a specific benchmark and are surprised and blown away the LLM performed well in it? What’s that law again? When a benchmark is targeted by some system the benchmark becomes useless?
reply
forgottofloss
16 hours ago
[-]
Yeah, seriously. The style of testing is public, so some engineers at OpenAI could easily have spent a few months generating millions of permutations of grid-based questions and including those in the original data for training the AI. Handshakes all around, publicity for everyone.
reply
ripped_britches
15 hours ago
[-]
They are running a business selling access these models to enterprises and consumers. People won’t pay for stuff that doesn’t solve real problems. Nobody pays for stuff just because of a benchmark. It’d be really weird to become obsessed with metrics gaming rather than racing to build something smarter than the other guys. Nothing wrong with curating any type of training set that actually produces something that is useful.
reply
brcmthrowaway
15 hours ago
[-]
How to invest in this stonk market
reply
cchance
22 hours ago
[-]
Is it just me or does looking at the ARC-AGI example questions at the bottom... make your brain hurt?
reply
drdaeman
22 hours ago
[-]
Looks pretty obvious to me, although, of course, it took me a few moments to understand what's expected as a solution.

c6e1b8da is moving rectangular figures by a given vector, 0d87d2a6 is drawing horizontal and/or vertical lines (connecting dots at the edges) and filling figures they touch, b457fec5 is filling gray figures with a given repeating color pattern.

This is pretty straightforward stuff that doesn't require much spatial thinking or keeping multiple things/aspects in memory - visual puzzles from various "IQ" tests are way harder.

This said, now I'm curious how SoTA LLMs would do on something like WAIS-IV.

reply
randyrand
21 hours ago
[-]
I'll sound like a total douche bag - but I thought they were incredibly obvious - which I think is the point of them.

What took me longer was figuring out how the question was arranged, i.e. left input, right output, 3 examples each

reply
nprateem
19 hours ago
[-]
There should be a benchmark that tells the AI it's previous answer was wrong and test the number of times it either corrects itself or incorrectly capitulates, since it seems easy to trip them up when they are in fact right.
reply
cubefox
20 hours ago
[-]
This was a surprisingly insightful blog post, going far beyond just announcing the o3 results.
reply
airstrike
22 hours ago
[-]
Uhh...some of us are apparently living under a rock, as this is the first time I hear about o3 and I'm on HN far too much every day
reply
burningion
21 hours ago
[-]
I think it was just announced today! You're fine!
reply
owenpalmer
7 hours ago
[-]
Someone asked if true intelligence requires a foundation of prior knowledge. This is the way I think about it.

I = E / K

where I is the intelligence of the system, E is the effectiveness of the system, and K is the prior knowledge.

For example, a math problem is given to two students, each solving the problem with the same effectiveness (both get the correct answer in the same amount of time). However, student A happens to have more prior knowledge of math than student B. In this case, the intelligence of B is greater than the intelligence of A, even though they have the same effectiveness. B was able to "figure out" the math, without using any of the "tricks" that A already knew.

Now back to the question of whether or not prior knowledge is required. As K approaches 0, intelligence approaches infinity. But when K=0, intelligence is undefined. Tada! I think that answers the question.

Most LLM benchmarks simply measure effectiveness, not intelligence. I conceptualize LLMs as a person with a photographic memory and a low IQ of 85, who was given 100 billion years to learn everything humans have ever created.

IK = E

low intelligence * vast knowledge = reasonable effectiveness

reply
someothherguyy
7 hours ago
[-]
reply
empiko
6 hours ago
[-]
Well put. You ask LLMs about ARC-like challenges and they are able to come up with a list of possible problem formulations even before you show them the input. The models already know that they might expect various object manipulations, symmetry problem, etc. The fact that the solution costs thousands of dollars says to me that the model iterates over many solutions while using this implicit knowledge and feedback it gets from running the program. It is still impressive, but I don't think this is what the ARC prize was supposed to be about.
reply
curl-up
6 hours ago
[-]
> while using this implicit knowledge and feedback it gets from running the program.

What feedback, and what program, are you referring to?

reply
empiko
5 hours ago
[-]
I assume that o3 can run Python scripts and observe the outputs.
reply
scotty79
6 hours ago
[-]
Basically solutions that were doing well in arc just threw thousands of ideas at the wall and picked the ones that stuck. They were literally generating thousands of python programs, running them and checking if any produced the correct output when fed with data from examples.

This o3 doesn't need to run python. It itself executes programs written in tokens inside it's own context window which is wildly inefficient but gives better results and is potentially more general.

reply
TheOtherHobbes
5 hours ago
[-]
So basically it's a massively inefficient trial-and-error leetcode solver which only works because it throws incredible amounts of compute at the problem.

This is hilarious.

reply
lorepieri
7 hours ago
[-]
There should be also a factor about resource consumption. See here: https://lorenzopieri.com/pgii/
reply
xlii
6 hours ago
[-]
An interesting point from a philosophical perspective!

But if we'd take this into consideration would it mean that 1st world engineer is by definition less inteligent than 3rd world one?

I think the (completely reasonable) knee jerk reaction is a definsive one, but I can imagine absolutarian regime escapee working side-by-side an engineer groomed in expensive, air conditioned lecture rooms. In this imaginary scenario escapee, even if slower and less efficient at the problem at hand would have to be more intelligent generally.

reply
eru
2 hours ago
[-]
That's a bit silly.

Yes, resource consumption is important. But your car guzzling a lot of gas doesn't mean he drives slower. It just means it drives slower per mol of petrol consumed.

It's good to know whether your system has a high or low 'bang for buck' metric, but that doesn't directly affect how much bang you get.

reply
spacebanana7
7 hours ago
[-]
Also perhaps a factor (with diminishing returns) for response speed?

All else equal, a student who gets 100% on a problem set in 10 minutes is more intelligent than one with the same score after 120 minutes. Likewise an LLM that can respond in 2 seconds is more impressive than one which responds in 30 seconds.

reply
owenpalmer
7 hours ago
[-]
> a student who gets 100% on a problem set in 10 minutes is more intelligent than one with the same score after 120 minutes

According to my mathematical model, the faster student would have higher effectiveness, not necessarily higher intelligence. Resource consumption and speed are practical technological concerns, but they're irrelevant in a theorical conceptualization of intelligence.

reply
baq
6 hours ago
[-]
If you disregard time, all computers have maximal intelligence, they can enumerate all programs and compute answers to any decidable question.
reply
wouldbecouldbe
6 hours ago
[-]
Yeah speed is a key factor in intelligence. And actually one of the biggest differentiators in human iq measurements
reply
eru
2 hours ago
[-]
Humans are a bit annoying that way, because it's all correlated.

So a human with a better response time, also tends to give you more intelligent answers, even when time is not a factor.

For a computer, you can arbitrarily slow them down (or speed them up), and still get the same answer.

reply
coffeebeqn
5 hours ago
[-]
Maybe. If I could ask a AI to come up with a 50% efficient mass market solar panel, I don’t really care if it takes a few weeks or a year if it can solve that though. I’m not sure if inventiveness or novelness of solution could be a metric. I suppose that is superintelligence rather than AGI? And by then there would be no question of what it is
reply
Terr_
6 hours ago
[-]
> response time

Imagine you take an extraordinarily smart person, and put them on a fast spaceship that causes time dilation.

Does that mean that they are stupider while in transit, and they regain their intelligence when it slows down?

reply
zoky
5 hours ago
[-]
Who is a better free-thrower, someone who can hit 20 free throws per minute on Earth, or the same thrower who logged 20 million free throws in the apparent two years he was gone but comes back ready for retirement?
reply
Earw0rm
6 hours ago
[-]
No, because intelligence is relative to your local context.
reply
Terr_
6 hours ago
[-]
Why should one kind of phenomenon which slows down performance on the test be given a special "you're more intelligent than you seem" exception, but not others?

If we are required to break the seal on the black-box and investigate the exactly how the agent is operating in order to judge its "intelligence"... Doesn't that kinda ruin the up-thread stuff about judging with equations?

reply
wangii
7 hours ago
[-]
Interesting formulation! it captures the intuition of the "smartness" when solving a problem. However, what about asking good questions or proposing conjectures?
reply
hanspeter
7 hours ago
[-]
Aren't those solutions to problems as well?

Find the best questions to ask. Find the best hypothesis to suggest.

reply
onemetwo
6 hours ago
[-]
An intelligent system could take more advantage of an increase of knowledge than a dumb one, so I should propose a simple formula: the derivative of efficiency with respect to knowledge is proportional to intelligence.

$$ I = \frac{partial E}{partial K} \simeq \frac{\delta E}{\delta K} $$

In order to estimate $I$ you have to consider that efficiency and knowledge are task related, so you could take some weighted mean $sum_T C(E,K,T)*I(E,K,T)$ where $T$ is task category. I am thinking in $C(E,K,T)$ as something similar to thermal capacity or electrical resistance, the equivalent concept when applied to task. An intelligent agent in a medium of low resistance should fly while a dumb one would still crawl.

reply
owenpalmer
5 hours ago
[-]
> An intelligent system could take more advantage of an increase of knowledge than a dumb one

Why?

> derivative of efficiency

Where did your efficiency variable come from?

reply
onemetwo
5 hours ago
[-]
Why? I am using dumb as a low intelligence system. A more intelligent person can take advantage of new opportunities. Efficience variable: You are right that effectiveness could be better here because we are not considering resources like computer time and power.
reply
Woodi
7 hours ago
[-]
Yep, I aways liked encyclopedia. Wiki is good too :)

What I would like to have in the future is SO answering-peoples accessible in real time via IRC. They have real answers NOW. They are even pedantic about their stuff !

reply
dmezzetti
6 hours ago
[-]
We should wait until it's released before we anoint it. It's disheartening to see how we keep repeating the same pattern that gives in to hype over the scientific method.
reply
lazide
6 hours ago
[-]
The scientific method doesn’t drive stock price (apparently).
reply
scotty79
5 hours ago
[-]
As a kid I absolutely hated math and loved physics and chemistry because solving anything in math requires vast specific K.

In comparison you can easily know everything there is to know about physics or chemistry and it's sufficient to solve interesting puzzles. In math every puzzle has it's own vast lore you need to know before you can have any chance at tackling it.

reply
owenpalmer
5 hours ago
[-]
Physics and chemistry require experimentation to verify solutions. With math however, any new knowledge can be intuited and proven from previous proofs, so yes, the lore goes deep!
reply
gardenhedge
6 hours ago
[-]
Where did someone ask that?
reply
iLoveOncall
17 hours ago
[-]
It's beyond ridiculous how the definition of AGI has shifted from being an AI that's so good it can improve itself entirely independently infinitely to "some token generator that can solve puzzles that kids could solve after burning tens of thousands of dollars".

I spend 100% of my work time working on a GenAI project, which is genuinely useful for many users, in a company that everyone has heard about, yet I recognize that LLMs are simply dogshit.

Even the current top models are barely usable, hallucinate constantly, are never reliable and are barely good enough to prototype with while we plan to replace those agents with deterministic solutions.

This will just be an iteration on dogshit, but it's the very tech behind LLMs that's rotten.

reply
uncomplexity_
15 hours ago
[-]
it's official old buddy, i'm a has been.
reply
duluca
11 hours ago
[-]
The first computers cost millions of dollars and filled entire rooms to accomplish what we would now consider simple computational tasks. That same computing power now fits into the width of a finger nail. I don’t get how technologists balk at the cost of experimental tech or assume current tech will run at the same efficiency for decades to come and melt the planet into a puddle. AGI won’t happen until you can fit enough compute that’d take several data center’s worth of compute into a brain sized vessel. So the thing can move around process the world in real time. This is all going to take some time to say the least. Progress is progress.
reply
8n4vidtmkvmk
9 hours ago
[-]
I thought you were going to say that now we're back to bigger-than-room sized computers that cost many millions just to perform the same tasks we could 40 years ago.

I of course mean we're using these LLMs for a lot of tasks that they're inappropriate for, and a clever manually coded algorithm could do better and much more efficiently.

reply
arthurcolle
9 hours ago
[-]
just ask the LLM to solve enough problems (even new problems), cache the best, do inference time compute for the rest, figure out the best/ fastest implementations, and boom, you have new training data for future AIs
reply
owenpalmer
9 hours ago
[-]
> cache the best

How do you quantify that?

reply
martinkallstrom
8 hours ago
[-]
"Assume the role of an expert in cache invalidation..."
reply
DyslexicAtheist
8 hours ago
[-]
"one does not just assume", "because the hardest problems in Tech are Johnny Cash invalidations" --Lao Tzi
reply
Terr_
6 hours ago
[-]
> "Those who invalidate caches know nothing; Those who know retain data." These words, as I am told, were spoken by Lao Tzi. If we are to believe that Lao Tzi was himself one who knew, why did he erase /var/tmp to make space for his project?

-- Poem by Cybernetic Bai Juyi, "The Philosopher [of Caching]"

reply
pavlov
6 hours ago
[-]
“Assume the role of an expert in naming things. You know, a… what do they call those people again… there must be a name for it”
reply
arthurcolle
1 hour ago
[-]
however you want
reply
adwn
9 hours ago
[-]
> and a clever manually coded algorithm could do better and much more efficiently.

Sure, but how long would it take to implement this algorithm, and would that be worth it for one-off cases?

Just today I asked Claude to create a jq query that looks for objects with a certain value for one field, but which lack a certain other field. I could have spent a long time trying to make sense of jq's man page, but instead I spent 30 seconds writing a short description of what I'm looking for in natural language, and the AI returned the correct jq invocation within seconds.

reply
freehorse
9 hours ago
[-]
I don’t think this is a bad use. A bad use would be to give Claude the dataset and ask it to tell you which elements have that value.
reply
globalise83
8 hours ago
[-]
Claude answers a lot of its questions by first writing and then running code to generate the results. Its only limitation is the access to databases and size of context window, both of which will be radically improved over the next 5 years.
reply
freehorse
5 hours ago
[-]
I would still rather be able to see the code it generates
reply
adwn
8 hours ago
[-]
Ha, I tried that before. However, the file was too large for its context window, so it only seemed to analyze the first part and gave a wrong result.
reply
Woodi
7 hours ago
[-]
It was your own data, right ? Becouse you just donated half of it...
reply
adwn
7 hours ago
[-]
It's okay, I also uploaded an NDA in a previous prompt :-)
reply
lottin
7 hours ago
[-]
But how do you know it's given you the correct answer? Just because the code appears to work it doesn't mean it's correct.
reply
adwn
7 hours ago
[-]
But how do I know if my hand-written jq query is the correct solution? Just because the query appears to work it doesn't mean it's correct.
reply
lottin
5 hours ago
[-]
Because I understand the process that I have followed to get to the solution.
reply
ogogmad
7 hours ago
[-]
It can explain its solution. Point to relevant docs as well.
reply
gf000
6 hours ago
[-]
It can also very convincingly explain a non-solution pointing to either real or hallucinated docs.
reply
freehorse
4 hours ago
[-]
Omg this is how llms used to trick me inventing out all these apis.
reply
globalise83
8 hours ago
[-]
The LLMs are now writing their own algorithms to answer questions. Not long before they can design a more efficient algorithm to complete any feasible computational task, in a millionth of the time needed by the best human.
reply
gf000
6 hours ago
[-]
> The LLMs are now writing their own algorithms to answer questions

Writing a python script, because it can't do math or any form of more complex reasoning is not what I would call "own algorithm". It's at most application of existing ones/calling APIs.

reply
bayindirh
7 hours ago
[-]
LLMs are probabilistic string blenders pulling pieces up from their training set, which unfortunately comes from us, humans.

The superset of the LLM knowledge pool is human knowledge. They can't go beyond the boundaries of their training set.

I'll not go into how humans have other processes which can alter their and collective human knowledge, but the rabbit hole starts with "emotions, opposable thumbs, language, communication and other senses".

reply
ogogmad
7 hours ago
[-]
> They can't go beyond the boundaries of their training set.

TFA says they just did. That's what the ARC-AGI benchmark was supposed to test.

reply
lxgr
11 hours ago
[-]
> take several data center’s worth of compute into a brain sized vessel. So the thing can move around process the world in real time

How so? I'd imagine a robot connected to the data center embodying its mind, connected via low-latency links, would have to walk pretty far to get into trouble when it comes to interacting with the environment.

The speed of light is about three orders of magnitude faster than the speed of signal propagation in biological neurons, after all.

reply
byw
9 hours ago
[-]
The robot brain could be layered so that more basic functions are embedded locally while higher-level reasonings and offloaded to the cloud.
reply
arthurcolle
9 hours ago
[-]
blue strip from iRobot?
reply
waldrews
10 hours ago
[-]
6 orders of magnitude if we use 120 m/s vs 300 km/s
reply
lxgr
3 hours ago
[-]
Ah, yes, I missed a “k” in that estimation!
reply
nopinsight
8 hours ago
[-]
Many of humans' capabilities are pretrained with massive computing through evolution. Inference results of o3 and its successors might be used to train the next generation of small models to be highly capable. Recent advances in the capabilities of small models such as Gemini-2.0 Flash suggest the same.

Recent research from NVIDIA suggests such an efficiency gain is quite possible in the physical realm as well. They trained a tiny model to control the full body of a robot via simulations.

---

"We trained a 1.5M-parameter neural network to control the body of a humanoid robot. It takes a lot of subconscious processing for us humans to walk, maintain balance, and maneuver our arms and legs into desired positions. We capture this “subconsciousness” in HOVER, a single model that learns how to coordinate the motors of a humanoid robot to support locomotion and manipulation."

...

"HOVER supports any humanoid that can be simulated in Isaac. Bring your own robot, and watch it come to life!"

More here: https://x.com/DrJimFan/status/1851643431803830551

---

This demonstrates that with proper training, small models can perform at a high level in both cognitive and physical domains.

reply
bigprof
8 hours ago
[-]
> Similarly, many of humans' capabilities are pretrained with massive computing through evolution.

Hmm .. my intuition is that humans' capabilities are gained during early childhood (walking, running, speaking .. etc) ... what are examples of capabilities pretrained by evolution, and how does this work?

reply
tiborsaas
7 hours ago
[-]
If you look at animals, they can walk in hours, not much time needed after being born. It takes us a longer time because we are born rather undeveloped to get the head out of the birth canal.

A more high level example, sea sickness is a evolutionary pre-learned thing, your body things it's poisoned and it automatically wants to empty your stomach.

reply
nopinsight
8 hours ago
[-]
The brain is predisposed to learn those skills. Early childhood experiences are necessary to complete the training. Perhaps that could be likened to post-training. It's not a one-to-one comparison but a rather loose analogy which I didn't make it precise because it is not the key point of the argument.

Maybe evolution could be better thought of as neural architecture search combined with some pretraining. Evidence suggests we are prebuilt with "core knowledge" by the time we're born [1].

See: Summary of cool research gained from clever & benign experiments with babies here:

[1] Core knowledge. Elizabeth S. Spelke and Katherine D. Kinzler. https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke...

reply
vanviegen
7 hours ago
[-]
> The brain is predisposed to learn those skills.

Learning to walk doesn't seem to be particularly easy, having observed the process with my own children. No easier than riding a bike or skating, for which our brains are probably not 'predisposed'.

reply
nopinsight
7 hours ago
[-]
Walking is indeed a complex skill. Yet some animals walk minutes after birth. Human babies are most likely born premature due to the large brain and related physical constraints.

Young children learn to bike or skate at an older age after they have acquired basic physical skills.

Check out the reference to Core Knowledge above. There are things young infants know or are predisposed to know from birth.

reply
HumanOstrich
7 hours ago
[-]
The brain has developed, through evolution, very specific and organized structures that allow us to learn language and reading skills. If you have a genetic defect that causes those structures to be faulty or missing, you will have severe developmental problems.

That seems like a decent example of pretraining through evolution.

reply
tesch1
3 hours ago
[-]
But maybe it's something more like general symbolic manipulation, and not specifically the sounds or structure of language. Reading is fairly new and unlikely to have had much if any evolutionary pressure in many populations who are now quite literate. Same seems true for music. Maybe the hardware is actually more general and adaptable and not just for language?
reply
HumanOstrich
3 hours ago
[-]
The research disagrees with you.
reply
eru
2 hours ago
[-]
Music is really, really old.

And reading and music co-evolved to be relatively easy for humans to do.

(See how computers have a much easier time reading barcodes and QR codes, with much less general processing power than it takes them to decipher human hand-writing. But good luck trying to teach humans to read QR codes fluently.)

reply
eru
2 hours ago
[-]
> No easier than riding a bike or skating, for which our brains are probably not 'predisposed'.

What makes you think so? Humans came up with biking and skating, because they were easy enough for us to master with the hardware we had.

reply
puffybuf
7 hours ago
[-]
I think of evolution as unassisted learning where agents compete with the each other for limited resources. Over time they get better and better at surviving by passing on genes. It never ends of course.
reply
eru
2 hours ago
[-]
Your brain is well adapted to learning how to walk and speak.

Chimpanzees score pretty high on many tests of intelligence, especially short term working memory. But they can't really learn language: they lack the specialised hardware more than the general intelligence.

reply
gf000
6 hours ago
[-]
I mean, there are plenty - e.g. mimicking (say, the mother's face's emotions), which are precursors to learning more advanced "features". Also, even walking has many aspects pretrained (I assume it's mostly a musculoskeletal limitation that we can't walk immediately), humans are just born "prematurely" due to our relatively huge heads. Newborn horses can walk immediately without learning.

But there are plenty of non-learned control/movement/sensing in utero that are "pretrained".

reply
eru
2 hours ago
[-]
Interestingly, there's a bunch of reflexes that also only develop over time.

They are more nature than nurture, but they aren't 'in-born'.

Just like human aren't (usually) born with teeth, but they don't 'learn' to have teeth or pubic hair, either.

reply
lumost
11 hours ago
[-]
The concern here is mainly on practicality. The original mainframes did not command startup valuations counted in fractions of the US economy, they did qualify for billions in investment.

This is a great milestone, but OpenAI will not be successful charging 10x the cost of a human to perform a task.

reply
raincole
10 hours ago
[-]
The cost of inference has be dropping by ~100x in the past 2 years.

https://a16z.com/llmflation-llm-inference-cost/

reply
christianqchung
10 hours ago
[-]
Hmm the link is saying the price of an LLM that scores 42 or above on MMLU has dropped 100x in 2 years, equating gpt 3.5 and llama 3.2 3B. In my opinion gpt 3.5 was significantly better than llama 3B, and certainly much better than the also-equated llama 2 7B. MMLU isn't a great marker of overall model capabilities.

Obviously the drop in cost for capability in the last 2 years is big, but I'd wager it's closer to 10x than 100x.

reply
gritzko
10 hours ago
[-]
*infernonce
reply
nico
10 hours ago
[-]
*inference
reply
owenpalmer
8 hours ago
[-]
> OpenAI will not be successful charging 10x the cost of a human to perform a task.

True, but they might be successful charging 20x for 2x the skill of a human.

reply
threatripper
8 hours ago
[-]
Or 10x the skill and speed of a human in some specific class of recurrent tasks. We don't need full super-human AGI for AI to become economically viable.
reply
eru
2 hours ago
[-]
Companies routinely pay short-term contractors a lot more than their permanent staff.

If you can just unleash AI on any of your problems, without having to commit to anything long term, it might still be useful, even if they charged more than for equivalent human labour.

(Though I suspect AI labour will generally trend to be cheaper than humans over time for anything AIs can do at all.)

reply
BriggyDwiggs42
11 hours ago
[-]
I wouldn’t expect it to cost 10x in five years, if only because parallel computing still seems to be roughly obeying moore’s.
reply
fragmede
6 hours ago
[-]
How much does AWS charge for compute?

If it can be spun up with Terraform, I bet you they could.

reply
pera
9 hours ago
[-]
Maybe AGI as a goal is overvalued: If you have a machine that can, on average, perform symbolic reasoning better than humans, and at a lower cost, that's basically the end game, isn't it? You won capitalism.
reply
harrall
9 hours ago
[-]
Right now I can ask an (experienced) human to do something for me and they will either just get it done or tell me that they can’t do it.

Right now when I ask an LLM… I have to sit there and verify everything. It may have done some helpful reasoning for me but the whole point of me asking someone else (or something else) was to do nothing at all…

I’m not sure you can reliably fulfill the first scenario without achieving AGI. Maybe you can, but we are not at that point yet so we don’t know yet.

reply
raincole
8 hours ago
[-]
You do need to verify humans work though.

The difference, to me, is that humans seem to be good at canceling each other's mistakes when put in a proper environment.

reply
anavat
8 hours ago
[-]
My guess is this is an artifact of the RLHF part of the training. Answers like "I don't know" or "let me think and let's catch on this next week" are flagged down by human testers, which eventually trains LLM to avoid this path altogether. And it probably makes sense because otherwise "I don't know" would come up way too often even in cases where the LLM is perfectly able to give the answer.
reply
gf000
6 hours ago
[-]
I don't know, that seems like a fundamental limitation. LLMs don't have any ability to do reflection on their own knowledge/abilities.
reply
ben_w
6 hours ago
[-]
Humans aren't very aware of their limits, either.

Even the Dunning-Kruger effect is, ironically, widely misunderstood by people who are unreasonably confident about their knowledge.

reply
gf000
26 minutes ago
[-]
But you know if you have ever heard about call by name or value semantics.
reply
eru
2 hours ago
[-]
Yes, Dunning-Kruger's paper never found what popular science calls the 'Dunning-Kruger' effect.

Effectively, they found nothing real but a statistical artifact.

reply
pera
8 hours ago
[-]
It's not clear to me whether AGI is necessary for solving most of the issues in the current generation of LLMs. It is possible you can get there by hacking together CoTs with automated theorem provers and bruteforcing your way to the solution or something like that.

But if it's not enough then maybe it might come as a second-order effect (e.g. reasoning machines having to bootstrap an AGI so then you can have a Waymo taxi driver who is also a Fields medalist)

reply
vbezhenar
8 hours ago
[-]
There are so called "yes-men" who can't say "no" in no situation. That's rooted in their culture. I suspect that AI was trained using their assistance. I mean, answering "I can't do that" is the simplest LLM path that should work often unless they gone out of their way to downrank it.
reply
concordDance
8 hours ago
[-]
> Right now I can ask an (experienced) human to do something for me and they will either just get it done or tell me that they can’t do it.

Finding reliable honest humans is a problem governments have struggled with for over a hundred years. If you have cracked this problem at scale you really need to write it up! There are a lot of people who would be extremely interested in a solution here.

reply
eru
2 hours ago
[-]
> Finding reliable honest humans is a problem governments have struggled with for over a hundred years.

Yes, though you are downplaying the problem a lot. It's not just governments, and it's way longer than 100 years.

Btw, a solution that might work for you or me, presumably relatively obscure people, might not work for anyone famous, nor a company nor a government.

reply
Existenceblinks
8 hours ago
[-]
Honestly, it doesn't need to be local, API is some 200ms away is ok-ish, make it 50ms it will be practically usable for every majority of interaction.
reply
TechDebtDevin
10 hours ago
[-]
Batteries..
reply
otabdeveloper4
10 hours ago
[-]
Intelligence has nothing at all whatever to do with compute.
reply
oefnak
10 hours ago
[-]
Unless you're a dualist who believes in a magic spirit, I cannot understand how you think that's the case. Can you please explain?
reply
lambdaphagy
14 minutes ago
[-]
Philosophy of mind is the branch of philosophy that attempts to account for a very difficult problem: why there are apparently two different realms of phenomena, physical and mental, that are at once tightly connected and yet as different from one another as two things can possibly be.

Broadly speaking you can think that the mental reduces to the physical (physicalism), that the physical reduces to the mental (idealism), both reduce to some other third thing (neutral monism) or that neither reduces to the other (dualism). There are many arguments for dualism but I’ve never heard a philosopher appeal to “magic spirits” in order to do so.

Here’s an overview: https://plato.stanford.edu/entries/dualism/

reply
freehorse
9 hours ago
[-]
Intelligence is about learning from few examples and generalising to novel solutions. Increasing compute so that exploring the whole problem space is possible is not intelligence. There is a reason the actual ARC-AGI price has efficiency as one of the success requirements. It is not so that the solutions scale to production and whatnot, these are toy tasks. It is to help ensure that it is actually an intelligent system solving these.

So yeah, the o3 result is impressive but if the difference between o3 and the previous state of art is more compute to do a much longer CoT/evaluation loop, I am not so impressed. Reminder that these problems are solved by humans in seconds, ARC-AGI is supposed to be easy.

reply
patrickhogan1
10 hours ago
[-]
Do you think intelligence exists without prior experience? For instance, can someone instantly acquire a skill—like playing the piano—as if downloading it in The Matrix? Even prodigies like Mozart had prior exposure. His father, a composer and music teacher, introduced him to music from an early age. Does true intelligence require a foundation of prior knowledge?
reply
1659447091
9 hours ago
[-]
Intelligence requires the ability to separate the wheat from the chaff on one's own to create a foundation of knowledge to build on.

It is also entirely possible to learn a skill without prior experience. That's how it(whatever skill) was first done

reply
owenpalmer
8 hours ago
[-]
> Does true intelligence require a foundation of prior knowledge?

This is the way I think about it.

I = E / K

where I is the intelligence of the system, E is the effectiveness of the system, and K is the prior knowledge.

For example, a math problem is given to two students, each solving the problem with the same effectiveness (both get the correct answer in the same amount of time). However, student A happens to have more prior knowledge of math than student B. In this case, the intelligence of B is greater than the intelligence of A, even though they have the same effectiveness. B was able to "figure out" the math, without using any of the "tricks" that A already knew.

Now back to your question of whether or not prior knowledge is required. As K approaches 0, intelligence approaches infinity. But when K=0, intelligence is undefined. Tada! I think that answers your question.

Most LLM benchmarks simply measure effectiveness, not intelligence. I conceptualize LLMs as a person with a photographic memory and a low IQ of 85, who was given 100 billion years to learn everything humans have ever created.

IK = E

low intelligence * vast knowledge = reasonable effectiveness

reply
behnamoh
22 hours ago
[-]
So now not only are the models closed, but so are their evals?! This is a "semi-private" eval. WTH is that supposed to mean? I'm sure the model is great but I refuse to take their word for it.
reply
ZeroCool2u
22 hours ago
[-]
The private evaluation set is private from the public/OpenAI so companies can't train on those problems and cheat their way to a high score by overfitting.
reply
jsheard
22 hours ago
[-]
If the models run on OpenAIs servers then surely they could still see the questions being put into it if they wanted to cheat? That could only be prevented by making the evaluation a one-time deal that can't be repeated, or by having OpenAI distribute their models for evaluators to run themselves, which I doubt they're inclined to do.
reply
foobarqux
21 hours ago
[-]
Yes that's why it is "semi"-private: From the ARC website "This set is "semi-private" because we can assume that over time, this data will be added to LLM training data and need to be periodically updated."

I presume evaluation on the test set is gated (you have to ask ARC to run it).

reply
cchance
22 hours ago
[-]
the evals are the question/answers, ARC-AGI doesn't share the questions and answers for a portion so that models can't be trained on them, the public ones... the public knows the questions so theres a chance they could have been at least partially been trained on the question (if not the actual answer).

Thats how i understand it

reply
sys32768
21 hours ago
[-]
So in a few years, coders will be as relevant as cuneiform scribes.
reply
HarHarVeryFunny
17 hours ago
[-]
I've never seen a company looking for a "coder", anymore than they look to hire spreadsheet creators or powerpoint specialists. A software developer can code, but being able to code doesn't make you a software developer, anymore than being able to create a powerpoint makes you a manager (although in some companies it might do, so maybe bad example!).
reply
__MatrixMan__
16 hours ago
[-]
With only a 100x increase in cost, we improved performance by 0.1x and continued plotting this concave-down diminishing-returns type graph! Hurray for logarithmic x-axes!

Joking aside, better than ever before at any cost is an achievement, it just doesn't exactly scream "breakthrough" to me.

reply
whalee
16 hours ago
[-]
imo it's a mistake to interpret the marginal increases in the upper echelons of benchmarks as materially marginal gains. Chess is an example. ELO narrows heavily at the top, but each ELO point carries more relative weight. This is a bit apples and oranges since chess is adversarial, but I think the point stands.
reply
wavemode
11 hours ago
[-]
> ELO narrows heavily at the top

What do you mean by this? I'm assuming you're not speaking about simple absolute differences in value - there have been top players rated over 100 points higher than the average of the rest of the top ten.

reply
energy123
12 hours ago
[-]
o3-mini (high) uses 1/3rd of the compute of o1, and performs about 200 Elo higher than o1 on Codeforces.

o1 is the best code generation model according to Livebench.

So how is this not a breakthrough? It's a genuine movement of the frontier.

reply
handzhiev
8 hours ago
[-]
How much time does a top sprinter take a 100 m run for compared to a mediocre sprinter?
reply
dyauspitr
14 hours ago
[-]
I mean going from 10% to 85% doesn’t seem like a 0.1% improvement
reply
__MatrixMan__
10 hours ago
[-]
Oh crap I made a mistake. I was comparing o3 low to o3 high.

I'm a little disappointed by all the upvotes I got for being flat wrong. I guess as long as you're trashing AI you can get away with anything.

Really I was just trying to nitpick the chart parameters.

reply
HDThoreaun
16 hours ago
[-]
compute gets cheaper and cheaper every year. This model will be in your phone by 2030 if we continue at the pace we've been at the last few years.
reply
hajile
12 hours ago
[-]
These models are nearing 2+ trillion parameters. At 4 bits each, we're talking about somewhere around 1tb of RAM.

The problem is that RAM stopped scaling a long time ago now. We're down to the size where a single capacitor's charge is held by a mere 40,000 or so electrons and all we've been doing is making skinnier, longer cells of that size because we can't find reliable ways to boost even weaker signals, but this is a dead end because as the math shows, if the volume is consistent and you are reducing X and Y dimensions, that Z dimension starts to get crazy big really fast. The chemistry issues of burning a hole a little at a time while keeping wall thickness somewhat similar all the way down is a very hard problem.

Another problem is that Moore's law hit a wall when Dennard Scaling failed. When you look at SRAM (it's generally the smallest and most reliable stuff we can make), you see that most recent shrinks can hardly be called shrinks.

Unless we do something very different like compute in storage or have some radical breakthrough in a new technology, I don't know that we will ever get a 2T parameter model inside a phone (I'd love for someone in 10 years to show up and say how wrong I was).

reply
agentultra
15 hours ago
[-]
There’s probably enough VC money to subsidize the costs for a few more years.

But the data centres running the training for models like this are bringing up new methane power plants at a fast rate at a time when we need to be reducing reliance on O&G.

But let’s assume that the efficiency gains out pace the resource consumption with the help of all the subsidies being thrown in and we achieve AGI.

What’s the benefit? Do we get more fresh water?

reply
fastball
14 hours ago
[-]
Politically anything can happen. Maybe the billionaire class controls everything with an army of robots and it's a horrible prison-like dystopia, or maybe we end up in a post-scarcity utopia a la The Culture.

Regardless, once we have AGI (and it can scale), I don't think O&G reliance (/ climate change) is going to be something that we need concern ourselves with.

reply
hamburga
15 hours ago
[-]
Yeah, good question. I think it depends on our politics. If we’re in a techno-capital-oligarchy, people are going to have a hard time making fresh water a priority when the robots would prefer to build nuclear power everywhere and use it to desalinate sea water.

OTOH if these data centers are sufficiently decentralized and run for public benefit, maybe there’s a chance we use them to solve collective action problems.

reply
kvetching
16 hours ago
[-]
It may eventually be able to solve any problem
reply
iterance
16 hours ago
[-]
Ah. Me, too.
reply
demirbey05
19 hours ago
[-]
It is not exactly AGI but huge step toward it. I would expect this step in 2028-2030. I cant really understand why people are happy with it, this technology is so dangerous that can disrupt whole society. It's neither like smartphone nor internet. What will happen to 3rd world countries. Lots of unsolved questions and world is not prepared for such a change. Lots of people will lose their jobs I am not even mentioning their debts. No one will have chance to be rich anymore, If you are in first world country you will probably get UBI, if not you wont.
reply
Ancalagon
19 hours ago
[-]
Same, I don’t really get the excitement. None of these companies are pushing for a utopian Star Trek society either with that power.
reply
moffkalast
18 hours ago
[-]
Open models will catch up next year or the year after, there only so many things to try and there's lots of people trying them, so it's more or less an inevitability.

The part to get excited about is that there's plenty of headroom left to gain in performance. They called o1 a preview, and it was, a preview for QwQ and similar models. We get the demo from OAI and then get the real thing for free next year.

reply
FanaHOVA
19 hours ago
[-]
> I would expect this step in 2028-2030.

Do you work at one of the frontier labs?

reply
ripped_britches
15 hours ago
[-]
I’ve never understood this perspective. Companies only make money when there are billions of customers. Are you imagining a total-monopoly scenario where zero humans have any income/wealth and there are only AI companies selling/mining/etc to each other, fully on their own? In such an extreme scenario, clearly the world’s governments would nationalize these entities. I think the only realistic scenario in which the future is not markedly better for every single human is if some rogue AI system decides to exterminate us, which I find to be increasingly unlikely as safety improvements are made (like the paper released today).

As for the wealth disparity between rich and poor countries, it’s hard to know how politics will handle this one, but it’s unlikely that poor countries won’t also be drastically richer as the cost of basic living drops to basically zero. Imagine the cost of food, energy, etc in an ASI world. Today’s luxuries will surely be considered human rights necessities in the near future.

reply
Jensson
14 hours ago
[-]
> In such an extreme scenario, clearly the world’s governments would nationalize these entities

Those entities are the worlds governments regardless how things play out. People just worry they will be hostile or indifferent to humans, since that would be bad news for humans. Pet, cattle or pest, our future will be as one of those.

reply
lagrange77
18 hours ago
[-]
I hope governments will finally take action.
reply
Joeri
18 hours ago
[-]
What action do you expect them to take?

What law would effectively reduce risk from AGI? The EU passed a law that is entirely about reducing AI risk and people in the technology world almost universally considered it a bad law. Why would other countries do better? How could they do better?

reply
lagrange77
17 hours ago
[-]
If their mission is the wellbeing of their peoples, they should take any action that ensures that.

Besides regulating the technology, they could try to protect people and society from the effects of the technology. UBI for example could be an attempt to protect people from the effects of mass unemployment, as i understood it.

Actually i'm afraid even more fundamental shifts are necessary.

reply
wyager
19 hours ago
[-]
> What will happen to 3rd world countries

Probably less disruption than will happen in 1st world countries.

> No one will have chance to be rich anymore

It's strange to reach this conclusion from "look, a massive new productivity increase".

reply
janalsncm
19 hours ago
[-]
Strange indeed if we work under the assumption that the profits from this productivity will be distributed (even roughly) evenly. The problem is that most of us see no indication that they will be.

I read “no one will have a chance to be rich anymore” as a statement about economic mobility. Despite steep declines in mobility over the last 50 years, it was still theoretically possible for a poor child (say bottom 20% wealth) to climb several quintiles. Our industry (SWE) was one of the best examples. Of course there have been practical barriers (poor kids go to worse schools, and it’s hard to get into college if you can’t read) but the path was there.

If robots replace a lot of people, that path narrows. If AGI replaces all people, the path no longer exists.

reply
entropi
17 hours ago
[-]
It is not strange at all, a very big motivation of spending billions in AI research is basically to remove what is called "skill premium" from the labor market. That "skill premium" was usually how people got richer than their fathers.
reply
the8472
18 hours ago
[-]
Intelligence is the thing distinguishing humans from all previous inventions that already were superhuman in some narrow domain.

car : horse :: AGI : humans

reply
demirbey05
19 hours ago
[-]
its not like sonnet, yes current ai tools are increasing productivity and provides many ways to have chance to be rich, but agi is completely different. You need to handle evil competition between you and big fishes, probably big fishes will have more ai resources than you. What is the survival ratio in such a environment ? Very low.
reply
dyauspitr
18 hours ago
[-]
I’m extremely excited because I want to see the future and I’m trying not to think of how severely fucked my life will be.
reply
og_kalu
22 hours ago
[-]
This is also wildly ahead in SWE-bench (71.7%, previous 48%) and Frontier Math (25% on high compute, previous 2%).

So much for a plateau lol.

reply
throwup238
22 hours ago
[-]
> So much for a plateau lol.

It’s been really interesting to watch all the internet pundits’ takes on the plateau… as if the two years since the release of GPT3.5 is somehow enough data for an armchair ponce to predict the performance characteristics of an entirely novel technology that no one understands.

reply
bandwidth-bob
21 hours ago
[-]
The pundits response to the (alleged) plateau was proportional to the certainty with which CEOs of frontier labs discussed pre-training scaling. The o3 result is from scaling test time compute, which represents a meaningful change in how you would build out compute for scaling (single supercluster --> presence in regions close to users). Thus it is important to discuss.
reply
jgalt212
22 hours ago
[-]
You could make an equivalently dismissive comment about the hypesters.
reply
throwup238
22 hours ago
[-]
Yeah but anyone with half a brain knows to ignore them. Vapid cynicism is a lot more seductive to the average nerd.
reply
HarHarVeryFunny
17 hours ago
[-]
You're talking apples and oranges. The plateau the frontier models have hit is the limited further gains to be had from dataset (+ corresponding model/compute) scaling.

These new reasoning models are taking things in a new direction basically by adding search (inference time compute) on top of the basic LLM. So, the capabilities of the models are still improving, but the new variable is how deep of a search you want to do (how much compute to throw at it at inference time). Do you want your chess engine to do a 10 ply search or 20 ply? What kind of real world business problems will benefit from this?

reply
og_kalu
16 hours ago
[-]
"New" reasoning models are plain LLMs with clever reinforcement learning. o1 is itself reinforcement learning on top GPT-4o.

They found a way to make test time compute a lot more effective and that is an advance but the idea is not new, the architecture is not new.

And the vast majority of people convinced LLMs plateaued did so regardless of test time compute.

reply
HarHarVeryFunny
16 hours ago
[-]
The fact that these reasoning models may compute for extended durations, using exponentially more compute for linear performance gains (says OpenAI), resulting in outputs that while better are not necessarily any longer (more tokens) than before, all point to a different architecture - some type of iterative calling of the underlying model (essentially a reasoning agent using the underlying model).

A plain LLM does not use variable compute - it is a fixed number of transformer layers, a fixed amount of compute for every token generated.

reply
throwaway314155
14 hours ago
[-]
Architecture generally refers to the design of the model. In this case, the underlying model is still a transformer based llm and so is its architecture.

What's different is the method for _sampling_ from that model where it seems they have encouraged the underlying LLM to perform a variable length chain of thought "conversation" with itself as has been done with o1. In addition, they _repeat_ these chains of thought in parallel using a tree of some sort to search and rank the outputs. This apparently scales performance on benchmarks as you scale both length of the chain of thought and the number of chains of thought.

reply
HarHarVeryFunny
3 hours ago
[-]
No disagreement, although the sampling + search procedure is obviously adding quite a lot to the capabilities of the system as a whole, so it really should be considered as part of the architecture. It's a bit like AlphaGo or AlphaZero - generating potential moves (cf LLM) is only a component of the overall solution architecture, and the MCTS sampling/search is equally (or more) important.
reply
og_kalu
12 hours ago
[-]
I think throwaway already explained what i was getting at.

That said, i probably did downplay the achievement. It may not be a "new" idea to do something like this but finding an effective method for reflection that doesn’t just lock you into circular thinking and is applicable beyond well defined problem spaces is genuinely tough and a breakthrough.

reply
attentionmech
22 hours ago
[-]
I legit see that if there is not even a new breakthrough just one week, people start shouting plateau plateau.. Our rate of progress is extraordinary and any downplay of it seems like stupid
reply
optimalsolver
22 hours ago
[-]
>Frontier Math (25% on high compute, previous 2%)

This is so insane that I can't help but be skeptical. I know FM answer key is private, but they have to send the questions to OpenAI in order to score the models. And a significant jump on this benchmark sure would increase a company's valuation...

Happy to be wrong on this.

reply
OsrsNeedsf2P
22 hours ago
[-]
At 6,670$/task? I hope there's a jump
reply
og_kalu
22 hours ago
[-]
It's not 6,670$/task. That was the high efficiency cost for 400 questions.
reply
lagrange77
18 hours ago
[-]
> You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.

That's the most plausible definition of AGI i've read so far.

reply
cmrdporcupine
17 hours ago
[-]
That's a pretty dark view of humanity and human intelligence. We're defined by the tasks we can do?

Instrumental reason FTW

reply
lagrange77
17 hours ago
[-]
That implies that human intelligence is equivalent to AGI.
reply
vessenes
22 hours ago
[-]
This feels like big news to me.

First of all, ARC is definitely an intelligence test for autistic people. I say as someone with a tad of the neurodiversity. That said, I think it's a pretty interesting one, not least because as you go up in the levels, it requires (for a human) a fair amount of lateral thinking and analogy-type thinking, and of course, it requires that this go in and out of visual representation. That said, I think it's a bit funny that most of the people training these next-gen AIs are neurodiverse and we are training the AI in our own image. I continue to hope for some poet and painter-derived intelligence tests to be added to the next gen tests we all look at and score.

For those reasons, I've always really liked ARC as a test -- not as some be-all end-all for AGI, but just because I think that the most intriguing areas next for LLMs are in these analogy arenas and ability to hold more cross-domain context together for reasoning and etc.

Prompts that are interesting to play with right now on these terms range from asking multimodal models to say count to ten in a Boston accent, and then propose a regional french accent that's an equivalent and count to ten in that. (To my ear, 4o is unconvincing on this). Similar in my mind is writing and architecting code that crosses multiple languages and APIs, and asking for it to be written in different styles. (claude and o1-pro are .. okay at this, depending).

Anyway. I agree that this looks like a large step change. I'm not sure if the o3 methods here involve the spinning up of clusters of python interpreters to breadth-search for solutions -- a method used to make headway on ARC in the past; if so, this is still big, but I think less exciting than if the stack is close to what we know today, and the compute time is just more introspection / internal beam search type algorithms.

Either way, something had to assess answers and think they were right, and this is a HUGE step forward.

reply
jamiek88
22 hours ago
[-]
> most of the people training these next-gen AIs are neurodiverse

Citation needed. This is a huge claim based only on stereotype.

reply
vessenes
21 hours ago
[-]
So true. Perhaps I'm just thinking it's my people and need to update my priors.
reply
getpost
21 hours ago
[-]
> most of the people training these next-gen AIs are neurodiverse and we are training the AI in our own image

Do you have any evidence to support that? It would be fascinating if the field is primarly advancing due to a unique constellation of traits contributed by individuals who, in the past, may not have collaborated so effectively.

reply
vessenes
21 hours ago
[-]
PURELY Anecdotal. But I'll say that as of 2024 1 in 36 US children are diagnosed on the spectrum according to the CDC(!), which would mean if you met 10 AI researchers and 4 were neurodivergent you'd reasonably expect that it's a higher-than-population average representation. I'm polling from the Effective Altruist AI folks in my mind, and the number is definitely, definitely higher than 4/10.
reply
EVa5I7bHFq9mnYK
21 hours ago
[-]
Are there non-Effective Altruist AI folks?
reply
vessenes
18 hours ago
[-]
I love how this might mean "non-Effective", non-"Effective Altruist" or non-"Effective Altruist AI" folks.

Yes

reply
braden-lk
22 hours ago
[-]
If people constantly have to ask if your test is a measure of AGI, maybe it should be renamed to something else.
reply
OfficialTurkey
22 hours ago
[-]
From the post

> Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.

reply
cchance
22 hours ago
[-]
Its funny when they say this, as if all humans can solve basic ass question/answer combos, people seem to forget theirs a percentage of the population that honestly believe the world is flat along with other hallucinations at the human level
reply
Jensson
17 hours ago
[-]
Humans works in groups, so you are wrong a group of human is extremely reliable on tons of tasks. These AI models also work in groups, or they don't improve from working in a group since the company uses whatever does the best on the benchmark, so it is only fair to compare AI vs group of people, AI compared to an individual will always be an unfair comparison since an AI is never alone.
reply
jppittma
22 hours ago
[-]
I don't believe AGI at that level has any commercial value.
reply
maxdoop
22 hours ago
[-]
How much longer can I get paid $150k to write code ?
reply
prmph
22 hours ago
[-]
I’ll believe the models can take the jobs of programmers when they can generate a sophisticated iOS app based on some simple prompts, ready for building and publication in the app store. That is nowhere near the horizon no matter how much things are hyped up, and it may well never arrive.
reply
vouaobrasil
21 hours ago
[-]
Nah, it will arrive. And regardless, this sort of AI reduces the skill level required to make the app. It reduces the amount of people required and thus reduces the demand for engineers. So, even though AI is not CLOSE to what you are suggesting, it can significantly reduce the salaries of those that ARE required. So maybe fewer $150K programmers will be hired with the same revenue for even higher profits.

The most bizarre thing is that programmers are literally writing code to replace themselves because once this AI started, it was a race to the bottom and nobody wants to be last.

reply
prmph
21 hours ago
[-]
They've been promising us this thing since the 60s: End-user development, 5GLs, etc. enabling the average Joe to develop sophisticated apps in minimal time. And it never arrives.

I remember attending a tech fair decades ago, and at one stand they were vending some database products. When I mentioned that I was studying computer science with a focus on software engineering, they sneered that coding will be much less important in the future since powerful databases will minimize the need for a lot of data wrangling in applications with algorithms.

What actually happened is that the demand for programmers increased, and software ate the world. I suspect something similar will happen the current AI hype.

reply
whynotminot
19 hours ago
[-]
> They've been promising us this thing since the 60s: End-user development, 5GLs, etc. enabling the average Joe to develop sophisticated apps in minimal time. And it never arrives.

This has literally already arrived. Average Joes are writing software using LLMs right now.

reply
arrosenberg
16 hours ago
[-]
Source? Which software products are built without engineers?
reply
Jensson
15 hours ago
[-]
Personal websites etc, you don't think about them as software products since they weren't built by engineers, but 30 years ago you needed engineers to build those things.
reply
arrosenberg
14 hours ago
[-]
Ok, well I’m not going to worry about my job then. 25 years ago GeoCities existed and you didn’t need an engineer. 10 year old me was writing functional HTML, definitely not an engineer at that point.
reply
whynotminot
13 hours ago
[-]
To be honest maybe no one should worry.

If AI truly overtakes knowledge work there’s not much we could reasonably do to prepare for it.

If AI never gets there though, then you saved yourself the trouble of stressing about it. So sure, relax, it’s just the second coming of GeoCities.

reply
hatefulmoron
8 hours ago
[-]
I think the fear comes from the span of time. If my job is obsolete at the same time as everybody else's, I wouldn't care. I mean, sure, the world is in for a very tough time, but I would be in good company.

The really bad situation is if my entire skill set is made obsolete while the rest of the world keeps going for a decade or two. Or maybe longer, who knows.

I realize I'm coming across quite selfish, but it's just a feeling.

reply
vouaobrasil
20 hours ago
[-]
Well, I think in the 60s we also didn't have LLMs that could actually write complete programs, either.
reply
mirsadm
19 hours ago
[-]
No one writes a "complete program" these days. Things just keep evolving forever. I spent way too much time I care to admit dealing with dependencies of libraries which change seemingly on a daily basis these days. These predictions are so far off reality it makes me wonder if the people making them have ever written any code in their life.
reply
vouaobrasil
19 hours ago
[-]
That's fair. Well, I've written a lot of code. But anyway, I do want to emphasize the following. I am not making the same prediction as some that say AI can replace a programmer. Instead, I am saying: combination of AI plus programmers will reduce the need for the number or programmers, and hence allow the software industry to exist with far fewer people, with the lucky ones accumulating even more wealth.
reply
skydhash
21 hours ago
[-]
> Nah, it will arrive

Will it?

It's already hard to get people to use computer as they are right now, where you only need to click on things and no longer have to enter commands. That because most people don't like to engage in formal reasoning. Even with one of the most intuitive computer assisted task (drawing and 3d modeling), there's so much to learn regarding theories that few people bother.

Programming has always been easy to learn, and tools to automate coding have existed for decades now. But how many people you know have had the urge to learn enough to automate their tasks?

reply
timenotwasted
22 hours ago
[-]
The absolutist type comments are such a wild take given how often they are so wrong.
reply
tsunamifury
22 hours ago
[-]
Totally... simple increases in 20% efficiency will already significant destroy demand for coders. This forum however will be resistant to admit such economic phenomenon.

Look at video bay editing after the advent of Final Cut. Significant drop in the specialized requirement as a professional field, even while content volume went up dramatically.

reply
derektank
21 hours ago
[-]
I could be misreading this, but as far as I can tell, there are more video and film editors today (29,240) than there were film editors in 1997 (9,320). Seems like an example of improved productivity shifting the skills required but ultimately driving greater demand for the profession as a whole. Salaries don't seem to have been hurt either, median wage was $35,214 in '97 and $66,600 today, right in line with inflation.

https://www.bls.gov/oes/2023/may/oes274032.htm

https://www.bls.gov/oes/tables.htm

reply
exitb
22 hours ago
[-]
Computing has been transforming countless jobs before it got to Final Cut. On one hand, programming is not the hardest job out there. On the other, it takes months to fully onboard a human developer - a person that already has years of relevant education and work experience. There are desk jobs that onboard new hires in days instead. Let’s see when they’re displaced by AI first.
reply
tsunamifury
21 hours ago
[-]
Don't know if you noticed but thats already happening. Mass layoffs in customer service etc have already happened over the last 2 years
reply
EVa5I7bHFq9mnYK
21 hours ago
[-]
That's until AI has improved enough that it can automatically navigate the menus to get me a human operator to talk to.
reply
exitb
21 hours ago
[-]
So, how does it work out? Are the customers happy? Are the bosses at my work going to be equally happy with my AI replacement?
reply
sss111
22 hours ago
[-]
3 to 5 years, max. Traditional coding is going to be dead in the water. Optimistically, the junior SWE job will evolve but more realistically dedicated AI-based programming agents will end demand for Junior SWEs
reply
lagrange77
19 hours ago
[-]
Which implies that a few years later they will not become senior SWEs either.
reply
HarHarVeryFunny
17 hours ago
[-]
You're not being paid $150K to "write code". You're being paid that to deliver solutions - to be a corporate cog than can ingest business requirements and emit (and maintain) business solutions.

If there are jobs paying $150K just to code (someone else tells you what to code, and you just code it up), then please share!

reply
arrosenberg
22 hours ago
[-]
Unless the LLMs see multiple leaps in capability, probably indefinitely. The Malthusians in this thread seem to think that LLMs are going to fix the human problems involved in executing these businesses - they won't. They make good programmers more productive and will cost some jobs at the margins, but it will be the low-level programming work that was previously outsourced to Asia and South America for cost-arbitrage.
reply
mrdependable
20 hours ago
[-]
I think they will have to figure out how to get around context limits before that happens. I also wouldn't be surprised if the future models that can actually replace workers are sold at such an exorbitant price that only larger companies will be able to afford it. Everyone else gets access to less capable models that still require someone with knowledge to get to an end result.
reply
torginus
22 hours ago
[-]
Well, considering they floated the $2000 subscription idea, and they still haven't revealed everything, they could still introduce the $2k sub with o3+agents/tool use, which means, till about next week.
reply
deadbabe
22 hours ago
[-]
There’s a very good chance that if a company can replace its programmers with pure AI then it means whatever they’re doing is probably already being offered as a SaaS product so why not just skip the AI and buy that? Much cheaper and you don’t have to worry about dealing with bugs.
reply
croemer
21 hours ago
[-]
SaaS works for general problems faced by many businesses.
reply
deadbabe
21 hours ago
[-]
Exactly. Most businesses can get away with not having developers at all if they just glue together the right combination of SaaS products. But this doesn’t happen, implying there is something more about having your own homegrown developers that SaaS cannot replace.
reply
croemer
21 hours ago
[-]
The risk is not SaaS replacing internal developers. It's about increased productivity of developers reducing the number of developers needed to achieve something.
reply
deadbabe
19 hours ago
[-]
Again, you’re assuming product complexity won’t grow as a result of new AI tools.

3 decades ago you needed a big team to create the type of video games that one person can probably make on their own today in their spare time with modern tools.

But now modern tools have been used to make even more complicated games that require more massive teams than ever and huge amounts of money. One person has no hope of replicating that now, but maybe in the future with AI they can. And then the AAA games will be even more advanced.

It will be similar with other software.

reply
kirykl
18 hours ago
[-]
If it’s any consolation, Agile priests and middle managers will be the first to go
reply
colesantiago
22 hours ago
[-]
Frontier expert specialist programmers will always be in demand.

Generalist junior and senior engineers will need to think of a different career path in less than 5 years as more layoffs will reduce the software engineering workforce.

It looks like it may be the way things are if progress in the o1, o3, oN models and other LLMs continues on.

reply
mitjam
21 hours ago
[-]
The question is: How to become a senior when there is no place to be a junior? Will future SWE need to do the 10k hours as a hobby? Will AI speed up or slow down learning?
reply
singularity2001
19 hours ago
[-]
good question and I think you gave the correct answer yes people will just do the 10,000 hours required by starting programming at the age of eight and then playing around until they're done studying
reply
deadbabe
22 hours ago
[-]
This assumes that software products in the future will remain at the same complexity as they are today, just with AI building them out.

But they won’t. AI will enable building even more complex software which counter intuitively will result in need even more human jobs to deal with this added complexity.

Think about how despite an increasing amount of free open source libraries over time enabling some powerful stuff easily, developer jobs have only increased, not decreased.

reply
hackinthebochs
21 hours ago
[-]
What about "general" in AGI do you not understand? There will be no new style of development for which the AGI will be poorly suited that all the displaced developers can move to.
reply
bandwidth-bob
20 hours ago
[-]
For true AGI (whatever that means, lets say fully replicates human abilities), discussing "developers" only is a drop in the bucket compared to all knowledge work jobs which will be displaced.
reply
dmm
21 hours ago
[-]
I've made a similar argument in the past but now I'm not so sure. It seems to me that developer demand was linked to large expansions in software demand first from PCs then the web and finally smartphones.

What if software demand is largely saturated? It seems the big tech companies have struggled to come up with the next big tech product category, despite lots of talent and capital.

reply
bandwidth-bob
20 hours ago
[-]
The new capabilities of LLMs, and generally large foundation models, expands the range of what a computer program can do. Naturally, we will need to build all of those things with code. Which will be done by a combo of people with product ideas, engineers, and LLMs. There will be then specialization and competition on each new use-case. eg., who builds the best AI doctor etc.,.
reply
deadbabe
21 hours ago
[-]
There doesn’t need to be a new category. Existing categories can just continue bloating in complexity.

Compare the early web vs the complicated JavaScript laden single page application web we have now. You need way more people now. AI will make it even worse.

Consider that in the AI driven future, there will be no more frameworks like React. Who is going to bother writing one? Instead every company will just have their own little custom framework built by an AI that works only for their company. Joining a new company means you bring generalist skills and learn how their software works from the ground up and when you leave to another company that knowledge is instantly useless.

Sounds exciting.

But there’s also plenty of unexplored categories anyway that we can’t access still because there’s insufficient technology for. Household robots with AGI for instance may require instructions for specific services sold as “apps” that have to be designed and developed by companies.

reply
cruffle_duffle
20 hours ago
[-]
This is exactly what will happen. We'll just up the complexity game to entirely new baselines. There will continue to be good money in software.

These models are tools to help engineers, not replacements. Models cannot, on their own, build novel new things no matter how much the hype suggests otherwise. What they can do is remove a hell of a lot of accidental complexity.

reply
lagrange77
18 hours ago
[-]
> These models are tools to help engineers, not replacements. Models cannot, on their own, build novel new things no matter how much the hype suggests otherwise.

But maybe models + managers/non technical people can?

reply
tsunamifury
22 hours ago
[-]
Often what happens is the golf-course phenomenon. As golfing gets less popular, low and mid tier golf courses go out of business as they simply aren't needed. But at the same time demand for high end golf courses actually skyrockets because people who want to golf either can give it up or go higher end.

This I think will happen with programmers. Rote programming will slowly die out, while demand for super high end will go dramatically up in price.

reply
CapcomGo
22 hours ago
[-]
Where does this golf-course phenomenon come from? It doesn't really match the real world or how golfing works.
reply
tsunamifury
22 hours ago
[-]
how so, witnessed it quite directly in California. Majority have closed and remaining have gone up in price and are up scale. This has been covered in various new programs like 60 minutes. You can look up death of golfing.

Also unsure what you mean by...'how golfing works'. This is the economics of it, not the game

reply
EVa5I7bHFq9mnYK
20 hours ago
[-]
Maybe its CA thing? Plenty of $50 golf courses here in Phoenix.
reply
razodactyl
22 hours ago
[-]
Great. Now we have to think of a new way to move the goalposts.
reply
a_wild_dandan
22 hours ago
[-]
Let's just define AI as "whatever computers still can't do." That'll show those dumb statistical parrots!
reply
foobarqux
21 hours ago
[-]
This is just as silly as claiming that people "moved the goalposts" when a computer beat Kasparov at chess to claim that it wasn't AGI: it wasn't a good test and some people only realize this after the computer beat Kasparov but couldn't do much else. In this case the ARC maintainers specifically have stated that this is a necessary but not sufficient test of AGI (I personally think it is neither).
reply
og_kalu
21 hours ago
[-]
It's not silly. The computer that could beat Kasparov couldn't do anything else so of course it wasn't Artificial General Intelligence.

o3 can do much much more. There is nothing narrow about SOTA LLMs. They are already General. It doesn't matter what ARC Maintainers have said. There is no common definition of General that LLMs fail to meet. It's not a binary thing.

By the time a single machine covers every little test humanity can devise, what comes out of that is not 'AGI' as the words themselves mean but a General Super Intelligence.

reply
foobarqux
20 hours ago
[-]
It is silly, the logic is the same: "Only a (world-altering) 'AGI' could do [test]" -> test is passed -> no (world-altering) 'AGI' -> conclude that [test] is not a sufficient test for (world-altering) 'AGI' -> chase new benchmark.

If you want to play games about how to define AGI go ahead. People have been claiming for years that we've already reached AGI and with every improvement they have to bizarrely claim anew that now we've really achieved AGI. But after a few months people realize it still doesn't do what you would expect of an AGI and so you chase some new benchmark ("just one more eval").

The fact is that there really hasn't been the type of world-altering impact that people generally associate with AGI and no reason to expect one.

reply
og_kalu
19 hours ago
[-]
>It is silly, the logic is the same: "Only a (world-altering) 'AGI' could do [test]" -> test is passed -> no (world-altering) 'AGI' -> conclude that [test] is not a sufficient test for (world-altering) 'AGI' -> chase new benchmark.

Basically nobody today thinks beating a single benchmark and nothing else will make you a General Intelligence. As you've already pointed out out, even the maintainers of ARC-AGI do not think this.

>If you want to play games about how to define AGI go ahead.

I'm not playing any games. ENIAC cannot do 99% of the things people use computers to do today and yet barely anybody will tell you it wasn't the first general purpose computer.

On the contrary, it is people who seem to think "General" is a moniker for everything under the sun (and then some) that are playing games with definitions.

>People have been claiming for years that we've already reached AGI and with every improvement they have to bizarrely claim anew that now we've really achieved AGI.

Who are these people ? Do you have any examples at all. Genuine question

>But after a few months people realize it still doesn't do what you would expect of an AGI and so you chase some new benchmark ("just one more eval").

What do you expect from 'AGI'? Everybody seems to have different expectations, much of it rooted in science fiction and not even reality, so this is a moot point. What exactly is World Altering to you ? Genuinely, do you even have anything other than a "I'll know it when i see it ?"

If you introduce technology most people adopt, is that world altering or are you waiting for Skynet ?

reply
foobarqux
19 hours ago
[-]
> Basically nobody today thinks beating a single benchmark and nothing else will make you a General Intelligence.

People's comments, including in this very thread, seem to suggest otherwise (c.f. comments about "goal post moving"). Are you saying that a widespread belief wasn't that a chess playing computer would require AGI? Or that Go was at some point the new test for AGI? Or the Turing test?

> I'm not playing any games... "General" is a moniker for everything under the sun that are playing games with definitions.

People have a colloquial understanding of AGI whose consequence is a significant change to daily life, not the tortured technical definition that you are using. Again your definition isn't something anyone cares about (except maybe in the legal contract between OpenAI and Microsoft).

> Who are these people ? Do you have any examples at all. Genuine question

How about you? I get the impression that you think AGI was achieved some time ago. It's a bit difficult to simultaneously argue both that we achieved AGI in GPT-N and also that GPT-(N+X) is now the real breakthrough AGI while claiming that your definition of AGI is useful.

> What do you expect from 'AGI'?

I think everyone's definition of AGI includes, as a component, significant changes to the world, which probably would be something like rapid GDP growth or unemployment (though you could have either of those without AGI). The fact that you have to argue about what the word "general" technically means is proof that we don't have AGI in a sense that anyone cares about.

reply
og_kalu
18 hours ago
[-]
>People's comments, including in this very thread, seem to suggest otherwise (c.f. comments about "goal post moving").

But you don't see this kind of discussion on the narrow models/techniques that made strides on this benchmark, do you ?

>People have a colloquial understanding of AGI whose consequence is a significant change to daily life, not the tortured technical definition that you are using

And ChatGPT has represented a significant change to the daily lives of many. It's the fastest adopted software product in history. In just 2 years, it's one of the top ten most visited sites on the planet worldwide. A lot of people have had the work they do significant change since its release. This is why I ask, what is world altering ?

>How about you? I get the impression that you think AGI was achieved some time ago.

Sure

>It's a bit difficult to simultaneously argue both that we achieved AGI in GPT-N and also that GPT-(N+X) is now the real breakthrough AGI

I have never claimed GPT-N+X is the "new breakthrough AGI". As far as I'm concerned, we hit AGI sometime ago and are making strides in competence and/or enabling even more capabilities.

You can recognize ENIAC as a general purpose computer and also recognize the breakthroughs in computing since then. They're not mutually exclusive.

And personally, I'm more impressed with o3's Frontier Math score than ARC.

>I think everyone's definition of AGI includes, as a component, significant changes to the world

Sure

>which probably would be something like rapid GDP growth or unemployment

What people imagine as "significant change" is definitely not in any broad agreement.

Even in science fiction, the existence of general intelligences more competent than today's LLMs does not necessarily precursor massive unemployment or GDP growth.

And for a lot of people, the clincher stopping them from calling a machine AGI is not even any of these things. For some, that it is "sentient" or "cannot lie" is far more important than any spike of unemployment.

reply
Jensson
17 hours ago
[-]
> But you don't see this kind of discussion on the narrow models/techniques that made strides on this benchmark, do you ?

This model was trained to pass this test, it was trained heavily on the example questions, so it was a narrow technique.

We even have proof that it isn't AGI, since it scores horribly on ARC-AGI 2. It overfitted for this test.

reply
og_kalu
16 hours ago
[-]
>This model was trained to pass this test, it was trained heavily on the example questions, so it was a narrow technique.

You are allowed to train on the train set. That's the entire point of the test.

>We even have proof that it isn't AGI, since it scores horribly on ARC-AGI 2. It overfitted for this test.

Arc 2 does not even exist yet. All we have are "early signs", not that that would be proof of anything. Whether I believe the models are generally intelligent or not doesn't depend on ARC

reply
Jensson
16 hours ago
[-]
> You are allowed to train on the train test. That's the entire point of the test.

Right, but by training on those test cases you are creating a narrow model. The whole point of training questions is to create narrow models, like all the models we did before.

reply
og_kalu
16 hours ago
[-]
That doesn't make any sense. Training on the train set does not make the models capabilities narrow. Models are narrow when you can't train them to do anything else even if you wanted to.

You are not narrow for undergoing training and it's honestly kind of ridiculous to think so. Not even the ARC maintainers believe so.

reply
Jensson
15 hours ago
[-]
> Training on the train set does not make the models capabilities narrow

Humans didn't need to see the training set to pass this, the AI needing it means it is narrower than the humans, at least on these kind of tasks.

The system might be more general than previous models, but still not as general as humans, and the G in AGI typically means being as general as humans. We are moving towards more general models, but still not at the level where we call them AGI.

reply
foobarqux
18 hours ago
[-]
> But you don't see this kind of discussion on the narrow models/techniques that made strides on this benchmark, do you ?

I don't understand what you are getting at.

Ultimately there is no axiomatic definition of the term AGI. I don't think the colloquial understanding of the word is what you think it is (i.e. if you had described to people, pre-chatgpt, today's chatgpt behavior, including all the limitations and failings and the fact that there was no change in GDP, unemployment, etc), and asked if that was AGI I seriously doubt they would say yes.)

More importantly I don't think anyone would say their life was much different from a few years ago and separately would say under AGI it would be.

But the point that started all this discussion is the fact that these "evals" are not good proxies for AGI and no one is moving goal-posts even if they realize this fact only after the tests have been beaten. You can foolishly define AGI as beating ARC but the moment ARC is beaten you realize that you don't care about that definition at all. That doesn't change if you make a 10 or 100 benchmark suite.

reply
og_kalu
16 hours ago
[-]
>I don't understand what you are getting at.

If such discussions only made when LLMs make strides in the benchmark then it's not just about beating the benchmark but also what kind of system is beating it.

>You can foolishly define AGI as beating ARC but the moment ARC is beaten you realize that you don't care about that definition at all.

If you change your definition of AGI the moment a test is beaten then yes, you are simply post moving.

If you care about other impacts like "Unemployment" and "GDP rising" but don't give any time or opportunity to see if the model is capable of such then you don't really care about that and are just mindlessly shifting posts.

How do such a person know o3 won't cause mass unemployment? The model hasn't even been released yet.

reply
foobarqux
13 hours ago
[-]
> If such discussions only made when LLMs make strides in the benchmark then it's not just about beating the benchmark but also what kind of system is beating it.

I still don't understand the point you are making. Nobody is arguing that discrete program search is AGI (and the same counter-arguments would apply if they did).

> If you change your definition of AGI the moment a test is beaten then yes, you are simply post moving.

I don't think anyone changes their definition, they just erroneously assume that any system that succeeds on the test must do so only because it has general intelligence (that was the argument for chess playing for example). When it turns out that you can pass the test with much narrower capabilities they recognize that it was a bad test (unfortunately they often replace the bad test with another bad test and repeat the error).

> If you care about other impacts like "Unemployment" and "GDP rising" but don't give any time or opportunity to see if the model is capable of such then you don't really care about that and are just mindlessly shifting posts.

We are talking about what models are doing now (is AGI here now) not what some imaginary research breakthroughs might accomplish. O3 is not going to materially change GDP or unemployment. (If you are confident otherwise please say how much you are willing to wager on it).

reply
og_kalu
13 hours ago
[-]
I'm not talking about any imaginary research breakthroughs. I'm talking about today, right now. We have a model unveiled today that seems a large improvement across several benchmarks but hasn't been released yet.

You can be confident all you want but until the model has been given the chance to not have the effect you think it won't then it's just an assertion that may or may not be entirely wrong.

If you say "this model passed this benchmark I thought would indicate AGI but didn't do this or that so I won't acknowledge it" then I can understand that. I may not agree on what the holdups are but I understand that.

If however you're "this model passed this benchmark I thought would indicate AGI but I don't think it's going to be able to do this or that so it's not AGI" then I'm sorry but that's just nonsense.

My thoughts or bets are irrelevant here.

A few days ago I saw someone seriously comparing a site with nearly 4B visits a month in under 2 years to Bitcoin and VR. People are so up in their bubbles and so assured in their way of thinking they can't see what's right in front of them, nevermind predict future usefulness. I'm just not interested in engaging "I think It won't" arguments when I can just wait and see.

I'm not saying you are one of such people. I just have no interest in such arguments.

My bet ? There's no way i would make a bet like that without playing with the model first. Why would I ? Why Would you ?

reply
Pesthuf
22 hours ago
[-]
Well right now, running this model is really expensive, but we should prepare a new cope for when equivalent models no longer are, ahead of time.
reply
cchance
22 hours ago
[-]
Ya getting costs down will be the big one, i imagine quantization, distillation and lots and lots of improvements on the compute side both hardware and software wise.
reply
tines
22 hours ago
[-]
I mean, what else do you call learning?
reply
rvz
22 hours ago
[-]
Great results. However, let's all just admit it.

It has well replaced journalists, artists and on its way to replace nearly both junior and senior engineers. The ultimate intention of "AGI" is that it is going to replace tens of millions of jobs. That is it and you know it.

It will only accelerate and we need to stop pretending and coping. Instead lets discuss solutions for those lost jobs.

So what is the replacement for these lost jobs? (It is not UBI or "better jobs" without defining them.)

reply
drdaeman
21 hours ago
[-]
> It has well replaced journalists, artists and on its way to replace nearly both junior and senior engineers.

Did it, really? Or did it just provide automation for routine no-thinking-necessary text-writing tasks, but is still ultimately completely bound by the level of human operator's intelligence? I strongly suspect it's the latter. If it had actually replaced journalists it must be junk outlets, where readers' intelligence is negligible and anything goes.

Just yesterday I've used o1 and Claude 3.5 to debug a Linux kernel issue (ultimately, a bad DSDT table causing TPM2 driver unable to reserve memory region for command response buffer, the solution was to use memmap to remove NVS flag from the relevant regions) and confirmed once again LLMs still don't reason at all - just spew out plausible-looking chains of words. The models were good listeners, and a mostly-helpful code generators (when they didn't make silliest mistakes), but they gave no traces of understanding and no attention for any nuances (e.g. LLM used `IS_ERR` to check `__request_resource` result, despite me giving it full source code for that function and there's even a comment that makes it obvious it returns a pointer or NULL, not an error code - misguided attention kind of mistake).

So, in my opinion, LLMs (as currently available to broad public, like myself) are useful for automating away some routine stuff, but their usefulness is bounded by the operator's knowledge and intelligence. And that means that the actual jobs (if they require thinking and not just writing words) are safe.

When asked about what I do at work, I used to joke that I just press buttons on my keyboard in fancy patterns. Ultimately, LLMs seem to suggest that it's not what I really do.

reply
RivieraKid
21 hours ago
[-]
The economic theory answer is that people simply switch to jobs that are not yet replaceable by AI. Doctors, nurses, electricians, construction workers, police officers, etc. People in aggregate will produce more, consume more and work less.
reply
achierius
8 hours ago
[-]
> Doctors

Many replaceable

> Police officers

Many replaceable (desk officers)

reply
whynotminot
21 hours ago
[-]
When none of us have jobs or income, there will be no ability for us to buy products. And then no reason for companies to buy ads to sell products to people who don’t have money. Without ad money (or the potential of future ad money), the people pushing the bounds of AGI into work replacement will lose the very income streams powering this research and their valuations.

Ford didn’t support a 40 hour work week out of the kindness of his heart. He wanted his workers to have time off for buying things (like his cars).

I wonder if our AGI industrialist overlords will do something similar for revenue sharing or UBI.

reply
tivert
19 hours ago
[-]
> When none of us have jobs or income, there will be no ability for us to buy products. And then no reason for companies to buy ads to sell products to people who don’t have money. Without ad money (or the potential of future ad money), the people pushing the bounds of AGI into work replacement will lose the very income streams powering this research and their valuations.

I don't think so. I agree the push for AGI will kill the modern consumer product economy, but I think it's quite possible for the economy to evolve into a new form (that will probably be terrible for most humans) that keep pushes "work replacement."

Imagine, an AGI billionare buying up land, mines, and power plants as the consumer economy dies, then shifting those resources away from the consumer economy into self-aggrandizing pet projects (e.g. ziggurats, penthouses on Mars, space yachts, life extension, and stuff like that). He might still employ a small community of servants, AGI researchers, and other specialists; but all the rest of the population will be irrelevant to him.

And individual autarky probably isn't necessary, consumption will be redirected towards the massive pet production I mentioned, with vestigial markets for power, minerals, etc.

reply
whimsicalism
21 hours ago
[-]
This picture doesn't make sense. If most don't have any money to buy products, just invent some other money and start paying one of the other people who doesn't have any money to start making the products for you.

In reality, if there really is mass unemployment, AI driven automation will make consumables so cheap that anyone will be able to buy it.

reply
astrange
16 hours ago
[-]
> If most don't have any money to buy products, just invent some other money and start paying one of the other people who doesn't have any money to start making the products for you.

This isn't possible if you want to pay sales taxes - those are what keep transactions being done in the official currency. Of course in a world of 99% unemployment presumably we don't care about this.

But yes, this world of 99% unemployment isn't possible, eg because as soon as you have two people and they trade things, they're employed again.

reply
tivert
19 hours ago
[-]
> This picture doesn't make sense. If most don't have any money to buy products, just invent some other money and start paying one of the other people who doesn't have any money to start making the products for you.

Ultimately, it all comes down to raw materials and similar resources, and all those will be claimed by people with lots of real money. Your "invented ... other money" will be useless to buy that fundamental stuff. At best, it will be useful for trading scrap and other junk among the unemployed.

> In reality, if there really is mass unemployment, AI driven automation will make consumables so cheap that anyone will be able to buy it.

No. Why would the people who own that automation want to waste their resources producing consumer goods for people with nothing to give them in return?

reply
whimsicalism
17 hours ago
[-]
if people with AI use it to somehow enclose all raw resources, then yes - the picture i painted will be wrong
reply
whynotminot
17 hours ago
[-]
Enclosing raw resources tends to be what powerful people do.
reply
astrange
16 hours ago
[-]
"Raw resources" aren't that valuable economically because they aren't where most of the value is added in production. That's why having a lot of them tends to make your country poorer (https://en.wikipedia.org/wiki/Resource_curse).
reply
Jensson
15 hours ago
[-]
Today educated humans are more valuable than anything else on earth, but AGI changes that. With cheap AGI raw resources and infrastructure will be the only two valuable things left.
reply
whynotminot
21 hours ago
[-]
> This picture doesn't make sense. If most don't have any money to buy products, just invent some other money and start paying one of the other people who doesn't have any money to start making the products for you.

Uh, this picture doesn’t make sense. Why would anyone value this randomly invented money?

reply
whimsicalism
21 hours ago
[-]
> Why would anyone value this randomly invented money?

Because they can use it to pay for goods?

Your notion is that almost everyone is going to be out of a job and thus have nothing. Okay, so I'm one of those people and I need this house built. But I'm not making any money because of AI or whatever. Maybe someone else needs someone to drive their aging relative around and they're a good builder.

If 1. neither of those people have jobs or income because of AI 2. AI isn't provisioning services for basically free,

then it makes sense for them to do an exchange of labor - even with AI (if that AI is not providing services to everyone). The original reason for having money and exchanging it still exists.

reply
staticman2
19 hours ago
[-]
You seem to be arguing that large unemployment rates are logically impossible, so we shouldn't worry about unemployment.

The fact unemployment was 25% during the great depression would seem to suggest that at a minimum, a 25% unemployment rate is possible during a disruptive event.

reply
astrange
16 hours ago
[-]
The unemployment rate in a modern economy is basically whatever the central bank wants it to be. The Great Depression was caused by bad monetary policy - I don't see a reason why having AI would cause that.
reply
staticman2
14 hours ago
[-]
The person upthread was saying that as long as someone wants a house built and someone wants a grandma driven around unemployment can't happen.

Unless nobody wanted either of those things done during the depression that's clearly not a very good mental model.

reply
astrange
13 hours ago
[-]
Yes, I disagree with that. The problem isn't the lack of demand, it's that the people with the demand can't get the money to express it with.
reply
neom
21 hours ago
[-]
Didn't money basically only emerge to deal with with difficulty of “double coincidence of wants”. Money simply solved the problem of making all forms of value interchangeable and transportable across time AND circumstance? A dollar can do with with or without AI existing no?
reply
whimsicalism
21 hours ago
[-]
Yes, that's my point
reply
whynotminot
21 hours ago
[-]
Honestly I don’t even know how to engage with your point.

Yes if we recreate society some form of money would likely emerge.

reply
neom
22 hours ago
[-]
Do you follow Jack Clark? I noticed he's been on the road a lot talking to governments and policy makers, and not just in the "AI is coming" way he used to talk.
reply
CliveBloomers
21 hours ago
[-]
Another meaningless benchmark, another month—it’s like clockwork at this point. No one’s going to remember this in a month; it’s just noise. The real test? It’s not in these flashy metrics or minor improvements. The only thing that actually matters is how fast it can wipe out the layers of middle management and all those pointless, bureaucratic jobs that add zero value.

That’s the true litmus test. Everything else? It’s just fine-tuning weights, playing around the edges. Until it starts cutting through the fat and reshaping how organizations really operate, all of this is just more of the same.

reply
oytis
20 hours ago
[-]
So far AI market seems to be focused on replacing meaningful jobs, meaningless ones look safe (which kind of makes sense if you think about it).
reply
handfuloflight
21 hours ago
[-]
Agreed, but isn't it management who decides that this would be implemented? Are they going to propogate their own removal?
reply
zamadatix
20 hours ago
[-]
Middle manager types are probably interested in their salary performance more than anything. "Real" management (more of their assets come from their ownership of the company than a salary) will override them if it's truthfully the best performing operating model for the company.
reply
sakopov
13 hours ago
[-]
Maybe I'm missing something vital, but how does anything that we've seen AI do up until this point or explained in this experiment even hint at AGI? Can any of these models ideate? Can they come up with technologies and tools? No and it's unlikely they will any time soon. However, they can make engineers infinitely more productive.
reply
jebarker
13 hours ago
[-]
You need to define ideate, tools and technologies to answer those questions. Not to mention that it's quite possible humans do those things through re-combination of learned ideas similarly to how these reasoning models are suggested to be working.
reply
sakopov
12 hours ago
[-]
Every technological advancement that we've seen in software engineering - be it in things like Postgres, Kubernetes and Cloud Infrastructure - came out from truly novel ideas. AI seems to generate outputs that appear novel but are they really? It's capable of synthesizing and combining vast amounts of information in creative ways but it's deriving everything from existing patterns found within its training data. Truly novel ideas require thinking outside the box. It's combination of cognitive, emotional and environmental factors which go beyond pattern recognition. How close are we to achieving this? Everyone seems to be shaking in their boots because we might lose our job safety in tech, but I don't see any intelligence here.
reply
panabee
16 hours ago
[-]
Nadella is a superb CEO, inarguably among the best of his generation. He believed in OpenAI when no one else did and deserves acclaim for this brilliant investment.

But his "below them, above them, around them" quote on OpenAI may haunt him in 2025/2026.

OAI or someone else will approach AGI-like capabilities (however nebulous the term), fostering the conditions to contest Microsoft's straitjacket.

Of course, OAI is hemorrhaging cash and may fail to create a sustainable business without GPU credits, but the possibility of OAI escaping Microsoft's grasp grows by the day.

Coupled with research and hardware trends, OAI's product strategy suggests the probability of a sustainable business within 1-3 years is far from certain but also higher than commonly believed.

If OAI becomes a $200b+ independent company, it would be against incredible odds given the intense competition and the Microsoft deal. PG's cannibal quote about Altman feels so apt.

It will be fascinating to see how this unfolds.

Congrats to OAI on yet another fantastic release.

reply
noah32
14 hours ago
[-]
The best AI on this graph costs 50000% more than a stem graduate to complete the tasks and even then has an error rate that is 1000% higher than the humans???
reply
agnosticmantis
17 hours ago
[-]
This is so impressive that it brings out the pessimist in me.

Hopefully my skepticism will end up being unwarranted, but how confident are we that the queries are not routed to human workers behind the API? This sounds crazy but is plausible for the fake-it-till-you-make-it crowd.

Also given the prohibitive compute costs per task, typical users won't be using this model, so the scheme could go on for quite sometime before the public knows the truth.

They could also come out in a month and say o3 was so smart it'd endanger the civilization, so we deleted the code and saved humanity!

reply
kvn8888
17 hours ago
[-]
That would be a ton of problems for a small team of PhD/Grad level experts to solve (for GPQA Diamond, etc) in a short time. Remember, on EpochAl Frontier Math, these problems require hours to days worth of reasoning by humans

The author also suggested this is a new architecture that uses existing methods, like a Monte Carlo tree search that deepmind is investigating (they use this method for AlphaZero)

I don't see the point of colluding for this sort of fraud, as these methods like tree search and pruning already exist. And other labs could genuinely produce these results

reply
agnosticmantis
16 hours ago
[-]
I had the ARC AGI in mind when I suggested human workers. I agree the other benchmark results make the use of human workers unlikely.
reply
aetherson
16 hours ago
[-]
I'm very confident that queries were not routed to human workers behind the API.

Possibly some other form of "make it seem more impressive than it is," but not that one.

reply
rsanek
17 hours ago
[-]
this is an impressive tinfoil take. but what would be their plan in the medium term? like once they release this people can check their data
reply
agnosticmantis
16 hours ago
[-]
How can people check their data?

In the medium term the plan could be to achieve AGI, and then AGI would figure out how to actually write o3. (Probably after AGI figures out the business model though: https://www.reddit.com/r/MachineLearning/s/OV4S2hGgW8)

reply