The dawn of a world simulator
63 points
4 days ago
| 11 comments
| odyssey.ml
| HN
LarsDu88
3 hours ago
[-]
I feel like there's a bit if a disconnect with the cool video demos demonstrated here and say, the type of world models someone like Yann Lecunn is talking about.

A proper world model like Jepa should be predicting in latent space where the representation of what is going on is highly abstract.

Video generation models by definition are either predicting in noise or pixel space (latent noise if the diffuser is diffusing in a variational encoders latent space)

It seems like what this lab is doing is quite vanilla, and I'm wondering if they are doing any sort of research in less demo sexy joint embedding predictive spaces.

There was a recent paper, LeJepa from LeCunn and a postdoc that actually fixes many of the mode distribution collapse issues with the Jepa embedding models I just mentioned.

I'm waiting on the startup or research group that gives us an unsexy world model. Instead of giving us 1080p video of supermodels camping, gives us a slideshow of something a 6 year old child would draw. That would be a more convincing demonstrator of an effective world model.

reply
jstanley
41 minutes ago
[-]
> Video generation models by definition are either predicting in noise or pixel space

I don't see that this follows "by definition" at all.

Just because your output is pixel values doesn't mean your internal world model is in pixel space.

reply
blueblisters
2 hours ago
[-]
Dreamer4 (https://danijar.com/project/dreamer4/) is a promising direction (by a frontier lab)
reply
godelski
4 hours ago
[-]
As a machine learning researcher, I don't get why these are called world models.

Visually, they are stunning. But it's nowhere near physical. I mean look at that video with the girl and lion. The tail teleports between legs and then becomes attached to the girl instead of the tiger.

Just because the visuals are high quality doesn't mean it's a world model or has learned physics. I feel like we're conflating these things. I'm much happier to call something a world model if its visual quality is dogshit but it is consistent with its world. And I say its world because it doesn't need to be consistent with ours

reply
maplethorpe
1 hour ago
[-]
The tail teleports and reattaches because that is the sort of thing that happens in this special AI world. Even though it looks like a bug, it's actually a physical process being modelled accurately.
reply
godelski
28 minutes ago
[-]
I'll remind you I am a ML researcher.

So, you need to say more. Or at least give me some reason to believe you rather than state something as an objective truth and "just trust me". In the long response to a sibling I state more precisely why I have never bought this common conjecture. Because that's what it is, conjecture.

So give me at least some reason to believe you. Because you have neither logos nor ethos. Your answer is in the form of ethos, but without the critical requisites.

reply
nurettin
4 hours ago
[-]
> Visually, they are stunning.

The input images are stunning, model's result is another disappointing trip to uncanny valley. But we feel Ok as long as the sequence doesn't horribly contradict the original image or sound. That is the world model.

reply
godelski
3 hours ago
[-]

  > But we feel Ok as long as the sequence doesn't horribly contradict the original image or sound. 
Is the error I pointed out not "horribly contradicting"?

  > That is the world model.
I would say that if it is non-physical[0] then it's hard to call it a /world/ model. A world is consistent and has a set of rules that must be followed.

I've yet to see a claimed world model that actually captures this behavior. Yet it's something every game engine[1] gets very well. We'd call it a bad physics engine if they made the same mistakes we see even the most advanced "world models" do.

This is part of why I'm trying to explain that visual quality is actually orthogonal. Even old Atari games have consistent world models despite being pixelated. Or think about Mario on the original NES. Even the physics breaking in that game are more edge cases and not the norm. But here, things like the lion's tail is not consistent even to a 2D world. I've never bought the explanation that teleporting in front of and behind the leg is an artifact of embedding 3D into 2D[2] because the issue is actually the model not understanding collision and occlusion. It does not understand how the sections relate to one another in the image.

The major problem with these systems is that they just hope that the physics is recovered through enough examples of videos. Yet if one studied physics (beyond your basic college courses) you'd understand the naïveté of that. It took a long time to develop physics due to these specific limitations. These models don't even have the advantage of being able to interact with the environment. They have no mechanisms to form beliefs and certainly no means to test them. It's essentially impossible to develop physics through observation alone

[0] with respect to the physics of the world being simulated. I want you distinguish real world physics from /a physics/

[1] a game physics engine is a world model. Which, as in stressing in [0], does not necessarily need follow real world physics. Mistakes happen of course but things are generally consistent.

[2] no video and almost no game is purely 2D. They tend to have backgrounds which places some layering but we'll say 2D for convenience and since we have a shared understanding

reply
kgeist
5 minutes ago
[-]
>A world is consistent and has a set of rules that must be followed.

Large language models are mostly consistent, but they have mistakes even in grammar too, from time to time. And it's usually called a "hallucination". Can't we say physics errors are a kind of "hallucination" too, in a world model? I guess the question is, what hallucination rate are we willing to tolerate.

reply
IAmGraydon
4 hours ago
[-]
>As a machine learning researcher, I don't get why these are called world models.

It's called "world models" because it's a grift. An out-in-the-open, shameless grift. Investors, pile on.

reply
godelski
3 hours ago
[-]
I'm just trying to be a bit more political as it can be hard to communicate the issues. My first degree is actually in physics and I'll just say... over there "world model" implies something very different.

Edit: I said a bit more in the reply to the sibling comment. But we're probably on a similar page.

reply
superb_dev
10 hours ago
[-]
None of these examples videos seem like the kind of “experiments” that they’re talking about simulating with these models.

I was expecting them to test a simple hypothesis and compare the model results to a real world test

reply
ainiriand
10 hours ago
[-]
It is not a world simulator, looks like a world fantasy.
reply
nl
3 hours ago
[-]
The reason they are called "world models" is because the internal representation of what they display represents a "world" instead of a video frame or image. The model needs to "understand" geometry and physics to output a video.

Just because there are errors in this doesn't mean it isn't significant. If a machine learning model understands how physical objects interact with each other that is very useful.

reply
godelski
3 hours ago
[-]

  > what they display represents a "world" instead of a video frame or image.
Do they?

I'm unconvinced. The tiger and girl video is the clearest example. Nothing about that seems world representing

reply
PunchyHamster
3 hours ago
[-]
I think the reason is "those words look nice on promo material". It is absolutely build to trigger hype from the clueless
reply
slashdave
2 hours ago
[-]
> The model needs to "understand" geometry and physics to output a video.

No it doesn't. It merely needs to mimic.

reply
zkmon
2 hours ago
[-]
Please AI - lions have their tail attached to their back, not front. The lion's tail in the video of Girl with a lion is misplaced.
reply
rmnclmnt
9 hours ago
[-]
For a minute I was like (spoiler alert) « wow the creepy sci-fi theories from the DEVS tv show is taking place »… then I looked up the video and that’s just video generation at this point
reply
qingcharles
4 hours ago
[-]
That's where this is headed, though. That's the end game.
reply
rmnclmnt
1 hour ago
[-]
This should be interesting then: we’ll finally be able to assert whether time is deterministic and the future and past can be modelled/predicted (if you’ve seen the show you know what I mean)
reply
jaggederest
36 minutes ago
[-]
I think that's actually already provably false if you're bloody-minded enough. I think the proof lies somewhere like Cantor's diagonalization but applied to reality, something like "if you could produce a model sufficiently complex enough to model the future perfectly it wouldn't fit into this current reality because it would require more than this reality's information"

I'm not saying it couldn't be locally violated, but it seems straightforward philosophically that each nesting doll of simulated reality must be imperfect by being less complicated.

reply
anigbrowl
8 hours ago
[-]
This appears to be a simulator that produces only nice things.
reply
01HNNWZ0MV43FF
7 hours ago
[-]
Only SFW, too
reply
nylonstrung
8 hours ago
[-]
I can't wait for companies like this to run out of money
reply
alex1138
1 hour ago
[-]
I guess this might be a chance to plug the fact that Matrix came up with their own Metaverse thing (for lack of a better word) called Third Room, it represented the rooms you joined as spaces/worlds, they built some limited functionality demos before the funding dried up
reply
pedalpete
10 hours ago
[-]
This looks interesting, but can someone explain to me how this is different from video generators using the previous frames as inputs to expand on the next frame?

Is this more than recursive video? If so, how?

reply
smusamashah
10 hours ago
[-]
See the demo on their homepage. Calling it a world simulator is a marketing gimmick. It's a worse video generator but you can interact with it in real time and direct the video a little bit. Next version of this thing will be worth looking, this one isnt.
reply
netsharc
7 hours ago
[-]
reply
Animats
6 hours ago
[-]
> Calling it a world simulator is a marketing gimmick.

Yes, it should be called an AI Metaverse.

It does do a nice job of short term prediction. That's useful as a component of common sense.

reply
vrighter
3 hours ago
[-]
why would you assume anything about "the next version"?
reply
nowittyusername
8 hours ago
[-]
There is soo much marketing bs around these things it drives me nuts. and it doesn't help that the large labs and credible individuals like denis use these terms. "world models" are video generator with contextual memory but that term is soo misplaced. when one thinks of a "world model" you expect the thing to be at least be physics engine driven from its foundation, not the other way around where everything is generated and assumed at best.
reply
arminiusreturns
7 hours ago
[-]
I'm doing a metasim in full 3D with physics, I just keep seeing the limitations of the video format too much, but it is amazing when done right. The other biggest concern is licensing of output.
reply