The gauge-reading example here is great, but in reality of course having the system synthesize that Python script, run the CV tasks, come back with the answer etc. is currently quite slow.
Once things go much faster, you can also start to use image generation to have models extrapolate possible futures from photos they take, and then describe them back to themselves and make decisions based on that, loops like this. I think the assumption is that our brains do similar things unconsciously, before we integrate into our conscious conception of mind.
I'm really curious what things we could build if we had 100x or 1000x inference throughput.
A few robot legs and arms, big battery, off-the-shelf GPU. Solar panels.
Prompt: "Take care of all this land within its limits and grow some veggies."
Anyway, cool.
Of course this is for counting animal legs while giving coordinates and reading analog clocks. Not coding or or solving puzzles. I imagine the image performance to model weight of this model is very high.
So there might be awesome progress behind the scenes, just not ready for the general public.
My non-AI dishwasher can't even always keep the water inside. Nothing is perfect.
That's a bit exaggerated, no? Early roombas would get tangled in socks, drag pet poop all over the floor, break glass stuff and so on, and yet the market accepted that, evolved, and now we have plenty of cleaning robots from various companies, including cheap spying ones from china.
I actually think that there's a lot of value in being the first to deploy bots into homes, even if they aren't perfect. The amount of data you'd collect is invaluable, and by the looks of it, can't be synth generated in a lab.
I think the "safer" option is still the "bring them to factories first, offices next and homes last", but anyway I'm sure someone will jump straight to home deployments.
VLA models essentially take a webcam screenshot + some text (think "put the red block in the right box") and output motor control instructions to achieve that.
Note: "Gemini Robotics-ER" is not a VLA, though Gemini does have a VLA model too: "Gemini Robotics".