Show HN: Only 1 LLM can fly a drone
38 points
4 hours ago
| 5 comments
| github.com
| HN
avaer
14 minutes ago
[-]
Gemini 3 is the only model I've found that can reason spatially. The results here are accurate to my experiments with putting LLM NPCs in simulated worlds.

I was surprised that most VLLMs cannot reliably tell if a character is facing left or right, they will confidently lie no matter what you do (even gemini 3 cannot do it reliably). I guess it's just not in the training data.

That said Qwen3VL models are smaller/faster and better "spatially grounded" in pixel space, because pixel coordinates are encoded in the tokens. So you can use them for detecting things in the scene, and where they are (which you can project to 3d space if you are running a sim). But they are not good reasoning models so don't ask them to think.

That means the best pipeline I've found at the moment is to tack a dumb detection prepass on before your action reasoning. This basically turns 3d sims into 1d text sims operating on labels -- which is something that LLMs are good at.

reply
fsiefken
22 minutes ago
[-]
I am curious how these models would perform and how much energy they'd take to semi-realtime detect objects: SmolVLM2-500M - Moondream 0.5B/2B/2.5B - Qwen3-VL (3B) https://huggingface.co/collections/Qwen/qwen3-vl

I am sure this is already worked on in Russia, Ukraine and The Netherlands. A lot can go wrong with autonomous flying. One could load the VLM on a high end android phone on the drone and have dual control.

reply
accrual
33 minutes ago
[-]
I think it's fascinating work even if LLMs aren't the ideal tool for this job right now.

There were some experiments with embodied LLMs on the front page recently (e.g. basic robot body + task) and SOTA models struggled with that too. And of course they would - what training data is there for embodying a random device with arbitrary controls and feedback? They have to lean on the "general" aspects of their intelligence which is still improving.

With dedicated embodiment training and an even tighter/faster feedback loop, I don't see why an LLM couldn't successfully pilot a drone. I'm sure some will still fall of the rails, but software guardrails could help by preventing certain maneuvers.

reply
bigfishrunning
1 hour ago
[-]
Why would you want an LLM to fly a drone? Seems like the wrong tool for the job -- it's like saying "Only one power drill can pound roofing nails". Maybe that's true, but just get a hammer
reply
avaer
5 minutes ago
[-]
Using an LLM is the SOTA way to turn plain text instructions into embodied world behavior.

Charitably, I guess you can question why you would ever want to use text to command a machine in the world (simulated or not).

But I don't see how it's the wrong tool given the goal.

reply
notepad0x90
1 hour ago
[-]
There are almost endless reasons why. It's like asking why would you want a self-driving car. Having a drone to transport things would be amazing, or to patrol an area. LLMs can be helpful with object identification, reacting to different events, and taking commands from users.

The first thought I had was those security guard robots that are popping up all over the place. if they were drones instead, and LLM talked to people asking them to do/not-do things, that would be an improvement.

Or an waiter drone, that takes your order in a restaurant, flies to the kitchen, picks up a sealed and secured food container, flies it back to the table, opens it, and leaves. It will monitor for gestures and voice commands to respond to diners and get their feedback, abuse, take the food back if it isn't satisfactory,etc...

This is the type of stuff we used to see in futuristic movies. It's almost possible now. glad to see this kind of tinkering.

reply
laffOr
41 minutes ago
[-]
You could have a program, not LLM-based but could be ANN, for flying and an LLM for overseeing; the LLM could give the program instructions to the pilot program as a (x,y,z) directions. I mean currently autopilots are typically not LLMs, right?

You describe why it would be useful to have an LLM in a drone to interact with it but do not explain why it is the very same LLM that should be doing the flying.

reply
lewispollard
1 hour ago
[-]
The point is that you don't need an LLM to pilot the thing, even if you want to integrate an LLM interface to take a request in natural language.
reply
notepad0x90
58 minutes ago
[-]
We don't need a lot of things, but new tech should also address what people want, not just needs. I don't know how to pilot drones, nor do I care to learn how to, but I want to do things with drones, does that qualify as a need? Tech is there to do things for us we're too lazy to do.
reply
infecto
43 minutes ago
[-]
That’s a pretty boring point for what looks like a fun project. Happy to see this project and know I am not the only one thinking about these kinds of applications.
reply
iso1631
41 minutes ago
[-]
You want a self driving car

You don't want an LLM to drive a car

There is more to "AI" than LLMs

reply
munchler
1 hour ago
[-]
Because we’re interested in AGI (emphasis on general) and LLM’s are the closest thing to AGI that we have right now.
reply
dan-bailey
59 minutes ago
[-]
When your only tool is a hammer, every problem begins to resemble a nail.
reply
bob1029
37 minutes ago
[-]
The system prompt for the drone is hilarious to me. These models are horrible at spatial reasoning tasks:

https://github.com/kxzk/snapbench/blob/main/llm_drone/src/ma...

I've been working with integrating GPT-5.2 in Unity. It's fantastic at scripting but completely worthless at managing transforms for scene objects. Even with elaborate planning phases it's going to make a complete jackass of itself in world space every time.

LLMs are also wildly unsuitable for real-time control problems. They never will be. A PID controller or dedicated pathfinding tool being driven by the LLM will provide a radically superior result.

reply
infecto
44 minutes ago
[-]
What’s the right tool then?

This looks like a pretty fun project and in my rough estimation a fun hacker project.

reply
pavlov
1 hour ago
[-]
Yeah, it feels a bit like asking "which typewriter model is the best for swimming".
reply
ralusek
22 minutes ago
[-]
Why would you want an LLM to identify plants and animals? Well, they're often better than bespoke image classification models at doing just that. Why would you want a language model to help diagnose a medical condition?

It would not surprise me at all if self-driving models are adopting a lot of the model architecture from LLMs/generative AI, and actually invoke actual LLMs in moments where they would've needed human intervention.

Imagine if there's a decision engine at the core of a self driving model, and it gets a classification result of what to do next. Suddenly it gets 3 options back with 33.33% weight attached to each of them and a very low confidence interval of which is the best choice. Maybe that's the kind of scenario that used to trigger self-driving to refuse to choose and defer to human intervention. If that can then first defer judgement to an LLM which could say "that's just a goat crossing the road, INVOKE: HONK_HORN," you could imagine how that might be useful. LLMs are clearly proving to be universal reasoning agents, and it's getting tiring to hear people continuously try to reduce them to "next word predictors."

reply
peterpost2
1 hour ago
[-]
Did you read his post?

He answers your question

reply
macintux
1 hour ago
[-]
> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".

https://news.ycombinator.com/newsguidelines.html

reply
philipwhiuk
1 hour ago
[-]
I disagree. The nearest justification is:

> to see what happens

reply
ceejayoz
1 hour ago
[-]
Isn't that the epitome of the hacker spirit?

"Why?" "Because I can!"

reply
antisthenes
56 minutes ago
[-]
LLMs flying weaponized drones is exactly how it starts.
reply