The "UI" for saving/loading the map and calibrating the camera is exposed through a built-in crude webserver. Visualization is done via threejs instead of having a dependency on pangolin.
If your robot can expose the camera feed as anything opencv can ingest ( i.e. mjpeg via http ) you could just point it there and then receive the pose stream via HTTP/SSE
The whole thing is distributed as an AppImage so you just run it and connect to it
All but the most basic vacuum robots map their work area and devise plans how to clean them systematically. The others just bump into obstacles, rotate a random amount and continue forward.
Don't get me wrong, I love this project and the idea to build it yourself. I just feel like that (huge) part is missing in the article?
Not saying that it’s viable here to build a world map since things like furniture can move but some systems, e.g. warehouse robots do use things like lights to triangulate on the assumption that the lights on the tall ceiling are fixed and consistent.
I guess the mapping capabilities vary greatly between vendors. I had a first gen Mi Robot vacuum and it was amazing. It would map the entire floor with all the rooms, then go room by room in a zigzag pattern, then repeat each room, having no issues going from one room to another and avoiding obstacles. It also made sure not to fall down the stairs. Then later it broke and I bought a more noname model and despite having lidar tower, it didn't perform as well as Xiaomi vacuum did. It worked for a single room, but anything more and it would get lost.
For the actual cleaning, random works great.
With a seat and handle similar to "wooden bee ride on" by b. toys?
I want a vacuum that kids can actually drive, ride on, do real vacuuming and has minimal levels so safety, like turning it over halts vacuums, stairs/ledges are avoided, and lack of rollers or items that could snare a kids hair, etc.
There may be benefits of fusion of child input signals with supervisory vacuums route goals. Would be age dependent, older kids would want full manual I think.
Kids like to do real jobs, and as a parent I prefer purchasing real items for my kids rather than toy versions if practical.
Real vacuums have existed for a very long time now :P
Too little train data, and/or data of insufficient quality. Maybe let the robot run autonomously with an (expensive) VLM operating it to bootstrap a larger train dataset without needing to annotate it yourself.
Or maybe the problem itself is poorly specified, or intractable with your chosen network architecture. But if you see that a vision llm can pilot the bot, at least you know you have a fighting chance.
Thats a cool idea, is there any VLM you would suggest? I can think of Gemini maybe? Or any would do?
invest in a good prompt describing the setup, your goals, when to move. Type your output, don't go parsing move commands out of unstructured chat output. And maybe validate first on the data you already collected: does the vlm take the same actions as your existing train set?
And then just let it run and collect data for as long as you can afford. Maybe 0.2 fps (sample and take action every 5 sec) is already good enough.
Good luck!
I would begin in one room to practice this.
The hard part is the engineering hours to make it all work well. But you can get repaid those as long as you can sell 100 Million units to every nation in the world.
Do you know any cheap wifi MCU with a little ML accelerator that we can buy off the shelf? The only one we could think of was the Jetson Orin Nano and thats not cheap
For the compute problem, you don't need a Jetson. The approach you want is knowledge distillation: train a large, expensive teacher model offline on a beefy GPU (cloud instance, your laptop's GPU, whatever), then distill it down into a tiny student network like a MobileNetV3-Small or EfficientNet-Lite. Quantize that student to int8 and export it to TFLite. The resulting model is 2-3 MB and runs at 10-20 FPS on a Raspberry Pi 4/5 with just the CPU - no ML accelerator needed. For even cheaper, an ESP32-S3 with a camera module can run sub-500KB models for simpler tasks. The preprocessing is trivial: resize the camera frame to 224x224, normalize pixel values, feed the tensor to the TFLite interpreter. The CNN learns its own feature extraction internally, so you don't need any classical CV preprocessing. Looking at your observations, I think the deeper issue is what you identified: there's not enough signal in single frames. Your validation loss not converging even after augmentation and ImageNet pretraining confirms this. The fix is exactly what you listed in your future work - feed stacked temporal frames instead of single images. A simple approach is to concatenate 3-4 consecutive grayscale frames into a multi-channel input (e.g., 224x224x4). This gives the network implicit motion, velocity, and approach-rate information without needing to compute optical flow explicitly. It's the same trick DeepMind used in the original Atari DQN paper - a single frame of Pong doesn't tell you which direction the ball is moving either. On the action space: your intuition about STOP being problematic is right. It creates a degenerate attractor - once the model predicts STOP, there's no recovery mechanism. The paper you referenced that only uses STOP at goal-reached is the better design. Also consider that TURN_CW and TURN_CCW have no obvious visual signal in a single frame (which way to turn is a function of where you've been and where you're going, not just what you see right now), which is another reason temporal stacking or adding a small recurrent/memory component would help. Even a simple LSTM or state tuple fed alongside the image could encode "I've been turning left for 3 steps, maybe try something else." For the longer term, consider a hybrid architecture: use the distilled neural net for obstacle detection and free-space classification, but pair it with classical SLAM or even simple odometry-based mapping for path planning and coverage. Pure end-to-end behavior cloning for the full navigation stack is a hard problem - even the commercial robots use learned perception with algorithmic planning. And your data collection would get easier too, because you'd only need to label "what's in front of me" rather than "what should I do," which decouples perception from decision-making and makes each piece easier to train and debug independently.
(Lidar can of course also be echolocation).
Very cool project though!
It could easily understand so much about the environment with even a small multimodal model.