> We also plan to compile solved steps into micro‑policies. If you're running something like a RPA task or similar workflow as before, you can simply run the execution locally (with archon-mini running locally) and not have to worry about the planning. Over time, the planner is a background teacher, not a crutch.
Conceptually, I really like this - why re-do the work of reasoning about an already solved task? Just do it again. For some plausibly large majority of things, this could speed things up considerably.
> In the future we hope to run a streaming capture pipeline similar to Gemma 3. Consuming frames at 20–30 fps, emitting actions at 5–10 Hz, and verifying state on each commit.
I love targets like this. It makes you tune the architecture and abstractions to push the boundary of whats possible with a traditional agent loop.
The salience heat map compression is a great idea. I think you could take this a step further and tune a model so that it compresses an image into a textual semantic/interactive element hierarchy. This is effectively what browser-use is doing, just using javascript instead of a vision model.
This seems like a task that would benefit from narrow focus. I'm aware of the "Bitter Lesson," but my intuition seems to tell me that chaining together fit to purpose classification as an input to an intelligent planning system is the way to go.
Without this, AI is going to be limited and kloodgy. Like if I wanted to have AI run a FEA simulation on some CAD model, I have to wait until the FEA software, the CAD software, the corporate models repo, etc., etc. all have AI integrations and then create some custom agent that glues them all together. Once AI can just control the computer effectively, then it can look up the instruction manuals for each of these pieces of software online, and then just have at it e2e like a human would. It can even ping you over slack if it gets stuck on something.
I think once stuff like this becomes possible, custom AI integrations will become less necessary. I'm sure they'll continue to exist for special cases, but the other nice thing about a generic computer-use agent is that you can record the stream and see exactly what it's doing, so a huge increase in observability. It can even demo to human workers how to do things because it works via the same interfaces.
I see a ton of potential for testing. RPA can quickly get annoying because even simple change can break automation. The LLM ability to “reason” could really bridge the gap.
Coupled with agents who help turn specification/stories into a testing plan, I could really see automated end to end testing becoming far cheaper than it is nowadays in the near future.
That’s a very good piece of news for system reliability.
I feel like your demo video is not the greatest one to highlight the capability. A browsing use case likely does require a key press->planning loop, but a gaming use case, or a well known software (e.g., excel), may be able to think ahead 10-20 key presses before needing the next loop / verification. The current demo makes it seem slow / prototype-like.
Also, the X/Y approach is interesting when thinking about a generic approach to screen management. But for example for browsers, you're likely adding overhead relative to just marking the specific div/buttons that are on screen and having those be part of the reasoning (e.g., "Click button X at div with path XX"). It may be helpful to think about the workflows you are going after and what kind of accelerated management you have over them.
At the very least, display a very strongly worded warning message when the tool is run in a non-VM environment. Internet connectivity is still dangerous, but at least VMs can be snapshot. And it should not be packaged as an end-user product with a bar at the top of the screen, period.
This is a risky product and those who developed it have every ability to know that, based on the history of AI hallucinations. Not because it will escape or self replicate or other things claimed by the various idiotic AI religions, but because one of the first things it will inevitably do is screw up someone's work and cause data loss.
From the AI's perspective a filesystem that vector indexes data on the fly would make sense, perhaps, along with an ability for the user to share out fine-grained permissions with it.
I mean that would sure be a nice demo, but it’s too probabilistic to give AI agents real tasks (and it seems that isn’t going to change anytime soon).
It’s all fun and games until it implies spending money and/or taking responsibility.
And be it in personal life or in businesses, money and responsibility are vital things.
Sure you can ask LLMs to generate a minesweeper game with custom rules or ask it to summarize headlines from HN.
Releasing a program an unattended agent generated to real clients that pay you or asking it to order a non refundable flight ticket is something else.
However, I can see the point of an agent that uses my computer while I watch it.