Show HN: Understudy – Teach a desktop agent by demonstrating a task once
120 points
22 days ago
| 10 comments
| github.com
| HN
I built Understudy because a lot of real work still spans native desktop apps, browser tabs, terminals, and chat tools. Most current agents live in only one of those surfaces.

Understudy is a local-first desktop agent runtime that can operate GUI apps, browsers, shell tools, files, and messaging in one session. The part I'm most interested in feedback on is teach-by-demonstration: you do a task once, the agent records screen video + semantic events, extracts the intent rather than coordinates, and turns it into a reusable skill.

Demo video: https://www.youtube.com/watch?v=3d5cRGnlb_0

In the demo I teach it: Google Image search -> download a photo -> remove background in Pixelmator Pro -> export -> send via Telegram. Then I ask it to do the same for Elon Musk. The replay isn't a brittle macro: the published skill stores intent steps, route options, and GUI hints only as a fallback. In this example it can also prefer faster routes when they are available instead of repeating every GUI step.

Current state: macOS only. Layers 1-2 are working today; Layers 3-4 are partial and still early.

    npm install -g @understudy-ai/understudy
    understudy wizard
GitHub: https://github.com/understudy-ai/understudy

Happy to answer questions about the architecture, teach-by-demonstration, or the limits of the current implementation.

obsidianbases1
22 days ago
[-]
Nice work. I scanned through the code and found this file to be an interesting read https://github.com/understudy-ai/understudy/blob/main/packag...
reply
shawntwin
22 days ago
[-]
smart observation, seem some interesting package included
reply
rybosworld
22 days ago
[-]
I have a hard time believing this is robust.
reply
walthamstow
22 days ago
[-]
It's a really cool idea. Many desktop tasks are teachable like this.

The look-click-look-click loop it used for sending the Telegram for Musk was pretty slow. How intelligent (and therefore slow) does a model have to be to handle this? What model was used for the demo video?

reply
bayes-song
22 days ago
[-]
In the demo, I used GPT-5.4:medium accessed through the Codex subscription.
reply
sethcronin
22 days ago
[-]
Cool idea -- Claude Chrome extension as something like this implemented, but obviously it's restricted to the Chrome browser.
reply
bayes-song
22 days ago
[-]
I really like the Claude Chrome extension, but unfortunately it has too many limitations. Not only is it restricted to Chrome, but even within Chrome some websites especially financial ones are blocked.
reply
8note
22 days ago
[-]
sounds a bit sketch?

learning to do a thing means handling the edge cases, and you cant exactly do that in one pass?

when ive learned manual processes its been at least 9 attempts. 3 watching, 3 doing with an expert watching, and 3 with the expert checking the result

reply
bayes-song
22 days ago
[-]
That’s true. The demo I showed was somewhat cherry-picked, and agentic systems themselves inherently introduce uncertainty. To address this, a possible approach was proposed earlier in this thread: currently, after /teach is completed, we have an interactive discussion to refine the learned skill. In practice, this could likely be improved when the agent uses a learned skill and encounters errors, it could proactively request human help to point out the mistake. I think this could be an effective direction.
reply
jedreckoning
22 days ago
[-]
cool idea. good idea doing a demo as well.
reply
skeledrew
22 days ago
[-]
Interested, and disappointed that it's macOS only. I started something similar a while back on Linux, but only got through level 1. I'll take some ideas from this and continue work on it now that it's on my mind again.
reply
bayes-song
22 days ago
[-]
Thanks! And good luck with your project as well.

One of the motivations for open-sourcing this is exactly to see it grow beyond macOS. I personally don’t have much development experience on Windows or Linux, so it’s great to see people picking up the idea and trying it on other platforms.

Interestingly, the original spark for this project actually came from my dad. He mostly uses CAD to review architectural design files, and there are quite a few repetitive steps that are fairly mechanical.Many operations don’t seem to be accessible through normal shell automation and end up requiring GUI interactions.

So one of the next things I want to try is experimenting with similar ideas on Windows, especially for GUI-heavy workflows like that, and see how far it can go.

reply
mustafahafeez
22 days ago
[-]
Nice idea
reply
bayes-song
22 days ago
[-]
thx
reply
abraxas
22 days ago
[-]
One more tool targeting OSX only. That platform is overserved with desktop agents already while others are underserved, especially Linux.
reply
bayes-song
22 days ago
[-]
Fair point that Linux is underserved.

My own view is that the bigger long-term opportunity is actually Windows, simply because more desktop software and more professional workflows still live there. macOS-first here is mostly an implementation / iteration choice, not the thesis.

reply
renewiltord
22 days ago
[-]
That's mostly because Mac OS users make tools that solve their problems and Linux users go online to complain that no one has solved their problem but that if they did they'd want it to be free.
reply
Muhammad523
22 days ago
[-]
Listen; we're not in a "Windows vs MacOS vs Linux user" meme. We're trying to have intelligent discussion here, and surely generalizing a large amount of people simply because they use one OS is not intelligent discussion. Wake up. Real life is not what you see in funny memes.
reply
Muhammad523
22 days ago
[-]
I'd truly like to see what examples you have of Linux users "complaining about the fact no one solved their problem yet"
reply
renewiltord
22 days ago
[-]
The guy has given you everything you need to solve this problem you supposedly have. So solve it.

You have all the tools.

reply
sukhdeepprashut
22 days ago
[-]
2026 and we still pretend to not understand how llms work huh
reply