Claude-real-video - any LLM can watch a video
77 points
6 hours ago
| 13 comments
| github.com
| HN
fzysingularity
1 hour ago
[-]
Pretty terribly expensive way to watch a video with Claude.

Use Gemini or some local VLM to do this way more efficiently. We spent quite a bit of time on video understanding, and Claude will just burn tokens.

Check out this library: https://vlm-run.github.io/mm/

You can swap models and try out different encoding methods for videos (https://vlm-run.github.io/mm/encoders/#video)

reply
mh-
9 minutes ago
[-]
Assuming that's your project, the GitHub link from the PyPi page is a 404.
reply
bonoboTP
3 hours ago
[-]
"Where the video goes: stays on your machine" - No, the frames (that this tool extracts) obviously get sent to Anthropic if you use Claude.
reply
fny
49 minutes ago
[-]
"Or any LLM" on your machine.
reply
nickpeterson
12 minutes ago
[-]
I’m currently punishing Fable by making it watch the entire series of 7th Heaven.
reply
kingkawn
3 minutes ago
[-]
Inhumane
reply
Lerc
45 minutes ago
[-]
Are models any good at descerning motion from multiple frames?

For instance if I gave models multiple animations of a bouncing ball as individual frames. Would they be able to tell which bounce was the more realistic motion.

(Is this a potential new benchmark? maybe also variations of stair dismount)

reply
danbrooks
25 minutes ago
[-]
I’d imagine they could. I’d try Gemini 3.5 flash with high fps.
reply
zitterbewegung
2 hours ago
[-]
This looks cool but this should be renamed without having Claude in the name.
reply
walrus01
2 hours ago
[-]
llm-real-video would be a much better name
reply
nickvec
30 minutes ago
[-]
Curious as to how many tokens are used per second of video.
reply
ElijahLynn
3 hours ago
[-]
I was just thinking about this exact use case yesterday:

And it's for me measuring different charged speeds at different starting battery capacities and different temperatures and I was like well. What if I just had a video camera pointing at the voltage going in and out and then I could see the battery percentage increase and I can have a temperature gun pointed at the phone as well. And I couldn't know what temperature of the phone is as well and it could just figure it all out create charts..

This would make reviewing different charging equipment really easy as long as you really have to do is plug it in and tell other people to do the same thing and take a video of it and beat it to the system.

I might very well give this a try!

reply
idiotsecant
1 hour ago
[-]
It's kind of wild how much we are abandoning basic problem solving skills in favor of just pointing an enormous stack of GPUs at it
reply
siriusastrebe
1 hour ago
[-]
Identifying objects in pictures was considered an insurmountable task only a few years ago, like in the xckd comic https://xkcd.com/1425/
reply
smallerize
30 minutes ago
[-]
In the general case, I guess. But watching gauges and dials like battery capacity only take a little work with a deterministic computer vision library.
reply
gvkhna
3 hours ago
[-]
Nice @OP i put together something similar as well. Incidentally I found for motion design specifically llm is not able to infer specific animations as well as it just being described very plainly and accurately what is happening and the timing.

One thing which sort of worked decently was actually take the frames and put them into a grid and have the agent look at the image of all of the frames together. It did surprisingly well but missed a lot of subtle details that it couldn’t see.

Also tried various kinds of vision embeddings, heat map of motion etc, and blur etc to show motion. But none really worked as well so I ended up just describing it until it got it. Haven’t quite found the right solution yet.

reply
octember
2 hours ago
[-]
Cool idea, but keyframes are not videos. Motion, object permanence, are not things Claude can infer from a set of images. Nice demo though!
reply
sawjet
1 hour ago
[-]
I have been going through this with claude and qwenvl3:8b this week. Both are pretty decent at inferring context and analyzing contact sheets. Finding high visual interest moments with a mixture of coarse and fine keyframes.
reply
fzysingularity
1 hour ago
[-]
Exactly! We experimented with a whole bunch of video encoding techniques for LLMs here: https://vlm-run.github.io/mm/encoders/#video
reply
BeetleB
3 hours ago
[-]
I think this is much more useful than just LLM related applications. I'd suggest renaming it to not make it seem like it's LLM related.
reply
fred123123
3 hours ago
[-]
How do you handle things like scrolling quickly in a video?
reply
nxtfari
3 hours ago
[-]
this is really clever, props
reply
cortexosmain
6 hours ago
[-]
Hi HN! I built this because I was frustrated that no LLM actually "sees" a video — Claude won't accept video files, ChatGPT reads the transcript only, and Gemini samples at a fixed 1fps (missing fast cuts, over-sampling static slides).

claude-real-video takes a URL or local file and:

1. Extracts frames at every scene change (not fixed intervals) + a density floor 2. Deduplicates with a sliding-window pixel-diff algorithm (so A-B-A interview cutaways don't re-send the same shot) 3. Transcribes audio (prefers embedded subtitles, falls back to Whisper) 4. Optionally keeps the full soundtrack for audio-capable models 5. Writes a clean MANIFEST.txt you can drop into any LLM chat

A 10-min presentation goes from ~600 fixed-interval frames to 5-15 meaningful keyframes. 90%+ token savings with better comprehension.

The dedup approach (v0.2.0) uses real pixel difference on 16x16 RGB thumbnails against a sliding window of the last N kept frames — inspired by videostil's pixelmatch, but simpler and self-contained.

`--report` generates a self-contained HTML showing every keep/drop decision with diff percentages, so you can tune the threshold visually.

pip install claude-real-video && crv "https://youtube.com/watch?v=..." --report

MIT licensed, pure Python + ffmpeg. Happy to answer questions!

reply
garciasn
3 hours ago
[-]
I gave Claude a video provided by a county attorney for a speeding ticket I got. It was spot on in its analysis, even though I don’t like what the video showed.

What does it mean that Claude can’t view video; it did it just fine. Or do you mean tool less?

reply
torhorway
3 hours ago
[-]
yeah im pretty sure claude code can handle videos. its been doing frame by frame analysis for me with generated video to iterate on pipelines
reply
AmazingEveryDay
3 hours ago
[-]
I think a more or less clunky name like 'llm video preprocessor' would be better description? In any case seems like a you came up with a good project idea. I wonder how long until the sota models will just have this kind of functionallity built in.
reply
ProofHouse
4 hours ago
[-]
Very cool I have something that does this as well along these lines. I’ll dig into yours over the next few days and contribute where and if I can too, awesome to see!
reply