I think this author and I have different definitions of fun.
1,000 pings, how many correctly ponged?
Imagine face recognition to work like a text chat, where the PC gets the frame from the camera and writes in the chat: "Who's that? Here's the RGB888 image in hex: ...".
Vision language models are an incredible achievement in the generality and usability. But they pay a hefty price in fidelity and speed
Image gets rasterized into smaller pieces (eg 4x4 pixels) and each of those is assigned a token, similarly how text is broken up into tokens. And the whole thing is fed into a single model.
> Imagine face recognition to work like a text chat, where the PC gets the frame from the camera and writes in the chat: "Who's that? Here's the RGB888 image in hex: ...".
that's p much how it works.
/skill-creator [or /create-skill] Write an agent skill with code script(s) that use an existing user space IP library that works with your agent runtime, to [...]
ComposioHQ/awesome-claude-skills: https://github.com/ComposioHQ/awesome-claude-skills
anthopics/skills//skill-creator/SKILL.md: https://github.com/anthropics/skills/blob/main/skills/skill-...
/.agents/skills/skill-name/SKILL.md, scripts/{script_name.py,__init__.py}
Even faster would just to be use code in the first place!