PaliGemma 2: Powerful Vision-Language Models, Simple Fine-Tuning
218 points
20 days ago
| 7 comments
| developers.googleblog.com
| HN
minimaxir
20 days ago
[-]
Hugging Face's blog post on the release is more technical: https://huggingface.co/blog/paligemma2
reply
xnx
20 days ago
[-]
Even more technical detail here: https://arxiv.org/html/2412.03555v1
reply
timmg
20 days ago
[-]
I recently wanted to try to get an LLM to help me organize my photos. (I'm someone who takes a lot of photos when I travel and then back them up to a hard drive -- assuming someday I'll organize them :)

I created a prompt to try to get an LLM to do high-level organization:

> Help categorize this photo for organization. Please output in JSON.

> First, add a field called "type" that is one of: Wildlife, Architecture, Landscape, People or other. Pick the one that most appropriately reflects the subject of the photo. Use Other if you don't feel confident in your answer.

> Next, if it is Wildlife, please add another field called "animal" that gives a general type of animal that is the focus of the photo. Use large, common types like Elephant, Bird, Lion, Fish, Antelope, etc. Do not add this field if your confidence is low.

> If the type of animal is Bird, add a field called "bird" that gives the common type of bird, if you can clearly determine it. For example: Eagle, Hummingbird, Vulture, Crow, etc.

> If it is an Architecture photo, and you can determine with good confidence what specific building (or city) it is a photo of, please add a field called "place" with that name. (Name only, please -- no description).

I've tried with llama-vision using Ollama and it worked reasonably well for the top-level categories. A little less-well for identifying specific birds or places. And it didn't always generate proper JSON (and sometimes added new fields to JSON.)

I also tried with Claude's API -- and it seemed to work perfectly (for a small sample size).

It will be interesting to try with PaliGemma and see what I get.

I have like 50k photos, so I don't want to pay $$$ for the Claude API to categorize them all. It will be cool someday (soon?) for an open-source DAM to have something like one of these models available to call locally.

reply
warangal
20 days ago
[-]
Disclaimer: I work on such a project[0]

I think a combination of CLIP and some face-recognition may solve your issues! It just takes a path to your directory, and can index all the images while preserving your folder hierarchy along with a high quality face-clustering. Each image indexing takes about 100ms on a cpu. Every combination can then be mixed and matched, from a single interface, It doesn't take much to try as dependencies are very minimal . There is a self contained app for windows too. I have been looking for feedback, so just plugging it here in case some one has a use case.

[0] https://github.com/eagledot/hachi

reply
swyx
20 days ago
[-]
> https://github.com/eagledot/hachi

promising but i want to see more before i try it - could you invest a little in your readme to list out features, maybe do a little loom demo?

reply
warangal
20 days ago
[-]
for images readme is at: https://github.com/eagledot/hachi/tree/main/images/readme.md with more than enough details! It is supposed to be a search engine for all modalities, for now `images` are supported !

For demo, i don't have much resources to host it, would a video showcasing features help ?

Also for windows there is a portable app at https://github.com/eagledot/hachi/releases/download/v1.3/hac...

reply
swyx
19 days ago
[-]
yea simple Loom/youtube video works well!
reply
senko
20 days ago
[-]
Simonw estimates it'd cost less than $10 to categorize 67k+ photos using Amazon Nova: https://simonwillison.net/2024/Dec/4/amazon-nova/#gamoa

I agree it'll still be cool to be able to do it all locally.

reply
rsolva
20 days ago
[-]
The photo organizing software Ente [0] can do this, and is packaged into a really neat product. I have not gotten around to try the self hosted version yet, but it is on my list!

[0] https://ente.io/ml

reply
magicalhippo
20 days ago
[-]
Recently played with something similar using Llama 3.2 Vision locally.

Worked pretty well, and if the model fit in G-RAM decently fast.

Main issue was prompt adherence. In my experience prompt adherence goes down significantly when you reduce the model size.

Llama 3.2 Vision seems to be tuned hard to provide a summary at the end, usually with some social commentary, and was difficult to get the 8B model to avoid outputting it. Also adding multiple if-this-then-that clauses, like in your prompt, was often ignored in the 8B and smaller model compared to 90B.

I've tried the Gemma 2 model before for assistant tasks, and was very pleased with the 9B performance. It had good prompt adherence and performed well on various tasks. So looking forward to trying this.

reply
nulld3v
18 days ago
[-]
Are you aware of Immich? I believe it does mostly everything you were trying to do: https://immich.app/docs/features/smart-search It's open source, and is fairly polished too.

I think the main part that's missing is how the classifications are not exposed to you in the UI. But I haven't used this feature much and my instance isn't running right now (SSD failed, still trying to get it replaced) so not able to check.

reply
thorncorona
20 days ago
[-]
None of this requires a multimodal LLM.

You can do this with traditional CNNs. For place, use Geospy or another geospatial model.

reply
visarga
20 days ago
[-]
use the json mode in ollama
reply
pilooch
20 days ago
[-]
Paligemma proves easy to train and useful in fine-tuning. It's main drawback was not being able to handle multiple images without being partly retrained. This new version dies not seem to support multiple images as input at once. Qwen2vl does. This is useful for vision rag typically.
reply
sigmar
20 days ago
[-]
It is probably hard to come up with good benchmarks for VLMs like this, but I feel like the "Non entailment sentences" benchmark seems ill-suited. The examples for sentences that were non entailment included[1]: "There is a pile of horse manure in front of the horse." Which is true if you mean "in front of the [photosubject] from the perspective of the camera," but I think they marked it as non entailment because the pile is not in front of the horse's face(?)

[1] page 20 https://arxiv.org/pdf/2412.03555

reply
turnsout
20 days ago
[-]
Does anyone know how this stacks up against other multimodal vision models?
reply
mountainriver
20 days ago
[-]
They do an exceptionally poor job at evaluating it against competitors.
reply
swyx
20 days ago
[-]
how about leaderboards that can pop them in?
reply
lofaszvanitt
16 days ago
[-]
Why does that matter? Everyone has their own uses for a vlm. Compare them for the given task at hand.
reply
dmvdoug
20 days ago
[-]
Saw name, was expecting something to do with wreaking AI upon the Pali Canon (https://en.m.wikipedia.org/wiki/Pali_Canon).
reply
exe34
20 days ago
[-]
does anyone know if they can output bonding box coordinates? like "where is the ball" -> [50, 75, 150, 175].

so far cogvlm is the only one I've seen that works but it's a bit of a pain to run.

reply
xnx
20 days ago
[-]
Yes: "The initial four location tokens represent the coordinate of the bounding box, ranging from 0 to 1023. These coordinates are independent of the aspect ratio, as the image is assumed to be resized to 1024 x 1024."

https://developers.googleblog.com/en/gemma-explained-paligem...

reply
exe34
20 days ago
[-]
Thank you! Will have a play with that.
reply
__jl__
20 days ago
[-]
Gemini is surprisingly good at this. Look at example 5 here: https://developers.googleblog.com/en/7-examples-of-geminis-m...

They also have a colab notebook with more examples linked in the article.

reply
exe34
20 days ago
[-]
I meant weights-available ones, but thank you!
reply
jsight
19 days ago
[-]
Yes, just use "detect ball" as the prompt. It will give you y1,x1,y2,x2 coordinates on a scale of 0-1024. It is really good at this.

Unfortunately, without fine tunes you can't have it just detect everything and return all detected objects, afaict.

reply
exe34
18 days ago
[-]
reply