I created a prompt to try to get an LLM to do high-level organization:
> Help categorize this photo for organization. Please output in JSON.
> First, add a field called "type" that is one of: Wildlife, Architecture, Landscape, People or other. Pick the one that most appropriately reflects the subject of the photo. Use Other if you don't feel confident in your answer.
> Next, if it is Wildlife, please add another field called "animal" that gives a general type of animal that is the focus of the photo. Use large, common types like Elephant, Bird, Lion, Fish, Antelope, etc. Do not add this field if your confidence is low.
> If the type of animal is Bird, add a field called "bird" that gives the common type of bird, if you can clearly determine it. For example: Eagle, Hummingbird, Vulture, Crow, etc.
> If it is an Architecture photo, and you can determine with good confidence what specific building (or city) it is a photo of, please add a field called "place" with that name. (Name only, please -- no description).
I've tried with llama-vision using Ollama and it worked reasonably well for the top-level categories. A little less-well for identifying specific birds or places. And it didn't always generate proper JSON (and sometimes added new fields to JSON.)
I also tried with Claude's API -- and it seemed to work perfectly (for a small sample size).
It will be interesting to try with PaliGemma and see what I get.
I have like 50k photos, so I don't want to pay $$$ for the Claude API to categorize them all. It will be cool someday (soon?) for an open-source DAM to have something like one of these models available to call locally.
I think a combination of CLIP and some face-recognition may solve your issues! It just takes a path to your directory, and can index all the images while preserving your folder hierarchy along with a high quality face-clustering. Each image indexing takes about 100ms on a cpu. Every combination can then be mixed and matched, from a single interface, It doesn't take much to try as dependencies are very minimal . There is a self contained app for windows too. I have been looking for feedback, so just plugging it here in case some one has a use case.
promising but i want to see more before i try it - could you invest a little in your readme to list out features, maybe do a little loom demo?
For demo, i don't have much resources to host it, would a video showcasing features help ?
Also for windows there is a portable app at https://github.com/eagledot/hachi/releases/download/v1.3/hac...
I agree it'll still be cool to be able to do it all locally.
Worked pretty well, and if the model fit in G-RAM decently fast.
Main issue was prompt adherence. In my experience prompt adherence goes down significantly when you reduce the model size.
Llama 3.2 Vision seems to be tuned hard to provide a summary at the end, usually with some social commentary, and was difficult to get the 8B model to avoid outputting it. Also adding multiple if-this-then-that clauses, like in your prompt, was often ignored in the 8B and smaller model compared to 90B.
I've tried the Gemma 2 model before for assistant tasks, and was very pleased with the 9B performance. It had good prompt adherence and performed well on various tasks. So looking forward to trying this.
I think the main part that's missing is how the classifications are not exposed to you in the UI. But I haven't used this feature much and my instance isn't running right now (SSD failed, still trying to get it replaced) so not able to check.
You can do this with traditional CNNs. For place, use Geospy or another geospatial model.
[1] page 20 https://arxiv.org/pdf/2412.03555
so far cogvlm is the only one I've seen that works but it's a bit of a pain to run.
https://developers.googleblog.com/en/gemma-explained-paligem...
They also have a colab notebook with more examples linked in the article.
Unfortunately, without fine tunes you can't have it just detect everything and return all detected objects, afaict.