FilterHN

CatLIP: Clip Vision Accuracy with 2.7x Faster Pre-Training on Web-Scale Data

48 points

by panabee

9 days ago

| past

| 2 comments

| arxiv.org

| HN

▲

ggnore7452

9 days ago

[-]

question: any good on-device size image embedding models?

tried https://github.com/unum-cloud/uform which i do like, especially they also support languages other than English. Any recommendations on other alternatives?

▲

philipkglass

9 days ago

[-]

I have successfully used OpenCLIP models for embedding and similar-image search. The smallest model listed on that UForm page is 79 million parameters, so I presume that you can use other models of similar size. There are a few OpenCLIP models with 80 million or fewer parameters listed here:

https://github.com/mlfoundations/open_clip/blob/main/docs/mo...

When embeddings are quantized to int8 they still work very well for similarity (no differences in top 10 search on my test set). I haven't tried quantizing the models themselves.

▲

cs702

9 days ago

[-]

TL;DR: The authors pretrain the model to classify images into Wordnet synsets[a] that appear in the caption, using a standard Cross Entropy loss. They keep the number of classes relatively small by removing any synsets that don't show up in captions at least 500 times in the dataset. It seems to work well.

My immediate question is: Why not classify among the entire hierarchy of all Wordnet synsets?

---

[a] https://wordnet.princeton.edu/

▲

cs702

8 days ago

[-]

I've tried this for Wordnet hierarchical classification with a Standard Cross Entropy loss:

https://github.com/glassroom/heinsen_tree#sample-usage-with-...

It worked for me, but I had to modify the code to use all hypernym paths, giving me 147,200 classes, one per path. English only. For synsets with more than one path, I split target probability mass over their paths. For prediction, I added the predicted probs of hypernym paths ending at each synset.