You can train it in under a minute, and it will work perfectly well on embedded devices.
Small LLMs are good choices for text classification in two cases:
- If you next to provide in-context examples and classifier based on them.
- Your classification goes beyond simple subject-type classifiers. For example, multiple choice question answering is classification where small LLM will work but traditional ML methods won't/
- Zero-shot encoders like tasksource or GliNER
- Natural language inference: https://huggingface.co/blog/dleemiller/nli-xenc-ways-to-use
- GRPO training
- GEPA prompt tuning Qwen 0.6B (or GEPA, then GRPO)
- Use an embedding model and train a classifier (MLP, logistic, svm)
- Use a larger LLM to generate a synthetic dataset (beware of lack of diversity, mine "seed text" from real sources first)
- Synthetically generate "hard examples" where more than one category may be valid and DPO tune your preferred responses
Half of the times I ask qwen 0.6b "what is 1 + 2?" it ends up in a thinking loop of "but wait, the user is asking me to ..."
Cool write up! Really appreciate it but incidentally how does this categorization help you get better retrieval results?
also, you could stick a classifier head on a BERT model as another option.
Can this specific failure mode be solved by providing a grammar that the output must adhere to? (Not sure if Qwen has this feature, it's used for eg. to ensure the output is parseable json)
It's something that is implemented by the thing that runs the model - eg Llama.cpp - rather than the model itself.
Note that it is hard to make work if you turn thinking on because the grammar gets complicated quickly (I don't recall if Qwen 0.6B can do thinking).
I'm also interested in it as a student for distillation.