FilterHN

Finding Optimal Tokenizers

24 points

by mcyc

15 hours ago

| past

| 1 comment

| blog.aqnichol.com

| HN

▲

fxtentacle

35 minutes ago

[-]

This is an interesting approach with integer programming and then using an explicit solver. It’s probably very slow, but you only have to run this once and it produces the mathematically perfect result.

In the past, I got good results with trying to reduce the variance in entropy in-between tokens, which you can implement very easily by starting with each single character as its own token and then doing a greedy merge of the most numerous outlier token pairs in a loop until you reach your desired token count. https://arxiv.org/abs/2206.12693