FilterHN

From text to token: How tokenization pipelines work

106 points

by philippemnoel

1 day ago

| past

| 8 comments

| paradedb.com

| HN

▲

heikkilevanto

8 hours ago

[-]

Good explanation on tokenizing English text for regular search. But it is far from universal, and will not work well in Finnish, for example.

Folding diacritics makes "vähä" (little) into "vaha" (wax).

Dropping stop words like "The" misses the word for "tea" (in rather old-fashioned finnish, but also in current Danish).

Stemming Finnish words is also much more complex, as we tend to append suffixes to the words instead of small words in front to the word. "talo" is "house", "talosta" is "from the house", "talostani" is "from my house", and "talostaniko" makes it a question "from my house?"

If that sounds too easy, consider Japanese. From what little I know they don't use whitespace to separate words, mix two phonetic alphabets with Chinese ideograms, etc.

▲

philippemnoel

7 hours ago

[-]

That's true. For this reason, most modern search engines support language-aware stemming and tokenization. Popular tokenizers for CJK languages include Lindera and Jieba.

We (ParadeDB) use a search library called Tantivy under the hood, which supports stemming in Finnish, Danish and many other languages: https://docs.paradedb.com/documentation/token-filters/stemmi...

▲

ashirviskas

57 minutes ago

[-]

Yep and I find that this really worsens LLM performance. For example `Ben,Alice` would be tokenized as `Ben|,A|lice`. And having to connect `lice` to the name `Alice` does not make it any easier for LLMs. However, formatting it as `Ben, Alice` tokenizes it as `Ben|,| Alice`. I found it kind of useful to improve performance by just formatting the data a bit differently.

I actually just started working on a data formatter that applies principles like these to drastically reduce the amount of tokens without decreasing the performance, like other formats do (looking at you, tson).

▲

wongarsu

11 hours ago

[-]

Notably tokenization for traditional search. LLMs use very different tokenization with very different goals

▲

tgv

5 hours ago

[-]

It's a rather old-fashioned style of tokenization. In the 1980s this was common, I think. But, as noted in another comment, it doesn't work that well for languages with a richer morphology, or compounding. It's a very "English" approach.

▲

empiko

4 hours ago

[-]

This was common even in 2015. You can still see people removing stop words from text, even when they feed it to LLMs. It's of course terrible for performance, but old habits die hard I guess.

▲

jamesgresql

5 hours ago

[-]

Chinese, Japanese, Korean etc.. don’t work like this either.

However, even though the approach is “old fashioned” it’s still widely used for English. I’m not sure there is a universal approach that semantic search could use that would be both fast and accurate?

At the end of the day people choose a tokenizer that matches their language.

I will update the article to make all this clearer though!

▲

jamesgresql

5 hours ago

[-]

100%, maybe we should do a follow up on other types of tokenization.

▲

6r17

4 hours ago

[-]

I'm wondering if the english stopwords are not children of a forgotten declination that was forgotten from the language - ... ok so I had to check this out but I don't really have time to check more than with gemini - apparently - The word "the" is basically the sole survivor of a massive, complex table of declensions. In Old English, you could not just say "the." You had to choose the correct word based on gender, case, and number—exactly like you do in Polish today with ten, ta, to, tego, temu, tej, etc.

The Old English "The" (Definite Article) Case Masculine (Ten) Neuter (To) Feminine (Ta) Plural (Te) Nominative Se Þæt Sēo Þā Accusative Þone Þæt Þā Þā Genitive Þæs Þæs Þære Þāra Dative Þæm Þæm Þære Þæm Instrumental Þy Þy — —

I have read somewhere that polish was actually more precise language to be used with AI - I'm wondering if the idea of shortening words that apparently make no sense are not actually hurting it more - as noticed by the article though.

So I'm to wonder at this point - wouldn't it be worthy of exploring a tenser version of the language that might bridge that gap ? completely exploratory though I don't even know if that might be helpful idea other than being a toy

▲

flakiness

5 hours ago

[-]

Oh it's good old tokenization vs for-LLM tokenizations like sentence piece or tiktoken. We shouldn't forget there are non-ML simple things like this one which doesn't ask you to buy more GPUs.

▲

jamesgresql

5 hours ago

[-]

Haha, I like “good old tokenization”

▲

gortok

8 hours ago

[-]

My biggest complaints about search come from day-to-day uses:

I use search in my email pretty heavily, and I’m most interested in specific words in the email; and when those emails are from specific folks or a specific domain. But, the mobile version of Gmail produces different results than the mobile Outlook app than the desktop version of Gmail, and all of them are pretty terrible at search as it pertains to email.

I have a hard to getting them to pull up emails in search that I know exist, that I know have certain words, and I know have certain email addresses in the body.

I recognize a generalized searching mechanisms is going to get domain specific nuances wrong, but is it really so hard to make a search engine that works on email and email based attachments that no one cares enough to try?

▲

mattnewton

7 hours ago

[-]

Huh, maybe your use case is around the indexing of the contents of attachments? I basically never search for the contents of attachments, just the clip does of emails, and have found gmail search to be really good. I switched back to the web client from Mac’s native mail app for this reason because search has been so good for me in Gmail.

I haven’t looked, but I wonder if there is a good hackable email client that will let you substitute out the search index with a reasonable abstraction from all the complicated email protocol stuff. I feel like building an index for your use case is totally achievable if so.

▲

nawazgafar

5 hours ago

[-]

You beat me to the punch. I wrote a blog post[1] with the exact same title last week! Though, I went into a bit more detail with regard to embedding layers, so maybe my title is not accurate.

1. https://gafar.org/blog/text-to-tokens

▲

jamesgresql

5 hours ago

[-]

Amazing, will have a read!

▲

semicognitive

8 hours ago

[-]

ParadeDB is a great team, highly recommend using

▲

the_arun

8 hours ago

[-]

Just curious - if we remove stop words from prompts before going to LLM, wouldn't it reduce token size? Will it keep the response from LLM same (original vs without stop tokens)?

▲

kylecazar

8 hours ago

[-]

Search engines can afford to throw out stopwords because they're often keyword based. But (frontier) LLM's need the nuance and semantics they signal -- they don't automatically strip them. There are probably special purpose models that do this, or in certain parts of a RAG pipeline, but that's the exception.

Yeah, it'll be less input tokens if you omitted them yourself. It's not guaranteed to keep the response the same, though. You're asking the model to work with less context and more ambiguity at that point. So stripping your prompt of stopwords is going to save you negligible $ and potentially cost a lot in model performance.

▲

cubefox

4 hours ago

[-]

Don't know, but GPT-5 Thinking strips out a lot of words in its reasoning trace in order to save tokens. Someone on Twitter jailbroke it in order to get the original CoT traces.