Folding diacritics makes "vähä" (little) into "vaha" (wax).
Dropping stop words like "The" misses the word for "tea" (in rather old-fashioned finnish, but also in current Danish).
Stemming Finnish words is also much more complex, as we tend to append suffixes to the words instead of small words in front to the word. "talo" is "house", "talosta" is "from the house", "talostani" is "from my house", and "talostaniko" makes it a question "from my house?"
If that sounds too easy, consider Japanese. From what little I know they don't use whitespace to separate words, mix two phonetic alphabets with Chinese ideograms, etc.
We (ParadeDB) use a search library called Tantivy under the hood, which supports stemming in Finnish, Danish and many other languages: https://docs.paradedb.com/documentation/token-filters/stemmi...
I actually just started working on a data formatter that applies principles like these to drastically reduce the amount of tokens without decreasing the performance, like other formats do (looking at you, tson).
However, even though the approach is “old fashioned” it’s still widely used for English. I’m not sure there is a universal approach that semantic search could use that would be both fast and accurate?
At the end of the day people choose a tokenizer that matches their language.
I will update the article to make all this clearer though!
The Old English "The" (Definite Article) Case Masculine (Ten) Neuter (To) Feminine (Ta) Plural (Te) Nominative Se Þæt Sēo Þā Accusative Þone Þæt Þā Þā Genitive Þæs Þæs Þære Þāra Dative Þæm Þæm Þære Þæm Instrumental Þy Þy — —
I have read somewhere that polish was actually more precise language to be used with AI - I'm wondering if the idea of shortening words that apparently make no sense are not actually hurting it more - as noticed by the article though.
So I'm to wonder at this point - wouldn't it be worthy of exploring a tenser version of the language that might bridge that gap ? completely exploratory though I don't even know if that might be helpful idea other than being a toy
I use search in my email pretty heavily, and I’m most interested in specific words in the email; and when those emails are from specific folks or a specific domain. But, the mobile version of Gmail produces different results than the mobile Outlook app than the desktop version of Gmail, and all of them are pretty terrible at search as it pertains to email.
I have a hard to getting them to pull up emails in search that I know exist, that I know have certain words, and I know have certain email addresses in the body.
I recognize a generalized searching mechanisms is going to get domain specific nuances wrong, but is it really so hard to make a search engine that works on email and email based attachments that no one cares enough to try?
I haven’t looked, but I wonder if there is a good hackable email client that will let you substitute out the search index with a reasonable abstraction from all the complicated email protocol stuff. I feel like building an index for your use case is totally achievable if so.
Yeah, it'll be less input tokens if you omitted them yourself. It's not guaranteed to keep the response the same, though. You're asking the model to work with less context and more ambiguity at that point. So stripping your prompt of stopwords is going to save you negligible $ and potentially cost a lot in model performance.