When I started learning Russian, the declensions (like the ones mentioned in the article) really threw me for a loop. I looked all over for a similar app to explain the patterns and drill rote practice, but never found one.
While slightly off-topic, does anyone know of such an app (web-based or macOS/iOS)?
> KOFI (Konjugation First) is the name I've given to a provocative language-learning approach I've created: to learn all the forms of a language's conjugation before even starting to formally study the language
I used the French one, years after I learned French, because my conjugation was abysmal. You can get by using basic tenses or wrong tenses, and people will understand you, but it's not what you want. The KOFI method is supposed to teach you all the conjugation patterns in a matter of months before learning the language, I'd like to give it a try in-earnest some day for a new language. My interest in French has waned so I didn't stick with it.
Non-native Russian speaker here. In the past, I cobbled together some scripts that use the spaCy Python module with the larger of the two Russian modules to provide context-aware lemmatization and grammatical tag extraction.
On the whole, though, my biggest gains in Russian were in letting go of the need to analytically deconstruct the inflections and instead build up a mental library of patterns (and exceptions) in my head through use.
EDIT: I mean context within a sentence, not a broader meaning.
There was a section at the front of the dictionary with full conjugation patterns over all tenses for one sample verb in each class.
Eg, each type of stem-changing verb fell into one index, full irregulars were singletons in their own class, some irregulars that behave similarly (iirc tener and detener) shared one class.
So all verbs in Spanish fell neatly into a few dozen unique patterns, and the indexing was already done.
I was going to build a quiz software just like you mentioned to conjugate any verb in any tense, but “never got around to it”.
I wonder how the reverse-string trie pattern in the article would be for reconstructing the class mapping.
It is based on an OpenCorpora dictionary: https://opencorpora.org/dict.php
This dictionary is based on a Zaliznyak dictionary, which is always referenced in Wiktionary's articles.
Ah, as a cheap bastard, I hate how software was pay once back then, and for this one I'm just going to ask you what's the monthly subscription price?
That said, I very much like Codeweavers’ approach [0], which IMO is the modern equivalent to purchasing software on a physical medium: you buy it, you can re-download it as many times as you’d like, install it on as many machines as you’d like (single-user usage only), and you get 1 year of updates and support. After that, you can still keep using it indefinitely, but you don’t get updates or paid support. You get a discount if you renew before expiry. They also have a lifetime option which, so far, they’ve not indicated they’re going to change.
I have no affiliation with them, I just think it’s a good product, and a good licensing / sales model.
On the other hand, we have software which has low maintenance cost, but sold for peanuts ($0-$10) in small quantities, so authors try to introduce alternative revenue streams.
As in, it's fair to pay continuously (subscription) for continuous work (maintenance), so I don't expect that to go away. Ads, though, yuck...
Increasingly I am not buying software at all.
That solved all the issues with paying for maintenance, but sadly someone must have figured out a mandatory subscription was a better way to make more money.
Major versions come from a time where one had to produce physical media. Thus one could do a major release only every few years. Back then features had to be grouped together in a big bang release.
Nowadays one can ship features as they are being developed, with many small features changes all the time.
Price is usually established based on how much something cost to make (materials, effort, profit), combined with market conditions (abundance/shortage of products, surplus cash/tough economy...).
If you want to continuously extract profit from consistent use of a hammer or vacuum cleaner, somebody else will trivially make a competing product at a lower price with no subscription.
And software like photoshop is not trivial to copy so it can survive being priced based off of value provided. There exists competitors that don't have a subscription, but they are not good enough to kill it.
The manufacturer/builder gets paid once, and you get value monthly.
The existence of purchasing cars and houses with no ongoing cost to the builder is due to competition.
The idea of capturing reward post-receipt is feudalistic.
Encoding them into a trie like this would still be a good way to distribute the result, but you don't have to rely on the trie also being a good way to guess the declensions.
I would not be confident enough myself to add the data myself since I'd probably be wrong a lot of the time. When reviewing the results for the top 100 unknown names I frequently got results that I thought _might_ be wrong, but I wasn't sure. For those, I looked up similar names in DIM to verify, and often thought "huh, I would not have declined those names like this". For that reason, I rely on the DIM data as the source of truth since it's maintained by experts on the language.
I also live in a country with a centrally governed personal name list, but you can request exceptions, and there are people who were born before the list existed, so their names won't necessarily be on the list either. Immigrants can also retain their names during naturalization I believe, and there can be lots of other complications still. So the ability to sorta-kinda predict the proper declension is still useful.
Maybe generating a minimal list of regexes that classifies 100% of names correctly? Maybe a big enough bloom filter? Maybe like a bloom filter but instead of hashes we use engineered features?
That said I’m curious how this manifests with cross-language situations. I guess the Icelandic UI displaying French names would just always use the nomitive case, and likewise for the English UI displaying Icelandic names? I assume this all mostly matters where the user is directly being addressed, or perhaps in an admin panel (“user x responded to user y”).
…
> But that quickly breaks down. There are other names ending with “ður” or “dur” that follow a different pattern of declension
My “everything should be completely orderly” comp-sci brain is always triggered by these almost trivial problems that end up being much more interesting.
Is the suffix pattern based on the pronunciation of the syllable(s) before the suffix? If one wanted to improve upon your work for unknown names, rather than consider the letters used, would you have to do some NLP on the name to get a representation of the pronunciation and look that up (in a trie or otherwise)?
Careful, this is how you fall down the Are Dependent Types The Answer?? hole.
In this particular example, having a subsequent part of an expression rely on prior parts would usually be accomplished at runtime in most languages. But some (like Idris) might allow you to encode the rules in the type system. Thus the rabbit hole.
- Ástvaldur -> ur,,i,ar - Baldur -> ur,ur,ri,urs
The "aldur" ending is pronounced in the exact same manner, but applying the declension pattern of "Ástvaldur" to "Baldur" would yield:
- Baldur - Bald - Baldi - Baldar
The three last forms feel very wrong (I asked my partner to verify and she cringed).
Spoken Icelandic is surprisingly close to its written form. I wouldn't expect very different results for the trie if a "phonetic" version of names and their endings were used instead of their written forms
const suffixes = [",,,", "a,u,u,u", ",,i,s", ",,,s", "i,a,a,a", ...];
and then use the index of this list in the var serializedInput = "{e:{n:{ein:0_r: ...
If you can use gzip there's bound to be a clever way of using a suffix array as well, that might end up being better unless you can use an optimised binary format for the tree.
Of course, that would mean you lose the ability to say "name not handled".
Native speakers very frequently decline names in ways that are not technically perfect but sound correct enough. For example, my name (Alex) should not be declined, but people frequently use the declension pattern (Alex, Alex, Alexi, Alexar).
There's some parallel to be drawn with how the compressed trie applies patterns that it's learned to names. That's at least how I thought about it when designing the library.
For example, if an English person called Arthur uses the site in Icelandic, I'm not sure they'd expect their name to be changed to presumably "Arth", "Arthi" or "Arthar" even if they were a keen learner of Icelandic. Their name is their name. So, as well as storing someone's name, you also have to ask them what language their name is, or guess and get it wrong. At that point, you might as well just ask them for all the different forms for the name as well, and then you don't have to worry about whether their name is on an approved list or not.
And if the website isn't localised into Icelandic, I've also got to wonder if Icelandic visitors would have an expectation of Icelandic grammar rules being applied to English (or whatever) text. Most Icelandic people I've spoken to before have an excellent command of English anyway, and I'm sure they'd understand why their name isn't changing form in English.
So if your name was Arthur, and you wanted to emigrate to Iceland you would you change name.
Might still be like this.
Why not just reuse the existing standard and change everyone’s last names to Kim, Lee, or Park?
*surnames. Not last in that case, whatever the case is you're trying to make.
It's not a privacy issue if it's just "someone's" name.
> A name not already on the official list of approved names must be submitted to the naming committee for approval. A new name is considered for its compatibility with Icelandic tradition and for the likelihood that it might cause the bearer embarrassment. Under Article 5 of the Personal Names Act, names must be compatible with Icelandic grammar (in which all nouns, including proper names, have grammatical gender and change their forms in an orderly fashion according to the language's case system).
A database of those names is no more interesting or personal than a dictionary or list of names ( https://www.insee.fr/en/statistiques/6536067 ) in another language... which is where they got the data.
> Iceland has a publicly run institution, Árnastofnun, that manages the Database of Icelandic Morphology (DIM). The database was created, amongst other reasons, to support Icelandic language technology.
https://bin.arnastofnun.is/DMII/aboutDMII/
There is no more personal information being presented than saying John or providing https://en.wikipedia.org/wiki/John_(given_name) or https://www.wolframalpha.com/input?i=John
John may be your given name, but that data isn't personal data. One of the numbers 1969, 1978, 1987, 1996 might be your birth year... but https://oeis.org/A101039 isn't personal information either. Combining John with Smith and 1978 as the year of someone's birth... now you've got personal information that would be covered by the GDPR.
> John may be your given name, but that data isn't personal data. One of the numbers 1969, 1978, 1987, 1996 might be your birth year... but https://oeis.org/A101039 isn't personal information either. Combining John with Smith and 1978 as the year of someone's birth... now you've got personal information that would be covered by the GDPR.
Just the facts "John" or "Smith" or "1978" aren't PII, but any single one attached to some other data is, because then that provides partial identification of that other data. So, for instance an attribution of a forum post to "John" is PII, even if there are thousands of other Johns using the system.
Actually, even that's not necessarily true. The mere fact that you are acknowledging a user exists with that name may make it PII. It's not a big deal to say our usernames include "John", "Mark", etc if there are literally thousands of them, but it's a big deal if one of the usernames is an incredibly rare name or spelling. In this case, the list presented in the article isn't PII, because the list is just a list of names downloaded from a government site that represent possible acceptable names. Just having that list provides no information about whether anyone with any of those names is using your service.