There is no public website to use it, be it free or paid, the dataset is not public, the code is not public (The github URL in the article returns 404 ), the claimed model intelligence is so low that is pretty much useless at 32K context and massively inferior to GPT‑4o.
As per tradition in Portugal, some people managed to get 5.5 Million to produce nothing and no one is asking questions.
You want a better idea? Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset, the cost would be under a million and we would be getting something useful.
It would be really nice to know what happened to 5.5 Millions whilst not being able to even provide a functional website to use the model.
Similar waste.
> Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset
I think that tokenizers of all popular models are heavily biased towards English or English and Mandarin.
And I don't think that it is possibple to replace the tokenizer without full retraining.
```
Llama3
english, 0.216
portuguese, 0.285
italian, 0.287
greek, 0.592
```
```
Gemma4
english, 0.219
portuguese, 0.246
italian, 0.249
greek, 0.537
```
```
Kimi2.6
english, 0.214
portuguese, 0.310
italian, 0.308
greek, 0.716
```
Portuguese is worse than English certainly, but it is on par with Italian (which I think has more overlap with English) and much better than Greek (since it doesn't use the Latin script and is definitely not prioritized in the tokenizer construction).
On your second point, tokenizer transfer allows for extending/modifying a tokenizer without retraining the model from scratch. The simplest version of this is tokenizer extension + continual pretraining, where you just add a bunch more tokens to the vocab for the language/domain that you want to improve and train a little more. It's been done for Japanese [2] and Indic languages, but afaik not Portuguese.
So I think that continual pretraining for a large base model would have probably been fine for this case with huge cost savings. But it is good to have the ability to train your own base models, so I don't think this is such a bad idea.
-----------------------
[1]: https://huggingface.co/datasets/goldfish-models/fish-food
If the budget is indeed so modest (5.5 million euros!), I would focus completely on preparing datasets and making sure all open cultural artifacts that we can find are well documented in them. That way every model, private or open, that gets trained in the future could better represent the culture and language of your country.
Sovereign SOTA models might also be possible with nation-state involvement. But this is a good stopgap.
All in all, I don't think that's a major issue here.
I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English)
Then again, if you go to Miranda de Douro, they’ll say the rest of Portugal has been talking nonsense for the last 700 years, so the purists at least always have their concents to retreat to if they so choose.
That's easy to say when you're not on the other end of US defaultism.
If you give enough time, all languages will change, and some of them because of major political changes/conquests
[1]: https://en.wikipedia.org/wiki/Paleohispanic_languages
[2]: https://en.wikipedia.org/wiki/Influence_of_French_on_English
[3]: https://en.wikipedia.org/wiki/Indigenous_languages_of_the_Am...
I mean, I’m a Brit who lived a long time in the US, so that’s a dynamic with which I am rather familiar
They have concerns of portuguese children learning to "speak brazillian" because there is a lot more of video content being produced in Brasil than in Portugal and stuff like movies, videogames and software in general are avaliable in brazilian localization/adaptation first.
I'm not saying it's the same, but there is definitively similarities in that parents are worrying about what language their children use. And yeah, unrelated, wasn't trying to claim it's the same or better/worse or anything, just another similar situation other (curious) people might want to learn more about, regardless of what you think Catalan wants or not.
And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English or a mixture of languages, which is essentially what they did by starting with EuroLLM?
I speak Spanish, and have talked with people who only speak Portuguese, either of the variants, and also talked with Portuguese people before how they see their language, comparing it with Brazilian Portuguese, and vice-versa. So basically based on vibes and experience.
> And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English
I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences. Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.
I'm a dual citizen of Portugal and Brazil and I live in the US now, so that's my linguistic background. (Also studied bits of French, Russian, Latin and Greek.)
> Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.
Not only are your reasons not obvious, your conclusion is actually wrong.
If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals), it might actually make more sense to train it in any other language BUT Brazilian Portuguese (say, English), then fine-tune it for European Portuguese.
LLM's have shown to be very good at generalizing across languages (the transformer architecture literally comes from work on translators IIRC).
Oh, I wasn't aware that was their goal, would certainly be intuitive to avoid Brazilian Portuguese if that's the case, although I'm still not sure it actually makes sense to 100% avoid it for pre-training even if you're trying to avoid Brazilian bias, you can "skew" things pretty heavily in post-training if you so wish.
Where can I read more about this goal, because it doesn't seem to be mentioned in the submission article, just a short off-hand about one of the benchmarks, so I'm guessing there is some resource they talk more about the specifically perhaps?
https://en.wikipedia.org/wiki/List_of_languages_by_number_of...
Yes, there are much smaller European countries, but those are generally the only source of truth for their specific language, so the context of a LLM query in that language steers the LLM towards facts from that country, for example, if I ask a big generic LLM something in Latvian then it most likely will answer something relevant to the context of Latvia. But Portugal, being the much smaller user of its language, have the somewhat unique problem that if I ask a generic model something in Portuguese it will probably answer something related to Brazil instead of Portugal.
Maybe the UK and Spain have somewhat similar struggles, but I suspect that none has it as bad as Portugal in that regard.
Don't get me wrong, it is definitely impressive given Portugal's actual size, but I believe there's a hard limit for population and size that will be difficult to cross
[1]: https://en.wikipedia.org/wiki/List_of_countries_by_number_of...
that's not impressive
The grammar and vocabularies don't match, but I think the worst are the expressions. Both sides have *a lot* of expressions that vary per context and location.
Trying to force a LLM into a specific language makes you missed out on most of the world knowledge.
Besides, there is knowledge that is locked behind languages, there are things known in Portuguese that aren't known in other languages, and the same for other languages too. More accessibility to those ideas wouldn't hurt.
It's just a database. If you push text in one language into it, it'll likely crap out stuff in that same language, unless the system prompt that also goes in with your query causes it not to.
I went to JCON EUROPE this year. The amount of "Europe this" "Europe that" "sovereign this, sovereign that" is mind boggling and just a waste of time and money. The regular people know this and thus left the conferences mid way. But somehow the people "in charge" really need to push this. Same thing here.
whats wrong with exploring ways to keep national languages alive in the LLM area
Already happening via low birth rates and mass migration. Without kids, there will be nobody to carry the culture forwards.
>and go full on english ?
Nobody is saying you have to swap your culture for English. You can have English as the mandatory language for tech and business across the EU, while still keeping your language and culture for your education, leisure, festivities, art, media, etc. This way everyone is happy. But countries like France would rather detonate its entire nuclear arsenal rather than accepting official use of English on its own soil.
As long as resources are spent across the EU to account for every language and bureaucracy, we'll keep falling behind internationally, and the only winners will be the bureaucrats, notaries, lawyers, consultants, translators, etc. We need another Concord moment.
and, who knows what will happen to grammar ?
https://simianwords.bearblog.dev/why-domain-specific-llms-wo...