AMÁLIA and the future of European Portuguese LLMs
106 points
3 days ago
| 10 comments
| duarteocarmo.com
| HN
mariopt
3 hours ago
[-]
This model is a waste of Public Funds.

There is no public website to use it, be it free or paid, the dataset is not public, the code is not public (The github URL in the article returns 404 ), the claimed model intelligence is so low that is pretty much useless at 32K context and massively inferior to GPT‑4o.

As per tradition in Portugal, some people managed to get 5.5 Million to produce nothing and no one is asking questions.

You want a better idea? Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset, the cost would be under a million and we would be getting something useful.

It would be really nice to know what happened to 5.5 Millions whilst not being able to even provide a functional website to use the model.

reply
dr_dshiv
2 hours ago
[-]
It’s a way to suck all the money out of the room in the name of nationalism — and it’s all over Europe. Only idea everyone has had.
reply
upupupandaway
2 hours ago
[-]
As a pt-BR speaker from across the pond: https://soberania.ai/

Similar waste.

reply
vova_hn2
2 hours ago
[-]
I'm not arguing with the rest of your points, but...

> Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset

I think that tokenizers of all popular models are heavily biased towards English or English and Mandarin.

And I don't think that it is possibple to replace the tokenizer without full retraining.

reply
mcyc
1 hour ago
[-]
You are right about most tokenizers being heavily biased towards English, but the situation is not so bad for Portuguese. Here are some results on the Goldfish corpus [1] with a few different tokenizers. This measures #characters in corpus / #subwords in tokenized corpus.

```

Llama3

english, 0.216

portuguese, 0.285

italian, 0.287

greek, 0.592

```

```

Gemma4

english, 0.219

portuguese, 0.246

italian, 0.249

greek, 0.537

```

```

Kimi2.6

english, 0.214

portuguese, 0.310

italian, 0.308

greek, 0.716

```

Portuguese is worse than English certainly, but it is on par with Italian (which I think has more overlap with English) and much better than Greek (since it doesn't use the Latin script and is definitely not prioritized in the tokenizer construction).

On your second point, tokenizer transfer allows for extending/modifying a tokenizer without retraining the model from scratch. The simplest version of this is tokenizer extension + continual pretraining, where you just add a bunch more tokens to the vocab for the language/domain that you want to improve and train a little more. It's been done for Japanese [2] and Indic languages, but afaik not Portuguese.

So I think that continual pretraining for a large base model would have probably been fine for this case with huge cost savings. But it is good to have the ability to train your own base models, so I don't think this is such a bad idea.

-----------------------

[1]: https://huggingface.co/datasets/goldfish-models/fish-food

[2]: https://arxiv.org/abs/2404.17790

reply
pu_pe
6 hours ago
[-]
I'm not sure the direction should be to finetune a small local model for each country or language. These models are already not particularly great at information retrieval, so I doubt anyone would use them for questions like the author suggests (ie who was the president between X and Y). Similarly, they are a little too lightweight to be used for translations too.

If the budget is indeed so modest (5.5 million euros!), I would focus completely on preparing datasets and making sure all open cultural artifacts that we can find are well documented in them. That way every model, private or open, that gets trained in the future could better represent the culture and language of your country.

reply
iugtmkbdfil834
4 hours ago
[-]
I agree, the research is complex enough as is without having to worry about splitting it babel-like into multiple languages.
reply
TheMagicHorsey
3 hours ago
[-]
This is the way.

Sovereign SOTA models might also be possible with nation-state involvement. But this is a good stopgap.

reply
dyauspitr
4 hours ago
[-]
Yeah I think India is going the better route with Sarvam which is trained from scratch and still relatively cheap.
reply
alexaholic
1 hour ago
[-]
The Amália model is not yet publicly available. Until it's ready, one can fool around with Anália at https://analia.pt
reply
swiftcoder
6 hours ago
[-]
It is definitely an interesting problem, because Portugal is a small enough country that the actual total corpus of available texts in (non-Brazilian) Portuguese is potentially problematic.
reply
embedding-shape
6 hours ago
[-]
I don't think so, Portugal the country might be small, with a small population, but there is ~250 million "Lusophones" (native Portuguese speakers), making it the fifth-most spoken native language in the world, I'd hardly call that small :) And before everyone screams; yes, European Portuguese is different from Brazilian Portuguese, but they're still both Portuguese and understand each other, so it's not like the text from one cannot be used to train a model for the other, or vice-versa.

All in all, I don't think that's a major issue here.

reply
swiftcoder
6 hours ago
[-]
The authors are pretty clearly trying to draw only from European Portuguese sources - I feel like there's a fairly widespread attitude here that the language is being overwhelmed by the sheer number of Brazilian speakers (which there is obviously at least some truth to).

I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English)

reply
madaxe_again
6 hours ago
[-]
Man, there’s an attitude up here in trás-os-montes that the rest of Portugal has spoken unrecognisable trash for a century. It took me years to realise I’d learned hilariously antique Portuguese by moving there.

Then again, if you go to Miranda de Douro, they’ll say the rest of Portugal has been talking nonsense for the last 700 years, so the purists at least always have their concents to retreat to if they so choose.

reply
philipwhiuk
5 hours ago
[-]
> I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English).

That's easy to say when you're not on the other end of US defaultism.

reply
augusto-moura
4 hours ago
[-]
To be fair, it is only natural: Portuguese itself only came to be because the Roman Empire conquered the Lusitan land [1], a lot of English comes from Norman French from the Norman conquest [2], the Americas didn't speak European languages until 500 years ago or so, etc.

If you give enough time, all languages will change, and some of them because of major political changes/conquests

[1]: https://en.wikipedia.org/wiki/Paleohispanic_languages

[2]: https://en.wikipedia.org/wiki/Influence_of_French_on_English

[3]: https://en.wikipedia.org/wiki/Indigenous_languages_of_the_Am...

reply
swiftcoder
3 hours ago
[-]
> That's easy to say when you're not on the other end of US defaultism.

I mean, I’m a Brit who lived a long time in the US, so that’s a dynamic with which I am rather familiar

reply
mghackerlady
5 hours ago
[-]
Right, but most of those speak brazilian portuguese. There's so much less european portuguese text that it becomes impossible for a model to not speak brazilian portuguese if not trained in a way that ignores brazilian sources
reply
evandrofisico
2 hours ago
[-]
Portugal has a growing Xenophobic attitude towards immigrants, specially Brazilians and this is reflected in linguistic prejudice.

They have concerns of portuguese children learning to "speak brazillian" because there is a lot more of video content being produced in Brasil than in Portugal and stuff like movies, videogames and software in general are avaliable in brazilian localization/adaptation first.

reply
embedding-shape
2 hours ago
[-]
We have the same thing happening, on multiple levels, here too. First some Spanish parents are afraid the children aren't listening and watching enough Spanish media. Then additionally, some Catalan parents are afraid the children don't get to use Catalan in school so they don't become proficient enough to use it in society.
reply
darkwater
2 hours ago
[-]
The Catalan situation is completely different and unrelated, being a completely different language and not endangered (with or without scary quotes, as you prefer) by an ex-colony that became independent. Actually many Catalans would like to be such ex-colony.
reply
embedding-shape
1 hour ago
[-]
> The Catalan situation is completely different and unrelated

I'm not saying it's the same, but there is definitively similarities in that parents are worrying about what language their children use. And yeah, unrelated, wasn't trying to claim it's the same or better/worse or anything, just another similar situation other (curious) people might want to learn more about, regardless of what you think Catalan wants or not.

reply
KK7NIL
6 hours ago
[-]
The whole point of this project is to have an LLM that speaks European Portuguese, not Brazilian Portuguese.
reply
embedding-shape
6 hours ago
[-]
Right, and my point is that if you use 80% Brazilian Portuguese during base model training + 20% European Portuguese as post-training, you pretty much get exactly that, except with a ton more of available training data.
reply
KK7NIL
6 hours ago
[-]
What's your evidence for that?

And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English or a mixture of languages, which is essentially what they did by starting with EuroLLM?

reply
embedding-shape
5 hours ago
[-]
Evidence? Not so much, I didn't realize I was defending a PhD thesis here.

I speak Spanish, and have talked with people who only speak Portuguese, either of the variants, and also talked with Portuguese people before how they see their language, comparing it with Brazilian Portuguese, and vice-versa. So basically based on vibes and experience.

> And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English

I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences. Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.

reply
KK7NIL
5 hours ago
[-]
> I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences.

I'm a dual citizen of Portugal and Brazil and I live in the US now, so that's my linguistic background. (Also studied bits of French, Russian, Latin and Greek.)

> Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.

Not only are your reasons not obvious, your conclusion is actually wrong.

If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals), it might actually make more sense to train it in any other language BUT Brazilian Portuguese (say, English), then fine-tune it for European Portuguese.

LLM's have shown to be very good at generalizing across languages (the transformer architecture literally comes from work on translators IIRC).

reply
embedding-shape
2 hours ago
[-]
> If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals)

Oh, I wasn't aware that was their goal, would certainly be intuitive to avoid Brazilian Portuguese if that's the case, although I'm still not sure it actually makes sense to 100% avoid it for pre-training even if you're trying to avoid Brazilian bias, you can "skew" things pretty heavily in post-training if you so wish.

Where can I read more about this goal, because it doesn't seem to be mentioned in the submission article, just a short off-hand about one of the benchmarks, so I'm guessing there is some resource they talk more about the specifically perhaps?

reply
madaxe_again
6 hours ago
[-]
Mutually intelligible, yes, but far from perfectly so. I speak both, as a native anglophone, and the difference is not so much “US vs British English” so much as “Guyanese English vs British English”. Like, fundamental points of grammar differ, the spoken rhythm and syllabic stress differs (poetry does not translate well between them), never mind just vocabulary. Continental Portuguese people tend to find it easier to understand brasileiros than vice versa, largely due to mostly one-way cultural exports, but to try to roll both into a single model would create a creole at best.
reply
embedding-shape
6 hours ago
[-]
I agree, they're not the same. But they're far closer than other languages who don't come from the same families.
reply
fy20
5 hours ago
[-]
European Portuguese is the 13th most populous language in Europe. Not that small, there are many other European languages in use that are much smaller.

https://en.wikipedia.org/wiki/List_of_languages_by_number_of...

reply
SkeuomorphicBee
2 hours ago
[-]
What makes Portugal's situation unique is that it is a small population that is eclipsed in models by the bigger weights of the much bigger population of Brazil.

Yes, there are much smaller European countries, but those are generally the only source of truth for their specific language, so the context of a LLM query in that language steers the LLM towards facts from that country, for example, if I ask a big generic LLM something in Latvian then it most likely will answer something relevant to the context of Latvia. But Portugal, being the much smaller user of its language, have the somewhat unique problem that if I ask a generic model something in Portuguese it will probably answer something related to Brazil instead of Portugal.

Maybe the UK and Spain have somewhat similar struggles, but I suspect that none has it as bad as Portugal in that regard.

reply
augusto-moura
4 hours ago
[-]
It is pretty small when considering content output. It is only 11 million people, and only a fraction of them will be writing something that could be used on training datasests. If you look at the countries by scientific contribution, for example [1], Portugal is on the 28th position, while Brazil is in 14th by more than double the number of contributions.

Don't get me wrong, it is definitely impressive given Portugal's actual size, but I believe there's a hard limit for population and size that will be difficult to cross

[1]: https://en.wikipedia.org/wiki/List_of_countries_by_number_of...

reply
depaulagu
5 hours ago
[-]
> European Portuguese is the 13th most populous language in Europe

that's not impressive

reply
senko
4 hours ago
[-]
Hello from 23rd
reply
drivebyhooting
1 hour ago
[-]
I’ve noticed that ChatGPT is noticeably dumber in languages other than English. It even will confidently repeat common but wrong superstitions from the target language as if they were fact.
reply
r2ob
2 hours ago
[-]
"This model is a waste of Public Funds". There is no "public funds", this is a waste of money from the tax payers.
reply
mt_
3 hours ago
[-]
5 million for a llama-2 finetune, how is that impressive?
reply
algoth1
6 hours ago
[-]
Wouldnt it be easier to fine tune a model to convert the Brazilian Portuguese corpus into European Portuguese and then use that corpus?
reply
kinow
43 minutes ago
[-]
That idea is different than what most are talking here in other comments.

The grammar and vocabularies don't match, but I think the worst are the expressions. Both sides have *a lot* of expressions that vary per context and location.

reply
hartator
7 hours ago
[-]
What a waste of time and money.

Trying to force a LLM into a specific language makes you missed out on most of the world knowledge.

reply
embedding-shape
6 hours ago
[-]
What LLM isn't forced into a specific language? That'd be a weird language model no one could understand, you need to chose at least one language, ideally the same as the creators speak.

Besides, there is knowledge that is locked behind languages, there are things known in Portuguese that aren't known in other languages, and the same for other languages too. More accessibility to those ideas wouldn't hurt.

reply
Miraste
6 hours ago
[-]
To my knowledge, all major LLMs are multilingual. This article could really have used an evaluation of existing models' European Portuguese capabilities.
reply
numpad0
5 hours ago
[-]
yeah, they seem all confined to being an American-consultant-Chinese-authoritarian split personality with broad second language capabilities. I suppose they become too incoherent otherwise.
reply
cess11
6 hours ago
[-]
E.g. gemma3:4b can fake simple conversations in several european languages, including portuguese, swedish and finnish.

It's just a database. If you push text in one language into it, it'll likely crap out stuff in that same language, unless the system prompt that also goes in with your query causes it not to.

reply
CrimsonRain
3 hours ago
[-]
Europe always has a thing for their languages. They think many languages make them stronger while spending billions in system loss due to communication barriers. It is obvious they will try to do the same with LLMs and call it the next best thing since bread and butter.

I went to JCON EUROPE this year. The amount of "Europe this" "Europe that" "sovereign this, sovereign that" is mind boggling and just a waste of time and money. The regular people know this and thus left the conferences mid way. But somehow the people "in charge" really need to push this. Same thing here.

reply
lmf4lol
3 hours ago
[-]
whats your suggestion? we just eradicate all of our culture and languages and go full on english ?

whats wrong with exploring ways to keep national languages alive in the LLM area

reply
joe_mamba
22 minutes ago
[-]
> we just eradicate all of our culture

Already happening via low birth rates and mass migration. Without kids, there will be nobody to carry the culture forwards.

>and go full on english ?

Nobody is saying you have to swap your culture for English. You can have English as the mandatory language for tech and business across the EU, while still keeping your language and culture for your education, leisure, festivities, art, media, etc. This way everyone is happy. But countries like France would rather detonate its entire nuclear arsenal rather than accepting official use of English on its own soil.

As long as resources are spent across the EU to account for every language and bureaucracy, we'll keep falling behind internationally, and the only winners will be the bureaucrats, notaries, lawyers, consultants, translators, etc. We need another Concord moment.

reply
KK7NIL
6 hours ago
[-]
This is how Europe thinks they can catch up on tech, by having the government fund vanity projects which will be made obsolete by more general techniques in 6 months.
reply
xp84
1 hour ago
[-]
It's the European Way
reply
lmf4lol
3 hours ago
[-]
everyone on this project probably learned a lot doing it, dont you think!
reply
joe_mamba
6 minutes ago
[-]
I'd also want to get paid to work on stuff not meant to bring any financial returns, just to learn and pad my resume. Sounds like a sweet gig. Where do I sign up?
reply
mistrial9
7 hours ago
[-]
> makes you missed out on most of the world knowledge

and, who knows what will happen to grammar ?

reply
simianwords
5 hours ago
[-]
Domain specific models will never be a thing. You don't get generalised intelligence with that.

https://simianwords.bearblog.dev/why-domain-specific-llms-wo...

reply