Rio de Janeiro's city government model Rio3.5 beats Qwen3.7 in recent benchmarks
112 points
3 hours ago
| 8 comments
| twitter.com
| HN
VoidWhisperer
1 hour ago
[-]
https://github.com/nex-agi/Nex-N2/issues/4

Seems that they didn't make/train a new novel model, they did a mix of two existing models and then gave it an instruction to say it was 'Rio, trained by Rio AI Labs'

reply
w4yai
1 hour ago
[-]
> The model is built via a merge of https://huggingface.co/nex-agi/Nex-N2-Pro and https://huggingface.co/Qwen/Qwen3.5-397B-A17B, proceeded by On-Policy Distillation from a stronger model. We detected an incorrect upload in the previous version, where the base merged version was upload instead of the final distilled model. We are sorry for the confusion and apologize profusely.

https://huggingface.co/prefeitura-rio/Rio-3.5-Open-397B/comm...

reply
daquisu
18 minutes ago
[-]
It was a recent edit though. Yesterday snapshot: https://web.archive.org/web/20260613072958/https://huggingfa...
reply
mettamage
3 hours ago
[-]
https://xcancel.com/ZenMagnets/status/2065796012820848699

Correct me if I'm wrong but reading through the comments of the thread this seems to be post training/fine tuning.

reply
oceansky
2 hours ago
[-]
Yes. It's post training in qwen using the novel SwiReasoning framework.
reply
hedgehog
2 hours ago
[-]
I hadn't seen SwiReasoning (https://swireasoning.github.io, paper and code), it looks like that works at generation time without any requirements on the model. It increases token-efficiency and accuracy, but at first skim it seems like this would be incompatible with multi-token prediction. For large reductions in token budget it could be worth it.
reply
rafaquintanilha
1 hour ago
[-]
Doesn't look like it's incompatible. Someone already released a quantization using MTP: https://huggingface.co/foxipanda/Rio-3.5-Open-397B-GGUF
reply
hedgehog
55 minutes ago
[-]
As I understand it the basic premise of all the speculative decoding schemes is that the logits on the draft don't need to be exact so long as you mostly sample the same tokens, and because each position is fed by the embedding associated with the previous position's token you sort of "round away" error. With SwiReasoning I think you skip the sampling/rounding part and do something continuous using the whole distribution, so it would seem to rely on the accuracy of those values. MTP still makes sense outside the latent reasoning chunks though.
reply
Kelteseth
3 hours ago
[-]
Thanks, Firefox and uBlock does not let me watch any X content (I guess this is a good thing)
reply
drnick1
2 hours ago
[-]
Same thing here, X content and trackers are blocked by my Firefox settings. The occasional inconvenience is a small price to pay not to be profiled by X, Google, FB, Amazon, and countless other Internet parasites.
reply
adrian_b
3 hours ago
[-]
> Post-trained from Qwen 3.5 397B

Model Card:

https://huggingface.co/prefeitura-rio/Rio-3.5-Open-397B

reply
Aurornis
2 hours ago
[-]
A city government funding a fine-tune of a model is interesting.

As for the benchmarks: If you spend any time playing with fine tunes of published models you know that benchmarks are gamed so much that they're a useless indicator of performance for models from small teams. It's too easy to fine tune a model to perform well on the benchmarks, release it, put a line on your resume saying you released a model that beat the major labs on benchmarks, and then try to use that to jump into a new job. The temptation is high.

There are a lot of fringe models and fine tunes that claim to have better performance on some benchmark. Then you try to use them and find they're often worse at general tasks than the base model.

I would wait and see if these results hold across other benchmarks. It's cool that the city is doing something with AI, but this is something where extraordinary claims require extraordinary evidence. I doubt a small, previously unknown team has unlocked something secret that the team who made Qwen couldn't figure out. It's more likely it was fine tuned for a specific outcome (possibly these benchmarks) and performance in other areas was reduced as a consequence.

reply
marcosdumay
1 hour ago
[-]
> A city government funding a fine-tune of a model is interesting.

Looks like it's an IT services government-owned company.

Most likely, they saw some business opportunity on selling it around for cities.

reply
embedding-shape
1 hour ago
[-]
Indeed, this is all very true, I'd say it's true for the larger teams too, the entire ecosystem is so gamed by now that if you don't have your own private benchmarks with private test cases you haven't shared publicly, it's almost impossible to get a fair picture how well a model works, unless you actually sit down and use it.
reply
HeliumHydride
3 hours ago
[-]
reply
arjie
2 hours ago
[-]
Benchmaxxing is the new “have a crypto trading strategy”. No one is impressed by it except non practitioners.
reply
hmokiguess
2 hours ago
[-]
Never let them know your next move
reply
ramon156
3 hours ago
[-]
Every day I'm reminded why I don't spend time on twitter. What use does it have to claim "X is better than Y in benchmark Z, disagreeing with that means disagreeing with me"

Information is power, dick measurements are not.

reply
itsthecourier
2 hours ago
[-]
my length is a valid data point for the sake of science
reply
reed1234
2 hours ago
[-]
No, I love twitter— and you are wrong.
reply