Here is the awards page: https://cspaper.org/topic/116/record-breaking-acl-2025-crown...
I have a suspicion with how quiet all the major players got after the two weeks after deepseek R1 was released that they were reading and implementing everything in the papers that came with it as fast as humanly possible.
I applaud their open efforts. But being "altruistic" and being best are two different things.
Their innovations in training efficiency were almost guaranteed to have been heavily considered by the big AI labs. For example, Dario Amodei talks about the efficiency improvements being the real important contribution of DeepSeek V3 here: https://www.darioamodei.com/post/on-deepseek-and-export-cont...
> DeepSeek's team did this via some genuine and impressive innovations, mostly focused on engineering efficiency. There were particularly innovative improvements in the management of an aspect called the "Key-Value cache", and in enabling a method called "mixture of experts" to be pushed further than it had before.
And the saltiness of US labs about DeepSeek is well-known. "O3, explain model distillation like I'm five."
No Sam, explain intellectual property rights to the judge in the NYT test case asshole.
Isn't it very notable that the latency improvement didn't have a performance loss? I'm not super familiar with all the technical aspects, but that seems like it should be one of the main focuses of the paper.
The awards page for ACL seems to disagree with this editorialized title: https://2025.aclweb.org/program/awards/
> Industry Track Awards
> Best Paper
> Speed Without Sacrifice: Fine-Tuning Language Models with Medusa and Knowledge Distillation in Travel Applications
> Daniel Zagyva, Emmanouil Stergiadis, Laurens van der Maas, Aleksandra Dokic, Eran Fainman, Ilya Gusev, Moran Beladev
Per TFA, the paper we’re looking for is this one:
> Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
> Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng
I’m not finding it by author on the page you linked but I think it’s this reference by title:
> DeepSeek × PKU × UW — Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
I did find it on this page: