For example, Sketchy Provider tells you they are running the latest and greatest, but actually is knowingly running some cheaper (and worse) model and pocketing the difference. These tests wouldn't help since Sketchy Provider could detect when they're being tested and do the right thing (like the Volkswagen emissions scandal). Right?
If someone actually goes out of their way to bypass the check, that's a pretty different situation legally compared to just quietly shipping a cheaper quant anyway.
This is probably kimi trying to protect their brand from bargain basement providers that dont properly represent what the models are capable of.
I'm curious what exactly they mean by this...
"because we learned the hard way that open-sourcing a model is only half the battle."
For a truly malicious actor, you're right. But it shifts it from "well we aren't obviously committing fraud by quantizing this model and not telling people" to "we're deliberately committing fraud by verifying our deployment with one model and then serving customer requests with another".
I suspect there's a lot of semi-malicious actors who are only happy to do the former.
Kimi K2.6, however, is the new open source leader, so far. Agentic evaluations still in progress, but one-shot coding reasoning benchmarks are ready at https://gertlabs.com/?mode=oneshot_coding
Edit: Kimi K2 uses int4 during its training as well as inference [2]. I wonder if that affects the quality if different gguf creators may not convert these correctly?
[1] https://openrouter.ai/docs/guides/routing/model-variants/exa...
[2] https://www.reddit.com/r/LocalLLaMA/comments/1pzfuqg/why_kim...
Going to test it out, thanks!