We Ran Over Half a Million Evaluations on Quantized LLMs
11 points
15 hours ago
| 2 comments
| neuralmagic.com
| HN
anotherhue
13 hours ago
[-]
> In conclusion, our comprehensive evaluation demonstrates that quantized models maintain impressive accuracy and quality compared to their full-precision counterparts, making them an essential tool for optimizing LLMs in real-world deployments.
reply
eldar_ciki
14 hours ago
[-]
The ML community has recently questioned whether quantized LLMs can genuinely compete with their full-precision counterparts. To address this, we conducted over half a million evaluations on quantized Llama-3.1-{8B, 70B, 405B}-Instruct models across FP8, INT8, and INT4 quantization schemes. We looked at various benchmarks, from open-ended challenges like Arena-Hard, to rigorous academic benchmarks such as MMLU, MMLU-Pro, Big Bench Hard, ARC-Challenge, IFEval, GPQA (and others from OpenLLM Leaderboard v1 and v2), and coding tests like HumanEval and HumanEval+.

-> Long story short: when models are carefully quantized, everything looks good. The lowest accuracy recovery we found was 96% relative to unquantized baseline, and this happened only for 8B model at weight-only INT4 quantization mostly because the unquantized baseline model has close to random prediction accuracy on a couple of Leaderboard v2 benchmarks. As one could imagine, "recovering" random accuracy is a bit noisy.

reply