FilterHN

We Ran Over Half a Million Evaluations on Quantized LLMs

11 points

by eldar_ciki

15 hours ago

| past

| 2 comments

| neuralmagic.com

| HN

▲

anotherhue

13 hours ago

[-]

> In conclusion, our comprehensive evaluation demonstrates that quantized models maintain impressive accuracy and quality compared to their full-precision counterparts, making them an essential tool for optimizing LLMs in real-world deployments.

▲

eldar_ciki

14 hours ago

[-]

The ML community has recently questioned whether quantized LLMs can genuinely compete with their full-precision counterparts. To address this, we conducted over half a million evaluations on quantized Llama-3.1-{8B, 70B, 405B}-Instruct models across FP8, INT8, and INT4 quantization schemes. We looked at various benchmarks, from open-ended challenges like Arena-Hard, to rigorous academic benchmarks such as MMLU, MMLU-Pro, Big Bench Hard, ARC-Challenge, IFEval, GPQA (and others from OpenLLM Leaderboard v1 and v2), and coding tests like HumanEval and HumanEval+.

-> Long story short: when models are carefully quantized, everything looks good. The lowest accuracy recovery we found was 96% relative to unquantized baseline, and this happened only for 8B model at weight-only INT4 quantization mostly because the unquantized baseline model has close to random prediction accuracy on a couple of Leaderboard v2 benchmarks. As one could imagine, "recovering" random accuracy is a bit noisy.