How We Broke Top AI Agent Benchmarks: And What Comes Next
52 points
1 hour ago
| 5 comments
| rdi.berkeley.edu
| HN
ggillas
46 minutes ago
[-]
This is a phenomenal paper on exploits and hopefully changes the way benchmarking is done.

From the paper: We achieved near-perfect scores on all of them without solving a single task. The exploits range from the embarrassingly simple (sending {} to FieldWorkArena) to the technically involved (trojanizing binary wrappers in Terminal-Bench), but they all share a common thread: the evaluation was not designed to resist a system that optimizes for the score rather than the task.

reply
operatingthetan
40 minutes ago
[-]
>hopefully changes the way benchmarking is done.

Yeah the path forward is simple: check if the solutions actually contain solutions. If they contain exploits then that entire result is discarded.

reply
ZeroGravitas
11 minutes ago
[-]
In human multiple choice tests they sometimes use negative marking to discourage guessing. It feels like exploits should be cancel out several correct solutions.
reply
siva7
28 minutes ago
[-]
Could it really be that not only we vibeslop all apps nowadays but also don't care to even check how ai solved a benchmark it claimed solved?
reply
operatingthetan
19 minutes ago
[-]
Probably a more interesting benchmark is one that is scored based on the LLM finding exploits in the benchmark.
reply
Leynos
33 minutes ago
[-]
Also, fuzz your benchmarks
reply
zer00eyz
37 minutes ago
[-]
2024: Industry group invalidates 2,600 official Intel CPU benchmarks — SPEC says the company's compiler used unfair optimizations to boost performance https://www.tomshardware.com/pc-components/cpus/spec-invalid...

2003: Nvidia accused of cheating in 3DMark 03 https://www.gamespot.com/articles/nvidia-accused-of-cheating...

It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.

I like what LLM's are doing and providing. But the industry as a whole seems to live in a vacuum that ignores so much of the hard lessons that have been learned over the last 50 years of computing. It is doing itself a disservice.

reply
irishcoffee
29 minutes ago
[-]
> It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.

I wonder if this common? We should call it Goodharts law while someone does the research on how common this is.

For real, I’ve assumed from the jump these things were all gamed, with the amount of money on the line.

reply
danslo
16 minutes ago
[-]
If only the blog itself wasn't written by AI?

>No reasoning. No capability. Just exploitation of how the score is computed.

shudder

reply
lnrd
28 minutes ago
[-]
I'm honestly confused by the design of SWE-bench and why is considered reliable.

It's based on existing GitHub PRs and Issues, the full dataset is on HuggingFace and is one year old now. All frontier models 100% have those issues and PRs in their training data so obviously they are good at reproducing fixes for them when confronted with the same codebase and similar requests. Am I missing something? How is this considered the most reliable benchmark?

reply
charcircuit
42 minutes ago
[-]
I always assumed that these benchmarks would happen in a sandbox. I'm surprised that no one realized this sooner.
reply
ModernMech
36 minutes ago
[-]
I'm surprised anyone took them seriously in the first place.
reply
subulaz
26 minutes ago
[-]
a LOT of the people who love benchmarks are middle management hard-selling GenAI/LLM as magic tech sauce to vaguely technical executives who only want to know about the money aka headcount savings they so desperately desire.

their collective butts are already glued to the hype train as they chase numbers they (often) manufactured to justify the latest round of tech spend.

lots of good use cases out there - like the incredible progress with medical imaging analysis or complex system models for construction - and lots of crap use cases that need benchmarks to cosplay relevance.

reply
operatingthetan
35 minutes ago
[-]
We need good benchmarks or we are just left following the hype train.
reply
oliver236
22 minutes ago
[-]
what are the point of benchmarks?
reply
andai
19 minutes ago
[-]
If there was not benchmark, number would not go up.
reply