That's just straight up nonsense, no? How much cherry picking do you need?
>from under 25 minutes to over 45 minutes.
If I get my raspberry pi to run a LLM task it'll run for over 6 hours. And groq will do it in 20 seconds.
It's a gibberish measurement in itself if you don't control for token speed (and quality of output).
I really hope this is a simulation example.
The fact that there is no clear trend in lower percentiles makes this more suspect to me.
If you want to control for user base evolution given the growth they've seen, look at the percentiles by cohort.
I actually come away from this questioning the METR work on autonomy.
You can see the trend for other percentiles at the bottom of this, which they link to in the blog post https://cdn.sanity.io/files/4zrzovbb/website/5b4158dc1afb211...
how autonomous are humans?
do i need to continually correct them and provide guidance?
do they go off track?
do they waste time on something that doesn't matter?
autonomous humans have same problems.