Higher effort reduces deep research accuracy for Gemini Flash 3 and GPT-5
9 points
1 hour ago
| 1 comment
| futuresearch.ai
| HN
mckennameyer
1 hour ago
[-]
We tested GPT-5 and Gemini Flash 3 at low, medium, and high effort on 169 instances with human-verified answers, scored against a frozen offline web corpus using Deep Research Bench. High effort consistently scored worse than lower thinking levels for both models. Methodology and raw data: https://everyrow.io/docs/notebooks/deep-research-bench-paret... (edited)
reply