FilterHN

Higher effort reduces deep research accuracy for Gemini Flash 3 and GPT-5

9 points

by wawawildwildest

1 hour ago

| past

| 1 comment

| futuresearch.ai

| HN

▲

mckennameyer

1 hour ago

[-]

We tested GPT-5 and Gemini Flash 3 at low, medium, and high effort on 169 instances with human-verified answers, scored against a frozen offline web corpus using Deep Research Bench. High effort consistently scored worse than lower thinking levels for both models. Methodology and raw data: https://everyrow.io/docs/notebooks/deep-research-bench-paret... (edited)