Project Benchmarks

click on snapshot for details

    model data provided by Surge AI
  1. Sep 1 September 1, 2025
  2. Nov 1 November 1, 2025
  3. Nov 26 November 26, 2025
  4. Mar 23 March 23, 2026
  5. Apr 11 April 11, 2026
  6. Apr 30 April 30, 2026

Published on April 11, 2026

50
research-level problems
20 runs / problem
multi-run protocol
Surge AI
model runs performed by
Observations: • On 32 of the 50 problems, the three standard models' behavior is nearly identical. • On 18 of the 50 problems, the models gave strongly divergent results. • The higher-compute variants clearly outperform their standard counterparts. We thank Surge AI for their support with generating and providing the data.
Model Name Model Type Correct Answer
GPT-5.4 Pro Active Model 60% (3 runs)
Gemini 3.1 Pro Deep Think Active Model 46% (3 runs)
GPT-5.4 Active Model 35% (20 runs)
Gemini 3.1 Pro Active Model 35% (20 runs)
Claude Opus 4.6 Active Model 33% (20 runs)
Based on 50 problems, evaluated with multiple independent runs.