Project Benchmarks

This benchmark is based on 140 submissions stumping at least 1 active model, visit the Samples Page for several sample submissions. 60 research mathematicians have contributed to this benchmark. It includes the two newly released models Gemini 3.1 Pro and GPT 5.4, each performing significantly better than its predecessor. Since our first benchmark in September 2025: • OpenAI's best model improved from 43% (GPT-5) to 68% (GPT-5.4) • Google's best model improved from 29% (Gemini 2.5 Pro) to 60% (Gemini 3.1 Pro) Every submission was attempted by every model once, rerunning failing attempts. All models were queried via the api, using the strongest available version.
Published March 23, 2026Contact: contact@sciencebench.ai
Based on 140 submissions that stump at least 1 active model.
Model Name Model Type Correct Answer
GPT-5.4 Active Model 68%
Gemini 3.1 Pro Active Model 60%
GPT-5.2 Legacy Model 52%
Gemini 3 Pro Legacy Model 44%
GPT-5.1 Legacy Model 38%
GPT-5 Legacy Model 31%
DeepSeek-V3.2 Active Model 23%
Grok-4 Legacy Model 23%
o3 Legacy Model 22%
Gemini 2.5 Pro Legacy Model 19%
Claude Opus 4.5 Active Model 18%
Grok-4.1 Active Model 17%
Based on 120 submissions that stump at least 2 active models.
Model Name Model Type Correct Answer
GPT-5.4 Active Model 62%
Gemini 3.1 Pro Active Model 51%
GPT-5.2 Legacy Model 45%
Gemini 3 Pro Legacy Model 36%
GPT-5.1 Legacy Model 31%
GPT-5 Legacy Model 26%
o3 Legacy Model 18%
Grok-4 Legacy Model 16%
Gemini 2.5 Pro Legacy Model 14%
DeepSeek-V3.2 Active Model 12%
Grok-4.1 Active Model 10%
Claude Opus 4.5 Active Model 7%
Based on 90 submissions that stump at least 3 active models.
Model Name Model Type Correct Answer
GPT-5.4 Active Model 53%
Gemini 3.1 Pro Active Model 39%
GPT-5.2 Legacy Model 37%
Gemini 3 Pro Legacy Model 26%
GPT-5.1 Legacy Model 22%
GPT-5 Legacy Model 18%
o3 Legacy Model 13%
Gemini 2.5 Pro Legacy Model 9%
Grok-4 Legacy Model 9%
Grok-4.1 Active Model 4%
Claude Opus 4.5 Active Model 3%
DeepSeek-V3.2 Active Model 3%
Based on 60 submissions that stump at least 4 active models.
Model Name Model Type Correct Answer
GPT-5.4 Active Model 29%
GPT-5.2 Legacy Model 19%
GPT-5.1 Legacy Model 10%
GPT-5 Legacy Model 9%
Gemini 3 Pro Legacy Model 9%
o3 Legacy Model 9%
Gemini 3.1 Pro Active Model 7%
DeepSeek-V3.2 Active Model 3%
Grok-4 Legacy Model 3%
Claude Opus 4.5 Active Model 2%
Gemini 2.5 Pro Legacy Model 2%
Grok-4.1 Active Model 2%