Project Benchmarks
This benchmark is based on 140 submissions stumping at least 1 active model, visit the Samples Page for several sample submissions. 60 research mathematicians have contributed to this benchmark.
It includes the two newly released models Gemini 3.1 Pro and GPT 5.4, each performing significantly better than its predecessor. Since our first benchmark in September 2025:
• OpenAI's best model improved from 43% (GPT-5) to 68% (GPT-5.4)
• Google's best model improved from 29% (Gemini 2.5 Pro) to 60% (Gemini 3.1 Pro)
Every submission was attempted by every model once, rerunning failing attempts. All models were queried via the api, using the strongest available version.
Published March 23, 2026Contact: contact@sciencebench.ai
Based on 140 submissions that stump at least 1 active model.
| Model Name | Model Type | Correct Answer |
|---|---|---|
| GPT-5.4 | Active Model | 68% |
| Gemini 3.1 Pro | Active Model | 60% |
| GPT-5.2 | Legacy Model | 52% |
| Gemini 3 Pro | Legacy Model | 44% |
| GPT-5.1 | Legacy Model | 38% |
| GPT-5 | Legacy Model | 31% |
| DeepSeek-V3.2 | Active Model | 23% |
| Grok-4 | Legacy Model | 23% |
| o3 | Legacy Model | 22% |
| Gemini 2.5 Pro | Legacy Model | 19% |
| Claude Opus 4.5 | Active Model | 18% |
| Grok-4.1 | Active Model | 17% |
Based on 120 submissions that stump at least 2 active models.
| Model Name | Model Type | Correct Answer |
|---|---|---|
| GPT-5.4 | Active Model | 62% |
| Gemini 3.1 Pro | Active Model | 51% |
| GPT-5.2 | Legacy Model | 45% |
| Gemini 3 Pro | Legacy Model | 36% |
| GPT-5.1 | Legacy Model | 31% |
| GPT-5 | Legacy Model | 26% |
| o3 | Legacy Model | 18% |
| Grok-4 | Legacy Model | 16% |
| Gemini 2.5 Pro | Legacy Model | 14% |
| DeepSeek-V3.2 | Active Model | 12% |
| Grok-4.1 | Active Model | 10% |
| Claude Opus 4.5 | Active Model | 7% |
Based on 90 submissions that stump at least 3 active models.
| Model Name | Model Type | Correct Answer |
|---|---|---|
| GPT-5.4 | Active Model | 53% |
| Gemini 3.1 Pro | Active Model | 39% |
| GPT-5.2 | Legacy Model | 37% |
| Gemini 3 Pro | Legacy Model | 26% |
| GPT-5.1 | Legacy Model | 22% |
| GPT-5 | Legacy Model | 18% |
| o3 | Legacy Model | 13% |
| Gemini 2.5 Pro | Legacy Model | 9% |
| Grok-4 | Legacy Model | 9% |
| Grok-4.1 | Active Model | 4% |
| Claude Opus 4.5 | Active Model | 3% |
| DeepSeek-V3.2 | Active Model | 3% |
Based on 60 submissions that stump at least 4 active models.
| Model Name | Model Type | Correct Answer |
|---|---|---|
| GPT-5.4 | Active Model | 29% |
| GPT-5.2 | Legacy Model | 19% |
| GPT-5.1 | Legacy Model | 10% |
| GPT-5 | Legacy Model | 9% |
| Gemini 3 Pro | Legacy Model | 9% |
| o3 | Legacy Model | 9% |
| Gemini 3.1 Pro | Active Model | 7% |
| DeepSeek-V3.2 | Active Model | 3% |
| Grok-4 | Legacy Model | 3% |
| Claude Opus 4.5 | Active Model | 2% |
| Gemini 2.5 Pro | Legacy Model | 2% |
| Grok-4.1 | Active Model | 2% |