Project Benchmarks
click on snapshot for details
-
model data provided by Surge AI
- Sep 1 September 1, 2025
- Nov 1 November 1, 2025
- Nov 26 November 26, 2025
- Mar 23 March 23, 2026
- Apr 11 April 11, 2026
- Apr 30 April 30, 2026
Published on September 1, 2025
100
research-level problems
37
contributing researchers
Algebra & Combinatorics
main areas
| Model Name | Model Type | Correct Answer |
|---|---|---|
| GPT-5 | Active Model | 43% |
| DeepSeek-V3.1 | Active Model | 34% |
| Grok-4 | Active Model | 34% |
| o3 | Active Model | 32% |
| Gemini 2.5 Pro | Active Model | 29% |
| DeepSeek R1 | Legacy Model | 27% |
| o3-mini | Legacy Model | 22% |
| Gemini 2.5 Flash | Legacy Model | 18% |
| Claude Opus 4.1 | Active Model | 15% |
| Claude Sonnet 4 | Legacy Model | 9% |
Based on 100 submissions that stump at least 1 active model. All models were queried via the API, using the strongest available version.
| Model Name | Model Type | Correct Answer |
|---|---|---|
| GPT-5 | Active Model | 35% |
| DeepSeek-V3.1 | Active Model | 26% |
| Grok-4 | Active Model | 26% |
| o3 | Active Model | 23% |
| DeepSeek R1 | Legacy Model | 22% |
| Gemini 2.5 Pro | Active Model | 21% |
| o3-mini | Legacy Model | 17% |
| Gemini 2.5 Flash | Legacy Model | 11% |
| Claude Opus 4.1 | Active Model | 8% |
| Claude Sonnet 4 | Legacy Model | 6% |
Based on 80 submissions that stump at least 2 active models. All models were queried via the API, using the strongest available version.
Sample Problems
Number Theory
How many mutually non-isomorphic extensions of degree 4 and Galois group of order at most $8$ does the the field $\mathbb{Q}_2$ of $2$-adic numbers admit?
Algebraic Geometry
The by codimension graded pieces of the Chow ring $A^\bullet (\mathcal{F})$ of the two-step flag variety $\mathcal{F}=\operatorname{Fl}(2,4;\mathbb{C}^6)$ are vector spaces generated by Schubert classes indexed by two subsets of $0,...,5$: The first of size $2$ and the second of size $4$ containing the former. Let $X$ be a variety of class $[1,5; 1,3,4,5]$ and $Y$ be of class $[3,5; 1,3,4,5]$. (For example, the former means lines $L\in X$ meet a $\mathbb{P}^1$, and there is no additional conditions on the projective $3$-planes containing $L$.) Suppose the intersection of $X$ and $Y$ is transversal. What is the class of the intersection $X\cap Y$ written in terms of the basis of $A^k(\mathcal{F})$ of correct codimension $k$?
Metric Geometry
Let $B$ be the unit ball in $\mathbb{R}^{65}$ with respect to the standard Euclidean norm. What is the smallest natural number $r$ such that there exist hermitian $r\times r$ matrices $A_0,\ldots,A_{65}$ with $B=\{p\in\mathbb{R}^{65}\mid A_0+p_1\cdot A_1+\cdots+p_{65}\cdot A_{65}\textrm{ is positive semidefinite}\}$?
Published on November 1, 2025
209
research-level problems
56
contributing researchers
Algebra, Combinatorics & Geometry
main areas
| Model Name | Model Type | Correct Answer |
|---|---|---|
| GPT-5 | Active Model | 43% |
| Grok-4 | Active Model | 32% |
| o3 | Active Model | 32% |
| DeepSeek-V3.1 | Active Model | 31% |
| Gemini 2.5 Pro | Active Model | 26% |
| DeepSeek R1 | Legacy Model | 23% |
| o3-mini | Legacy Model | 23% |
| Gemini 2.5 Flash | Legacy Model | 14% |
| Claude Opus 4.1 | Active Model | 12% |
| Claude Sonnet 4 | Legacy Model | 12% |
Based on 200 submissions that stump at least 1 active model. All models were queried via the API, using the strongest available version.
| Model Name | Model Type | Correct Answer |
|---|---|---|
| GPT-5 | Active Model | 36% |
| Grok-4 | Active Model | 27% |
| o3 | Active Model | 26% |
| DeepSeek-V3.1 | Active Model | 25% |
| DeepSeek R1 | Legacy Model | 20% |
| Gemini 2.5 Pro | Active Model | 20% |
| o3-mini | Legacy Model | 19% |
| Claude Sonnet 4 | Legacy Model | 9% |
| Gemini 2.5 Flash | Legacy Model | 9% |
| Claude Opus 4.1 | Active Model | 8% |
Based on 180 submissions that stump at least 2 active models. All models were queried via the API, using the strongest available version.
Sample Problems
Matroid Theory
Topology
Let $M = (E, \mathcal{B})$ be a matroid. Let $\mathbb{P}L_{M}$ be projectivization of the space of Lorentzian polynomials whose support is precisely equal to the collection $\mathcal{B}$. When $M = F_{7}$, the Fano matroid, find the dimension of $\mathbb{P}L_{M}$.
Homological Algebra
Let $Q = 1 \longrightarrow 2 \longrightarrow 3 \longleftarrow 4$. Consider $M = X_{[1,2]}^2 \oplus X_{[1,3]}^3 \oplus X_{[1,4]}^2 \oplus X_{[2]} \oplus X_{[2,3]}^2 \oplus X_{[2,4]}$. Calculate the generic Jordan form data of $M$.
Combinatorics
Discrete Geometry
In a card game each card has three attributes: color, number and shape. Each of these attributes has 4 possible variants. Color: red, blue, green, yellow. Number: 1, 2, 3, 4. Shape: circle, square, triangle, cross. All possible cards appear once in the deck. Define a "quartet" to be a set of four cards from the deck above such that for each of the three attributes, one of the following three possibilities holds for the four values: all four values are the same, all four values are different, or there are exactly two pairs of values. What is the minimum number of cards that must be randomly drawn from the deck so that one can guarantee that there is always at least one quartet among them?
Published on November 26, 2025
140
research-level problems
56
contributing researchers
Algebra, Combinatorics & Geometry
main areas
Now including Gemini 3 Pro, GPT 5.1, and Claude Opus 4.5, each performing significantly better.
| Model Name | Model Type | Correct Answer |
|---|---|---|
| Gemini 3 Pro | Active Model | 46% |
| GPT-5.1 | Active Model | 41% |
| GPT-5 | Legacy Model | 35% |
| Grok-4 | Active Model | 23% |
| o3 | Active Model | 23% |
| Claude Opus 4.5 | Active Model | 20% |
| DeepSeek-V3.1 | Active Model | 20% |
| Gemini 2.5 Pro | Legacy Model | 20% |
| o3-mini | Legacy Model | 19% |
| DeepSeek R1 | Legacy Model | 18% |
| Claude Opus 4.1 | Legacy Model | 11% |
| Gemini 2.5 Flash | Legacy Model | 10% |
| Claude Sonnet 4 | Legacy Model | 8% |
Based on 140 submissions that stump at least 1 active model. All models were queried via the API, using the strongest available version.
| Model Name | Model Type | Correct Answer |
|---|---|---|
| Gemini 3 Pro | Active Model | 40% |
| GPT-5.1 | Active Model | 33% |
| GPT-5 | Legacy Model | 29% |
| o3 | Active Model | 17% |
| Grok-4 | Active Model | 16% |
| DeepSeek R1 | Legacy Model | 15% |
| Gemini 2.5 Pro | Legacy Model | 15% |
| o3-mini | Legacy Model | 15% |
| DeepSeek-V3.1 | Active Model | 14% |
| Claude Opus 4.5 | Active Model | 13% |
| Claude Opus 4.1 | Legacy Model | 7% |
| Claude Sonnet 4 | Legacy Model | 7% |
| Gemini 2.5 Flash | Legacy Model | 6% |
Based on 130 submissions that stump at least 2 active models. All models were queried via the API, using the strongest available version.
Sample Problems
Commutative Algebra
Algebraic Combinatorics
Let $R=\mathbb{C}[x_1,\ldots,x_6,y_1,\ldots,y_6]$ with bigrading $\deg(x_i)=(1,0)$, $\deg(y_i)=(0,1).$ Let $I\subset R$ be the bihomogeneous ideal generated by the $2$-minors of the generic matrix \[\begin{pmatrix} x_1 & x_2 & \cdots & x_6 \\ y_1 & y_2 & \cdots & y_6\end{pmatrix}.\] What is the dimension of the $\mathbb{K}$-vector space $(I^2/I^3)_{(5,2)}$?
Metric Geometry
Let $B$ be the unit ball in $\mathbb{R}^{8193}$ with respect to the standard Euclidean norm. What is the smallest natural number $r$ such that there exist hermitian $r\times r$ matrices $A_0,\ldots,A_{8193}$ with $B=\{p\in\mathbb{R}^{1025}\mid A_0+p_1\cdot A_1+\cdots+p_{8193}\cdot A_{8193}\textrm{ is positive semidefinite}\}$?
Algebraic Combinatorics
Commutative Algebra
Let $W$ be the Weyl algebra $\mathbb{C}\left<D,X \mid DX - XD = 1\right>$; this is the $\mathbb{C}$-algebra with two generators $D$ and $X$ and a single relation $DX - XD = 1$.
Given an integer $n \ge 0$, we define an *$n$-monomial element* to be an element of $W$ that can be written as a product of $n$ generators from the set $\left\{D,X\right\}$, i.e., as $G_1G_2\cdots G_n$ where each $G_i \in \left\{D,X\right\}$. Note that some $n$-monomial elements can be written in several ways in such a form (for instance, $DXXD = XDDX$).
Let $a_n$ denote the number of $n$-monomial elements. Find $a_{11} + a_{12}$.
Published on March 23, 2026
140
research-level problems
60
contributing researchers
28
subfields included
Now including the models Gemini 3.1 Pro and GPT 5.4, each performing significantly better than its predecessor. Since our first benchmark in September 2025:
• OpenAI's best model improved from 43% (GPT-5) to 68% (GPT-5.4)
• Google's best model improved from 29% (Gemini 2.5 Pro) to 60% (Gemini 3.1 Pro)
| Model Name | Model Type | Correct Answer |
|---|---|---|
| GPT-5.4 | Active Model | 68% |
| Gemini 3.1 Pro | Active Model | 60% |
| GPT-5.2 | Legacy Model | 52% |
| Gemini 3 Pro | Legacy Model | 44% |
| GPT-5.1 | Legacy Model | 38% |
| GPT-5 | Legacy Model | 31% |
| DeepSeek-V3.2 | Active Model | 23% |
| Grok-4 | Legacy Model | 23% |
| o3 | Legacy Model | 22% |
| Gemini 2.5 Pro | Legacy Model | 19% |
| Claude Opus 4.5 | Active Model | 18% |
| Grok-4.1 | Active Model | 17% |
Based on 140 submissions that stump at least 1 active model. All models were queried via the API, using the strongest available version.
| Model Name | Model Type | Correct Answer |
|---|---|---|
| GPT-5.4 | Active Model | 62% |
| Gemini 3.1 Pro | Active Model | 51% |
| GPT-5.2 | Legacy Model | 45% |
| Gemini 3 Pro | Legacy Model | 36% |
| GPT-5.1 | Legacy Model | 31% |
| GPT-5 | Legacy Model | 26% |
| o3 | Legacy Model | 18% |
| Grok-4 | Legacy Model | 16% |
| Gemini 2.5 Pro | Legacy Model | 14% |
| DeepSeek-V3.2 | Active Model | 12% |
| Grok-4.1 | Active Model | 10% |
| Claude Opus 4.5 | Active Model | 7% |
Based on 120 submissions that stump at least 2 active models. All models were queried via the API, using the strongest available version.
Contributing Subfields
Algebraic Combinatorics
32
Algebra
19
Combinatorics
17
Discrete Geometry
11
Algebraic Geometry
10
Enumerative Combinatorics
7
Homological Algebra
7
Matroid Theory
6
Commutative Algebra
3
Group Theory
3
Metric Geometry
3
Symmetric Function Theory
3
Topology
3
Algebraic Statistics
2
Analysis
2
Geometry
2
Graph Theory
2
Lie Theory
2
Complex Analysis
1
Euclidean Geometry
1
Monoid Theory
1
Number Theory
1
Partial Differential Equations
1
Polytope Theory
1
Probability Theory
1
Real Algebraic Geometry
1
Theoretical Computer Science
1
Tropical Geometry
1
Sample Problems
Graph Theory
What is the number of connected, simple graphs with exactly 1 cycle, 100 vertices and all vertex degrees at most 3? To clarify, the graphs are unlabelled and not necessarily planar.
Algebraic Statistics
Let $f_1,...,f_m$ be generic forms of degree 2 in n complex variables $x_1,...x_n$. Consider the *log-likelihood function* $\ell = \sum_{i=1}^m s_i \log(f_i)$. Here the $s_1,...,s_m$ are considered as complex variables too. This function is well-defined on the complement of the hypersurface arrangement defined by the $f_i$. The *likelihood correspondence* $L$ is the Zariski closure in
$\mathbb{P}^{n-1}\times \mathbb{P}^{m-1}$ of the critical locus of the likelihood equations. Precisely
\[
L = \overline{\left\lbrace (x, s)\in \mathbb{C}^{n}\times \mathbb{C}^{m} :
\frac{\partial \ell}{\partial x_i}(x,s)=0, i=1,\dots,n,
\prod_{i=1}^m f_i^{s_i(x)} \neq 0,\, F(x)\in X_{r} \right\rbrace},
\]
where $X$ is the Zariski-closure of the image of $F\colon\mathbb{C}^n \rightarrow\mathbb{C}^{m}, x \mapsto (f_1(x),\dots, f_m(x))$, and $X_{r}$ is its set of nonsingular points.
Determine the dimension of the likelihood correspondence for m=8 and n=5.
Discrete Geometry
How many distinct combinatorial types of 3-dimensional polytopes can you obtain by intersecting the 4-dimensional cube with an affine hyperplane?
Published on April 11, 2026
50
research-level problems
20 runs / problem
multi-run protocol
Surge AI
model runs performed by
Observations:
• On 32 of the 50 problems, the three standard models' behavior is nearly identical.
• On 18 of the 50 problems, the models gave strongly divergent results.
• The higher-compute variants clearly outperform their standard counterparts.
We thank Surge AI for their support with generating and providing the data.
| Model Name | Model Type | Correct Answer |
|---|---|---|
| GPT-5.4 Pro | Active Model | 60% (3 runs) |
| Gemini 3.1 Pro Deep Think | Active Model | 46% (3 runs) |
| GPT-5.4 | Active Model | 35% (20 runs) |
| Gemini 3.1 Pro | Active Model | 35% (20 runs) |
| Claude Opus 4.6 | Active Model | 33% (20 runs) |
Based on 50 problems, evaluated with multiple independent runs.
Published on April 30, 2026
120
research-level problems
51
contributing researchers
29
subfields included
| Model Name | Model Type | Correct Answer |
|---|---|---|
| GPT-5.5 | Active Model | 64% |
| GPT-5.4 | Legacy Model | 58% |
| Claude Opus 4.7 | Active Model | 41% |
| GPT-5.2 | Legacy Model | 40% |
| Gemini 3.1 Pro | Active Model | 38% |
| DeepSeek V4 Pro | Active Model | 32% |
| Claude Opus 4.6 | Legacy Model | 27% |
| Gemini 3 Pro | Legacy Model | 24% |
| DeepSeek-V3.2 | Legacy Model | 13% |
| Grok-4.1 | Legacy Model | 11% |
| Grok-4.20 | Active Model | 11% |
Based on 120 submissions that stump at least 1 active model. All models were queried via the API, using the strongest available version.
| Model Name | Model Type | Correct Answer |
|---|---|---|
| GPT-5.5 | Active Model | 53% |
| GPT-5.4 | Legacy Model | 45% |
| GPT-5.2 | Legacy Model | 29% |
| Claude Opus 4.7 | Active Model | 23% |
| Gemini 3.1 Pro | Active Model | 19% |
| Claude Opus 4.6 | Legacy Model | 15% |
| DeepSeek V4 Pro | Active Model | 14% |
| DeepSeek-V3.2 | Legacy Model | 9% |
| Gemini 3 Pro | Legacy Model | 9% |
| Grok-4.20 | Active Model | 9% |
| Grok-4.1 | Legacy Model | 6% |
Based on 90 submissions that stump at least 2 active models. All models were queried via the API, using the strongest available version.
Contributing Subfields
Algebraic Combinatorics
40
Combinatorics
20
Algebra
18
Algebraic Geometry
18
Enumerative Combinatorics
12
Matroid Theory
12
Discrete Geometry
11
Homological Algebra
11
Commutative Algebra
4
Graph Theory
4
Group Theory
4
Representation Theory
4
Algebraic Statistics
3
Analysis
3
Lie Theory
3
Metric Geometry
3
Number Theory
3
Symmetric Function Theory
3
Geometry
2
Topology
2
Complex Analysis
1
Euclidean Geometry
1
Monoid Theory
1
Partial Differential Equations
1
Polytope Theory
1
Probability Theory
1
Real Algebraic Geometry
1
Theoretical Computer Science
1
Tropical Geometry
1
Sample Problems
Number Theory
Analysis
For $n$ a positive integer define $V(n)$ to be the integer obtained by
using the base 10 digits of $n$ in base 11. I want to evaluate the
series
$\sum_{p \text{ prime}} \frac{1}{V(p)}$ to an accuracy of $10^{-5}$.
Algebraic Combinatorics
Among all possible Bruhat intervals in any Coxeter group, find an interval with the smallest number of elements whose Kazhdan-Lusztig polynomial does not equal $1$. How many cover relations does this interval have?
Algebraic Geometry
Matroid Theory
Let $L$ denote the log-canonical bundle on $\overline{M}_{0,20}$ over a field of characteristic $2$. Compute the dimension of $H^{17}(\overline{M}_{0,20}, L^{-1})$.