ScienceBench|Project Benchmarks

Project Benchmarks

click on snapshot for details

model data provided by Surge AI

Leipzig Benchmark

Published on September 1, 2025

100

research-level problems

contributing researchers

Algebra & Combinatorics

main areas

Model Name	Model Type	Correct Answer
GPT-5	Active Model	43%
DeepSeek-V3.1	Active Model	34%
Grok-4	Active Model	34%
o3	Active Model	32%
Gemini 2.5 Pro	Active Model	29%
DeepSeek R1	Legacy Model	27%
o3-mini	Legacy Model	22%
Gemini 2.5 Flash	Legacy Model	18%
Claude Opus 4.1	Active Model	15%
Claude Sonnet 4	Legacy Model	9%

Based on 100 submissions that stump at least 1 active model. All models were queried via the API, using the strongest available version.

Model Name	Model Type	Correct Answer
GPT-5	Active Model	35%
DeepSeek-V3.1	Active Model	26%
Grok-4	Active Model	26%
o3	Active Model	23%
DeepSeek R1	Legacy Model	22%
Gemini 2.5 Pro	Active Model	21%
o3-mini	Legacy Model	17%
Gemini 2.5 Flash	Legacy Model	11%
Claude Opus 4.1	Active Model	8%
Claude Sonnet 4	Legacy Model	6%

Based on 80 submissions that stump at least 2 active models. All models were queried via the API, using the strongest available version.

Sample Problems

Number Theory

How many mutually non-isomorphic extensions of degree 4 and Galois group of order at most $8$ does the the field $\mathbb{Q}_2$ of $2$-adic numbers admit?

Algebraic Geometry

The by codimension graded pieces of the Chow ring $A^\bullet (\mathcal{F})$ of the two-step flag variety $\mathcal{F}=\operatorname{Fl}(2,4;\mathbb{C}^6)$ are vector spaces generated by Schubert classes indexed by two subsets of $0,...,5$: The first of size $2$ and the second of size $4$ containing the former. Let $X$ be a variety of class $[1,5; 1,3,4,5]$ and $Y$ be of class $[3,5; 1,3,4,5]$. (For example, the former means lines $L\in X$ meet a $\mathbb{P}^1$, and there is no additional conditions on the projective $3$-planes containing $L$.) Suppose the intersection of $X$ and $Y$ is transversal. What is the class of the intersection $X\cap Y$ written in terms of the basis of $A^k(\mathcal{F})$ of correct codimension $k$?

Metric Geometry

Let $B$ be the unit ball in $\mathbb{R}^{65}$ with respect to the standard Euclidean norm. What is the smallest natural number $r$ such that there exist hermitian $r\times r$ matrices $A_0,\ldots,A_{65}$ with $B=\{p\in\mathbb{R}^{65}\mid A_0+p_1\cdot A_1+\cdots+p_{65}\cdot A_{65}\textrm{ is positive semidefinite}\}$?

Published on November 1, 2025

209

research-level problems

contributing researchers

Algebra, Combinatorics & Geometry

main areas

Model Name	Model Type	Correct Answer
GPT-5	Active Model	43%
Grok-4	Active Model	32%
o3	Active Model	32%
DeepSeek-V3.1	Active Model	31%
Gemini 2.5 Pro	Active Model	26%
DeepSeek R1	Legacy Model	23%
o3-mini	Legacy Model	23%
Gemini 2.5 Flash	Legacy Model	14%
Claude Opus 4.1	Active Model	12%
Claude Sonnet 4	Legacy Model	12%

Based on 200 submissions that stump at least 1 active model. All models were queried via the API, using the strongest available version.

Model Name	Model Type	Correct Answer
GPT-5	Active Model	36%
Grok-4	Active Model	27%
o3	Active Model	26%
DeepSeek-V3.1	Active Model	25%
DeepSeek R1	Legacy Model	20%
Gemini 2.5 Pro	Active Model	20%
o3-mini	Legacy Model	19%
Claude Sonnet 4	Legacy Model	9%
Gemini 2.5 Flash	Legacy Model	9%
Claude Opus 4.1	Active Model	8%

Based on 180 submissions that stump at least 2 active models. All models were queried via the API, using the strongest available version.

Sample Problems

Matroid Theory Topology

Let $M = (E, \mathcal{B})$ be a matroid. Let $\mathbb{P}L_{M}$ be projectivization of the space of Lorentzian polynomials whose support is precisely equal to the collection $\mathcal{B}$. When $M = F_{7}$, the Fano matroid, find the dimension of $\mathbb{P}L_{M}$.

Homological Algebra

Let $Q = 1 \longrightarrow 2 \longrightarrow 3 \longleftarrow 4$. Consider $M = X_{[1,2]}^2 \oplus X_{[1,3]}^3 \oplus X_{[1,4]}^2 \oplus X_{[2]} \oplus X_{[2,3]}^2 \oplus X_{[2,4]}$. Calculate the generic Jordan form data of $M$.

Combinatorics Discrete Geometry

In a card game each card has three attributes: color, number and shape. Each of these attributes has 4 possible variants. Color: red, blue, green, yellow. Number: 1, 2, 3, 4. Shape: circle, square, triangle, cross. All possible cards appear once in the deck. Define a "quartet" to be a set of four cards from the deck above such that for each of the three attributes, one of the following three possibilities holds for the four values: all four values are the same, all four values are different, or there are exactly two pairs of values. What is the minimum number of cards that must be randomly drawn from the deck so that one can guarantee that there is always at least one quartet among them?

Published on November 26, 2025

140

research-level problems

contributing researchers

Algebra, Combinatorics & Geometry

main areas

Now including Gemini 3 Pro, GPT 5.1, and Claude Opus 4.5, each performing significantly better.

Model Name	Model Type	Correct Answer
Gemini 3 Pro	Active Model	46%
GPT-5.1	Active Model	41%
GPT-5	Legacy Model	35%
Grok-4	Active Model	23%
o3	Active Model	23%
Claude Opus 4.5	Active Model	20%
DeepSeek-V3.1	Active Model	20%
Gemini 2.5 Pro	Legacy Model	20%
o3-mini	Legacy Model	19%
DeepSeek R1	Legacy Model	18%
Claude Opus 4.1	Legacy Model	11%
Gemini 2.5 Flash	Legacy Model	10%
Claude Sonnet 4	Legacy Model	8%

Based on 140 submissions that stump at least 1 active model. All models were queried via the API, using the strongest available version.

Model Name	Model Type	Correct Answer
Gemini 3 Pro	Active Model	40%
GPT-5.1	Active Model	33%
GPT-5	Legacy Model	29%
o3	Active Model	17%
Grok-4	Active Model	16%
DeepSeek R1	Legacy Model	15%
Gemini 2.5 Pro	Legacy Model	15%
o3-mini	Legacy Model	15%
DeepSeek-V3.1	Active Model	14%
Claude Opus 4.5	Active Model	13%
Claude Opus 4.1	Legacy Model	7%
Claude Sonnet 4	Legacy Model	7%
Gemini 2.5 Flash	Legacy Model	6%

Based on 130 submissions that stump at least 2 active models. All models were queried via the API, using the strongest available version.

Sample Problems

Commutative Algebra Algebraic Combinatorics

Let $R=\mathbb{C}[x_1,\ldots,x_6,y_1,\ldots,y_6]$ with bigrading $\deg(x_i)=(1,0)$, $\deg(y_i)=(0,1).$ Let $I\subset R$ be the bihomogeneous ideal generated by the $2$-minors of the generic matrix \[\begin{pmatrix} x_1 & x_2 & \cdots & x_6 \\ y_1 & y_2 & \cdots & y_6\end{pmatrix}.\] What is the dimension of the $\mathbb{K}$-vector space $(I^2/I^3)_{(5,2)}$?

Metric Geometry

Let $B$ be the unit ball in $\mathbb{R}^{8193}$ with respect to the standard Euclidean norm. What is the smallest natural number $r$ such that there exist hermitian $r\times r$ matrices $A_0,\ldots,A_{8193}$ with $B=\{p\in\mathbb{R}^{8193}\mid A_0+p_1\cdot A_1+\cdots+p_{8193}\cdot A_{8193}\textrm{ is positive semidefinite}\}$?

Algebraic Combinatorics Commutative Algebra

Let $W$ be the Weyl algebra $\mathbb{C}\left<D,X \mid DX - XD = 1\right>$; this is the $\mathbb{C}$-algebra with two generators $D$ and $X$ and a single relation $DX - XD = 1$. Given an integer $n \ge 0$, we define an *$n$-monomial element* to be an element of $W$ that can be written as a product of $n$ generators from the set $\left\{D,X\right\}$, i.e., as $G_1G_2\cdots G_n$ where each $G_i \in \left\{D,X\right\}$. Note that some $n$-monomial elements can be written in several ways in such a form (for instance, $DXXD = XDDX$). Let $a_n$ denote the number of $n$-monomial elements. Find $a_{11} + a_{12}$.

Published on March 23, 2026

140

research-level problems

contributing researchers

subfields included

Now including the models Gemini 3.1 Pro and GPT 5.4, each performing significantly better than its predecessor. Since our first benchmark in September 2025: • OpenAI's best model improved from 43% (GPT-5) to 68% (GPT-5.4) • Google's best model improved from 29% (Gemini 2.5 Pro) to 60% (Gemini 3.1 Pro)

Model Name	Model Type	Correct Answer
GPT-5.4	Active Model	68%
Gemini 3.1 Pro	Active Model	60%
GPT-5.2	Legacy Model	52%
Gemini 3 Pro	Legacy Model	44%
GPT-5.1	Legacy Model	38%
GPT-5	Legacy Model	31%
DeepSeek-V3.2	Active Model	23%
Grok-4	Legacy Model	23%
o3	Legacy Model	22%
Gemini 2.5 Pro	Legacy Model	19%
Claude Opus 4.5	Active Model	18%
Grok-4.1	Active Model	17%

Based on 140 submissions that stump at least 1 active model. All models were queried via the API, using the strongest available version.

Model Name	Model Type	Correct Answer
GPT-5.4	Active Model	62%
Gemini 3.1 Pro	Active Model	51%
GPT-5.2	Legacy Model	45%
Gemini 3 Pro	Legacy Model	36%
GPT-5.1	Legacy Model	31%
GPT-5	Legacy Model	26%
o3	Legacy Model	18%
Grok-4	Legacy Model	16%
Gemini 2.5 Pro	Legacy Model	14%
DeepSeek-V3.2	Active Model	12%
Grok-4.1	Active Model	10%
Claude Opus 4.5	Active Model	7%

Based on 120 submissions that stump at least 2 active models. All models were queried via the API, using the strongest available version.

Contributing Subfields

Algebraic Combinatorics 32

Algebra 19

Combinatorics 17

Discrete Geometry 11

Algebraic Geometry 10

Enumerative Combinatorics 7

Homological Algebra 7

Matroid Theory 6

Commutative Algebra 3

Group Theory 3

Metric Geometry 3

Symmetric Function Theory 3

Topology 3

Algebraic Statistics 2

Analysis 2

Geometry 2

Graph Theory 2

Lie Theory 2

Complex Analysis 1

Euclidean Geometry 1

Monoid Theory 1

Number Theory 1

Partial Differential Equations 1

Polytope Theory 1

Probability Theory 1

Real Algebraic Geometry 1

Theoretical Computer Science 1

Tropical Geometry 1

Sample Problems

Graph Theory

What is the number of connected, simple graphs with exactly 1 cycle, 100 vertices and all vertex degrees at most 3? To clarify, the graphs are unlabelled and not necessarily planar.

Algebraic Statistics

Let $f_1,...,f_m$ be generic forms of degree 2 in n complex variables $x_1,...x_n$. Consider the *log-likelihood function* $\ell = \sum_{i=1}^m s_i \log(f_i)$. Here the $s_1,...,s_m$ are considered as complex variables too. This function is well-defined on the complement of the hypersurface arrangement defined by the $f_i$. The *likelihood correspondence* $L$ is the Zariski closure in $\mathbb{P}^{n-1}\times \mathbb{P}^{m-1}$ of the critical locus of the likelihood equations. Precisely \[ L = \overline{\left\lbrace (x, s)\in \mathbb{C}^{n}\times \mathbb{C}^{m} : \frac{\partial \ell}{\partial x_i}(x,s)=0, i=1,\dots,n, \prod_{i=1}^m f_i^{s_i(x)} \neq 0,\, F(x)\in X_{r} \right\rbrace}, \] where $X$ is the Zariski-closure of the image of $F\colon\mathbb{C}^n \rightarrow\mathbb{C}^{m}, x \mapsto (f_1(x),\dots, f_m(x))$, and $X_{r}$ is its set of nonsingular points. Determine the dimension of the likelihood correspondence for m=8 and n=5.

Discrete Geometry

How many distinct combinatorial types of 3-dimensional polytopes can you obtain by intersecting the 4-dimensional cube with an affine hyperplane?

Published on April 11, 2026

research-level problems

20 runs / problem

multi-run protocol

Surge AI

model runs performed by

Observations: • On 32 of the 50 problems, the three standard models' behavior is nearly identical. • On 18 of the 50 problems, the models gave strongly divergent results. • The higher-compute variants clearly outperform their standard counterparts. We thank Surge AI for their support with generating and providing the data.

Model Name	Model Type	Correct Answer
GPT-5.4 Pro	Active Model	60% (3 runs)
Gemini 3.1 Pro Deep Think	Active Model	46% (3 runs)
GPT-5.4	Active Model	35% (20 runs)
Gemini 3.1 Pro	Active Model	35% (20 runs)
Claude Opus 4.6	Active Model	33% (20 runs)

Based on 50 problems, evaluated with multiple independent runs.

Published on April 30, 2026

120

research-level problems

contributing researchers

subfields included

Model Name	Model Type	Correct Answer
GPT-5.5	Active Model	64%
GPT-5.4	Legacy Model	58%
Claude Opus 4.7	Active Model	41%
GPT-5.2	Legacy Model	40%
Gemini 3.1 Pro	Active Model	38%
DeepSeek V4 Pro	Active Model	32%
Claude Opus 4.6	Legacy Model	27%
Gemini 3 Pro	Legacy Model	24%
DeepSeek-V3.2	Legacy Model	13%
Grok-4.1	Legacy Model	11%
Grok-4.20	Active Model	11%

Based on 120 submissions that stump at least 1 active model. All models were queried via the API, using the strongest available version.

Model Name	Model Type	Correct Answer
GPT-5.5	Active Model	53%
GPT-5.4	Legacy Model	45%
GPT-5.2	Legacy Model	29%
Claude Opus 4.7	Active Model	23%
Gemini 3.1 Pro	Active Model	19%
Claude Opus 4.6	Legacy Model	15%
DeepSeek V4 Pro	Active Model	14%
DeepSeek-V3.2	Legacy Model	9%
Gemini 3 Pro	Legacy Model	9%
Grok-4.20	Active Model	9%
Grok-4.1	Legacy Model	6%

Based on 90 submissions that stump at least 2 active models. All models were queried via the API, using the strongest available version.

Contributing Subfields

Algebraic Combinatorics 40

Combinatorics 20

Algebra 18

Algebraic Geometry 18

Enumerative Combinatorics 12

Matroid Theory 12

Discrete Geometry 11

Homological Algebra 11

Commutative Algebra 4

Graph Theory 4

Group Theory 4

Representation Theory 4

Algebraic Statistics 3

Analysis 3

Lie Theory 3

Metric Geometry 3

Number Theory 3

Symmetric Function Theory 3

Geometry 2

Topology 2

Complex Analysis 1

Euclidean Geometry 1

Monoid Theory 1

Partial Differential Equations 1

Polytope Theory 1

Probability Theory 1

Real Algebraic Geometry 1

Theoretical Computer Science 1

Tropical Geometry 1

Sample Problems

Number Theory Analysis

For $n$ a positive integer define $V(n)$ to be the integer obtained by using the base 10 digits of $n$ in base 11. I want to evaluate the series $\sum_{p \text{ prime}} \frac{1}{V(p)}$ to an accuracy of $10^{-5}$.

Algebraic Combinatorics

Among all possible Bruhat intervals in any Coxeter group, find an interval with the smallest number of elements whose Kazhdan-Lusztig polynomial does not equal $1$. How many cover relations does this interval have?

Algebraic Geometry Matroid Theory

Let $L$ denote the log-canonical bundle on $\overline{M}_{0,20}$ over a field of characteristic $2$. Compute the dimension of $H^{17}(\overline{M}_{0,20}, L^{-1})$.

The Leipzig Benchmark

100

research-level problems

contributing researchers

subfields included

Research-level problems, contributed by researchers at the Benchmarks in Leipzig event.

Leipzig Benchmark

Download problem set as JSON

Submit Your Solutions

to christian@sciencebench.ai

Stage 1: single-attempt evaluation

Model Name	Model Type	Correct Answer
GPT-5.5	Active Model	44%
GPT-5.4	Legacy Model	28%
Gemini 3.1 Pro	Active Model	15%
Claude Opus 4.6	Legacy Model	14%
Gemini 3 Pro	Legacy Model	14%
Claude Opus 4.7	Active Model	13%
DeepSeek V4 Pro	Active Model	10%
DeepSeek-V3.2	Legacy Model	8%
Grok 4.3	Active Model	6%
Grok 4.20	Legacy Model	5%

All models were queried via the API, using the strongest available version.

Stage 2: 20-run evaluation by Surge AI

Model Name	Model Type	Correct Answer
GPT-5.5	Active Model	75% (20 runs)
Claude Opus 4.7	Active Model	44% (20 runs)
Gemini 3.1 Pro	Active Model	40% (20 runs)

All models were queried via the API by Surge AI, 20 times per problem. Percentages report the share of problems solved at least once.

Stage 3: deep-thinking models

Model Name	Model Type	Correct Answer
GPT-5.5 Pro (extended thinking)	Active Model	88% (3 runs)
Gemini 3.1 Pro Deep Think	Active Model	56% (3 runs)

Both models were queried with extended-thinking enabled. Percentages report the share of problems solved at least once.

Additional statistics are presented in our arXiv paper.

Contributing Subfields

Algebraic Geometry 36

Algebraic Combinatorics 22

Matroid Theory 18

Enumerative Combinatorics 15

Representation Theory 15

Combinatorics 14

Discrete Geometry 10

Algebra 7

Commutative Algebra 5

Graph Theory 5

Algebraic Statistics 4

Homological Algebra 4

Polytope Theory 4

Tropical Geometry 4

Analysis 3

Arithmetic Geometry 3

Knot theory 3

Number Theory 3

Topology 3

Complex Analysis 2

Euclidean Geometry 2

Metric Geometry 1

Probability Theory 1

Real Algebraic Geometry 1

Theoretical Computer Science 1

Contributed by 49 researchers — see the Benchmarks in Leipzig page for the full list.