Project Benchmarks

click on snapshot for details

    model data provided by Surge AI Leipzig Benchmark
  1. Sep 1 September 1, 2025
  2. Nov 1 November 1, 2025
  3. Nov 26 November 26, 2025
  4. Mar 23 March 23, 2026
  5. Apr 11 April 11, 2026
  6. Apr 30 April 30, 2026
  7. May 26 May 26, 2026

The Leipzig Benchmark

100
research-level problems
49
contributing researchers
25
subfields included

Stage 1: single-attempt evaluation

Model Name Model Type Correct Answer
GPT-5.5 Active Model 44%
GPT-5.4 Legacy Model 28%
Gemini 3.1 Pro Active Model 15%
Claude Opus 4.6 Legacy Model 14%
Gemini 3 Pro Legacy Model 14%
Claude Opus 4.7 Active Model 13%
DeepSeek V4 Pro Active Model 10%
DeepSeek-V3.2 Legacy Model 8%
Grok 4.3 Active Model 6%
Grok 4.20 Legacy Model 5%
All models were queried via the API, using the strongest available version.

Stage 2: 20-run evaluation by Surge AI

Model Name Model Type Correct Answer
GPT-5.5 Active Model 75% (20 runs)
Claude Opus 4.7 Active Model 44% (20 runs)
Gemini 3.1 Pro Active Model 40% (20 runs)
All models were queried via the API by Surge AI, 20 times per problem. Percentages report the share of problems solved at least once.

Stage 3: deep-thinking models

Model Name Model Type Correct Answer
GPT-5.5 Pro (extended thinking) Active Model 88% (3 runs)
Gemini 3.1 Pro Deep Think Active Model 56% (3 runs)
Both models were queried with extended-thinking enabled. Percentages report the share of problems solved at least once.

Additional statistics are presented in our arXiv paper.

Contributing Subfields

Algebraic Geometry 36
Algebraic Combinatorics 22
Matroid Theory 18
Enumerative Combinatorics 15
Representation Theory 15
Combinatorics 14
Discrete Geometry 10
Algebra 7
Commutative Algebra 5
Graph Theory 5
Algebraic Statistics 4
Homological Algebra 4
Polytope Theory 4
Tropical Geometry 4
Analysis 3
Arithmetic Geometry 3
Knot theory 3
Number Theory 3
Topology 3
Complex Analysis 2
Euclidean Geometry 2
Metric Geometry 1
Probability Theory 1
Real Algebraic Geometry 1
Theoretical Computer Science 1
Contributed by 49 researchers — see the Benchmarks in Leipzig page for the full list.