Benchmarks in Leipzig
A benchmark problem set of research-level problems, compiled by 49 researchers to test the possibilities and limitations of large language models in mathematics research.
Hosted at the Max Planck Institute for Mathematics in the Sciences.
Organized by Veronica Calvo Cortes (MPI MiS), Christian Stump (Ruhr-Universität Bochum), and Bernd Sturmfels (MPI MiS).
The Leipzig Benchmark
100 research-level problems to which we know the answers. See our arXiv paper for details.
See the Leipzig Benchmark for the models' performance.
The Challenging Problems
Starting with the Leipzig Benchmark on which all our AI solution attempts failed, we present here problems that appear to be not solved by publicly available models.
History
| Date | Event |
|---|---|
| 2026-05-26 | Updated the sample problems to the initial 2 questions from the Leipzig Benchmark that remained unsolved after the 3-stage evaluation process. |