AI systems have fallen short of top mathematicians in a landmark research-level test.

Elite mathematics trial shows that AI still lags behind the world’s best human problem-solvers.

Dubai: Artificial intelligence (AI) has recently been demonstrating strong capabilities in solving long-standing puzzles, mastering games, and generating code at superhuman speed. However, when it comes to advanced mathematical research, humans still appear to hold the edge.

This was the conclusion of First Proof, a new project that subjected four AI systems to what is considered one of the most challenging mathematics evaluations designed for machines.

The test included ten authentic research-level problems—questions that had recently been solved by mathematicians but had not yet been formally published.

A panel of anonymous experts in each relevant field evaluated the AI-generated solutions. The results, published on the First Proof website on June 10, showed that none of the models reached the level of top human mathematicians.

This marked the first evaluation of its kind to meet three criteria at once: research-level problems, questions not present in training data, and formal assessment by expert mathematicians.

According to Nature, earlier AI benchmarks have often been criticised for including problems that models may have already seen during training, meaning strong performance could reflect memorisation rather than genuine reasoning.

First Proof addressed this limitation by selecting problems directly from researchers’ unpublished work, making it highly unlikely that the models had encountered them before.

Who was in the running?

OpenAI was the only major tech company to participate with a commercially available model, its ChatGPT 5.5 Pro.

The other three systems came from academic teams at:

University of California, Los Angeles
Princeton University
ETH Zurich

Notably absent were Google’s Aletheia—a system designed specifically for mathematical problem-solving—and the full, unreleased version of Anthropic’s Claude Mythos. First Proof required that no human assistance could be independently verified or ruled out, which meant these models could not be formally included.

The participating systems also showed a familiar limitation: hallucinations. Even when instructed to carefully verify their references, the AI models still produced incorrect outputs—an issue that remains especially problematic in mathematics, where precision is critical.

AI shows promise, but mathematicians remain ahead

In May, an OpenAI chatbot gained attention for solving an 80-year-old problem posed by the late Hungarian mathematician Paul Erdős, a result that sparked excitement about AI’s growing capabilities in pure mathematics.

However, the First Proof findings place that achievement in context: solving a long-standing historical problem is very different from tackling entirely new, unpublished research-level mathematical questions—an area where human mathematicians still lead.

First Proof carried out an initial trial in February using a different set of questions. While anyone was free to test their preferred AI system and many did, the results were not formally validated. This was addressed in the June round, which introduced structured oversight to ensure a more meaningful comparison.

The First Proof team views future iterations of the test as a useful tool for the mathematics community—not to determine whether AI will replace mathematicians, but to better understand how it could assist them.

Possible applications include checking proofs for errors, suggesting new lines of inquiry, and eventually solving narrowly defined problems autonomously. However, for now, the conclusion from the test is clear: human mathematicians remain ahead.

News Week
Magazine PRO

Company

AI systems have fallen short of top mathematicians in a landmark research-level test.

Who was in the running?

AI shows promise, but mathematicians remain ahead

Subscribe

Public school uniform details released for UAE students ahead of 2026–2027 term

150 Emirati scholarship students receive preparation ahead of studying abroad

Dubai Police reunites owners with 6,040 lost items, including jewellery worth millions

ADGM introduces broker rankings to help property buyers and tenants make informed choices

2PointZero’s profits surged to Dh7.7 billion, supported by gains from its investments in SpaceX and Anthropic.

More like this
Related

Public school uniform details released for UAE students ahead of 2026–2027 term

150 Emirati scholarship students receive preparation ahead of studying abroad

Dubai Police reunites owners with 6,040 lost items, including jewellery worth millions

ADGM introduces broker rankings to help property buyers and tenants make informed choices

About us

Company

The latest

Public school uniform details released for UAE students ahead of 2026–2027 term

150 Emirati scholarship students receive preparation ahead of studying abroad

Dubai Police reunites owners with 6,040 lost items, including jewellery worth millions

Subscribe

News WeekMagazine PRO

Company

AI systems have fallen short of top mathematicians in a landmark research-level test.

Who was in the running?

AI shows promise, but mathematicians remain ahead

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

News Week
Magazine PRO

More like this
Related