AI systems have fallen short of top mathematicians in a landmark research-level test.

Date:

Elite mathematics trial shows that AI still lags behind the world’s best human problem-solvers.

Dubai: Artificial intelligence (AI) has recently been demonstrating strong capabilities in solving long-standing puzzles, mastering games, and generating code at superhuman speed. However, when it comes to advanced mathematical research, humans still appear to hold the edge.

This was the conclusion of First Proof, a new project that subjected four AI systems to what is considered one of the most challenging mathematics evaluations designed for machines.

The test included ten authentic research-level problems—questions that had recently been solved by mathematicians but had not yet been formally published.

A panel of anonymous experts in each relevant field evaluated the AI-generated solutions. The results, published on the First Proof website on June 10, showed that none of the models reached the level of top human mathematicians.

This marked the first evaluation of its kind to meet three criteria at once: research-level problems, questions not present in training data, and formal assessment by expert mathematicians.

According to Nature, earlier AI benchmarks have often been criticised for including problems that models may have already seen during training, meaning strong performance could reflect memorisation rather than genuine reasoning.

First Proof addressed this limitation by selecting problems directly from researchers’ unpublished work, making it highly unlikely that the models had encountered them before.

Who was in the running?

OpenAI was the only major tech company to participate with a commercially available model, its ChatGPT 5.5 Pro.

The other three systems came from academic teams at:

  • University of California, Los Angeles
  • Princeton University
  • ETH Zurich

Notably absent were Google’s Aletheia—a system designed specifically for mathematical problem-solving—and the full, unreleased version of Anthropic’s Claude Mythos. First Proof required that no human assistance could be independently verified or ruled out, which meant these models could not be formally included.

The participating systems also showed a familiar limitation: hallucinations. Even when instructed to carefully verify their references, the AI models still produced incorrect outputs—an issue that remains especially problematic in mathematics, where precision is critical.

AI shows promise, but mathematicians remain ahead

In May, an OpenAI chatbot gained attention for solving an 80-year-old problem posed by the late Hungarian mathematician Paul Erdős, a result that sparked excitement about AI’s growing capabilities in pure mathematics.

However, the First Proof findings place that achievement in context: solving a long-standing historical problem is very different from tackling entirely new, unpublished research-level mathematical questions—an area where human mathematicians still lead.

First Proof carried out an initial trial in February using a different set of questions. While anyone was free to test their preferred AI system and many did, the results were not formally validated. This was addressed in the June round, which introduced structured oversight to ensure a more meaningful comparison.

The First Proof team views future iterations of the test as a useful tool for the mathematics community—not to determine whether AI will replace mathematicians, but to better understand how it could assist them.

Possible applications include checking proofs for errors, suggesting new lines of inquiry, and eventually solving narrowly defined problems autonomously. However, for now, the conclusion from the test is clear: human mathematicians remain ahead.

Share post:

Subscribe

spot_imgspot_img

Popular

More like this
Related

UAE expands its automatic degree recognition system to include four additional universities.

The Automatic Qualifications Recognition Initiative now includes 42 institutions...

India’s revised investment rules for NRIs: what they could mean for UAE residents.

The changes are expected to ease operational barriers, enhance...