AI Performance on Advanced Mathematical Proofs Remains Inconsistent
The "First Proof" project, an initiative led by prominent mathematicians, has released its latest assessment of how large language models (LLMs) handle research-level mathematics. Unlike standard industry benchmarks, which often prioritize speed or basic arithmetic, this project focuses on complex problems that mirror the actual work of professional mathematicians. In the most recent round of testing, top-tier models—including OpenAI’s ChatGPT-5.5 Pro and various academic entries—achieved a passing grade of roughly 60 to 70 percent, effectively earning a "C-" for their efforts.
While the models demonstrated an impressive ability to synthesize literature and apply established mathematical techniques to new contexts, their performance remains unreliable. The researchers noted that while the AI can occasionally provide brilliant insights, it frequently generates significant errors that require extensive human oversight to correct. The grading process, which involved expert mathematicians reviewing the outputs, highlighted that the models are currently better suited as research assistants rather than autonomous problem solvers.
This evaluation is critical because it challenges the metrics currently used by AI developers to claim mathematical proficiency. By moving away from proprietary, internal benchmarks and toward rigorous, peer-reviewed testing, the First Proof project provides a more transparent look at the limitations of current technology. As AI continues to be integrated into scientific research, these findings underscore the necessity of human verification, suggesting that while machines can accelerate the pace of discovery, they are not yet capable of replacing the nuanced, error-checking rigor of human mathematicians.