MA-ProofBench exposes that GPT-4.5 achieves only 16% on undergraduate-level formal mathematical analysis—Mathlib hallucinations and incomplete proofs are the dominant failure modes blocking theorem proving automation.
June 15, 2026
Summary
If you're building formal verification into AI workflows, this benchmark reveals the real ceiling: current LLMs struggle with mathematical rigor beyond algebra. You need explicit fallback strategies and validation layers, not just prompting.
Why it matters
If you're building formal verification into AI workflows, this benchmark reveals the real ceiling: current LLMs struggle with mathematical rigor beyond algebra. You need explicit fallback strategies and validation layers, not just prompting.
Implementation verdict
Replaces nothing yet—the benchmark itself is the deliverable. Requires access to Lean/Coq formalization infrastructure and tolerance for 84% failure rates on starter problems. Not ready for production theorem proving; useful as a progress tracker and failure analysis dataset for your own model training.
Sources
Dev Signal
Get briefs like this in your inbox — free, 3x a week.
100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.