theorem-proving formal-verification benchmark llm-limitations mathematical-reasoning

LLMs fail formal math proofs at scale

MA-ProofBench exposes that GPT-4.5 achieves only 16% on undergraduate-level formal mathematical analysis—Mathlib hallucinations and incomplete proofs are the dominant failure modes blocking theorem proving automation.

June 15, 2026

Summary

If you're building formal verification into AI workflows, this benchmark reveals the real ceiling: current LLMs struggle with mathematical rigor beyond algebra. You need explicit fallback strategies and validation layers, not just prompting.

Why it matters

Implementation verdict

Replaces nothing yet—the benchmark itself is the deliverable. Requires access to Lean/Coq formalization infrastructure and tolerance for 84% failure rates on starter problems. Not ready for production theorem proving; useful as a progress tracker and failure analysis dataset for your own model training.

Sources

1.even the best-performing model, GPT-5.5, achieves only 16% Pass@8 on Level I and 5% on Level II
2.Mathlib hallucinations and incomplete proofs as the two dominant failure modes
3.200 formalized theorems covering 6 core topics and 27 subcategories
4.the first formal theorem-proving benchmark dedicated to Mathematical Analysis

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs