memorization-detection financial-llms benchmark-contamination prompt-injection evaluation-methodology

Frontier LLMs memorize financial data with near-perfect recall

NumLeak detects that top-tier LLMs recall public benchmarks like Fama-French factors at r=0.97-0.99, collapsing to r=0.02 when model memorization is residualized—meaning apparent financial reasoning is cached pretraining data, not learned inference.

June 5, 2026

Summary

If you're building financial or time-series applications on frontier LLMs, memorization masquerades as capability. Your evals on public datasets will overestimate real generalization, and probing your model's actual reasoning requires white-box validation or prompt defenses, not just API calls.

Why it matters

Implementation verdict

Replaces naive API benchmarking on public financial datasets with NumLeak's dual approach: black-box API probes plus white-box logprob ranking to detect memorization. Requires careful prompt design and residualization testing if deploying LLMs for financial analysis. Worth implementing now if shipping financial products; the one-line system-prompt defense blocks 99.8% of extraction attempts at near-zero utility cost.

Sources

1.Top-tier frontier LLMs recall the Fama-French market excess return at 3-seed pooled Pearson r=0.97-0.99
2.parse rate collapses to 21-57% but r stays at approximately 0.99 on months answered
3.A Sonnet "date to market-sentiment" regression that correlates with true Mkt-RF at r=0.74 collapses to r=0.02 once the model's own recall is residualized out
4.A one-line system-prompt defense blocks 99.8% of a non-adaptive single-turn suffix attack set at near-zero utility cost

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs