Frontier LLMs memorize financial data with near-perfect recall
NumLeak detects that top-tier LLMs recall public benchmarks like Fama-French factors at r=0.97-0.99, collapsing to r=0.02 when model memorization is residualized—meaning apparent financial reasoning is cached pretraining data, not learned inference.
June 5, 2026
Summary
If you're building financial or time-series applications on frontier LLMs, memorization masquerades as capability. Your evals on public datasets will overestimate real generalization, and probing your model's actual reasoning requires white-box validation or prompt defenses, not just API calls.
Why it matters
If you're building financial or time-series applications on frontier LLMs, memorization masquerades as capability. Your evals on public datasets will overestimate real generalization, and probing your model's actual reasoning requires white-box validation or prompt defenses, not just API calls.
Implementation verdict
Replaces naive API benchmarking on public financial datasets with NumLeak's dual approach: black-box API probes plus white-box logprob ranking to detect memorization. Requires careful prompt design and residualization testing if deploying LLMs for financial analysis. Worth implementing now if shipping financial products; the one-line system-prompt defense blocks 99.8% of extraction attempts at near-zero utility cost.
Sources
- 1.Top-tier frontier LLMs recall the Fama-French market excess return at 3-seed pooled Pearson r=0.97-0.99
- 2.parse rate collapses to 21-57% but r stays at approximately 0.99 on months answered
- 3.A Sonnet "date to market-sentiment" regression that correlates with true Mkt-RF at r=0.74 collapses to r=0.02 once the model's own recall is residualized out
- 4.A one-line system-prompt defense blocks 99.8% of a non-adaptive single-turn suffix attack set at near-zero utility cost
Dev Signal
Get briefs like this in your inbox — free, 3x a week.
100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.