Claude finds 75.4% of Snyk Code vulnerabilities with high variance on extra findings; nearly 50% of LLM-only reports appear once in five identical runs, but reference-matched findings stay stable at 80%+ consistency.
Summary
If your CI/CD uses LLMs for security review before human code inspection, unrepeatable findings create noise in diffs and false confidence in coverage. You need to know whether the agent will flag the same issue twice or miss it on the next scan.
Why it matters
If your CI/CD uses LLMs for security review before human code inspection, unrepeatable findings create noise in diffs and false confidence in coverage. You need to know whether the agent will flag the same issue twice or miss it on the next scan.
Implementation verdict
Do not replace SAST with LLM security review. Combine them. LLMs reach 75.4% F1 against Snyk Code's reference set with 24.6-point gap to deterministic baseline; unmatched findings are too noisy for solo use. Keep SAST deterministic, use Claude for unfamiliar patterns and prose risk explanation. Benchmark requires small Express apps—test your own codebase before committing LLM review to gate checks.
Sources
Dev Signal
Get briefs like this in your inbox — free, every weekday.
100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.