security-scanning llm-reliability benchmark agentic-ai sast

LLM security scans repeat inconsistently without reference baseline

Claude finds 75.4% of Snyk Code vulnerabilities with high variance on extra findings; nearly 50% of LLM-only reports appear once in five identical runs, but reference-matched findings stay stable at 80%+ consistency.

Summary

If your CI/CD uses LLMs for security review before human code inspection, unrepeatable findings create noise in diffs and false confidence in coverage. You need to know whether the agent will flag the same issue twice or miss it on the next scan.

Why it matters

Implementation verdict

Do not replace SAST with LLM security review. Combine them. LLMs reach 75.4% F1 against Snyk Code's reference set with 24.6-point gap to deterministic baseline; unmatched findings are too noisy for solo use. Keep SAST deterministic, use Claude for unfamiliar patterns and prose risk explanation. Benchmark requires small Express apps—test your own codebase before committing LLM review to gate checks.

Sources

1.The highest-recall LLM configuration found only 81% of Snyk Code reference vulnerabilities
2.Nearly 50% of LLM-only vulnerability reports appeared in just 1 of 5 identical scans
3.80 of 161 unique-unmatched findings appeared in only one of five identical repetitions, while only 22 appeared in all five
4.134 of 158 unique reference-matched findings appeared in all five repetitions
5.The best-scoring LLM configuration reached 75.4% Snyk-reference F1, leaving a 24.6-point gap against deterministic SAST reference reproduction

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs