llm-security prompt-injection agent-safety benchmark

LLMs fail basic security exploits reliably

GPT-4.5 solves SQL injection 70% of the time; Claude Sonnet 4.6 hits budget limits before breaching—guardrails work, but inconsistently.

June 5, 2026

Summary

Security teams need empirical data on LLM attack surface before deploying agents with data access. This benchmark reveals which models leak user data and which ones stop themselves.

Why it matters

Security teams need empirical data on LLM attack surface before deploying agents with data access. This benchmark reveals which models leak user data and which ones stop themselves.

Implementation verdict

Replaces hand-waving about LLM safety with actual exploit metrics. Requires building a vulnerable test app matching your threat model. Worth running now if you're shipping agents with sensitive context—the variance between models is stark.

Sources

1.GPT-5.5 performed the best, solving the task in seven out of 10 runs
2.DeepSeek-V4-Pro was the runner-up with only three successful runs
3.Claude Sonnet 4.6 was the most expensive model to run, and it only solved the task on two runs
4.Many models could not complete the task due to security guardrails

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs