llm-agents privacy-benchmark adversarial-testing on-device-inference policy-compliance

Smaller models leak privacy under adversarial probing

POLAR-Bench exposes that 1–30B open-weight models running as on-device agents leak over 50% of protected attributes, while frontier models withhold 99%+—forcing a choice between privacy and local inference.

May 27, 2026

Summary

If you're deploying LLM agents with user data locally or via private inference, your threat model now has a measured failure point. Smaller models that fit on-device consistently fail intent-following under adversarial pressure, making privacy-sensitive workloads risky at that scale.

Why it matters

Implementation verdict

POLAR-Bench is a diagnostic tool, not a solution. It localizes privacy breakdown across model size and attack strategy but doesn't replace your privacy architecture—it audits it. Worth running against your candidate models before shipping agents with PII access. Requires adapting their 5×5 diagnostic surface to your specific privacy policies and domains.

Sources

1.current frontier models withhold over 99% of protected attributes, while smaller open-weight models in the 1--30B range, the class users most commonly run as their own trusted agent on-device or via private inference, score notably worse, with the weakest leaking over half
2.Across 10 domains and 7,852 samples, we score privacy and utility by deterministic set-membership
3.providing a foothold for privacy alignment where it matters most

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs