llm-safety adversarial-ml white-box-attack interpretability jailbreak

Single neuron disables safety across model families

Flipping one hidden neuron in MLPs achieves 91.7% jailbreak success with white-box access to activations—safety isn't distributed, it's localized and fragile.

May 27, 2026

Summary

If you're deploying open-weight models in restricted environments, you need neuron-level monitoring. Current safety evaluations miss this attack vector entirely, making benchmarks like JailbreakBench insufficient for production risk assessment.

Why it matters

Implementation verdict

This doesn't replace existing safety testing—it exposes it as incomplete. Requires white-box access to activation maps to exploit, so black-box deployments aren't directly vulnerable. Start auditing your model's MLP neurons if you control the inference layer; add neuron-suppression tests to your eval suite now.

Sources

1.flipping a single hidden neuron can disable the refusal gate entirely
2.Suppressing one identified "refusal neuron" yields a 91.7 % average attack success rate on JailbreakBench across seven models, from 1.7 B to 70 B parameters, spanning Qwen‑3 and Llama‑3.1 families
3.The attack requires only white‑box access to model activations and no additional training, fine‑tuning, or prompt engineering
4.safety evaluations must start probing neuron‑level vulnerabilities rather than relying on aggregate loss or prompt‑based tests

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs