Single neuron disables safety across model families
Flipping one hidden neuron in MLPs achieves 91.7% jailbreak success with white-box access to activations—safety isn't distributed, it's localized and fragile.
May 27, 2026
Summary
If you're deploying open-weight models in restricted environments, you need neuron-level monitoring. Current safety evaluations miss this attack vector entirely, making benchmarks like JailbreakBench insufficient for production risk assessment.
Why it matters
If you're deploying open-weight models in restricted environments, you need neuron-level monitoring. Current safety evaluations miss this attack vector entirely, making benchmarks like JailbreakBench insufficient for production risk assessment.
Implementation verdict
This doesn't replace existing safety testing—it exposes it as incomplete. Requires white-box access to activation maps to exploit, so black-box deployments aren't directly vulnerable. Start auditing your model's MLP neurons if you control the inference layer; add neuron-suppression tests to your eval suite now.
Sources
- 1.flipping a single hidden neuron can disable the refusal gate entirely
- 2.Suppressing one identified "refusal neuron" yields a 91.7 % average attack success rate on JailbreakBench across seven models, from 1.7 B to 70 B parameters, spanning Qwen‑3 and Llama‑3.1 families
- 3.The attack requires only white‑box access to model activations and no additional training, fine‑tuning, or prompt engineering
- 4.safety evaluations must start probing neuron‑level vulnerabilities rather than relying on aggregate loss or prompt‑based tests
Dev Signal
Get briefs like this in your inbox — free, 3x a week.
100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.