kv-cache-quantization safety-alignment quantization-robustness vllm-inference mechanistic-diagnosis

KV cache quantization silently breaks model safety alignment

Safety features occupy a low-dimensional subspace 10^2-10^3x more vulnerable to quantization noise than general perplexity metrics detect; Per-Channel Reduction (PCR) diagnoses failure modes and recovers up to 97% alignment with 35 GPU-minutes calibration.

Summary

Production LLM deployments use KV cache quantization to cut inference memory, but standard perplexity evals hide safety regression—Mistral-7B loses 15.2% of refusals at barely measurable perplexity cost. PCR gives you a diagnostic protocol to catch this before serving.

Why it matters

Implementation verdict

Replaces blind quantization with mechanistic failure classification. Requires 20 calibration prompts, 35 GPU-minutes per model, and integration at quantization step—training-free. Ready now for production vLLM+FP8 stacks; validates on independent model families and production quantizers including KIVI.

Sources

1.low-bit quantization can silently destroy safety alignment
2.Mistral-7B loses 15.2% of its refusals at only 1.03x perplexity
3.safety features occupy a low-dimensional activation subspace 10^2-10^3x more vulnerable to quantization noise
4.PCR predicts the correct mitigation direction on all nine primary models and one held-out model from an independent family using 20 calibration prompts
5.recovers up to 97% of lost alignment at minimal memory overhead

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs