Gemma 4 QAT checkpoints run on-device sub-1GB
Quantization-Aware Training applied to Gemma 4 with mobile-specialized schema reduces E2B footprint to under 1GB while preserving quality—ship inference locally without PTQ performance degradation.
June 9, 2026
Summary
Developers can now deploy state-model inference on consumer GPUs and phones without post-training quantization tradeoffs. Edge deployment shifts from server-dependent to genuinely local, reducing latency and dependency footprint for production systems.
Why it matters
Developers can now deploy state-model inference on consumer GPUs and phones without post-training quantization tradeoffs. Edge deployment shifts from server-dependent to genuinely local, reducing latency and dependency footprint for production systems.
Implementation verdict
Replaces post-training quantization workflow for Gemma 4. Requires no retraining—use released checkpoints directly in llama.cpp, Ollama, vLLM, or Transformers.js. Ready now: weights on HuggingFace in GGUF and compressed-tensor formats. Test on desktop first (LM Studio), then deploy on-device via LiteRT-LM or web runtime. Worth trying immediately if you're already using Gemma 4.
Sources
- 1.reduced the memory footprint of Gemma 4 E2B to 1GB
- 2.QAT integrates the quantization process directly into training
- 3.our QAT results yield even higher overall quality compared to standard PTQ baselines
- 4.Gemma 4 E2B text-only model (without Per-Layer Embeddings) requires less than 1 GB of memory
Dev Signal
Get briefs like this in your inbox — free, 3x a week.
100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.