Gemma 4 QAT checkpoints run on-device sub-1GB

Quantization-Aware Training applied to Gemma 4 with mobile-specialized schema reduces E2B footprint to under 1GB while preserving quality—ship inference locally without PTQ performance degradation.

June 9, 2026

Summary

Developers can now deploy state-model inference on consumer GPUs and phones without post-training quantization tradeoffs. Edge deployment shifts from server-dependent to genuinely local, reducing latency and dependency footprint for production systems.

Why it matters

Developers can now deploy state-model inference on consumer GPUs and phones without post-training quantization tradeoffs. Edge deployment shifts from server-dependent to genuinely local, reducing latency and dependency footprint for production systems.

Implementation verdict

Replaces post-training quantization workflow for Gemma 4. Requires no retraining—use released checkpoints directly in llama.cpp, Ollama, vLLM, or Transformers.js. Ready now: weights on HuggingFace in GGUF and compressed-tensor formats. Test on desktop first (LM Studio), then deploy on-device via LiteRT-LM or web runtime. Worth trying immediately if you're already using Gemma 4.

Sources

  1. 1.reduced the memory footprint of Gemma 4 E2B to 1GB
  2. 2.QAT integrates the quantization process directly into training
  3. 3.our QAT results yield even higher overall quality compared to standard PTQ baselines
  4. 4.Gemma 4 E2B text-only model (without Per-Layer Embeddings) requires less than 1 GB of memory

Dev Signal

Get briefs like this in your inbox — free, 3x a week.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.