LiteRT-LM ships native Gemma 4 multi-token prediction support
Speculative decoding with co-located GPU execution of drafter and primary model eliminates cross-IP data transfers, achieving 2.2x faster inference on mobile hardware.
June 5, 2026
Summary
On-device LLM inference latency is a hard constraint for mobile UX. Native MTP support with optimized KV cache management means you can ship faster agentic features without rebuilding inference pipelines.
Why it matters
On-device LLM inference latency is a hard constraint for mobile UX. Native MTP support with optimized KV cache management means you can ship faster agentic features without rebuilding inference pipelines.
Implementation verdict
Replaces hand-rolled speculative decoding or slower runtimes like llama.cpp for Gemma 4 on Android/iOS. Requires Swift/Kotlin/JavaScript adoption and GitHub source access. Worth trying now if you're shipping Gemma 4 mobile—benchmarks are Google-attributed and the API is production-ready.
Sources
- 1.up to 2.2x faster inference
- 2.the highest-performing runtime environment for Gemma models
- 3.optimizing the data interplay between the primary Gemma 4 model and the MTP drafter
- 4.1.8x to 3.7x faster than competing frameworks like llama.cpp, MLX, Cactus, and ONNX
- 5.the ~2.58GB Gemma 4 E2B model taking just 607MB on Apple mobile CPUs
Dev Signal
Get briefs like this in your inbox — free, 3x a week.
100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.