LiteRT-LM ships native Gemma 4 multi-token prediction support

Speculative decoding with co-located GPU execution of drafter and primary model eliminates cross-IP data transfers, achieving 2.2x faster inference on mobile hardware.

June 5, 2026

Summary

On-device LLM inference latency is a hard constraint for mobile UX. Native MTP support with optimized KV cache management means you can ship faster agentic features without rebuilding inference pipelines.

Why it matters

On-device LLM inference latency is a hard constraint for mobile UX. Native MTP support with optimized KV cache management means you can ship faster agentic features without rebuilding inference pipelines.

Implementation verdict

Replaces hand-rolled speculative decoding or slower runtimes like llama.cpp for Gemma 4 on Android/iOS. Requires Swift/Kotlin/JavaScript adoption and GitHub source access. Worth trying now if you're shipping Gemma 4 mobile—benchmarks are Google-attributed and the API is production-ready.

Sources

  1. 1.up to 2.2x faster inference
  2. 2.the highest-performing runtime environment for Gemma models
  3. 3.optimizing the data interplay between the primary Gemma 4 model and the MTP drafter
  4. 4.1.8x to 3.7x faster than competing frameworks like llama.cpp, MLX, Cactus, and ONNX
  5. 5.the ~2.58GB Gemma 4 E2B model taking just 607MB on Apple mobile CPUs

Dev Signal

Get briefs like this in your inbox — free, 3x a week.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.