Holo3.1 adds mobile, quantized weights, local inference
Computer-use model now ships FP8/Q4/NVFP4 quantized checkpoints, function-calling support, and sub-4B variants for on-device deployment with 1.74× throughput gain over full precision.
June 3, 2026
Summary
Eliminates the eval-to-prod gap by handling mobile, desktop, and different agent frameworks without retraining. Local inference options let you run agents fully offline while cutting deployment costs and latency to 3.3s average step time.
Why it matters
Eliminates the eval-to-prod gap by handling mobile, desktop, and different agent frameworks without retraining. Local inference options let you run agents fully offline while cutting deployment costs and latency to 3.3s average step time.
Implementation verdict
Replaces Holo3 if you need mobile automation (67%→79.3% on 35B-A3B) or local deployment. Requires vLLM with NVFP4 for DGX inference or GGUF for consumer hardware. Worth trying now if you're blocked on latency or privacy constraints; benchmarks are concrete.
Sources
- 1.Holo3.1 improves robustness across the three dimensions that matter most in production: environments (web, desktop, mobile), agent frameworks, and deployment targets
- 2.On AndroidWorld, our 35B-A3B model improves from 67% to 79.3%, while the smaller 4B and 9B variants improve from 58% to 72%
- 3.more than a 25% improvement over Holo3 when evaluated inside our Holotab product harness
- 4.NVFP4 W4A16 delivers 1.41× the total token throughput of FP8 and 1.74× that of BF16
- 5.cutting average step time from 6.8s to 3.3s
Dev Signal
Get briefs like this in your inbox — free, 3x a week.
100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.