quantization on-device-inference image-generation apple-silicon open-weights

Bonsai Image 4B runs diffusion inference on iPhones

Binary and ternary quantization reduce FLUX.2 Klein 4B diffusion transformer from 7.75 GB to 0.93–1.21 GB while retaining 88–95% quality, enabling local generation on Apple Silicon devices.

June 2, 2026

Summary

Eliminates cloud round-trip latency for iterative image generation workflows and keeps prompts/assets local. Developers can embed high-quality image generation in apps on hardware users already own, removing per-request costs and enabling faster creative loops.

Why it matters

Implementation verdict

Replaces cloud-only FLUX.2 Klein deployment for on-device use cases. Requires MLX (Apple Silicon) or Gemlite (CUDA) support; both variants ship as open weights. Ready now for iOS/macOS apps—9.4s per 512×512 on iPhone 17 Pro Max is practical for most UX patterns. Ternary variant recommended for quality; 1-bit for extreme memory pressure.

Sources

1.1.125 effective bits per weight
2.1.71 effective bits per weight
3.the first image model in its parameter class to run directly on an iPhone
4.mean-active memory is 1.5 GB and 1.96 GB, for the binary and ternary models, compared to 11.74 GB for the original FLUX.2 Klein 4B
5.retains 95% of the FLUX.2 Klein 4B accuracy across GenEval, HPSv3, and DPG-Bench, while reducing the diffusion transformer footprint by 6.4x
6.generation can sit directly inside the product experience

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs