MiniMax M3 hits production with 1M-token sparse attention
Block-sparse attention reduces N² scaling to make 1M context windows feasible; 9x prefill, 15x decode speedup over dense attention at the cost of reimplementing attention kernels and multimodal preprocessing pipelines.
June 3, 2026
Summary
Long-context inference (codebases, documents, agentic loops) becomes cost-competitive. Developers targeting production agentic systems can now evaluate a model built for tool-use at scale without prohibitive latency or KV-cache overhead.
Why it matters
Long-context inference (codebases, documents, agentic loops) becomes cost-competitive. Developers targeting production agentic systems can now evaluate a model built for tool-use at scale without prohibitive latency or KV-cache overhead.
Implementation verdict
Replaces dense attention implementations and KV-cache management strategies. Requires custom kernel work (block-major reordering, sparse paged attention integration, decode scoring optimization) and gateway-level multimodal preprocessing. Ready now via Together AI endpoint; self-hosted deployment demands kernel engineering expertise.
Sources
- 1.1M-token context window, native multimodality, and an architecture that demands serious engineering to serve efficiently
- 2.brings a speed up of more than 9x in the prefilling stage and more than 15x in the decoding stage
- 3.The attention computation itself no longer scales as N^2 with context length, thus making it very suitable for long context workload
- 4.MSA significantly lowers the wall time percent of the actual attention computation per iteration
Dev Signal
Get briefs like this in your inbox — free, 3x a week.
100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.