sparse-attention long-context open-weights inference-optimization multimodal

MiniMax M3 open weights model ships sparse attention

MiniMax Sparse Attention (MSA) cuts per-token compute to 1/20th via block-major KV gathering, enabling 1M-token context with 9.7× prefill and 15.6× decode speedups on compatible hardware.

June 15, 2026

Summary

Long-context agent tasks and video understanding become feasible on standard infrastructure; sparse attention kernel design trades indexing overhead for massive throughput gains, directly reducing inference latency on coding and agentic workloads.

Why it matters

Implementation verdict

Replaces full-attention models for 512K+ context use cases. Requires whole-stack optimization (Modular Cloud deployment or custom kernel tuning); sparse selection overhead adds complexity. Enterprise access available today; self-hosted adoption blocked until kernel libraries mature. Worth trying now if you're latency-bound on long documents.

Sources

1.1M-token context window (with a guaranteed minimum of 512K)
2.MSA's design allows it to cut the per-token attention compute to roughly 1/20th of its full-attention predecessor
3.around 9.7× speedup on prefill and 15.6× speedup on decode
4.This structure has an added benefit of simplifying the online softmax computation

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs