open-source-llm inference-optimization mixture-of-experts multimodal-models cost-efficiency

Mistral Small 4 unifies reasoning multimodal coding

Single 119B-parameter MoE model replaces separate reasoning/coding/multimodal specialists with configurable reasoning_effort parameter and 40% latency reduction vs. Small 3.

July 3, 2026

Summary

Eliminates context-switching between specialized models for chat, reasoning, and agentic tasks. Shorter output tokens (20% fewer on coding tasks) directly reduce inference costs and latency in production deployments.

Why it matters

Implementation verdict

Replaces Magistral + Devstral + Mistral Small instruct workflows. Requires 4x H100, 2x H200, or 1x B200 minimum. Available now via Mistral API, NVIDIA NIM, vLLM, llama.cpp, and Transformers. Worth migrating if you're running multiple specialized models; evaluate latency/cost trade-offs against your current stack.

Sources

1.119B total parameters, with 6B active parameters per token
2.40% reduction in end-to-end completion time (latency-optimized setup)
3.3x more requests per second (throughput-optimized setup) compared to Mistral Small 3
4.Mistral Small 4 with reasoning achieves competitive scores, matching or surpassing GPT-OSS 120B on all three benchmarks, while generating significantly shorter outputs
5.reasoning_effort parameter: users can dynamically adjust the model's behavior

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs