May 28, 2026

MoE coding models, agent adoption surges to 59%

Tool of the Week

Laguna releases mixture-of-experts coding models

M.1 (225.8B parameters, 23.4B activated) and XS.2 (33.4B total, 3B activated) are MoE models trained end-to-end in a versioned Model Factory stack, competitive on SWE-bench and terminal coding tasks.

MoE architecture reduces inference cost per token while maintaining competitive performance on agentic software engineering benchmarks. XS.2's Apache 2.0 release gives builders a smaller, deployable baseline for terminal-based coding workflows.

XS.2 weights are available now under Apache 2.0. Replaces closed agentic models for local deployment. Requires infrastructure to run 33.4B-parameter inference (3B activated per token is still substantial). Worth evaluating on your SWE-bench-like tasks before committing; M.1 data is technical report only, not yet open.

“M.1 has $225.8$B total parameters ($23.4$B activated per token) and XS.2 has $33.4$B total ($3$B activated)”
“two Mixture-of-Experts foundation models built for long-horizon, agentic coding”
“On agentic software engineering and terminal benchmarks (SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0) M.1 and XS.2 are competitive with state-of-the-art open models in their respective weight classes”
“Laguna XS.2 weights are released under Apache~2.0”

moe-modelscode-generationagentic-aiswe-benchopen-weights

Dev Signal

Get issues like this in your inbox — free, every weekday.

Quick Signals

Treat Claude Code as autonomous agent with guardrails

Stop treating Claude Code as autocomplete; build feedback loops so it verifies its own work, compounds improvements via CLAUDE.md rules extracted from failures.

Developers using verification loops see 2-3x quality improvement and shift from manual iteration to delegated execution. Compounding CLAUDE.md rules mean the same prompt produces better output over weeks, not degradation.

Replaces line-by-line pair programming; requires committing .claude/ config to git, using plan mode (Shift+Tab twice) before coding, and capturing mistakes as rules. Worth implementing today—concrete patterns (delegation briefs, plan review in fresh sessions, rules-from-failures) are field-tested by Anthropic's team.

“give Claude a way to verify its own work. Without that, you are the only feedback loop. With it, Claude iterates until things actually work, and Boris says this alone gives a 2-3x quality improvement”
“The model performs best if you treat it like an engineer you're delegating to, not a pair programmer you're guiding line by line”
“Claude is surprisingly good at distilling its own mistakes into precise rules”
“The single most important principle from Boris Cherny and the Anthropic team: give Claude a way to verify its own work”
“Every time Claude does something wrong, tell it: 'Update CLAUDE.md so you do not repeat this'”

claude-codeai-agentsworkflowconfigurationprompt-engineering

Agent adoption doubles to 59% but humans stay in control

Data Point

Anchor formalizes ERP agent benchmarking with constraint optimization

Anchor generates task harnesses from constraint specs, producing verifiable ground-truth solutions and state-based rewards that eliminate artifact drift in agent evaluation.

Agent evaluation environments frequently diverge between instruction, environment, oracle, and verifier—making tasks unsolvable or reward-hackable. Anchor's pipeline jointly produces all components from a single spec, letting you generate controllable difficulty benchmarks with known optimal solutions for production ERP workflows.

Replaces manual task harness construction with parametric generation; requires formalization of domain workflows as constraint optimization programs. ERP-Bench (300 tasks, procurement/manufacturing) shows frontier models hit constraints 26.1% of trials but fully optimal solutions only 17.4%—useful for calibrating agent capability but not production-ready. Worth evaluating if you own ERP agent evaluation; task generator and dataset released.

“artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires”
“frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials”
“harness-agnostic environments whose rewards depend solely on end-state business correctness”

agent-evaluationbenchmarkconstraint-optimizationerp-systemstask-generation

Enjoying Dev Signal? Get every issue in your inbox.

Free forever · 3 issues a week · One-click unsubscribe

Refer a friend →

Earn rewards for every developer you bring in.

Go premium →

Sponsor-free feed · full archive search · $149 lifetime.

MoE coding models, agent adoption surges to 59%

Laguna releases mixture-of-experts coding models

Quick Signals

Treat Claude Code as autonomous agent with guardrails

Agent adoption doubles to 59% but humans stay in control

Anchor formalizes ERP agent benchmarking with constraint optimization

Logic Apps agents execute code in Hyper-V sandboxes

Run local speech pipeline for Reachy Mini robots

Ollama switches to llama.cpp backend, adds GGUF support

Next.js fixes Turbopack imports, devtools, benchmarking