May 28, 2026

MoE coding models, agent adoption surges to 59%

Share:

Tool of the Week

Laguna releases mixture-of-experts coding models

M.1 (225.8B parameters, 23.4B activated) and XS.2 (33.4B total, 3B activated) are MoE models trained end-to-end in a versioned Model Factory stack, competitive on SWE-bench and terminal coding tasks.

MoE architecture reduces inference cost per token while maintaining competitive performance on agentic software engineering benchmarks. XS.2's Apache 2.0 release gives builders a smaller, deployable baseline for terminal-based coding workflows.

XS.2 weights are available now under Apache 2.0. Replaces closed agentic models for local deployment. Requires infrastructure to run 33.4B-parameter inference (3B activated per token is still substantial). Worth evaluating on your SWE-bench-like tasks before committing; M.1 data is technical report only, not yet open.

  • M.1 has $225.8$B total parameters ($23.4$B activated per token) and XS.2 has $33.4$B total ($3$B activated)
  • two Mixture-of-Experts foundation models built for long-horizon, agentic coding
  • On agentic software engineering and terminal benchmarks (SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0) M.1 and XS.2 are competitive with state-of-the-art open models in their respective weight classes
  • Laguna XS.2 weights are released under Apache~2.0
moe-modelscode-generationagentic-aiswe-benchopen-weights

Dev Signal

Get issues like this in your inbox — free, 3x a week.

Quick Signals

Treat Claude Code as autonomous agent with guardrails

Stop treating Claude Code as autocomplete; build feedback loops so it verifies its own work, compounds improvements via CLAUDE.md rules extracted from failures.

Developers using verification loops see 2-3x quality improvement and shift from manual iteration to delegated execution. Compounding CLAUDE.md rules mean the same prompt produces better output over weeks, not degradation.

Replaces line-by-line pair programming; requires committing .claude/ config to git, using plan mode (Shift+Tab twice) before coding, and capturing mistakes as rules. Worth implementing today—concrete patterns (delegation briefs, plan review in fresh sessions, rules-from-failures) are field-tested by Anthropic's team.

  • give Claude a way to verify its own work. Without that, you are the only feedback loop. With it, Claude iterates until things actually work, and Boris says this alone gives a 2-3x quality improvement
  • The model performs best if you treat it like an engineer you're delegating to, not a pair programmer you're guiding line by line
  • Claude is surprisingly good at distilling its own mistakes into precise rules
  • The single most important principle from Boris Cherny and the Anthropic team: give Claude a way to verify its own work
  • Every time Claude does something wrong, tell it: 'Update CLAUDE.md so you do not repeat this'
claude-codeai-agentsworkflowconfigurationprompt-engineering

Agent adoption doubles to 59% but humans stay in control

Developers are adopting single-agent workflows with mandatory human review rather than autonomous systems; GitHub Copilot (65%) and Claude Code (50%) dominate practical implementations.

Agent usage is now embedded in daily developer work across roles (40% daily use among devs, 52% among architects), shifting the conversation from adoption to operational control and security governance. Understanding which tools integrate safely into existing CI/CD affects toolchain decisions.

Replaces manual code review with AI-assisted review; requires approval gates before agent-triggered system changes (60% of users block unapproved changes). Single-agent setups are production-ready now. Multi-agent orchestration remains niche—only daily multi-agent users (70% using Claude Code) justify the complexity. Start with GitHub Copilot or Claude Code in gated workflows, not autonomous pipelines.

  • agentic usage has almost doubled (59%) since we last asked about it
  • 63% of technologists still rarely or never let agents run entirely on autopilot
  • Most (60%) of survey respondents block agents from making unapproved system changes
  • the majority of respondents (full-stack developers) is GitHub Copilot (65%) or Claude Code (50%)
  • 1,100 developers and working professionals responded to our survey
  • Accuracy and security remain the top two concerns with using agents at work
ai-agentsworkflow-integrationgovernancesurvey-datatooling

Logic Apps agents execute code in Hyper-V sandboxes

Azure Logic Apps now runs agent-generated Python, JavaScript, C#, and PowerShell in isolated containers, eliminating the need to call external Functions for mid-workflow data transformation.

Integration workflows can now inline code generation and execution within the same security boundary, reducing latency and external API calls. Hallucinated destructive code cannot escape the sandbox, shifting risk from deployment to execution.

Replaces Azure Function invocations for lightweight transformations in agent loops. Requires Azure Container Apps session pool and public preview opt-in. Ready now if you're already on Logic Apps Agent Loop; overhead is provisioning ACA infrastructure.

  • Each code interpreter session runs in its own Hyper-V boundary, a hardware-level isolation primitive that Microsoft also uses for its own untrusted workloads.
  • an LLM can receive a natural-language instruction, generate code to fulfill it, execute that code in a secure sandbox, and return the results, all within a single governed workflow
  • Logic Apps Agent Loop is best suited when your scenario requires orchestrating across multiple enterprise systems, ERP, CRM, databases, APIs, with built-in governance, retry logic, and audit trails.
  • Logic Apps code interpreters are available now in public preview
azure-logic-appscode-executionsandbox-isolationagent-workflowsintegration-platforms

Run local speech pipeline for Reachy Mini robots

VAD → STT → LLM → TTS cascade on single machine eliminates cloud dependency; swap components as models improve.

Removes API latency, cost, and privacy surface from voice agent deployments. Developers can iterate on pipeline components independently without redeploying entire infrastructure.

Replaces cloud speech backends (OpenAI Realtime API, Hugging Face Inference Endpoints). Requires llama.cpp + speech-to-speech CLI + 2-3 terminal sessions to bootstrap. Ready now—Gemma-4, Silero VAD, Parakeet-TDT, Qwen3-TTS tested and recommended. Latency bottleneck is LLM inference; decouple via Responses API protocol to scale.

  • speech-to-speech, our cascaded VAD → STT → LLM → TTS pipeline that exposes a Realtime API-compatible /v1/realtime WebSocket
  • Cascades are the most flexible option in the open-source landscape today, and with the right pieces they're also the fastest
  • The main bottleneck in the system is LLM inference latency
  • Full support for the Responses API protocol, including tool-call streaming used by the speech-to-speech backend, landed in vLLM 0.21.0
voice-agentslocal-inferencecascade-architectureroboticsllm-latency

Ollama switches to llama.cpp backend, adds GGUF support

Ollama 0.30.0-rc28 replaces its GGML foundation with direct llama.cpp integration and GGUF compatibility, with MLX acceleration on Apple Silicon.

Direct llama.cpp backend reduces abstraction layers, potentially improving performance and compatibility with the broader inference ecosystem. Developers can now use GGUF files directly, standardizing model format interchange.

Replaces GGML stack with llama.cpp; requires testing performance/memory on your hardware before production use. Two known gaps: laguna-xs.2 and llama3.2-vision unsupported. Worth trying in rc28 if you run models on Mac/Linux/Windows, but wait for 0.30.0 stable if you rely on those missing model types.

  • directly support llama.cpp instead of building on top of GGML
  • allows for compatibility with GGUF file format
  • MLX is used to accelerate model inference on Apple Silicon
  • laguna-xs.2 is not supported yet on this pre-release
  • llama3.2-vision is not supported yet on this pre-release
ollamallama-cppggufinferenceapple-silicon

Next.js fixes Turbopack imports, devtools, benchmarking

Turbopack now respects module-sync exports and external package subpaths; devtools detects renamed VS Code macOS binary; benchmarking adds percentile comparison and retry logic.

These fixes reduce friction in build tooling and local development iteration: external package imports work correctly, editor launch detection doesn't fail on macOS, and benchmark results become more reliable. Cumulative effect is fewer surprises during development.

Cherry-pick relevant fixes into your Next.js version if you hit the specific issues (Turbopack subpath imports, VS Code launch, benchmark flakiness). Otherwise wait for the next stable release. Low friction to adopt once released.

  • Turbopack: fix subpath imports pointing to external packages
  • fix(devtools): detect VS Code renamed macOS binary in launch-editor
  • devlow-bench: percentile-based comparison and run retries
  • Turbopack: respect the module-sync export condition
turbopacknext-jstoolingdevtoolsbenchmarking

Data Point

Anchor formalizes ERP agent benchmarking with constraint optimization

Anchor generates task harnesses from constraint specs, producing verifiable ground-truth solutions and state-based rewards that eliminate artifact drift in agent evaluation.

Agent evaluation environments frequently diverge between instruction, environment, oracle, and verifier—making tasks unsolvable or reward-hackable. Anchor's pipeline jointly produces all components from a single spec, letting you generate controllable difficulty benchmarks with known optimal solutions for production ERP workflows.

Replaces manual task harness construction with parametric generation; requires formalization of domain workflows as constraint optimization programs. ERP-Bench (300 tasks, procurement/manufacturing) shows frontier models hit constraints 26.1% of trials but fully optimal solutions only 17.4%—useful for calibrating agent capability but not production-ready. Worth evaluating if you own ERP agent evaluation; task generator and dataset released.

  • artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires
  • frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials
  • harness-agnostic environments whose rewards depend solely on end-state business correctness
agent-evaluationbenchmarkconstraint-optimizationerp-systemstask-generation

Enjoying Dev Signal? Get every issue in your inbox.

Free forever · 3 issues a week · One-click unsubscribe