May 20, 2026

Gemini 3.5 Flash scales agentic workflows

Share:

Tool of the Week

Gemini 3.5 Flash executes agentic workflows at scale

3.5 Flash outperforms 3.1 Pro on coding benchmarks (76.2% Terminal-Bench 2.1) while running 4x faster than other frontier models, available now via Gemini API and Antigravity.

Developers can replace multi-step manual workflows with supervised agentic execution; latency-critical applications no longer trade intelligence for speed. Antigravity harness enables parallel subagent deployment for complex tasks like codebase refactoring and document processing.

Ready now. 3.5 Flash replaces 3.1 Pro for agent-heavy workloads and coding tasks. Requires Gemini API integration or Antigravity harness setup; early adopters (Shopify, Macquarie, Salesforce, Databricks) confirm multi-step workflow automation works at production scale. Start with API access today, evaluate cost/latency tradeoffs against your current models.

  • outperforming Gemini 3.1 Pro on challenging coding and agentic benchmarks like Terminal-Bench 2.1 (76.2%), GDPval-AA (1656 Elo) and MCP Atlas (83.6%)
  • 4 times faster than other frontier models
  • available today to billions of people globally
  • often at less than half the cost of other frontier models
  • reliably execute multi-step workflows and coding tasks while sustaining frontier performance
agentic-workflowsgemini-3.5-flashagent-apicoding-benchmarksproduction-ready

Dev Signal

Get issues like this in your inbox — free, 3x a week.

Quick Signals

Composer 2.5 improves long-task execution and collaboration

Targeted textual feedback during RL training fixes localized failures (bad tool calls, style violations) that global reward signals miss, enabling better long-horizon behavior without full rollout retraining.

Composer now handles multi-step coding tasks more reliably with fewer false starts, reducing iteration cycles in sustained agentic work. Better instruction following and communication style cut friction in human-AI collaboration loops.

Drop-in replacement for Composer 2 at $0.50/$2.50 per M tokens (standard) or $3.00/$15.00 (fast tier). Requires no client-side changes—Cursor users get it automatically. Worth switching today if you're running long-context code tasks; the 25x synthetic task scale and targeted feedback training directly address timeout/retry patterns in multi-file editing.

  • It is better at sustained work on long-running tasks, follows complex instructions more reliably, and is more pleasant to collaborate with
  • Composer 2.5 is trained with 25x more synthetic tasks than Composer 2
  • we trained Composer 2.5 with targeted textual feedback
  • Credit assignment during RL is becoming an increasingly difficult challenge as rollouts can span hundreds of thousands of tokens
  • Composer 2.5 is priced at $0.50/M input and $2.50/M output tokens
llm-trainingreinforcement-learningagentic-ailong-contextcursor

Cloudflare Sandboxes Execute Claude Managed Agents

Run the agent loop on Anthropic's platform while offloading code execution, private service access, and tool calls to Cloudflare Sandboxes—decoupling the brain from the hands.

Developers can now choose execution infrastructure independent of the agent orchestration layer, enabling compliance-driven architecture, cost optimization via lightweight isolates instead of full microVMs, and zero-trust credential injection without exposing secrets to sandbox environments.

This replaces running the entire Claude Managed Agents stack on Anthropic-provided infrastructure. Requires forking the provided deployment template and configuring backend type (microVM or isolate). Ready now—Cloudflare provides an onboarding guide and default template for minutes-to-deployment setup.

  • decoupling the brain from the hands
  • The core agent loop runs in Anthropic (the "brain"), but the infrastructure for running and executing code (the "hands") can be run anywhere, including Cloudflare
  • If we're constantly running a full microVM per agent, we'll be unnecessarily burning a ton of resources and money to enable this scale
  • Claude can read files, run commands, browse the web, and execute code
  • This lets you inject secrets into requests outside the sandbox, so the agent never has access to them
claude-agentscloudflare-workerssandbox-executioninfrastructure-decouplingzero-trust-auth

Gemini 3.5 Flash executes agentic tasks at scale

3.5 Flash delivers frontier-level coding and agentic performance at 4x the throughput of competing flagship models, with pricing at half the cost for multi-step workflows.

Reduces latency bottlenecks in agent execution and cuts inference costs for long-horizon tasks, making complex automation economically viable for production workloads. Multi-agent orchestration via Antigravity harness enables parallel subagent execution without rebuilding orchestration layers.

Replaces 3.1 Pro for agentic workloads and coding tasks. Requires Antigravity framework (Google's agent-first platform) for subagent coordination; available now via Gemini API and Google AI Studio. Worth migrating existing agentic systems immediately—the speed/cost trade-off is measurable and the framework maturity suggests production-ready deployment.

  • 4 times faster than other frontier models
  • outperforming Gemini 3.1 Pro on challenging coding and agentic benchmarks like Terminal-Bench 2.1 (76.2%), GDPval-AA (1656 Elo) and MCP Atlas (83.6%)
  • often at less than half the cost of other frontier models
  • When coupled with the updated Antigravity harness, 3.5 Flash becomes a powerful engine for deploying collaborative subagents to tackle problems at scale
gemini-3.5-flashagentic-workflowsagent-orchestrationinference-latencycost-optimization

Gemini 3.1 Flash-Lite launches at scale pricing

New model delivers 2.5X faster time-to-first-token than 2.5 Flash at $0.25/1M input tokens, targeting high-volume inference workloads with selectable reasoning depth.

Reduces inference cost and latency for production translation, moderation, and UI generation pipelines. Thinking levels let you dial reasoning up/down per request, managing cost-quality tradeoffs at scale.

Replaces 2.5 Flash for latency-sensitive, high-volume tasks. Requires migrating inference calls to Gemini API or Vertex AI; preview status means production readiness TBD. Worth benchmarking against your current model on actual workloads now.

  • Priced at just $0.25/1M input tokens and $1.50/1M output tokens
  • 2.5X faster Time to First Answer Token and 45% increase in output speed, according to the Artificial Analysis benchmark
  • comes standard with thinking levels in AI Studio and Vertex AI, giving developers the control and flexibility to select how much the model "thinks" for a task
  • 3.1 Flash-Lite achieves an impressive Elo score of 1432 on the Arena.ai Leaderboard
geminiinference-optimizationcost-efficiencylatencypreview

Six ModernBERT rerankers ship with training recipe

Distilled cross-encoders (17M–1B params) built on Ettin encoders replace exhaustive ranking in retrieve-then-rerank pipelines with 1.7x–8.3x speedup via Flash Attention 2.

Rerankers require per-pair inference, making them expensive at scale. These models let you rank top-K retrieval results accurately without running a cross-encoder over your entire corpus, keeping latency bounded while improving result quality.

Drop-in replacement for generic cross-encoders in production pipelines. Requires sentence-transformers library and optional flash-attention2 + bfloat16 for speed gains. Ready now: three lines of code, supports 8K token context, benchmarked on MTEB. Start with 32M or 68M for typical trade-offs.

  • six new Sentence Transformers CrossEncoder rerankers, state-of-the-art at their respective sizes
  • retrieve-then-rerank: a fast embedding model retrieves the top-K candidates (cheap), then a cross-encoder re-orders just those K with high accuracy
  • a 1.7x-8.3x speedup over default loading depending on model size and sequence length
  • All six accept up to 8K tokens of context (useful for long-document reranking) thanks to ModernBERT's long-context pre-training
rerankingcross-encodersretrievalsentence-transformersmodernbert

Data Point

Qwen3.7 Preview ranks top LLM arenas now

Qwen3.7 Preview lands on Arena with 13th (Text) and 16th (Vision) placements—hybrid model strategy for multimodal inference without retraining.

Developers building production search or vision pipelines gain another ranked baseline to benchmark against. Arena leaderboards now include a model you can test locally before committing infrastructure.

Replaces the need to spin up custom evals against Qwen3.5. Requires nothing—Arena is public. Worth testing now if you're comparing multimodal routing strategies, but rankings are snapshot-based and don't reflect your data distribution.

  • Qwen3.7 Preview is now on Arena for Text and Vision
  • Qwen3.7 Max Preview ranks 13th overall in Text Arena
  • Qwen3.7 Plus Preview ranks 16th overall in Vision Arena
model-rankingmultimodalbenchmarkqweninference

Enjoying Dev Signal? Get every issue in your inbox.

Free forever · 3 issues a week · One-click unsubscribe