AI agents level up: storage, code, and new models — Dev Signal
Dev Signal/Archive/AI agents level up: storage, code, and new models
June 10, 2026
AI agents level up: storage, code, and new models
Share:
Tool of the Week
Tigris Agent Plugin teaches AI agents storage infrastructure
Pre-built skills and a specialized subagent replace hallucinated CLI flags and multi-step guesswork with deterministic Tigris operations across Claude Code, Cursor, and other agents.
Agents currently fail at infrastructure tasks—inventing nonexistent flags, skipping security checks, and requiring manual corrections that defeat automation value. This plugin moves storage setup from error-prone trial-and-error to reliable, policy-enforced workflows in your existing agent environment.
Replaces manual Tigris CLI lookups and agent hallucination cycles with five auto-loading skills (auth, buckets, objects, access-keys, IAM) plus a subagent for multi-step workflows. Requires one install (marketplace, settings rule, or manual clone depending on agent), then operators handle setup and migrations via natural language. Worth trying now if you use Claude Code or Cursor; minimal friction to validate.
“Five skills that auto-load based on context, a specialized subagent for multi-step storage workflows, and a pair of security rules that keep the agent from doing anything reckless”
“Agents hallucinate flags. Without the skill, an agent might guess --region eu-west-1 (an AWS region, not a Tigris one) or invent a --public flag that doesn't exist”
“Skills contain complete CLI references with every flag, alias, and usage example”
“Buckets are private by default, presigned URLs are preferred over public access, secret keys (anything starting with tsec_) are never exposed in output”
“snapshots are taken before destructive operations”
Get issues like this in your inbox — free, 3x a week.
Quick Signals
Claude Fable 5 launches with state-of-the-art benchmarks
Mythos-class model achieves strongest long-context reasoning and autonomous task execution; priced at $10/$50 per million tokens, half the cost of Mythos Preview.
Developers can now delegate multi-day codebase migrations and complex analytical tasks to a single model with improved token efficiency and vision capabilities. Extended context window (millions of tokens) with persistent memory enables autonomous agents to maintain state and self-correct across longer workflows.
Replaces Opus 4.8 for code generation, knowledge work, and vision tasks—available now at public pricing. Requires updating inference clients to handle extended context and autonomous loops. Worth adopting immediately for long-horizon coding tasks (50M-line migrations tested); vision-only mode replaces scaffolded harnesses. Mythos 5 (unrestricted version) limited to trusted access program initially.
“Fable 5 compressed months of engineering into days”
“the model performed a codebase-wide migration in a day that would otherwise have taken a whole team over two months by hand”
“safeguards trigger, on average, in less than 5% of sessions”
“Fable 5 can work autonomously for longer than any previous Claude models”
“Fable 5 improved its outputs using its own notes”
“$10 per million input tokens and $50 per million output tokens—less than half the price of Claude Mythos Preview”
Code-switched speech breaks ASR pipelines predictably
Data Point
ConstMap replaces Go maps with 6x memory savings
Binary fuse filters compress string-to-uint64 lookups from 56 bytes/key to 9 bytes/key with 3x faster latency on immutable datasets.
For large read-only lookup tables (logging IDs, config mappings, feature flags), this trades one-time construction cost for sustained memory efficiency and cache locality gains—critical when map size exceeds L3 cache. Reduces operational memory footprint without code rewrites.
Drops in as constmap.New() with identical API to Go maps. Worth adopting now for immutable string→uint64 mappings over 100k entries; not suitable for frequently modified maps or other value types. Serialize to disk to avoid reconstruction overhead.
“The map may use over 50 bytes per entry”
“you can reduce the memory usage down to almost the size of the keys, so about 8 bytes per entry”
“ConstMap is nearly 3 times faster than Go's standard map for lookups”
3 issues a week · Free forever · 4,200+ developers
ElevenLabs Scribe V2, Gemini 3 Flash, and AssemblyAI Universal 3-Pro handle bilingual switching best; transcription errors propagate directly into downstream task failure, measurable via Answer Error Rate.
Enterprise voice agents serving bilingual customers fail silently on code-switched input—misrouted tickets, wrong policy answers. Benchmark data lets you pick ASR systems that preserve semantic meaning, not just character accuracy.
Replaces guessing which ASR handles code-switching. Requires: testing your language pairs against this benchmark (Spanish-English, French-English, Canadian French-English, German-English covered). Ready now—use AU-Harness to evaluate models. Scribe V2 and Gemini 3 Flash are your safest bets; Whisper defaults to translation, not transcription, on code-switched audio.
“ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro surface as the top models across metrics for the task”
“transcription errors propagate forward into every downstream component”
“When called without an explicit language parameter on code-switched audio, Whisper defaults to translating into English rather than transcribing”
“We report three metrics: Word Error Rate (WER), Semantic Word Error Rate (SWER), and Answer Error Rate (AER)”
30B MoE model with 3B active parameters trained on 70k verifiable coding tasks across containerized environments, optimized for cross-harness agent reliability rather than single-benchmark performance.
Agents built on single-harness-optimized models break when tooling changes (CLI vs JSON vs text). North Mini Code's multi-harness post-training reduces the friction of deploying coding agents across different frameworks without retraining.
Replaces smaller coding models (Devstral Small, Gemma 4) for agent workloads where robustness matters more than raw benchmark score. Requires 3B active params (manageable inference cost) and containerized task environments for your own RLVR stage if you need domain specificity. Worth trying now for SWE-Bench and terminal-based tasks; Apache 2.0 licensed on HuggingFace.
“30B-parameter Mixture-of-Experts model with 3B active parameters”
“specifically designed and trained for agentic software engineering tasks”
“128 experts, of which 8 are activated per token”
“over 70k verifiable tasks across ~5k unique repositories”
“code datasets correspond to 70% of trainable tokens, 43% agentic tool-use data, and 27% single-turn competitive or scientific programming data”
“achieves 61.0% pass@1 using mini-SWE-Agent”
“10% gain on the evaluation with OpenCode harness while maintaining performance with SWE-Agent”
Anthropic releases Fable 5 with mandatory safeguards
Fable 5 hits 80% on SWE-Bench Pro and sustains focus across millions of tokens for autonomous coding tasks, but requires 30-day data retention and costs 2x Opus.
Benchmark gains are real—Stripe modernized 50M lines of Ruby in one day versus two months of team time—but guardrails trade capability for safety, and mandatory data retention blocks privacy-sensitive workloads.
Fable 5 replaces Opus for long-context code and knowledge work; requires API adoption ($10/$50 per M tokens), accepts 30-day data logging, no opt-out. Sub pricing expires June 22. Worth piloting on non-sensitive tasks now; hold for private deployments until data policy clarifies.
“Fable scores 80% (and Mythos 5, without the guardrails, 80.4%)”
“Stripe, for example, had Fable 5 modernize a 50-million-line Ruby codebase in one day — something the company says would've otherwise taken a team of developers two months”
“using those models means opting in to 30-day data retention — or not using them at all”
“stay "focused across millions of tokens in long-running tasks and improve its outputs using its own notes"”
“$10 per million input tokens and $50 per million output tokens. That's twice the price of Anthropic's current Opus model”
Define agent behavior in repository Markdown files with YAML frontmatter; agents execute team workflows consistently across CLI, IDE, and GitHub.
Encodes repeated task patterns and team standards once, then invokes them from the terminal without re-explaining context or re-running manual steps each time.
Replaces ad-hoc prompts and shell scripts with versioned, reviewable agent profiles stored in `.github/agents/`. Requires Copilot CLI access and repository write permissions. Worth adopting now for teams running repetitive security, compliance, or code-quality workflows.
“A custom agent is a Copilot agent that can be defined using a Markdown file.”
“Because the agent profile lives in your repository, your team can review it, version it, and share it”
“The agent profile is a Markdown file with YAML frontmatter that defines the agent's role, scope, capabilities, and guardrails, so it behaves consistently in your workflows.”
“The agent profile file ends with .agent.md – for example, accessibility.agent.md.”