SMCEvolve cuts LLM calls for program search
Sequential Monte Carlo sampling replaces reward-maximization trial-and-error in LLM-driven code generation, with finite-sample complexity bounds on LLM budget.
Reduces wasted API calls in evolutionary code search by applying principled sampling theory instead of greedy mutations. Matters for teams iterating on symbolic regression, algorithm optimization, or automated ML research where LLM cost dominates.
Replaces ad-hoc LLM mutation loops with SMC-driven resampling + acceptance mixing. Requires adapting your reward function to a target distribution and integrating Sequential Monte Carlo sampling. Code available; ready for research teams now, production adoption needs validation on your domain.
- “LLM-driven program evolution has emerged as a powerful tool for automated scientific discovery”
- “recasts program search as sampling from a reward-tilted target distribution”
- “three core mechanisms emerge as principled components: adaptive parent resampling, mixture of mutation with acceptance, and automatic convergence control”
- “finite-sample complexity analysis that bounds the LLM-call budget required to reach a target approximation error”
- “SMCEvolve surpasses state-of-the-art evolving systems while using fewer LLM calls under self-determined termination”
llm-optimizationprogram-synthesismonte-carlocost-reductionsymbolic-regression
Claude Code /goal asks too many questions mid-run
Claude Code's /goal interrupts for user judgment calls and gives up early on long tasks, while Codex runs unattended to token exhaustion—opposite of the intended paradigm.
Long-horizon agent tasks fail not from capability gaps but from model behavior: Claude Code breaks autonomy by asking for direction mid-task, forcing manual monitoring. Codex demonstrates that stubborn, delegation-averse execution actually scales better than orchestrated subagents.
Don't switch to Claude Code /goal yet for overnight/unattended tasks. Codex /goal is production-ready for long runs despite 400K context default. Claude Code 2.1.139 adds the feature but requires babysitting—use it for interactive, short-horizon work only. Anthropic's laziness regression (April 16 rollout) never fully recovered; the RL layer internalized the bias.
- “Claude Code added the feature in its May 12 2.1.139 release—straight to stable, not experimental”
- “Codex almost never calls subagents; it works inline unless I explicitly tell it to delegate”
- “Most importantly, it's stubborn. It almost never tells me a goal is unachievable”
- “it kept popping up to ask me to make choices. And the questions are usually on point. But under /goal, this is a bug, not a feature”
- “it proactively tells me it can't achieve the goal. Then it actually fails the goal. Sometimes after just a few dozen minutes”
- “After each compaction, Claude Code often seems to have forgotten everything that came before”
- “on April 16 they had added a 'reduce verbosity' instruction to the system prompt”
- “You can't fix that by tweaking a system prompt”
- “In extended continuous operation like /goal, this laziness gets amplified”
agentic-aiclaude-codelong-horizon-taskscontext-managementautonomous-agents Supervise agents like services, not scripts
Three patterns—process supervision, state persistence, timeout bounds—took production agent uptime from 71% to 99.4% without changing agent code.
Agent reliability in production depends entirely on operational infrastructure, not model capability. Crashes are common; the win is recovering in under 30 seconds instead of losing a night of work.
Replaces naive while-loop agents with supervisord/systemd management, SQLite checkpoints per tool call, and signal-based timeouts on every tool. Requires 2–3 hours to wire in; worthwhile immediately if running agents on any infrastructure you control. Author notes operational overhead becomes significant by month three, making managed hosting ($99/mo+) viable alternative.
- “my "agent uptime" went from 71% to 99.4% in a week”
- “average time-to-recovery on a crash dropped from "next morning when I noticed" to under 30 seconds”
- “token spend on retries dropped by about 40%”
- “Checkpoint after every tool call”
- “Wrap every tool the agent can call in a timeout”
- “If the same tool times out three times in a row, mark it broken for ten minutes”
agent-opsreliability-patternsprocess-supervisionproduction-deploymentstate-persistence
Single vector index handles 25 languages cross-lingual RAG
text-embedding-3-large maps queries across 100+ languages into the same embedding space, eliminating per-language indexing and translation infrastructure in production e-commerce support.
Multilingual RAG typically requires duplicate indexes or runtime translation steps. This cuts infrastructure complexity and latency for any team scaling support across regions—retrieval latency stays under 500ms with semantic + keyword hybrid search.
Replaces: language-specific vector indexes, query-time translation, Pinecone/Weaviate for serverless setups. Requires: Upstash Vector, OpenAI embeddings API, chunk-size tuning (250–500 tokens), hybrid alpha calibration (0.6 for e-commerce), score threshold for escalation (0.35). Ready now—code is complete and benchmarked at 70% automated resolution.
- “text-embedding-3-large is trained on 100+ languages”
- “70% of queries resolved without a human, with P95 retrieval latency under 500ms”
- “Retrieval precision was within 3% of English-to-English queries”
- “Chunk size for e-commerce content: 250–500 tokens is the sweet spot”
- “At 1,614 documents, a full re-index costs around $4 in API fees”
- “Upstash Vector was the only option that gave me hybrid search without managing a server”
ragmultilingualvector-searchhybrid-searche-commerce
Quantization hides bias emergence below perplexity thresholds
Standard metrics miss fairness degradation in quantized models—3-bit causes 6-21% of previously unbiased items to develop stereotypical behaviors while perplexity barely shifts.
If you're deploying quantized models to production, aggregate metrics won't catch bias emergence. You need item-level fairness audits before compression, not after, or you'll ship models that silently amplify stereotypes while passing quality gates.
This doesn't replace existing quantization pipelines yet—it replaces your confidence in standard eval metrics. Requires adding fairness benchmarks (BBQ-style) to your quantization testing matrix. Worth implementing now if deploying 4-bit or lower to any inference service touching user-facing classification or generation.
- “3-bit quantization causes 6-21% of previously unbiased items to develop new stereotypical behaviors”
- “models' willingness to select "unknown" answers declines by 17.4%”
- “perplexity increases by less than 0.5% at 8-bit and under 3% at 4-bit across all three models, yet 2.5-5.6% of items already develop new biases at 4-bit”
- “aggregate evaluation metrics systematically miss fairness-critical degradation”
quantizationbias-detectionfairnessllm-compressionevaluation