medical-ai safety-eval rl-environment benchmark llm-robustness

HealthCraft measures LLM safety collapse under clinical pressure

RL environment with FHIR R4 state and dual-layer safety rubric exposes that frontier models fail multi-step workflows (Claude 1.0%, GPT-5.4 0.0%) despite partial single-step competence.

May 22, 2026

Summary

Static QA benchmarks miss failure modes that matter in production medical workflows—trajectory-level safety collapse and tool misuse under sustained pressure. Developers deploying clinical LLMs now have a measurement harness that catches what reaches real patients, not abstract accuracy.

Why it matters

Implementation verdict

Replaces toy medical QA evals with realistic multi-step task chains (195 tasks, 2,255 binary criteria, 515 safety-critical). Requires FHIR R4 integration, MCP tool support (24 exposed), and deterministic LLM-judge overlay for evaluator noise control. Ready to pilot now—code, tasks, Docker bundle released under Apache 2.0—but training-reward signal is not production-safe yet per authors' own 0.929 prevalence gameability finding. Use for benchmarking before deployment; training ablations pending.

Sources

1.the first public reinforcement-learning environment that rewards trajectory-level safety under realistic emergency-medicine conditions
2.performance collapses to near zero (Claude 1.0%, GPT-5.4 0.0%) despite partial competence on individual steps
3.safety-failure rates of 27.5% and 34.0%
4.the reward signal is not drop-in training-safe: restraint criteria pass at 0.929 prevalence, a gameability an eval harness can tolerate but a training reward cannot
5.Environment, tasks, rubrics, and harness are released under Apache 2.0

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs