HealthCraft measures LLM safety collapse under clinical pressure
RL environment with FHIR R4 state and dual-layer safety rubric exposes that frontier models fail multi-step workflows (Claude 1.0%, GPT-5.4 0.0%) despite partial single-step competence.
May 22, 2026
Summary
Static QA benchmarks miss failure modes that matter in production medical workflows—trajectory-level safety collapse and tool misuse under sustained pressure. Developers deploying clinical LLMs now have a measurement harness that catches what reaches real patients, not abstract accuracy.
Why it matters
Static QA benchmarks miss failure modes that matter in production medical workflows—trajectory-level safety collapse and tool misuse under sustained pressure. Developers deploying clinical LLMs now have a measurement harness that catches what reaches real patients, not abstract accuracy.
Implementation verdict
Replaces toy medical QA evals with realistic multi-step task chains (195 tasks, 2,255 binary criteria, 515 safety-critical). Requires FHIR R4 integration, MCP tool support (24 exposed), and deterministic LLM-judge overlay for evaluator noise control. Ready to pilot now—code, tasks, Docker bundle released under Apache 2.0—but training-reward signal is not production-safe yet per authors' own 0.929 prevalence gameability finding. Use for benchmarking before deployment; training ablations pending.
Sources
- 1.the first public reinforcement-learning environment that rewards trajectory-level safety under realistic emergency-medicine conditions
- 2.performance collapses to near zero (Claude 1.0%, GPT-5.4 0.0%) despite partial competence on individual steps
- 3.safety-failure rates of 27.5% and 34.0%
- 4.the reward signal is not drop-in training-safe: restraint criteria pass at 0.929 prevalence, a gameability an eval harness can tolerate but a training reward cannot
- 5.Environment, tasks, rubrics, and harness are released under Apache 2.0
Dev Signal
Get briefs like this in your inbox — free, 3x a week.
100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.