A new benchmark called HealthBench has been introduced to assess how well AI models perform in realistic health-related conversations. Developed with input from 262 physicians across 60 countries, HealthBench includes 5,000 conversations designed to reflect actual interactions between individuals, clinicians, and AI systems. Each conversation comes with a custom grading guide created by medical professionals, detailing what makes a response effective or problematic. The benchmark aims to test AI responses across a wide range of situations spanning multiple specialties, languages, and user types, simulating complex and diverse medical interactions.
HealthBench introduces a rubric-based scoring method that grades AI responses using 48,562 unique criteria. These criteria are weighted according to their relevance and importance as judged by physicians. Evaluations focus on aspects such as factual accuracy, communication clarity, and appropriate response to uncertainty. The scoring is conducted using the GPT‑4.1 model, which checks whether the AI responses meet the rubric standards. HealthBench covers themes including emergency care, medical uncertainty, and international health practices. Initial evaluations have been made public, offering baseline scores for multiple AI models. The results reveal that while progress has been made, there remains significant room for improvement when compared to the performance of human physicians. This benchmark is currently being used to guide AI development in clinical decision support and health information tools across various healthcare research institutions.




















