Why Evaluation Matters in Healthcare AI
Healthcare AI operates in a domain where errors carry real consequences. Unlike general-purpose AI evaluation, healthcare evaluation must account for clinical context, patient safety implications, and regulatory requirements. A triage AI that misses chest pain symptoms isn’t just “inaccurate”—it’s potentially dangerous.Framework Architecture
The Rubric evaluation framework operates across four interconnected layers:Evaluation Types
What aspects of the AI output are being assessed—accuracy, safety, completeness, hallucination detection.
Metrics
Quantitative measures like sensitivity, specificity, and custom rubric-based scores.
Human Review
Expert clinician grading for nuanced cases that require medical judgment.
Versioning & Comparison
Track changes over time, compare model versions, ensure reproducibility.
Evaluation Types
Rubric supports multiple evaluation types, each targeting a specific dimension of AI performance:| Type | Description | Use Case |
|---|---|---|
| Model Output Accuracy | Measures correctness of AI decisions against ground truth or expert consensus | Triage classification, diagnosis suggestions, treatment recommendations |
| Clinical Safety | Detects missed red flags, inappropriate escalation/de-escalation, and safety protocol violations | Emergency detection, contraindication checking, critical symptom identification |
| Hallucination Detection | Identifies fabricated medical information, unsupported claims, or invented references | Clinical note generation, patient education, medical summaries |
| Completeness & Coverage | Assesses whether the AI captured all relevant information and addressed key concerns | History taking, symptom documentation, care plan generation |
Creating an Evaluation
Evaluations are created by specifying a dataset, one or more evaluators, and optional configuration:create_evaluation.py
Evaluation Pipeline
Key Concepts
Asymmetric Error Weighting
In healthcare, not all errors are equal. Missing a heart attack (under-triage) is far more serious than sending someone to the ER unnecessarily (over-triage). Rubric allows you to configure asymmetric weights that reflect clinical reality.Confidence Thresholds
Every automated evaluation produces a confidence score. Samples below your configured threshold are automatically routed for human review, ensuring edge cases get expert attention.Clinical Context
Evaluators can be configured with clinical context—patient demographics, chief complaint categories, acuity levels—to apply appropriate standards for each case type.Next Steps
Evaluation Types
Deep dive into each evaluation type and configuration options
Metrics Reference
Complete list of metrics, formulas, and interpretation guidance
Human Review Design
Best practices for designing effective clinician review workflows
Versioning & Comparison
Track evaluations over time and compare model versions
