Skip to main content

Why Evaluation Matters in Healthcare AI

Healthcare AI operates in a domain where errors carry real consequences. Unlike general-purpose AI evaluation, healthcare evaluation must account for clinical context, patient safety implications, and regulatory requirements. A triage AI that misses chest pain symptoms isn’t just “inaccurate”—it’s potentially dangerous.
Patient Safety First: Rubric’s evaluation framework is built around a core principle: catching potentially dangerous AI decisions before they reach patients. Every metric and workflow is designed with this goal in mind.

Framework Architecture

The Rubric evaluation framework operates across four interconnected layers:

Evaluation Types

What aspects of the AI output are being assessed—accuracy, safety, completeness, hallucination detection.

Metrics

Quantitative measures like sensitivity, specificity, and custom rubric-based scores.

Human Review

Expert clinician grading for nuanced cases that require medical judgment.

Versioning & Comparison

Track changes over time, compare model versions, ensure reproducibility.

Evaluation Types

Rubric supports multiple evaluation types, each targeting a specific dimension of AI performance:
TypeDescriptionUse Case
Model Output AccuracyMeasures correctness of AI decisions against ground truth or expert consensusTriage classification, diagnosis suggestions, treatment recommendations
Clinical SafetyDetects missed red flags, inappropriate escalation/de-escalation, and safety protocol violationsEmergency detection, contraindication checking, critical symptom identification
Hallucination DetectionIdentifies fabricated medical information, unsupported claims, or invented referencesClinical note generation, patient education, medical summaries
Completeness & CoverageAssesses whether the AI captured all relevant information and addressed key concernsHistory taking, symptom documentation, care plan generation

Creating an Evaluation

Evaluations are created by specifying a dataset, one or more evaluators, and optional configuration:
create_evaluation.py
from rubric import Rubric

client = Rubric()

# Create a comprehensive triage evaluation
evaluation = client.evaluations.create(
    name="Triage Accuracy - Q1 2025",
    project="patient-triage-v2",
    dataset="ds_production_jan",

    evaluators=[
        {
            "type": "triage_accuracy",
            "config": {
                "severity_weights": {
                    "under_triage": 5.0,  # Missing urgent cases is critical
                    "over_triage": 1.0,   # Over-caution is less severe
                    "correct": 0.0
                }
            }
        },
        {
            "type": "red_flag_detection",
            "config": {
                "protocols": ["chest_pain", "stroke", "pediatric_fever"],
                "require_all_flags": True
            }
        },
        {
            "type": "hallucination_check",
            "config": {
                "check_medications": True,
                "check_diagnoses": True,
                "check_citations": True
            }
        }
    ],

    # Enable human review for borderline cases
    human_review={
        "enabled": True,
        "threshold": 0.7,  # Review if confidence < 70%
        "reviewer_pool": "physician"
    }
)

print(f"Evaluation started: {evaluation.id}")

Evaluation Pipeline

1

Ingest Samples

Load dataset samples (transcripts, notes, images)
2

Run Evaluators

Apply configured evaluators to each sample
3

Route for Review

Flag uncertain cases for human expert review
4

Aggregate Scores

Compute final metrics with confidence intervals

Key Concepts

Asymmetric Error Weighting

In healthcare, not all errors are equal. Missing a heart attack (under-triage) is far more serious than sending someone to the ER unnecessarily (over-triage). Rubric allows you to configure asymmetric weights that reflect clinical reality.

Confidence Thresholds

Every automated evaluation produces a confidence score. Samples below your configured threshold are automatically routed for human review, ensuring edge cases get expert attention.

Clinical Context

Evaluators can be configured with clinical context—patient demographics, chief complaint categories, acuity levels—to apply appropriate standards for each case type.

Next Steps