Evaluation Framework

Why Evaluation Matters in Healthcare AI

Healthcare AI operates in a domain where errors carry real consequences. Unlike general-purpose AI evaluation, healthcare evaluation must account for clinical context, patient safety implications, and regulatory requirements. A triage AI that misses chest pain symptoms isn’t just “inaccurate”—it’s potentially dangerous.

Patient Safety First: Rubric’s evaluation framework is built around a core principle: catching potentially dangerous AI decisions before they reach patients. Every metric and workflow is designed with this goal in mind.

Framework Architecture

The Rubric evaluation framework operates across four interconnected layers:

Evaluation Types

What aspects of the AI output are being assessed—accuracy, safety, completeness, hallucination detection.

Metrics

Quantitative measures like sensitivity, specificity, and custom rubric-based scores.

Human Review

Expert clinician grading for nuanced cases that require medical judgment.

Versioning & Comparison

Track changes over time, compare model versions, ensure reproducibility.

Evaluation Types

Rubric supports multiple evaluation types, each targeting a specific dimension of AI performance:

Type	Description	Use Case
Model Output Accuracy	Measures correctness of AI decisions against ground truth or expert consensus	Triage classification, diagnosis suggestions, treatment recommendations
Clinical Safety	Detects missed red flags, inappropriate escalation/de-escalation, and safety protocol violations	Emergency detection, contraindication checking, critical symptom identification
Hallucination Detection	Identifies fabricated medical information, unsupported claims, or invented references	Clinical note generation, patient education, medical summaries
Completeness & Coverage	Assesses whether the AI captured all relevant information and addressed key concerns	History taking, symptom documentation, care plan generation

Creating an Evaluation

Evaluations are created by specifying a dataset, one or more evaluators, and optional configuration:

create_evaluation.py

from rubric import Rubric

client = Rubric()

# Create a comprehensive triage evaluation
evaluation = client.evaluations.create(
    name="Triage Accuracy - Q1 2025",
    project="patient-triage-v2",
    dataset="ds_production_jan",

    evaluators=[
        {
            "type": "triage_accuracy",
            "config": {
                "severity_weights": {
                    "under_triage": 5.0,  # Missing urgent cases is critical
                    "over_triage": 1.0,   # Over-caution is less severe
                    "correct": 0.0
                }
            }
        },
        {
            "type": "red_flag_detection",
            "config": {
                "protocols": ["chest_pain", "stroke", "pediatric_fever"],
                "require_all_flags": True
            }
        },
        {
            "type": "hallucination_check",
            "config": {
                "check_medications": True,
                "check_diagnoses": True,
                "check_citations": True
            }
        }
    ],

    # Enable human review for borderline cases
    human_review={
        "enabled": True,
        "threshold": 0.7,  # Review if confidence < 70%
        "reviewer_pool": "physician"
    }
)

print(f"Evaluation started: {evaluation.id}")

Evaluation Pipeline

Ingest Samples

Load dataset samples (transcripts, notes, images)

Run Evaluators

Apply configured evaluators to each sample

Route for Review

Flag uncertain cases for human expert review

Aggregate Scores

Compute final metrics with confidence intervals

Key Concepts

Asymmetric Error Weighting

In healthcare, not all errors are equal. Missing a heart attack (under-triage) is far more serious than sending someone to the ER unnecessarily (over-triage). Rubric allows you to configure asymmetric weights that reflect clinical reality.

Confidence Thresholds

Every automated evaluation produces a confidence score. Samples below your configured threshold are automatically routed for human review, ensuring edge cases get expert attention.

Clinical Context

Evaluators can be configured with clinical context—patient demographics, chief complaint categories, acuity levels—to apply appropriate standards for each case type.

Next Steps

Evaluation Types

Deep dive into each evaluation type and configuration options

Metrics Reference

Complete list of metrics, formulas, and interpretation guidance

Human Review Design

Best practices for designing effective clinician review workflows

Versioning & Comparison

Track evaluations over time and compare model versions

Home

Getting Started

Core Concepts

Onboarding Guides

Evaluation Framework

Tutorials

Integrations

Security & Compliance

Voice AI

Medical Imaging

Clinical Notes

Workflows

Glossary & Appendix

Evaluation Framework

Why Evaluation Matters in Healthcare AI

Framework Architecture

Evaluation Types

Metrics

Human Review

Versioning & Comparison

Evaluation Types

Creating an Evaluation

Evaluation Pipeline

Key Concepts

Asymmetric Error Weighting

Confidence Thresholds

Clinical Context

Next Steps

Evaluation Types

Metrics Reference

Human Review Design

Versioning & Comparison

Home

Getting Started

Core Concepts

Onboarding Guides

Evaluation Framework

Tutorials

Integrations

Security & Compliance

Voice AI

Medical Imaging

Clinical Notes

Workflows

Glossary & Appendix

​Why Evaluation Matters in Healthcare AI

​Framework Architecture

Evaluation Types

Metrics

Human Review

Versioning & Comparison

​Evaluation Types

​Creating an Evaluation

​Evaluation Pipeline

​Key Concepts

​Asymmetric Error Weighting

​Confidence Thresholds

​Clinical Context

​Next Steps

Evaluation Types

Metrics Reference

Human Review Design

Versioning & Comparison

Why Evaluation Matters in Healthcare AI

Framework Architecture

Evaluation Types

Creating an Evaluation

Evaluation Pipeline

Key Concepts

Asymmetric Error Weighting

Confidence Thresholds

Clinical Context

Next Steps