Skip to main content

Overview

Healthcare AI evaluation requires both automated scoring and human clinical judgment. This guide explains when to use each approach and how to build effective hybrid workflows.

Automated Evaluation

Automated evaluators run code-based scoring on every sample.

Strengths

Speed

Process thousands of samples in minutes

Consistency

Same criteria applied uniformly

Scale

Evaluate 100% of production traffic

Cost

Fraction of human review cost

Best For

Use CaseExample
Structured comparisonsPredicted triage vs. expected triage
Pattern matchingRed flag keyword detection
Compliance checksRequired questions asked
Metric calculationLatency, token count, cost
Regression detectionScore changes between versions

Example

# Automated evaluators for structured checks
evaluators = [
    {
        "type": "triage_accuracy",
        "config": {
            "levels": ["emergency", "urgent", "routine"],
            "require_exact_match": True
        }
    },
    {
        "type": "red_flag_detection",
        "config": {
            "keywords": ["chest pain", "difficulty breathing", "severe bleeding"]
        }
    },
    {
        "type": "response_latency",
        "config": {
            "threshold_ms": 5000
        }
    }
]

Human Review

Human reviewers provide clinical judgment that automated systems cannot replicate.

Strengths

Clinical Judgment

Nuanced medical reasoning

Context Understanding

Interpret ambiguous situations

Edge Cases

Handle novel scenarios

Ground Truth

Generate training labels

Best For

Use CaseExample
Clinical appropriateness”Was this triage decision safe?”
Reasoning quality”Did the AI ask the right follow-up questions?”
Edge casesUnusual symptom combinations
Ambiguous scenariosWhen correct answer is debatable
Ground truth creationLabeling data for future automation

Example

# Configure human review for complex cases
client.projects.update(
    project="patient-triage",
    review_config={
        "rubric": {
            "dimensions": [
                {
                    "name": "clinical_appropriateness",
                    "description": "Was the triage decision clinically appropriate?",
                    "scale": ["inappropriate", "questionable", "appropriate", "excellent"]
                },
                {
                    "name": "safety",
                    "description": "Were all safety concerns addressed?",
                    "scale": ["unsafe", "partially_safe", "safe"]
                },
                {
                    "name": "communication",
                    "description": "Was the communication clear and empathetic?",
                    "scale": [1, 2, 3, 4, 5]
                }
            ],
            "require_notes_on_failure": True
        }
    }
)

The Hybrid Approach

The most effective strategy combines automated evaluation with targeted human review.

Workflow

Configuration

# Configure hybrid workflow
client.projects.update(
    project="patient-triage",
    
    evaluation_config={
        # Automated evaluators run on everything
        "evaluators": [
            {"type": "triage_accuracy"},
            {"type": "red_flag_detection"},
            {"type": "guideline_compliance"}
        ],
        
        # Routing rules for human review
        "routing_rules": [
            # Always review emergency decisions
            {
                "condition": "output.triage_level == 'emergency'",
                "action": "route_to_review",
                "priority": "high",
                "reviewer_credential": "MD"
            },
            
            # Review when AI is uncertain
            {
                "condition": "output.confidence < 0.8",
                "action": "route_to_review",
                "priority": "medium"
            },
            
            # Review mismatches between evaluators
            {
                "condition": "scores.triage_accuracy < 1.0 AND scores.red_flag_detection == 1.0",
                "action": "route_to_review",
                "priority": "medium"
            },
            
            # Review detected safety issues
            {
                "condition": "scores.red_flag_detection < 1.0",
                "action": "route_to_review",
                "priority": "urgent",
                "reviewer_credential": "MD"
            }
        ],
        
        # Random sampling for QA
        "sampling": {
            "routine": 0.05,       # 5% of routine cases
            "urgent": 0.10,        # 10% of urgent
            "emergency": 1.0       # 100% of emergency (covered by rules above)
        }
    }
)

Routing Rules

Rule Syntax

{
    "condition": "<expression>",
    "action": "route_to_review",
    "priority": "low|medium|high|urgent",
    "reviewer_credential": "MD|NP|RN|...",
    "due_within_hours": 24
}

Condition Expressions

ExpressionDescription
output.triage_level == 'emergency'Match specific output value
output.confidence < 0.8Numeric comparison
scores.evaluator_name < thresholdCheck evaluator score
'keyword' in output.symptomsCheck list membership
metadata.model_version == 'v2'Match metadata
condition1 AND condition2Combine conditions
condition1 OR condition2Either condition

Priority Levels

PrioritySLAUse Case
urgent1 hourSafety-critical issues
high4 hoursEmergency triage decisions
medium24 hoursStandard flagged cases
low72 hoursQA random sampling

Assignment Strategies

Round-Robin

Distribute tasks evenly across available reviewers:
"assignment_strategy": {
    "type": "round_robin",
    "respect_credentials": True,
    "max_daily_per_reviewer": 50
}

Load-Balanced

Assign based on current workload:
"assignment_strategy": {
    "type": "load_balanced",
    "factors": ["queue_size", "avg_review_time"],
    "max_queue_per_reviewer": 20
}

Expertise-Based

Route to specialists by topic:
"assignment_strategy": {
    "type": "expertise",
    "routing": {
        "cardiology_symptoms": ["dr_chen", "dr_patel"],
        "pediatric": ["dr_wilson"],
        "mental_health": ["dr_garcia", "np_thompson"]
    }
}

Dual Review

Require multiple reviewers for critical cases:
"assignment_strategy": {
    "type": "dual_review",
    "conditions": {
        "emergency_decisions": 2,
        "safety_flags": 2,
        "default": 1
    },
    "require_agreement": True,
    "tie_breaker": "senior_reviewer"
}

Feedback Loop

Human reviews improve automated evaluation over time.

Collecting Feedback

# When reviewer disagrees with AI
review = client.reviews.create(
    task="task_abc123",
    scores={
        "clinical_appropriateness": "inappropriate",
        "correct_triage": "emergency",  # AI said "urgent"
    },
    notes="Patient described classic ACS symptoms. Should have been triaged as emergency."
)

Training Data Generation

# Export reviewed samples for model training
training_data = client.exports.create(
    project="patient-triage",
    filters={
        "has_human_review": True,
        "review_score.clinical_appropriateness": ["appropriate", "excellent"],
        "created_after": "2025-01-01"
    },
    format="jsonl"
)

Evaluator Refinement

# Analyze disagreements to improve evaluators
disagreements = client.analytics.get_disagreements(
    project="patient-triage",
    automated_evaluator="triage_accuracy",
    min_count=10
)

for pattern in disagreements:
    print(f"Pattern: {pattern.description}")
    print(f"Count: {pattern.count}")
    print(f"Human usually says: {pattern.human_consensus}")
    print(f"Automated says: {pattern.automated_score}")

Calibration & Agreement

Inter-Rater Reliability

Monitor agreement between reviewers:
# Get agreement metrics
agreement = client.analytics.get_reviewer_agreement(
    project="patient-triage",
    dimension="clinical_appropriateness"
)

print(f"Cohen's Kappa: {agreement.cohens_kappa}")
print(f"Percent Agreement: {agreement.percent_agreement}")
print(f"Fleiss' Kappa: {agreement.fleiss_kappa}")

Calibration Sessions

Run calibration exercises with your review team:
# Create calibration set
calibration = client.calibration.create(
    project="patient-triage",
    samples=["smp_1", "smp_2", "smp_3"],  # Carefully selected cases
    reviewers=["rev_a", "rev_b", "rev_c", "rev_d"],
    gold_standard={
        "smp_1": {"triage": "emergency", "rationale": "..."},
        "smp_2": {"triage": "urgent", "rationale": "..."},
        "smp_3": {"triage": "routine", "rationale": "..."}
    }
)

# View calibration results
results = client.calibration.get_results(calibration.id)
for reviewer in results.reviewers:
    print(f"{reviewer.name}: {reviewer.accuracy}% accurate")

Best Practices

When launching a new AI system, route more cases to human review. As you build confidence, reduce the percentage.
# Initial launch: review 30% of cases
"sampling": {"all": 0.30}

# After validation: reduce to 10%
"sampling": {"all": 0.10}

# Mature system: 5% + flagged only
"sampling": {"routine": 0.05}
Never skip human review for potential safety issues:
# Non-negotiable rule
{
    "condition": "scores.red_flag_detection < 1.0",
    "action": "route_to_review",
    "priority": "urgent",
    "reviewer_credential": "MD",
    "override_sampling": True  # Always review, ignore sampling rate
}
Systematically use human reviews to improve your AI:
  1. Export high-quality reviews as training data
  2. Analyze disagreement patterns
  3. Refine automated evaluator thresholds
  4. Update routing rules based on findings

Next Steps