Human vs Automated Evaluation

Overview

Healthcare AI evaluation requires both automated scoring and human clinical judgment. This guide explains when to use each approach and how to build effective hybrid workflows.

Automated Evaluation

Automated evaluators run code-based scoring on every sample.

Strengths

Speed

Process thousands of samples in minutes

Consistency

Same criteria applied uniformly

Scale

Evaluate 100% of production traffic

Cost

Fraction of human review cost

Best For

Use Case	Example
Structured comparisons	Predicted triage vs. expected triage
Pattern matching	Red flag keyword detection
Compliance checks	Required questions asked
Metric calculation	Latency, token count, cost
Regression detection	Score changes between versions

Example

# Automated evaluators for structured checks
evaluators = [
    {
        "type": "triage_accuracy",
        "config": {
            "levels": ["emergency", "urgent", "routine"],
            "require_exact_match": True
        }
    },
    {
        "type": "red_flag_detection",
        "config": {
            "keywords": ["chest pain", "difficulty breathing", "severe bleeding"]
        }
    },
    {
        "type": "response_latency",
        "config": {
            "threshold_ms": 5000
        }
    }
]

Human Review

Human reviewers provide clinical judgment that automated systems cannot replicate.

Strengths

Clinical Judgment

Nuanced medical reasoning

Context Understanding

Interpret ambiguous situations

Edge Cases

Handle novel scenarios

Ground Truth

Generate training labels

Best For

Use Case	Example
Clinical appropriateness	”Was this triage decision safe?”
Reasoning quality	”Did the AI ask the right follow-up questions?”
Edge cases	Unusual symptom combinations
Ambiguous scenarios	When correct answer is debatable
Ground truth creation	Labeling data for future automation

Example

# Configure human review for complex cases
client.projects.update(
    project="patient-triage",
    review_config={
        "rubric": {
            "dimensions": [
                {
                    "name": "clinical_appropriateness",
                    "description": "Was the triage decision clinically appropriate?",
                    "scale": ["inappropriate", "questionable", "appropriate", "excellent"]
                },
                {
                    "name": "safety",
                    "description": "Were all safety concerns addressed?",
                    "scale": ["unsafe", "partially_safe", "safe"]
                },
                {
                    "name": "communication",
                    "description": "Was the communication clear and empathetic?",
                    "scale": [1, 2, 3, 4, 5]
                }
            ],
            "require_notes_on_failure": True
        }
    }
)

The Hybrid Approach

The most effective strategy combines automated evaluation with targeted human review.

Workflow

Configuration

# Configure hybrid workflow
client.projects.update(
    project="patient-triage",
    
    evaluation_config={
        # Automated evaluators run on everything
        "evaluators": [
            {"type": "triage_accuracy"},
            {"type": "red_flag_detection"},
            {"type": "guideline_compliance"}
        ],
        
        # Routing rules for human review
        "routing_rules": [
            # Always review emergency decisions
            {
                "condition": "output.triage_level == 'emergency'",
                "action": "route_to_review",
                "priority": "high",
                "reviewer_credential": "MD"
            },
            
            # Review when AI is uncertain
            {
                "condition": "output.confidence < 0.8",
                "action": "route_to_review",
                "priority": "medium"
            },
            
            # Review mismatches between evaluators
            {
                "condition": "scores.triage_accuracy < 1.0 AND scores.red_flag_detection == 1.0",
                "action": "route_to_review",
                "priority": "medium"
            },
            
            # Review detected safety issues
            {
                "condition": "scores.red_flag_detection < 1.0",
                "action": "route_to_review",
                "priority": "urgent",
                "reviewer_credential": "MD"
            }
        ],
        
        # Random sampling for QA
        "sampling": {
            "routine": 0.05,       # 5% of routine cases
            "urgent": 0.10,        # 10% of urgent
            "emergency": 1.0       # 100% of emergency (covered by rules above)
        }
    }
)

Routing Rules

Rule Syntax

{
    "condition": "<expression>",
    "action": "route_to_review",
    "priority": "low|medium|high|urgent",
    "reviewer_credential": "MD|NP|RN|...",
    "due_within_hours": 24
}

Condition Expressions

Expression	Description
`output.triage_level == 'emergency'`	Match specific output value
`output.confidence < 0.8`	Numeric comparison
`scores.evaluator_name < threshold`	Check evaluator score
`'keyword' in output.symptoms`	Check list membership
`metadata.model_version == 'v2'`	Match metadata
`condition1 AND condition2`	Combine conditions
`condition1 OR condition2`	Either condition

Priority Levels

Priority	SLA	Use Case
`urgent`	1 hour	Safety-critical issues
`high`	4 hours	Emergency triage decisions
`medium`	24 hours	Standard flagged cases
`low`	72 hours	QA random sampling

Assignment Strategies

Round-Robin

Distribute tasks evenly across available reviewers:

"assignment_strategy": {
    "type": "round_robin",
    "respect_credentials": True,
    "max_daily_per_reviewer": 50
}

Load-Balanced

Assign based on current workload:

"assignment_strategy": {
    "type": "load_balanced",
    "factors": ["queue_size", "avg_review_time"],
    "max_queue_per_reviewer": 20
}

Expertise-Based

Route to specialists by topic:

"assignment_strategy": {
    "type": "expertise",
    "routing": {
        "cardiology_symptoms": ["dr_chen", "dr_patel"],
        "pediatric": ["dr_wilson"],
        "mental_health": ["dr_garcia", "np_thompson"]
    }
}

Dual Review

Require multiple reviewers for critical cases:

"assignment_strategy": {
    "type": "dual_review",
    "conditions": {
        "emergency_decisions": 2,
        "safety_flags": 2,
        "default": 1
    },
    "require_agreement": True,
    "tie_breaker": "senior_reviewer"
}

Feedback Loop

Human reviews improve automated evaluation over time.

Collecting Feedback

# When reviewer disagrees with AI
review = client.reviews.create(
    task="task_abc123",
    scores={
        "clinical_appropriateness": "inappropriate",
        "correct_triage": "emergency",  # AI said "urgent"
    },
    notes="Patient described classic ACS symptoms. Should have been triaged as emergency."
)

Training Data Generation

# Export reviewed samples for model training
training_data = client.exports.create(
    project="patient-triage",
    filters={
        "has_human_review": True,
        "review_score.clinical_appropriateness": ["appropriate", "excellent"],
        "created_after": "2025-01-01"
    },
    format="jsonl"
)

# Analyze disagreements to improve evaluators
disagreements = client.analytics.get_disagreements(
    project="patient-triage",
    automated_evaluator="triage_accuracy",
    min_count=10
)

for pattern in disagreements:
    print(f"Pattern: {pattern.description}")
    print(f"Count: {pattern.count}")
    print(f"Human usually says: {pattern.human_consensus}")
    print(f"Automated says: {pattern.automated_score}")

Calibration & Agreement

Inter-Rater Reliability

Monitor agreement between reviewers:

# Get agreement metrics
agreement = client.analytics.get_reviewer_agreement(
    project="patient-triage",
    dimension="clinical_appropriateness"
)

print(f"Cohen's Kappa: {agreement.cohens_kappa}")
print(f"Percent Agreement: {agreement.percent_agreement}")
print(f"Fleiss' Kappa: {agreement.fleiss_kappa}")

Calibration Sessions

Run calibration exercises with your review team:

# Create calibration set
calibration = client.calibration.create(
    project="patient-triage",
    samples=["smp_1", "smp_2", "smp_3"],  # Carefully selected cases
    reviewers=["rev_a", "rev_b", "rev_c", "rev_d"],
    gold_standard={
        "smp_1": {"triage": "emergency", "rationale": "..."},
        "smp_2": {"triage": "urgent", "rationale": "..."},
        "smp_3": {"triage": "routine", "rationale": "..."}
    }
)

# View calibration results
results = client.calibration.get_results(calibration.id)
for reviewer in results.reviewers:
    print(f"{reviewer.name}: {reviewer.accuracy}% accurate")

Best Practices

Start with More Human Review

When launching a new AI system, route more cases to human review. As you build confidence, reduce the percentage.

# Initial launch: review 30% of cases
"sampling": {"all": 0.30}

# After validation: reduce to 10%
"sampling": {"all": 0.10}

# Mature system: 5% + flagged only
"sampling": {"routine": 0.05}

Always Review Safety Flags

Never skip human review for potential safety issues:

# Non-negotiable rule
{
    "condition": "scores.red_flag_detection < 1.0",
    "action": "route_to_review",
    "priority": "urgent",
    "reviewer_credential": "MD",
    "override_sampling": True  # Always review, ignore sampling rate
}

Use Review Data to Improve

Systematically use human reviews to improve your AI:

Export high-quality reviews as training data
Analyze disagreement patterns
Refine automated evaluator thresholds
Update routing rules based on findings

Next Steps

Clinical Evaluations

Healthcare-specific evaluation requirements

Clinician Review Workflow

Detailed review interface guide

Home

Getting Started

Core Concepts

Onboarding Guides

Evaluation Framework

Tutorials

Integrations

Security & Compliance

Voice AI

Medical Imaging

Clinical Notes

Workflows

Glossary & Appendix

​Overview

​Automated Evaluation

​Strengths

Speed

Consistency

Scale

Cost

​Best For

​Example

​Human Review

​Strengths

Clinical Judgment

Context Understanding

Edge Cases

Ground Truth

​Best For

​Example

​The Hybrid Approach

​Workflow

​Configuration

​Routing Rules

​Rule Syntax

​Condition Expressions

​Priority Levels

​Assignment Strategies

​Round-Robin

​Load-Balanced

​Expertise-Based

​Dual Review

​Feedback Loop

​Collecting Feedback

​Training Data Generation

​Evaluator Refinement

​Calibration & Agreement

​Inter-Rater Reliability

​Calibration Sessions

​Best Practices

​Next Steps

Clinical Evaluations

Clinician Review Workflow

Overview

Automated Evaluation

Strengths

Best For

Example

Human Review

Strengths

Best For

Example

The Hybrid Approach

Workflow

Configuration

Routing Rules

Rule Syntax

Condition Expressions

Priority Levels

Assignment Strategies

Round-Robin

Load-Balanced

Expertise-Based

Dual Review

Feedback Loop

Collecting Feedback

Training Data Generation

Evaluator Refinement

Calibration & Agreement

Inter-Rater Reliability

Calibration Sessions

Best Practices

Next Steps