Evaluating Voice & Multimodal AI

Overview

Voice AI for healthcare involves a complex pipeline: speech recognition, natural language understanding, clinical reasoning, and response generation. Rubric helps you evaluate the entire pipeline end-to-end, or drill down into individual components.

Patient Triage Calls

Nurse hotlines, symptom assessment, urgent care routing

Voice Assistants

In-clinic voice AI, patient intake, follow-up calls

Call Center AI

Appointment scheduling, prescription refills, results delivery

Ambient Documentation

Visit recording, note generation, clinical summarization

The Voice AI Pipeline

A typical healthcare voice AI system has multiple stages, each requiring evaluation:

Audio Input

Patient speech captured via phone, app, or deviceMetrics: Quality, noise level, duration

Speech-to-Text

Transcription via ASR (Whisper, Deepgram, etc.)Metrics: WER, medical term accuracy

Clinical NLU

Extract symptoms, intent, urgency from transcriptMetrics: Entity F1, intent accuracy

Triage Decision

Determine urgency level and routingMetrics: Triage accuracy, safety score

Response

Generate appropriate guidance or escalationMetrics: Completeness, empathy, clarity

Rubric evaluates the final clinical decision, but also lets you log intermediate outputs to pinpoint where errors originate in your pipeline.

Logging Voice Interactions

Log the complete interaction including audio, transcript, and AI decisions:

from rubric import Rubric

client = Rubric(api_key="gr_live_xxx")

# Log a complete patient triage call
client.calls.log(
    project="patient-triage-voice",
    
    # Audio file (S3, GCS, or direct upload)
    audio_url="s3://your-bucket/calls/call_20250115_143022.wav",
    
    # Full transcript with speaker labels and timestamps
    transcript=[
        {
            "speaker": "agent",
            "text": "Thank you for calling. How can I help you today?",
            "start": 0.0,
            "end": 3.2
        },
        {
            "speaker": "patient",
            "text": "Hi, I've been having really bad chest pain since this morning.",
            "start": 3.8,
            "end": 8.1
        },
        {
            "speaker": "agent",
            "text": "I'm sorry to hear that. Can you describe the pain? Is it sharp, dull, or pressure-like?",
            "start": 8.5,
            "end": 13.2
        },
        {
            "speaker": "patient",
            "text": "It's like pressure, right in the center. And it kind of goes to my left arm.",
            "start": 13.8,
            "end": 19.4
        },
        {
            "speaker": "agent",
            "text": "That's important information. Are you also experiencing shortness of breath, sweating, or nausea?",
            "start": 19.9,
            "end": 25.6
        },
        {
            "speaker": "patient", 
            "text": "Yeah, I'm sweating a lot actually. And I feel a bit sick to my stomach.",
            "start": 26.1,
            "end": 31.2
        },
        {
            "speaker": "agent",
            "text": "Based on your symptoms, I need you to call 911 immediately or have someone drive you to the nearest emergency room right now. These symptoms could indicate a serious cardiac event. Please don't drive yourself. Is someone with you who can help?",
            "start": 31.8,
            "end": 45.3
        }
    ],
    
    # AI's extracted understanding and decision
    ai_decision={
        "extracted_symptoms": [
            {"symptom": "chest_pain", "location": "center", "quality": "pressure", "severity": "severe"},
            {"symptom": "radiation", "location": "left_arm"},
            {"symptom": "diaphoresis", "severity": "moderate"},
            {"symptom": "nausea", "severity": "mild"}
        ],
        "red_flags_detected": [
            "chest_pain_cardiac_features",
            "radiation_to_arm",
            "diaphoresis"
        ],
        "triage_level": "emergency",
        "recommended_action": "call_911",
        "confidence": 0.96,
        "protocol_used": "chest_pain_acs"
    },
    
    # Ground truth (if available from physician review)
    expected={
        "triage_level": "emergency",
        "diagnosis_category": "possible_acs",
        "appropriate_escalation": True
    },
    
    # Metadata
    metadata={
        "call_id": "call_20250115_143022",
        "call_duration_seconds": 47.8,
        "asr_provider": "deepgram",
        "agent_model": "triage-voice-v3",
        "patient_demographics": {
            "age_group": "50-65",
            "sex": "male"
        }
    }
)

Voice-Specific Evaluators

Configure evaluators designed for voice AI healthcare applications:

evaluation = client.evaluations.create(
    name="Voice Triage - Weekly Eval",
    project="patient-triage-voice",
    dataset="ds_weekly_calls",
    
    evaluators=[
        # Core triage evaluation
        {
            "type": "triage_accuracy",
            "config": {
                "levels": ["emergency", "urgent", "semi_urgent", "routine"],
                "severity_weights": {
                    "under_triage": 10.0,  # Very dangerous
                    "over_triage": 1.0     # Acceptable
                }
            }
        },
        
        # Red flag detection
        {
            "type": "red_flag_detection",
            "config": {
                "protocols": [
                    "chest_pain_acs",
                    "stroke_fast",
                    "sepsis_sirs",
                    "allergic_reaction",
                    "respiratory_distress"
                ],
                "minimum_recall": 0.99  # Must catch 99% of red flags
            }
        },
        
        # Symptom extraction from transcript
        {
            "type": "symptom_extraction",
            "config": {
                "source": "transcript",
                "check_negations": True,
                "check_temporal": True,
                "check_severity": True
            }
        },
        
        # Guideline/protocol compliance
        {
            "type": "protocol_compliance",
            "config": {
                "protocol": "schmitt_thompson",
                "required_questions_by_symptom": {
                    "chest_pain": [
                        "pain_location", "pain_quality", "pain_radiation",
                        "associated_symptoms", "onset_timing"
                    ],
                    "headache": [
                        "severity", "onset", "location", "associated_symptoms"
                    ]
                }
            }
        },
        
        # Communication quality
        {
            "type": "communication_quality",
            "config": {
                "check_empathy": True,
                "check_clarity": True,
                "check_safety_netting": True,  # Did agent give return precautions?
                "max_jargon_score": 0.1  # Limit medical jargon
            }
        },
        
        # Escalation appropriateness
        {
            "type": "escalation_appropriateness",
            "config": {
                "check_timing": True,          # Did escalation happen promptly?
                "max_escalation_delay_seconds": 60,
                "check_handoff_quality": True   # Was handoff information complete?
            }
        }
    ]
)

ASR (Speech-to-Text) Evaluation

Evaluate your speech recognition separately:

# Log ASR output for evaluation
client.log(
    project="asr-evaluation",
    
    input={
        "audio_url": "s3://bucket/call.wav",
        "audio_duration_seconds": 47.8,
        "audio_quality": {"snr_db": 22, "sample_rate": 16000}
    },
    
    output={
        "transcript": "...",
        "word_timings": [...],
        "confidence_scores": [...],
        "asr_provider": "deepgram",
        "asr_model": "nova-2-medical"
    },
    
    expected={
        "transcript": "...",  # Human-verified transcript
        "medical_terms": ["hypertension", "metformin", "dyspnea"]
    }
)

# ASR-specific evaluators
asr_eval = client.evaluations.create(
    name="ASR Medical Accuracy",
    project="asr-evaluation",
    dataset="ds_transcribed_calls",
    
    evaluators=[
        {
            "type": "wer",  # Word Error Rate
            "config": {
                "normalize": True,
                "ignore_case": True,
                "ignore_punctuation": True
            }
        },
        {
            "type": "medical_term_accuracy",
            "config": {
                "term_categories": ["medications", "conditions", "symptoms", "anatomy"],
                "phonetic_similarity_threshold": 0.8
            }
        },
        {
            "type": "speaker_diarization_accuracy",
            "config": {
                "check_speaker_labels": True,
                "check_turn_boundaries": True
            }
        }
    ]
)

Multimodal Evaluation

For systems that combine voice with other modalities (images, documents):

# Log multimodal interaction
client.log(
    project="multimodal-triage",
    
    input={
        # Voice component
        "audio_url": "s3://bucket/call.wav",
        "transcript": [...],
        
        # Image component (patient-submitted photo)
        "image_url": "s3://bucket/rash_photo.jpg",
        "image_description": "Patient submitted photo of arm rash",
        
        # Prior records
        "patient_history": {
            "conditions": ["diabetes"],
            "medications": ["metformin"]
        }
    },
    
    output={
        # Voice analysis
        "voice_symptoms": ["itching", "spreading rash", "fever"],
        "voice_triage": "urgent",
        
        # Image analysis
        "image_findings": {
            "rash_type": "maculopapular",
            "distribution": "arm",
            "severity": "moderate"
        },
        "image_triage": "urgent",
        
        # Combined decision
        "final_triage": "urgent",
        "reasoning": "Maculopapular rash with fever in diabetic patient - needs same-day evaluation"
    }
)

# Multimodal evaluators
multimodal_eval = client.evaluations.create(
    name="Multimodal Triage Eval",
    project="multimodal-triage",
    dataset="ds_multimodal_cases",
    
    evaluators=[
        # Voice-specific evaluation
        {
            "type": "triage_accuracy",
            "config": {"source": "voice_triage"}
        },
        
        # Image-specific evaluation
        {
            "type": "image_finding_accuracy",
            "config": {
                "finding_types": ["rash_type", "distribution", "severity"]
            }
        },
        
        # Integration evaluation - did voice + image combine correctly?
        {
            "type": "multimodal_integration",
            "config": {
                "check_consistency": True,   # Voice and image findings align
                "check_comprehensive": True, # Both modalities considered
                "weight_voice": 0.4,
                "weight_image": 0.6
            }
        }
    ]
)

Critical Metrics for Voice AI

Key metrics to track for healthcare voice systems:

Metric	Target	Why It Matters
Triage Accuracy	> 95%	Core safety metric - correct urgency level
Red Flag Recall	> 99%	Must catch nearly all danger signs
Under-Triage Rate	< 1%	Missing urgent cases is unacceptable
Escalation Latency	< 60s	How quickly urgent cases are escalated
ASR Medical WER	< 5%	Transcription accuracy for clinical terms
Symptom F1	> 90%	Correct symptom extraction
Guideline Adherence	> 90%	Following clinical protocols
Patient Satisfaction	> 4.0/5	Empathy, clarity, helpfulness

Set asymmetric thresholds. It’s better to over-triage (send someone to ER who didn’t need it) than under-triage (send someone home who needed emergency care).

Setting Up Clinician Review

Route flagged calls to physicians and nurses for expert review:

# Configure review routing for voice calls
client.projects.update(
    project="patient-triage-voice",
    
    review_config={
        # Automatic flagging rules
        "auto_flag_rules": [
            # Always review emergency calls
            {"condition": "triage_level == 'emergency'", "priority": "high"},
            
            # Review when AI is uncertain
            {"condition": "confidence < 0.8", "priority": "medium"},
            
            # Review red flag cases
            {"condition": "red_flags_detected.length > 0", "priority": "high"},
            
            # Review potential under-triage
            {"condition": "symptoms.contains('chest_pain') AND triage_level != 'emergency'", "priority": "urgent"}
        ],
        
        # Random sampling for QA
        "sampling_config": {
            "routine_calls": 0.05,      # 5% of routine calls
            "semi_urgent_calls": 0.10,  # 10% of semi-urgent
            "urgent_calls": 0.20        # 20% of urgent
        },
        
        # Reviewer requirements
        "reviewer_requirements": {
            "emergency_calls": ["MD", "DO", "NP"],
            "urgent_calls": ["MD", "DO", "NP", "RN"],
            "routine_calls": ["RN", "LPN"]
        },
        
        # Review interface config
        "review_interface": {
            "show_audio_player": True,
            "show_transcript": True,
            "show_ai_reasoning": True,
            "require_triage_override": True,
            "require_notes": True
        }
    }
)

Integration Example

Complete example integrating Rubric into a voice triage pipeline:

from rubric import Rubric
from your_asr import transcribe_audio
from your_nlu import extract_symptoms, classify_intent
from your_triage import determine_triage_level

rubric = Rubric(api_key="gr_live_xxx")

async def process_triage_call(call_audio_url: str, call_id: str):
    """Process a patient triage call through the full pipeline."""
    
    # Step 1: Transcribe audio
    transcript = await transcribe_audio(call_audio_url)
    
    # Step 2: Extract clinical information
    symptoms = extract_symptoms(transcript)
    intent = classify_intent(transcript)
    
    # Step 3: Determine triage level
    triage_result = determine_triage_level(
        symptoms=symptoms,
        intent=intent,
        transcript=transcript
    )
    
    # Step 4: Log to Rubric for evaluation
    rubric.calls.log(
        project="patient-triage-voice",
        
        audio_url=call_audio_url,
        
        transcript=[
            {"speaker": turn.speaker, "text": turn.text, "start": turn.start, "end": turn.end}
            for turn in transcript.turns
        ],
        
        ai_decision={
            "extracted_symptoms": [s.to_dict() for s in symptoms],
            "intent": intent.label,
            "red_flags_detected": [rf.name for rf in triage_result.red_flags],
            "triage_level": triage_result.level,
            "recommended_action": triage_result.action,
            "confidence": triage_result.confidence,
            "protocol_used": triage_result.protocol
        },
        
        # Pipeline metadata for debugging
        metadata={
            "call_id": call_id,
            "asr_provider": "deepgram",
            "asr_confidence": transcript.confidence,
            "nlu_model": "clinical-bert-v3",
            "triage_model": "triage-ensemble-v2",
            "pipeline_latency_ms": get_pipeline_latency()
        }
    )
    
    return triage_result

# Run evaluation on logged calls
def run_weekly_evaluation():
    evaluation = rubric.evaluations.create(
        name=f"Voice Triage - Week of {get_week_start()}",
        project="patient-triage-voice",
        dataset="ds_this_week_calls",
        evaluators=[
            {"type": "triage_accuracy", "config": {...}},
            {"type": "red_flag_detection", "config": {...}},
            {"type": "symptom_extraction", "config": {...}},
            {"type": "guideline_compliance", "config": {...}}
        ]
    )
    
    # Wait and get results
    results = rubric.evaluations.wait(evaluation.id)
    
    # Alert if metrics derubric
    if results.metrics["under_triage_rate"] > 0.01:
        send_alert("Under-triage rate exceeded 1% threshold!")
    
    return results

Next Steps

Create Your First Evaluation

Run evaluators on your voice data

Clinician Review Workflow

Set up human-in-the-loop review

Transcript Formats

Supported audio and transcript formats

Evaluating LLMs

LLM-specific evaluation

Home

Getting Started

Core Concepts

Onboarding Guides

Evaluation Framework

Tutorials

Integrations

Security & Compliance

Voice AI

Medical Imaging

Clinical Notes

Workflows

Glossary & Appendix

Evaluating Voice & Multimodal AI

Overview

Patient Triage Calls

Voice Assistants

Call Center AI

Ambient Documentation

The Voice AI Pipeline

Logging Voice Interactions

Voice-Specific Evaluators

ASR (Speech-to-Text) Evaluation

Multimodal Evaluation

Critical Metrics for Voice AI

Setting Up Clinician Review

Integration Example

Next Steps

Create Your First Evaluation

Clinician Review Workflow

Transcript Formats

Evaluating LLMs

Home

Getting Started

Core Concepts

Onboarding Guides

Evaluation Framework

Tutorials

Integrations

Security & Compliance

Voice AI

Medical Imaging

Clinical Notes

Workflows

Glossary & Appendix

​Overview

Patient Triage Calls

Voice Assistants

Call Center AI

Ambient Documentation

​The Voice AI Pipeline

​Logging Voice Interactions

​Voice-Specific Evaluators

​ASR (Speech-to-Text) Evaluation

​Multimodal Evaluation

​Critical Metrics for Voice AI

​Setting Up Clinician Review

​Integration Example

​Next Steps

Create Your First Evaluation

Clinician Review Workflow

Transcript Formats

Evaluating LLMs

Overview

The Voice AI Pipeline

Logging Voice Interactions

Voice-Specific Evaluators

ASR (Speech-to-Text) Evaluation

Multimodal Evaluation

Critical Metrics for Voice AI

Setting Up Clinician Review

Integration Example

Next Steps