In this tutorial, you’ll set up a complete evaluation pipeline for a clinical NLP model that generates SOAP notes from patient encounter transcripts. By the end, you’ll have:
A dataset of clinical encounters with ground truth annotations
Automated evaluators for accuracy, completeness, and hallucination detection
# Add samples with ground truth annotationssamples = [ { "input": { "transcript": """ Doctor: Good morning, what brings you in today? Patient: I've been having this persistent cough for about two weeks now. Doctor: Is it a dry cough or are you bringing up any mucus? Patient: It's mostly dry, but sometimes I cough up a little clear stuff. Doctor: Any fever, shortness of breath, or chest pain? Patient: No fever, but I do feel a bit short of breath when climbing stairs. Doctor: Any history of asthma or allergies? Patient: I have seasonal allergies, take Zyrtec. Doctor: Current smoker? Patient: No, never smoked. """, "patient_demographics": { "age": 42, "sex": "female" }, "chief_complaint": "Persistent cough x 2 weeks" }, "expected_output": { "subjective": "42-year-old female presents with 2-week history of persistent, predominantly dry cough. Occasional clear sputum production. Reports dyspnea on exertion (climbing stairs). Denies fever or chest pain. PMH: seasonal allergies on cetirizine. Never smoker.", "objective": "[To be completed by examiner]", "assessment": "1. Acute bronchitis, likely viral\n2. Post-nasal drip syndrome (possible contributing factor given allergy history)\n3. Rule out reactive airway disease", "plan": "1. Supportive care with increased fluids\n2. OTC dextromethorphan for cough suppression PRN\n3. Continue cetirizine\n4. Return if symptoms worsen or persist >3 weeks\n5. Consider PFTs if cough persists to evaluate for asthma", "icd_codes": ["J20.9", "R05.9"] }, "metadata": { "encounter_type": "outpatient", "specialty": "primary_care", "complexity": "moderate" } }, # Add more samples...]# Batch upload samplesresult = client.samples.create_batch( dataset=dataset.id, samples=samples)print(f"Added {result.created_count} samples")
Sample Size Recommendations: For reliable metrics, aim for at least 100 samples per specialty/complexity combination. Include edge cases and known failure modes from production.