Why RAG Evaluation Matters
Clinical RAG systems pull information from multiple sources to generate summaries, care plans, and recommendations. Unlike general RAG, clinical RAG has unique failure modes:Retrieval Accuracy
Did the system find the right documents? Did it miss critical lab results or medications?
Attribution & Grounding
Can every claim be traced to a source? Are citations accurate and verifiable?
Synthesis Quality
Is information integrated correctly? Are contradictions resolved appropriately?
Hallucination Risk
Did the model fabricate medications, lab values, or clinical findings not in sources?
Step 1: Define Your RAG Context
Configure your evaluation to capture both the retrieved documents and the generated output:rag_evaluation_setup.py
Step 2: Log RAG Pipeline Outputs
Capture the full RAG pipeline including retrieved documents, their relevance scores, and the final generated output:log_rag_output.py
Step 3: Retrieval Evaluation
Evaluate whether your RAG system retrieved the right documents:retrieval_metrics.py
Step 4: Attribution & Grounding Evaluation
Verify that every clinical claim in the output can be traced to a source document:attribution_evaluation.py
| Attribution Issue | Severity | Example |
|---|---|---|
| Citation not found | High | Cited document LAB_042 but lab value appears in LAB_043 |
| Quote mismatch | High | Cited “BNP 580” but source says “BNP 590” |
| Context distortion | Medium | Source says “rule out MI” but summary says “MI ruled out” |
| Missing citation | Medium | Lab value stated without any citation |
| Stale citation | Low | Cited old lab when newer result available |
Step 5: Hallucination Detection
Identify fabricated clinical information not present in any source document:hallucination_detection.py
Step 6: Synthesis Quality Evaluation
Evaluate how well the system integrates information from multiple sources:synthesis_evaluation.py
Step 7: Human Review for Edge Cases
Route complex cases for physician review:rag_human_review.py
RAG Evaluation Metrics Summary
| Metric | What It Measures | Target |
|---|---|---|
| Retrieval Precision | Relevance of retrieved documents | > 85% |
| Retrieval Recall | Coverage of relevant documents | > 90% |
| Critical Doc Coverage | Required documents retrieved | 100% |
| Attribution Accuracy | Citations match sources | > 95% |
| Citation Coverage | Claims with valid citations | > 90% |
| Hallucination Rate | Fabricated clinical facts | < 1% |
| Critical Hallucinations | Fabricated meds/labs | 0 |
| Synthesis Coherence | Logical clinical narrative | > 85% |
| Temporal Accuracy | Correct timeline and recency | > 95% |
Common RAG Failure Patterns
Stale Medication Lists
RAG retrieves an old medication list instead of the current one, leading to discontinued drugs appearing in discharge instructions. Mitigation: Add recency constraints to medication retrieval, always fetch from MAR or current orders.Lab Value Interpolation
Model “interpolates” lab values not in sources, e.g., guessing a creatinine trend based on pattern recognition. Mitigation: Strict hallucination detection for all numeric values, require exact source match.Conflicting Source Resolution
Multiple progress notes have different assessments; model picks one without acknowledging the disagreement. Mitigation: Train model to acknowledge uncertainty, use most recent assessment, flag for review.Context Window Truncation
Long documents get truncated, losing critical information at the end of notes like “follow up in 1 week.” Mitigation: Use chunking strategies that preserve section integrity, prioritize actionable content.Next Steps
- Safety Gating Before Production - Block deployments on RAG failures
- Production Monitoring - Track RAG metrics in real-time
- Export for Regulators - Generate compliance reports
