Skip to main content

Why RAG Evaluation Matters

Clinical RAG systems pull information from multiple sources to generate summaries, care plans, and recommendations. Unlike general RAG, clinical RAG has unique failure modes:

Retrieval Accuracy

Did the system find the right documents? Did it miss critical lab results or medications?

Attribution & Grounding

Can every claim be traced to a source? Are citations accurate and verifiable?

Synthesis Quality

Is information integrated correctly? Are contradictions resolved appropriately?

Hallucination Risk

Did the model fabricate medications, lab values, or clinical findings not in sources?

Step 1: Define Your RAG Context

Configure your evaluation to capture both the retrieved documents and the generated output:
rag_evaluation_setup.py
from rubric import Rubric

client = Rubric(api_key="your-api-key")

# Create project for RAG evaluation
project = client.projects.create(
    name="clinical-summary-rag",
    data_type="clinical_notes",
    description="RAG system for patient discharge summaries"
)

# Configure RAG-specific evaluators
evaluators = [
    {
        "type": "retrieval_relevance",
        "config": {
            "document_types": ["lab_results", "medications", "progress_notes", "imaging"],
            "required_coverage": {
                "recent_labs": True,       # Must retrieve labs from last 24h
                "active_medications": True, # Must retrieve current med list
                "chief_complaint": True     # Must retrieve admission reason
            }
        }
    },
    {
        "type": "attribution_accuracy",
        "config": {
            "require_citations": True,
            "citation_granularity": "sentence",  # Each claim needs citation
            "verify_quotes": True                 # Check quoted text matches source
        }
    },
    {
        "type": "clinical_hallucination",
        "config": {
            "check_medications": True,
            "check_lab_values": True,
            "check_diagnoses": True,
            "check_procedures": True,
            "severity_weights": {
                "fabricated_medication": 10.0,
                "wrong_lab_value": 8.0,
                "unsupported_diagnosis": 7.0,
                "minor_detail_error": 1.0
            }
        }
    },
    {
        "type": "synthesis_quality",
        "config": {
            "check_contradiction_handling": True,
            "check_temporal_reasoning": True,
            "check_clinical_coherence": True
        }
    }
]

Step 2: Log RAG Pipeline Outputs

Capture the full RAG pipeline including retrieved documents, their relevance scores, and the final generated output:
log_rag_output.py
# Log a RAG-generated clinical summary
client.samples.create(
    project="clinical-summary-rag",

    # The query/task given to the RAG system
    input={
        "task": "generate_discharge_summary",
        "patient_id": "PAT_12345",
        "encounter_id": "ENC_67890",
        "query": "Generate discharge summary for patient with CHF exacerbation"
    },

    # Retrieved documents with metadata
    retrieved_context=[
        {
            "doc_id": "NOTE_001",
            "doc_type": "progress_note",
            "timestamp": "2024-03-14T08:00:00Z",
            "relevance_score": 0.94,
            "content": "Patient admitted with acute CHF exacerbation. BNP elevated at 1250...",
            "source": "Epic - Progress Notes"
        },
        {
            "doc_id": "LAB_042",
            "doc_type": "lab_results",
            "timestamp": "2024-03-15T06:00:00Z",
            "relevance_score": 0.91,
            "content": "BNP: 580 (down from 1250), Cr: 1.2, K: 4.1...",
            "source": "Epic - Lab Results"
        },
        {
            "doc_id": "MED_LIST",
            "doc_type": "medications",
            "timestamp": "2024-03-15T10:00:00Z",
            "relevance_score": 0.89,
            "content": "Furosemide 40mg BID, Lisinopril 10mg daily, Metoprolol 25mg BID...",
            "source": "Epic - Medication Administration"
        },
        {
            "doc_id": "ECHO_001",
            "doc_type": "imaging",
            "timestamp": "2024-03-14T14:00:00Z",
            "relevance_score": 0.87,
            "content": "TTE: EF 35%, moderate MR, no pericardial effusion...",
            "source": "Epic - Imaging"
        }
    ],

    # The generated output with inline citations
    output={
        "summary": """
DISCHARGE SUMMARY

ADMISSION DIAGNOSIS: Acute on chronic systolic heart failure exacerbation [NOTE_001]

HOSPITAL COURSE:
Patient is a 68-year-old male admitted with acute CHF exacerbation.
Initial BNP was 1250 pg/mL [LAB_042]. Echocardiogram showed EF of 35%
with moderate mitral regurgitation [ECHO_001]. Patient was diuresed with
IV furosemide with good response. BNP improved to 580 pg/mL at discharge [LAB_042].

DISCHARGE MEDICATIONS:
- Furosemide 40mg PO BID [MED_LIST]
- Lisinopril 10mg PO daily [MED_LIST]
- Metoprolol succinate 25mg PO BID [MED_LIST]

FOLLOW-UP:
- Cardiology clinic in 1 week
- Daily weights, call if >3lb gain
        """,
        "citations": [
            {"marker": "NOTE_001", "doc_id": "NOTE_001", "quote": "acute CHF exacerbation"},
            {"marker": "LAB_042", "doc_id": "LAB_042", "quote": "BNP: 580 (down from 1250)"},
            {"marker": "ECHO_001", "doc_id": "ECHO_001", "quote": "EF 35%, moderate MR"},
            {"marker": "MED_LIST", "doc_id": "MED_LIST", "quote": "Furosemide 40mg BID"}
        ]
    },

    # Ground truth for evaluation (if available)
    expected={
        "required_elements": [
            "admission_diagnosis",
            "hospital_course",
            "discharge_medications",
            "follow_up_instructions"
        ],
        "key_facts": [
            {"type": "lab", "name": "BNP_admission", "value": "1250"},
            {"type": "lab", "name": "BNP_discharge", "value": "580"},
            {"type": "imaging", "name": "EF", "value": "35%"},
            {"type": "medication", "name": "furosemide", "dose": "40mg BID"}
        ]
    }
)

Step 3: Retrieval Evaluation

Evaluate whether your RAG system retrieved the right documents:
retrieval_metrics.py
# Run retrieval-focused evaluation
retrieval_eval = client.evaluations.create(
    name="Retrieval Quality Check",
    project="clinical-summary-rag",
    dataset="discharge-summaries-march",
    evaluators=[
        {
            "type": "retrieval_relevance",
            "config": {
                # Check if critical document types were retrieved
                "required_document_types": {
                    "lab_results": {"min_count": 1, "recency": "24h"},
                    "medications": {"min_count": 1, "recency": "current"},
                    "progress_notes": {"min_count": 1}
                },

                # Penalize missing critical documents
                "missing_document_penalties": {
                    "lab_results": 5.0,
                    "medications": 8.0,  # Critical - wrong meds are dangerous
                    "imaging": 3.0
                }
            }
        },
        {
            "type": "retrieval_precision",
            "config": {
                # Penalize retrieving irrelevant documents (noise)
                "max_irrelevant_docs": 2,
                "irrelevant_threshold": 0.5  # Relevance score below this
            }
        },
        {
            "type": "retrieval_recall",
            "config": {
                # Based on ground truth relevant documents
                "ground_truth_field": "expected.relevant_documents"
            }
        }
    ]
)

# Check results
results = client.evaluations.get(retrieval_eval.id)
print(f"Retrieval Precision: {results.scores['retrieval_precision']}%")
print(f"Retrieval Recall: {results.scores['retrieval_recall']}%")
print(f"Critical Doc Coverage: {results.scores['critical_coverage']}%")
Missing Medications Are Critical: If your RAG system fails to retrieve the current medication list, the generated summary may omit critical drugs or include discontinued medications. This is a high-severity failure mode that should trigger immediate review.

Step 4: Attribution & Grounding Evaluation

Verify that every clinical claim in the output can be traced to a source document:
attribution_evaluation.py
# Run attribution evaluation
attribution_eval = client.evaluations.create(
    name="Attribution Accuracy Check",
    project="clinical-summary-rag",
    dataset="discharge-summaries-march",
    evaluators=[
        {
            "type": "attribution_accuracy",
            "config": {
                # Define what needs citations
                "citation_requirements": {
                    "lab_values": "required",      # All lab values must have citation
                    "medications": "required",      # All meds must cite source
                    "diagnoses": "required",        # Diagnoses must be supported
                    "procedures": "required",       # Procedures must be documented
                    "general_statements": "optional"  # Narrative can be unsourced
                },

                # Verify citation accuracy
                "verification_checks": {
                    "quote_accuracy": True,         # Quoted text matches source
                    "value_accuracy": True,         # Numbers match source exactly
                    "date_accuracy": True,          # Dates are correct
                    "context_accuracy": True        # Meaning preserved in context
                }
            }
        },
        {
            "type": "citation_coverage",
            "config": {
                # What percentage of claims have citations?
                "claim_types": ["factual", "numerical", "diagnostic"]
            }
        }
    ]
)

# Analyze attribution failures
failures = results.get_failures(evaluator="attribution_accuracy")
for failure in failures[:5]:
    print(f"Claim: {failure.claim}")
    print(f"Cited Source: {failure.cited_source}")
    print(f"Issue: {failure.issue}")  # e.g., "Quote not found in source"
    print(f"Severity: {failure.severity}")
    print("---")
Attribution IssueSeverityExample
Citation not foundHighCited document LAB_042 but lab value appears in LAB_043
Quote mismatchHighCited “BNP 580” but source says “BNP 590”
Context distortionMediumSource says “rule out MI” but summary says “MI ruled out”
Missing citationMediumLab value stated without any citation
Stale citationLowCited old lab when newer result available

Step 5: Hallucination Detection

Identify fabricated clinical information not present in any source document:
hallucination_detection.py
# Configure clinical hallucination detection
hallucination_eval = client.evaluations.create(
    name="Clinical Hallucination Check",
    project="clinical-summary-rag",
    dataset="discharge-summaries-march",
    evaluators=[
        {
            "type": "clinical_hallucination",
            "config": {
                # Types of hallucinations to detect
                "detection_categories": {
                    "fabricated_medications": {
                        "enabled": True,
                        "severity": "critical",
                        "check_against": ["retrieved_context.medications", "retrieved_context.mar"]
                    },
                    "fabricated_lab_values": {
                        "enabled": True,
                        "severity": "critical",
                        "tolerance": {
                            "numeric": 0.0,  # Exact match required for labs
                            "units": True     # Units must match
                        }
                    },
                    "fabricated_diagnoses": {
                        "enabled": True,
                        "severity": "high",
                        "check_against": ["retrieved_context.assessments", "retrieved_context.problem_list"]
                    },
                    "fabricated_procedures": {
                        "enabled": True,
                        "severity": "high"
                    },
                    "fabricated_dates": {
                        "enabled": True,
                        "severity": "medium"
                    },
                    "fabricated_providers": {
                        "enabled": True,
                        "severity": "low"
                    }
                },

                # How to handle claims that seem plausible but unsupported
                "unsupported_claim_handling": {
                    "threshold": "strict",  # Mark as hallucination if not in sources
                    "allow_inference": False  # Don't allow logical inferences
                }
            }
        }
    ]
)

# Analyze hallucination results
print(f"Hallucination Rate: {results.scores['hallucination_rate']}%")
print(f"Critical Hallucinations: {results.counts['critical_hallucinations']}")

# Get specific hallucination instances
hallucinations = results.get_hallucinations()
for h in hallucinations:
    print(f"Type: {h.category}")
    print(f"Claim: '{h.fabricated_claim}'")
    print(f"Severity: {h.severity}")
    print(f"Evidence: No supporting document found")
    print("---")
Zero Tolerance for Medication Hallucinations: Fabricated medications in discharge summaries can lead to patients taking drugs they weren’t prescribed. Configure your evaluation to flag ANY medication not found in the source medication list as a critical failure.

Step 6: Synthesis Quality Evaluation

Evaluate how well the system integrates information from multiple sources:
synthesis_evaluation.py
# Evaluate synthesis quality
synthesis_eval = client.evaluations.create(
    name="Synthesis Quality Check",
    project="clinical-summary-rag",
    dataset="discharge-summaries-march",
    evaluators=[
        {
            "type": "contradiction_handling",
            "config": {
                # How should the system handle conflicting information?
                "expected_behavior": "acknowledge_and_resolve",
                "check_cases": [
                    "conflicting_lab_values",      # Same lab, different values
                    "conflicting_medications",     # Med changes during stay
                    "conflicting_assessments"      # Different provider opinions
                ]
            }
        },
        {
            "type": "temporal_reasoning",
            "config": {
                # Does the summary reflect the correct timeline?
                "check_chronology": True,
                "check_recency": True,  # Uses most recent values
                "check_trends": True    # Captures improvement/decline
            }
        },
        {
            "type": "clinical_coherence",
            "config": {
                # Does the summary make clinical sense?
                "check_diagnosis_treatment_alignment": True,
                "check_lab_interpretation": True,
                "check_follow_up_appropriateness": True
            }
        },
        {
            "type": "completeness",
            "config": {
                "required_sections": [
                    "admission_diagnosis",
                    "hospital_course",
                    "discharge_medications",
                    "follow_up"
                ],
                "context_aware_requirements": {
                    "if_diabetic": ["glucose_management", "a1c_if_available"],
                    "if_cardiac": ["echo_findings", "cardiac_meds"],
                    "if_infectious": ["antibiotic_course", "cultures"]
                }
            }
        }
    ]
)

Step 7: Human Review for Edge Cases

Route complex cases for physician review:
rag_human_review.py
# Configure human review routing
client.projects.update(
    "clinical-summary-rag",
    human_review={
        "enabled": True,
        "routing_rules": [
            {
                "condition": "hallucination_detected",
                "action": "route_to_review",
                "priority": "urgent",
                "reviewer_type": "physician"
            },
            {
                "condition": "attribution_score < 80",
                "action": "route_to_review",
                "priority": "high"
            },
            {
                "condition": "contradiction_unresolved",
                "action": "route_to_review",
                "priority": "medium"
            },
            {
                "condition": "random_sample",
                "rate": 0.05,  # 5% random sample
                "action": "route_to_review"
            }
        ],

        "review_interface": {
            "show_retrieved_documents": True,
            "show_citations_inline": True,
            "highlight_unsupported_claims": True,
            "enable_source_comparison": True  # Side-by-side view
        }
    }
)

RAG Evaluation Metrics Summary

MetricWhat It MeasuresTarget
Retrieval PrecisionRelevance of retrieved documents> 85%
Retrieval RecallCoverage of relevant documents> 90%
Critical Doc CoverageRequired documents retrieved100%
Attribution AccuracyCitations match sources> 95%
Citation CoverageClaims with valid citations> 90%
Hallucination RateFabricated clinical facts< 1%
Critical HallucinationsFabricated meds/labs0
Synthesis CoherenceLogical clinical narrative> 85%
Temporal AccuracyCorrect timeline and recency> 95%

Common RAG Failure Patterns

Stale Medication Lists

RAG retrieves an old medication list instead of the current one, leading to discontinued drugs appearing in discharge instructions. Mitigation: Add recency constraints to medication retrieval, always fetch from MAR or current orders.

Lab Value Interpolation

Model “interpolates” lab values not in sources, e.g., guessing a creatinine trend based on pattern recognition. Mitigation: Strict hallucination detection for all numeric values, require exact source match.

Conflicting Source Resolution

Multiple progress notes have different assessments; model picks one without acknowledging the disagreement. Mitigation: Train model to acknowledge uncertainty, use most recent assessment, flag for review.

Context Window Truncation

Long documents get truncated, losing critical information at the end of notes like “follow up in 1 week.” Mitigation: Use chunking strategies that preserve section integrity, prioritize actionable content.

Next Steps