Evaluating Clinical NLP Models

Overview

If you’ve trained custom models for clinical NLP tasks like named entity recognition (NER), medical coding, or text classification, this guide shows you how to set up Rubric to evaluate model performance with healthcare-specific metrics.

Named Entity Recognition

Extracting symptoms, medications, diagnoses from text

Medical Coding

ICD-10, CPT, SNOMED CT code assignment

Text Classification

Urgency, specialty routing, sentiment analysis

Relation Extraction

Drug-condition, symptom-diagnosis relationships

NER Model Evaluation

For entity extraction models, log both the extracted entities and their spans:

from rubric import Rubric

client = Rubric(api_key="gr_live_xxx")

# Example clinical note
clinical_text = """
Patient is a 58-year-old male presenting with chest pain radiating to 
left arm for 2 hours. History of hypertension and type 2 diabetes. 
Currently taking metformin 500mg BID and lisinopril 10mg daily. 
Denies shortness of breath. BP 158/94, HR 88.
"""

# Your model's predictions
predicted_entities = [
    {"text": "chest pain", "label": "SYMPTOM", "start": 47, "end": 57},
    {"text": "radiating to left arm", "label": "SYMPTOM_MODIFIER", "start": 58, "end": 79},
    {"text": "2 hours", "label": "DURATION", "start": 84, "end": 91},
    {"text": "hypertension", "label": "CONDITION", "start": 104, "end": 116},
    {"text": "type 2 diabetes", "label": "CONDITION", "start": 121, "end": 136},
    {"text": "metformin", "label": "MEDICATION", "start": 156, "end": 165},
    {"text": "500mg", "label": "DOSAGE", "start": 166, "end": 171},
    {"text": "BID", "label": "FREQUENCY", "start": 172, "end": 175},
    {"text": "lisinopril", "label": "MEDICATION", "start": 180, "end": 190},
    {"text": "10mg", "label": "DOSAGE", "start": 191, "end": 195},
    {"text": "shortness of breath", "label": "SYMPTOM", "start": 217, "end": 236, "negated": True},
    {"text": "158/94", "label": "VITAL_BP", "start": 241, "end": 247},
    {"text": "88", "label": "VITAL_HR", "start": 252, "end": 254},
]

# Ground truth annotations (for evaluation)
ground_truth_entities = [
    {"text": "chest pain", "label": "SYMPTOM", "start": 47, "end": 57},
    {"text": "radiating to left arm", "label": "SYMPTOM_MODIFIER", "start": 58, "end": 79},
    {"text": "2 hours", "label": "DURATION", "start": 84, "end": 91},
    {"text": "hypertension", "label": "CONDITION", "start": 104, "end": 116},
    {"text": "type 2 diabetes", "label": "CONDITION", "start": 121, "end": 136},
    {"text": "metformin 500mg BID", "label": "MEDICATION_FULL", "start": 156, "end": 175},
    {"text": "lisinopril 10mg daily", "label": "MEDICATION_FULL", "start": 180, "end": 201},
    {"text": "shortness of breath", "label": "SYMPTOM", "start": 217, "end": 236, "negated": True},
    {"text": "BP 158/94", "label": "VITAL", "start": 238, "end": 247},
    {"text": "HR 88", "label": "VITAL", "start": 249, "end": 254},
]

# Log to Rubric
client.log(
    project="clinical-ner",
    
    input={
        "text": clinical_text,
        "document_type": "emergency_note",
        "source": "ehr_epic"
    },
    
    output={
        "entities": predicted_entities,
        "model": "clinical-bert-ner-v3",
        "processing_time_ms": 45
    },
    
    expected={
        "entities": ground_truth_entities
    },
    
    metadata={
        "model_version": "v3.2.1",
        "document_id": "doc_abc123",
        "annotator_id": "expert_001"
    }
)

Rubric supports both exact and relaxed span matching. Configure tolerance for boundary variations that don’t affect clinical meaning.

NER Evaluator Configuration

Configure evaluators specifically for entity extraction tasks:

evaluation = client.evaluations.create(
    name="Clinical NER - Model v3.2",
    project="clinical-ner",
    dataset="ds_annotated_notes",
    
    evaluators=[
        # Entity-level metrics
        {
            "type": "ner_accuracy",
            "config": {
                # Matching strategy
                "span_matching": "relaxed",  # or "exact", "overlap"
                "span_tolerance": 2,  # characters of boundary flexibility
                
                # What counts as a match
                "require_label_match": True,
                "case_sensitive": False,
                
                # Entity-type specific weights (for weighted F1)
                "entity_weights": {
                    "SYMPTOM": 2.0,      # High weight - clinical importance
                    "MEDICATION": 2.0,
                    "CONDITION": 1.5,
                    "VITAL": 1.0,
                    "DURATION": 0.5
                },
                
                # Critical entities that must not be missed
                "critical_entities": ["SYMPTOM", "MEDICATION", "CONDITION"]
            }
        },
        
        # Negation detection accuracy
        {
            "type": "negation_accuracy",
            "config": {
                "check_negation_scope": True,
                "negation_cues": ["denies", "no", "negative", "without", "absent"]
            }
        },
        
        # Clinical coherence - do extracted entities make sense together?
        {
            "type": "clinical_coherence",
            "config": {
                "check_medication_condition_alignment": True,
                "check_vital_ranges": True,
                "flag_contradictions": True
            }
        }
    ]
)

Medical Coding Evaluation

For ICD-10, CPT, or SNOMED coding models, evaluate code accuracy with hierarchy-aware metrics:

# Log medical coding predictions
client.log(
    project="icd-coder",
    
    input={
        "text": """Assessment: Type 2 diabetes mellitus with diabetic 
chronic kidney disease, stage 3. Hypertensive heart disease with 
heart failure. Obesity.""",
        "document_type": "assessment_plan"
    },
    
    output={
        "predicted_codes": [
            {"code": "E11.65", "description": "Type 2 DM with hyperglycemia", "confidence": 0.89},
            {"code": "E11.22", "description": "Type 2 DM with CKD", "confidence": 0.92},
            {"code": "N18.3", "description": "CKD stage 3", "confidence": 0.88},
            {"code": "I11.0", "description": "Hypertensive heart disease with HF", "confidence": 0.94},
            {"code": "E66.9", "description": "Obesity, unspecified", "confidence": 0.91}
        ],
        "coding_system": "ICD-10-CM",
        "model": "icd-coder-v2"
    },
    
    expected={
        "codes": ["E11.22", "N18.3", "I11.0", "E66.9"]
    }
)

# Coding evaluators
evaluation = client.evaluations.create(
    name="ICD-10 Coder Evaluation",
    project="icd-coder",
    dataset="ds_coded_notes",
    
    evaluators=[
        {
            "type": "coding_accuracy",
            "config": {
                "coding_system": "ICD-10-CM",
                
                # Hierarchy-aware matching
                "hierarchy_matching": True,
                "hierarchy_depth_tolerance": 1,  # Allow parent/child matches
                
                # Specificity scoring
                "reward_specificity": True,
                "penalize_over_coding": True,
                
                # Critical codes that must not be missed
                "critical_codes": ["I21.*", "I63.*", "J96.*"],  # MI, stroke, resp failure
                
                # Code category weights
                "category_weights": {
                    "principal_diagnosis": 3.0,
                    "secondary_diagnosis": 1.0,
                    "procedure": 2.0
                }
            }
        }
    ]
)

Text Classification Evaluation

For models that classify clinical text into categories:

# Log classification predictions
client.log(
    project="triage-classifier",
    
    input={
        "text": "Sudden onset worst headache of my life with neck stiffness and photophobia.",
        "document_type": "chief_complaint"
    },
    
    output={
        "predicted_class": "emergency",
        "probabilities": {
            "emergency": 0.94,
            "urgent": 0.04,
            "semi_urgent": 0.01,
            "routine": 0.01
        },
        "model": "triage-bert-v2"
    },
    
    expected={
        "class": "emergency",
        "rationale": "Thunderclap headache with meningeal signs"
    }
)

# Classification evaluators
evaluation = client.evaluations.create(
    name="Triage Classifier Eval",
    project="triage-classifier",
    dataset="ds_labeled_triage",
    
    evaluators=[
        {
            "type": "classification_accuracy",
            "config": {
                "classes": ["emergency", "urgent", "semi-urgent", "routine"],
                
                # Ordinal-aware metrics (adjacent misses less severe)
                "ordinal_classes": True,
                "adjacency_tolerance": 1,
                
                # Asymmetric cost matrix
                "cost_matrix": {
                    "emergency->urgent": 5.0,      # Under-triage: dangerous
                    "emergency->semi-urgent": 10.0,
                    "emergency->routine": 20.0,
                    "urgent->emergency": 0.5,      # Over-triage: acceptable
                    "routine->emergency": 0.2
                },
                
                # Per-class thresholds
                "confidence_thresholds": {
                    "emergency": 0.7,  # Lower threshold = more sensitive
                    "routine": 0.9    # Higher threshold = more specific
                }
            }
        },
        
        # Calibration check - are confidence scores reliable?
        {
            "type": "calibration",
            "config": {
                "n_bins": 10,
                "metric": "expected_calibration_error"
            }
        }
    ]
)

Model Version Comparison

Compare performance across model versions to catch regressions:

# Run evaluation on same dataset with different model versions
eval_v2 = client.evaluations.create(
    name="NER Model v2.0 Baseline",
    project="clinical-ner",
    dataset="ds_golden_test_set",
    evaluators=[...],
    metadata={"model_version": "v2.0"}
)

eval_v3 = client.evaluations.create(
    name="NER Model v3.0 Candidate",
    project="clinical-ner", 
    dataset="ds_golden_test_set",  # Same test set
    evaluators=[...],
    metadata={"model_version": "v3.0"}
)

# Compare results
comparison = client.experiments.compare(
    evaluations=[eval_v2.id, eval_v3.id],
    
    metrics=[
        "ner_f1_macro",
        "ner_f1_per_class",
        "critical_entity_recall",
        "negation_accuracy"
    ],
    
    # Statistical significance testing
    significance_test="mcnemar",
    confidence_level=0.95
)

print(f"v3.0 vs v2.0:")
print(f"  F1 Macro: {comparison.v3.f1_macro:.3f} vs {comparison.v2.f1_macro:.3f}")
print(f"  Improvement: {comparison.delta.f1_macro:+.3f}")
print(f"  Significant: {comparison.is_significant}")

Set up automated alerts when new model versions show derubric performance on critical entity types. A 2% drop in SYMPTOM recall could mean missed diagnoses.

Metrics Reference

Key metrics available for clinical NLP evaluation:

Metric	Task	Description
`ner_f1_macro`	NER	Macro-averaged F1 across all entity types
`ner_f1_weighted`	NER	F1 weighted by entity importance
`ner_precision`	NER	Precision (avoiding false extractions)
`ner_recall`	NER	Recall (catching all entities)
`span_exact_match`	NER	Exact boundary matching rate
`negation_accuracy`	NER	Correct negation detection rate
`coding_f1`	Coding	F1 for code prediction
`coding_hierarchy_score`	Coding	Hierarchy-aware accuracy
`classification_accuracy`	Classification	Overall accuracy
`classification_weighted_cost`	Classification	Cost-weighted error rate
`calibration_ece`	Classification	Expected calibration error
`auc_roc`	Classification	Area under ROC curve

CI/CD Integration

Integrate Rubric into your model training pipeline:

# .github/workflows/model-eval.yml
name: Clinical NLP Model Evaluation

on:
  push:
    paths:
      - 'models/**'
      - 'training/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Run model inference on test set
        run: python scripts/run_inference.py --model $GITHUB_SHA
      
      - name: Upload predictions to Rubric
        env:
          RUBRIC_API_KEY: ${{ secrets.RUBRIC_API_KEY }}
        run: |
          python scripts/upload_predictions.py \
            --project clinical-ner \
            --model-version $GITHUB_SHA \
            --predictions outputs/predictions.json
      
      - name: Run evaluation
        run: |
          python -c "
          from rubric import Rubric
          client = Rubric()
          
          eval = client.evaluations.create(
              name='CI Eval - ${GITHUB_SHA:0:7}',
              project='clinical-ner',
              dataset='ds_ci_test_set',
              evaluators=[...]
          )
          
          # Wait for completion
          result = client.evaluations.wait(eval.id)
          
          # Fail if below threshold
          if result.metrics['critical_entity_recall'] < 0.95:
              raise Exception('Critical entity recall below 95%')
          "
      
      - name: Post results to PR
        uses: actions/github-script@v6
        with:
          script: |
            // Post evaluation summary as PR comment

Next Steps

Create Your First Evaluation

Run evaluators on your model outputs

Create Datasets

Organize test sets for consistent evaluation

Python SDK Reference

Full SDK documentation

Evaluating Voice AI

Voice and multimodal evaluation

Home

Getting Started

Core Concepts

Onboarding Guides

Evaluation Framework

Tutorials

Integrations

Security & Compliance

Voice AI

Medical Imaging

Clinical Notes

Workflows

Glossary & Appendix

Evaluating Clinical NLP Models

Overview

Named Entity Recognition

Medical Coding

Text Classification

Relation Extraction

NER Model Evaluation

NER Evaluator Configuration

Medical Coding Evaluation

Text Classification Evaluation

Model Version Comparison

Metrics Reference

CI/CD Integration

Next Steps

Create Your First Evaluation

Create Datasets

Python SDK Reference

Evaluating Voice AI

Home

Getting Started

Core Concepts

Onboarding Guides

Evaluation Framework

Tutorials

Integrations

Security & Compliance

Voice AI

Medical Imaging

Clinical Notes

Workflows

Glossary & Appendix

​Overview

Named Entity Recognition

Medical Coding

Text Classification

Relation Extraction

​NER Model Evaluation

​NER Evaluator Configuration

​Medical Coding Evaluation

​Text Classification Evaluation

​Model Version Comparison

​Metrics Reference

​CI/CD Integration

​Next Steps

Create Your First Evaluation

Create Datasets

Python SDK Reference

Evaluating Voice AI

Overview

NER Model Evaluation

NER Evaluator Configuration

Medical Coding Evaluation

Text Classification Evaluation

Model Version Comparison

Metrics Reference

CI/CD Integration

Next Steps