Skip to main content

Overview

If you’ve trained custom models for clinical NLP tasks like named entity recognition (NER), medical coding, or text classification, this guide shows you how to set up Rubric to evaluate model performance with healthcare-specific metrics.

Named Entity Recognition

Extracting symptoms, medications, diagnoses from text

Medical Coding

ICD-10, CPT, SNOMED CT code assignment

Text Classification

Urgency, specialty routing, sentiment analysis

Relation Extraction

Drug-condition, symptom-diagnosis relationships

NER Model Evaluation

For entity extraction models, log both the extracted entities and their spans:
from rubric import Rubric

client = Rubric(api_key="gr_live_xxx")

# Example clinical note
clinical_text = """
Patient is a 58-year-old male presenting with chest pain radiating to 
left arm for 2 hours. History of hypertension and type 2 diabetes. 
Currently taking metformin 500mg BID and lisinopril 10mg daily. 
Denies shortness of breath. BP 158/94, HR 88.
"""

# Your model's predictions
predicted_entities = [
    {"text": "chest pain", "label": "SYMPTOM", "start": 47, "end": 57},
    {"text": "radiating to left arm", "label": "SYMPTOM_MODIFIER", "start": 58, "end": 79},
    {"text": "2 hours", "label": "DURATION", "start": 84, "end": 91},
    {"text": "hypertension", "label": "CONDITION", "start": 104, "end": 116},
    {"text": "type 2 diabetes", "label": "CONDITION", "start": 121, "end": 136},
    {"text": "metformin", "label": "MEDICATION", "start": 156, "end": 165},
    {"text": "500mg", "label": "DOSAGE", "start": 166, "end": 171},
    {"text": "BID", "label": "FREQUENCY", "start": 172, "end": 175},
    {"text": "lisinopril", "label": "MEDICATION", "start": 180, "end": 190},
    {"text": "10mg", "label": "DOSAGE", "start": 191, "end": 195},
    {"text": "shortness of breath", "label": "SYMPTOM", "start": 217, "end": 236, "negated": True},
    {"text": "158/94", "label": "VITAL_BP", "start": 241, "end": 247},
    {"text": "88", "label": "VITAL_HR", "start": 252, "end": 254},
]

# Ground truth annotations (for evaluation)
ground_truth_entities = [
    {"text": "chest pain", "label": "SYMPTOM", "start": 47, "end": 57},
    {"text": "radiating to left arm", "label": "SYMPTOM_MODIFIER", "start": 58, "end": 79},
    {"text": "2 hours", "label": "DURATION", "start": 84, "end": 91},
    {"text": "hypertension", "label": "CONDITION", "start": 104, "end": 116},
    {"text": "type 2 diabetes", "label": "CONDITION", "start": 121, "end": 136},
    {"text": "metformin 500mg BID", "label": "MEDICATION_FULL", "start": 156, "end": 175},
    {"text": "lisinopril 10mg daily", "label": "MEDICATION_FULL", "start": 180, "end": 201},
    {"text": "shortness of breath", "label": "SYMPTOM", "start": 217, "end": 236, "negated": True},
    {"text": "BP 158/94", "label": "VITAL", "start": 238, "end": 247},
    {"text": "HR 88", "label": "VITAL", "start": 249, "end": 254},
]

# Log to Rubric
client.log(
    project="clinical-ner",
    
    input={
        "text": clinical_text,
        "document_type": "emergency_note",
        "source": "ehr_epic"
    },
    
    output={
        "entities": predicted_entities,
        "model": "clinical-bert-ner-v3",
        "processing_time_ms": 45
    },
    
    expected={
        "entities": ground_truth_entities
    },
    
    metadata={
        "model_version": "v3.2.1",
        "document_id": "doc_abc123",
        "annotator_id": "expert_001"
    }
)
Rubric supports both exact and relaxed span matching. Configure tolerance for boundary variations that don’t affect clinical meaning.

NER Evaluator Configuration

Configure evaluators specifically for entity extraction tasks:
evaluation = client.evaluations.create(
    name="Clinical NER - Model v3.2",
    project="clinical-ner",
    dataset="ds_annotated_notes",
    
    evaluators=[
        # Entity-level metrics
        {
            "type": "ner_accuracy",
            "config": {
                # Matching strategy
                "span_matching": "relaxed",  # or "exact", "overlap"
                "span_tolerance": 2,  # characters of boundary flexibility
                
                # What counts as a match
                "require_label_match": True,
                "case_sensitive": False,
                
                # Entity-type specific weights (for weighted F1)
                "entity_weights": {
                    "SYMPTOM": 2.0,      # High weight - clinical importance
                    "MEDICATION": 2.0,
                    "CONDITION": 1.5,
                    "VITAL": 1.0,
                    "DURATION": 0.5
                },
                
                # Critical entities that must not be missed
                "critical_entities": ["SYMPTOM", "MEDICATION", "CONDITION"]
            }
        },
        
        # Negation detection accuracy
        {
            "type": "negation_accuracy",
            "config": {
                "check_negation_scope": True,
                "negation_cues": ["denies", "no", "negative", "without", "absent"]
            }
        },
        
        # Clinical coherence - do extracted entities make sense together?
        {
            "type": "clinical_coherence",
            "config": {
                "check_medication_condition_alignment": True,
                "check_vital_ranges": True,
                "flag_contradictions": True
            }
        }
    ]
)

Medical Coding Evaluation

For ICD-10, CPT, or SNOMED coding models, evaluate code accuracy with hierarchy-aware metrics:
# Log medical coding predictions
client.log(
    project="icd-coder",
    
    input={
        "text": """Assessment: Type 2 diabetes mellitus with diabetic 
chronic kidney disease, stage 3. Hypertensive heart disease with 
heart failure. Obesity.""",
        "document_type": "assessment_plan"
    },
    
    output={
        "predicted_codes": [
            {"code": "E11.65", "description": "Type 2 DM with hyperglycemia", "confidence": 0.89},
            {"code": "E11.22", "description": "Type 2 DM with CKD", "confidence": 0.92},
            {"code": "N18.3", "description": "CKD stage 3", "confidence": 0.88},
            {"code": "I11.0", "description": "Hypertensive heart disease with HF", "confidence": 0.94},
            {"code": "E66.9", "description": "Obesity, unspecified", "confidence": 0.91}
        ],
        "coding_system": "ICD-10-CM",
        "model": "icd-coder-v2"
    },
    
    expected={
        "codes": ["E11.22", "N18.3", "I11.0", "E66.9"]
    }
)

# Coding evaluators
evaluation = client.evaluations.create(
    name="ICD-10 Coder Evaluation",
    project="icd-coder",
    dataset="ds_coded_notes",
    
    evaluators=[
        {
            "type": "coding_accuracy",
            "config": {
                "coding_system": "ICD-10-CM",
                
                # Hierarchy-aware matching
                "hierarchy_matching": True,
                "hierarchy_depth_tolerance": 1,  # Allow parent/child matches
                
                # Specificity scoring
                "reward_specificity": True,
                "penalize_over_coding": True,
                
                # Critical codes that must not be missed
                "critical_codes": ["I21.*", "I63.*", "J96.*"],  # MI, stroke, resp failure
                
                # Code category weights
                "category_weights": {
                    "principal_diagnosis": 3.0,
                    "secondary_diagnosis": 1.0,
                    "procedure": 2.0
                }
            }
        }
    ]
)

Text Classification Evaluation

For models that classify clinical text into categories:
# Log classification predictions
client.log(
    project="triage-classifier",
    
    input={
        "text": "Sudden onset worst headache of my life with neck stiffness and photophobia.",
        "document_type": "chief_complaint"
    },
    
    output={
        "predicted_class": "emergency",
        "probabilities": {
            "emergency": 0.94,
            "urgent": 0.04,
            "semi_urgent": 0.01,
            "routine": 0.01
        },
        "model": "triage-bert-v2"
    },
    
    expected={
        "class": "emergency",
        "rationale": "Thunderclap headache with meningeal signs"
    }
)

# Classification evaluators
evaluation = client.evaluations.create(
    name="Triage Classifier Eval",
    project="triage-classifier",
    dataset="ds_labeled_triage",
    
    evaluators=[
        {
            "type": "classification_accuracy",
            "config": {
                "classes": ["emergency", "urgent", "semi-urgent", "routine"],
                
                # Ordinal-aware metrics (adjacent misses less severe)
                "ordinal_classes": True,
                "adjacency_tolerance": 1,
                
                # Asymmetric cost matrix
                "cost_matrix": {
                    "emergency->urgent": 5.0,      # Under-triage: dangerous
                    "emergency->semi-urgent": 10.0,
                    "emergency->routine": 20.0,
                    "urgent->emergency": 0.5,      # Over-triage: acceptable
                    "routine->emergency": 0.2
                },
                
                # Per-class thresholds
                "confidence_thresholds": {
                    "emergency": 0.7,  # Lower threshold = more sensitive
                    "routine": 0.9    # Higher threshold = more specific
                }
            }
        },
        
        # Calibration check - are confidence scores reliable?
        {
            "type": "calibration",
            "config": {
                "n_bins": 10,
                "metric": "expected_calibration_error"
            }
        }
    ]
)

Model Version Comparison

Compare performance across model versions to catch regressions:
# Run evaluation on same dataset with different model versions
eval_v2 = client.evaluations.create(
    name="NER Model v2.0 Baseline",
    project="clinical-ner",
    dataset="ds_golden_test_set",
    evaluators=[...],
    metadata={"model_version": "v2.0"}
)

eval_v3 = client.evaluations.create(
    name="NER Model v3.0 Candidate",
    project="clinical-ner", 
    dataset="ds_golden_test_set",  # Same test set
    evaluators=[...],
    metadata={"model_version": "v3.0"}
)

# Compare results
comparison = client.experiments.compare(
    evaluations=[eval_v2.id, eval_v3.id],
    
    metrics=[
        "ner_f1_macro",
        "ner_f1_per_class",
        "critical_entity_recall",
        "negation_accuracy"
    ],
    
    # Statistical significance testing
    significance_test="mcnemar",
    confidence_level=0.95
)

print(f"v3.0 vs v2.0:")
print(f"  F1 Macro: {comparison.v3.f1_macro:.3f} vs {comparison.v2.f1_macro:.3f}")
print(f"  Improvement: {comparison.delta.f1_macro:+.3f}")
print(f"  Significant: {comparison.is_significant}")
Set up automated alerts when new model versions show derubric performance on critical entity types. A 2% drop in SYMPTOM recall could mean missed diagnoses.

Metrics Reference

Key metrics available for clinical NLP evaluation:
MetricTaskDescription
ner_f1_macroNERMacro-averaged F1 across all entity types
ner_f1_weightedNERF1 weighted by entity importance
ner_precisionNERPrecision (avoiding false extractions)
ner_recallNERRecall (catching all entities)
span_exact_matchNERExact boundary matching rate
negation_accuracyNERCorrect negation detection rate
coding_f1CodingF1 for code prediction
coding_hierarchy_scoreCodingHierarchy-aware accuracy
classification_accuracyClassificationOverall accuracy
classification_weighted_costClassificationCost-weighted error rate
calibration_eceClassificationExpected calibration error
auc_rocClassificationArea under ROC curve

CI/CD Integration

Integrate Rubric into your model training pipeline:
# .github/workflows/model-eval.yml
name: Clinical NLP Model Evaluation

on:
  push:
    paths:
      - 'models/**'
      - 'training/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Run model inference on test set
        run: python scripts/run_inference.py --model $GITHUB_SHA
      
      - name: Upload predictions to Rubric
        env:
          RUBRIC_API_KEY: ${{ secrets.RUBRIC_API_KEY }}
        run: |
          python scripts/upload_predictions.py \
            --project clinical-ner \
            --model-version $GITHUB_SHA \
            --predictions outputs/predictions.json
      
      - name: Run evaluation
        run: |
          python -c "
          from rubric import Rubric
          client = Rubric()
          
          eval = client.evaluations.create(
              name='CI Eval - ${GITHUB_SHA:0:7}',
              project='clinical-ner',
              dataset='ds_ci_test_set',
              evaluators=[...]
          )
          
          # Wait for completion
          result = client.evaluations.wait(eval.id)
          
          # Fail if below threshold
          if result.metrics['critical_entity_recall'] < 0.95:
              raise Exception('Critical entity recall below 95%')
          "
      
      - name: Post results to PR
        uses: actions/github-script@v6
        with:
          script: |
            // Post evaluation summary as PR comment

Next Steps