Overview
If you’ve trained custom models for clinical NLP tasks like named entity recognition (NER), medical coding, or text classification, this guide shows you how to set up Rubric to evaluate model performance with healthcare-specific metrics.Named Entity Recognition
Extracting symptoms, medications, diagnoses from text
Medical Coding
ICD-10, CPT, SNOMED CT code assignment
Text Classification
Urgency, specialty routing, sentiment analysis
Relation Extraction
Drug-condition, symptom-diagnosis relationships
NER Model Evaluation
For entity extraction models, log both the extracted entities and their spans:Rubric supports both exact and relaxed span matching. Configure tolerance for boundary variations that don’t affect clinical meaning.
NER Evaluator Configuration
Configure evaluators specifically for entity extraction tasks:Medical Coding Evaluation
For ICD-10, CPT, or SNOMED coding models, evaluate code accuracy with hierarchy-aware metrics:Text Classification Evaluation
For models that classify clinical text into categories:Model Version Comparison
Compare performance across model versions to catch regressions:Metrics Reference
Key metrics available for clinical NLP evaluation:| Metric | Task | Description |
|---|---|---|
ner_f1_macro | NER | Macro-averaged F1 across all entity types |
ner_f1_weighted | NER | F1 weighted by entity importance |
ner_precision | NER | Precision (avoiding false extractions) |
ner_recall | NER | Recall (catching all entities) |
span_exact_match | NER | Exact boundary matching rate |
negation_accuracy | NER | Correct negation detection rate |
coding_f1 | Coding | F1 for code prediction |
coding_hierarchy_score | Coding | Hierarchy-aware accuracy |
classification_accuracy | Classification | Overall accuracy |
classification_weighted_cost | Classification | Cost-weighted error rate |
calibration_ece | Classification | Expected calibration error |
auc_roc | Classification | Area under ROC curve |
