Overview
If you’ve trained custom models for clinical NLP tasks like named entity recognition (NER), medical coding, or text classification, this guide shows you how to set up Rubric to evaluate model performance with healthcare-specific metrics.Named Entity Recognition
Extracting symptoms, medications, diagnoses from text
Medical Coding
ICD-10, CPT, SNOMED CT code assignment
Text Classification
Urgency, specialty routing, sentiment analysis
Relation Extraction
Drug-condition, symptom-diagnosis relationships
NER Model Evaluation
For entity extraction models, log both the extracted entities and their spans:Rubric supports both exact and relaxed span matching. Configure tolerance for boundary variations that don’t affect clinical meaning.
NER Evaluator Configuration
Configure evaluators specifically for entity extraction tasks:Medical Coding Evaluation
For ICD-10, CPT, or SNOMED coding models, evaluate code accuracy with hierarchy-aware metrics:Text Classification Evaluation
For models that classify clinical text into categories:Model Version Comparison
Compare performance across model versions to catch regressions:Metrics Reference
Key metrics available for clinical NLP evaluation:| Metric | Task | Description |
|---|---|---|
ner_f1_macro | NER | Macro-averaged F1 across all entity types |
ner_f1_weighted | NER | F1 weighted by entity importance |
ner_precision | NER | Precision (avoiding false extractions) |
ner_recall | NER | Recall (catching all entities) |
span_exact_match | NER | Exact boundary matching rate |
negation_accuracy | NER | Correct negation detection rate |
coding_f1 | Coding | F1 for code prediction |
coding_hierarchy_score | Coding | Hierarchy-aware accuracy |
classification_accuracy | Classification | Overall accuracy |
classification_weighted_cost | Classification | Cost-weighted error rate |
calibration_ece | Classification | Expected calibration error |
auc_roc | Classification | Area under ROC curve |
CI/CD Integration
Integrate Rubric into your model training pipeline:Next Steps
Create Your First Evaluation
Run evaluators on your model outputs
Create Datasets
Organize test sets for consistent evaluation
Python SDK Reference
Full SDK documentation
Evaluating Voice AI
Voice and multimodal evaluation
