AI / ML Metrics Definitions
Comprehensive reference for machine learning and evaluation metrics used in Rubric’s healthcare AI assessment platform.Classification Metrics
The proportion of correct predictions among the total number of cases examined.Healthcare Context: Overall correctness of triage classifications. Note: Can be misleading with imbalanced classes (e.g., rare conditions).Typical Range: 0.0 - 1.0 (or 0% - 100%)
The proportion of positive identifications that were actually correct. Also called Positive Predictive Value (PPV).Healthcare Context: When the AI flags a case as urgent, how often is it actually urgent? High precision reduces unnecessary escalations.Typical Range: 0.0 - 1.0
The proportion of actual positives that were identified correctly. Also called True Positive Rate (TPR).Healthcare Context: Of all truly urgent cases, how many did the AI correctly identify? Critical for patient safety—missing urgent cases (false negatives) can be dangerous.Typical Range: 0.0 - 1.0 (aim for >0.95 in safety-critical applications)
The proportion of actual negatives that were identified correctly. Also called True Negative Rate (TNR).Healthcare Context: Of all non-urgent cases, how many were correctly identified as non-urgent? Helps assess over-triage rates.Typical Range: 0.0 - 1.0
The harmonic mean of precision and recall, providing a single score that balances both metrics.Healthcare Context: Useful when you need to balance finding urgent cases (recall) with not over-escalating (precision).Typical Range: 0.0 - 1.0
Generalization of F1 that allows weighting recall vs precision.Healthcare Context: Use β > 1 (e.g., F2) when recall is more important (catching all urgent cases). Use β < 1 (e.g., F0.5) when precision is prioritized.Common Values: F0.5, F1, F2
Probabilistic Metrics
Area Under the Receiver Operating Characteristic Curve. Measures the ability to distinguish between classes across all classification thresholds.Healthcare Context: How well can the model separate urgent from non-urgent cases, regardless of the specific threshold chosen?Interpretation:
- 1.0: Perfect discrimination
- 0.9-1.0: Excellent
- 0.8-0.9: Good
- 0.7-0.8: Fair
- 0.5: No discrimination (random)
Area Under the Precision-Recall Curve. More informative than AUC-ROC for imbalanced datasets.Healthcare Context: Preferred for rare conditions or uncommon urgency levels where the positive class is much smaller than the negative class.Typical Range: 0.0 - 1.0 (baseline depends on class prevalence)
How well predicted probabilities match actual outcomes. A model predicting 70% confidence should be correct 70% of the time.Healthcare Context: Critical for clinical decision support—clinicians need to trust that confidence scores are meaningful.Measurement: Calibration curves, Expected Calibration Error (ECE), Brier Score
Mean squared difference between predicted probabilities and actual outcomes.Healthcare Context: Lower is better. Captures both calibration and discrimination.Typical Range: 0.0 (perfect) - 1.0 (worst)
Measures the performance of a classification model where the prediction is a probability.Healthcare Context: Penalizes confident wrong predictions heavily—important when AI expresses high certainty.Typical Range: 0.0 (perfect) - ∞
Healthcare-Specific Metrics
Percentage of cases where the AI assigned a lower urgency than appropriate.Healthcare Context: Primary safety metric. Under-triage can delay critical care and lead to adverse outcomes.Target: < 5% for high-acuity conditions, < 2% for life-threatening conditions
Percentage of cases where the AI assigned a higher urgency than necessary.Healthcare Context: Affects resource utilization and patient experience. Some over-triage is acceptable if it reduces under-triage.Target: < 30% is generally acceptable; trade-off with under-triage
Percentage of critical symptoms or warning signs correctly identified by the AI.Healthcare Context: Essential for patient safety. Missing red flags like chest pain radiation or stroke symptoms can be fatal.Target: > 98% for critical red flags
Percentage of AI decisions that align with established clinical guidelines and protocols.Healthcare Context: Measures whether the AI follows evidence-based medicine principles and organizational protocols.Target: > 90% adherence to applicable guidelines
Agreement rate between AI decisions and expert clinician judgments.Healthcare Context: Gold standard comparison using board-certified clinician review.Target: > 85% concordance with expert panel
Average time for incorrect triage decisions to be identified and corrected.Healthcare Context: Measures the effectiveness of human-in-the-loop review processes.Target: < 30 minutes for urgent cases, < 4 hours for routine cases
NLP & Text Metrics
Correctness of identified medical entities (symptoms, medications, conditions) from clinical text.Healthcare Context: Foundation for downstream clinical reasoning. Includes:
- Exact Match: Entity boundaries perfectly aligned
- Partial Match: Overlapping but not exact
- Type Accuracy: Correct entity category
Ability to correctly identify negated medical concepts (e.g., “no chest pain” vs “chest pain”).Healthcare Context: Critical for accurate clinical understanding. False positives from missed negations can lead to incorrect triage.Target: > 95% accuracy on negation scope
Correctness of extracted time references and durations from clinical narratives.Healthcare Context: “Chest pain for 2 days” vs “chest pain 2 years ago” have vastly different clinical implications.Target: > 90% accuracy on temporal extraction
Percentage of AI-generated content that contains fabricated or unsupported clinical information.Healthcare Context: AI must not invent symptoms, medications, or findings not present in the source data.Target: < 1% for clinical documentation
Percentage of clinically relevant information captured from source content.Healthcare Context: For clinical note generation, measures whether all important findings, symptoms, and plans are documented.Target: > 95% for critical clinical elements
Imaging Metrics
Percentage of true abnormalities detected by the imaging AI.Healthcare Context: For radiology AI, measures ability to find nodules, fractures, or other findings.Target: > 95% for critical findings (e.g., pneumothorax)
Intersection over Union measures overlap between predicted and actual abnormality locations.Healthcare Context: Important for surgical planning and treatment targeting.Typical Range: 0.0 - 1.0 (> 0.5 generally considered acceptable)
Average number of incorrect positive findings per imaging study.Healthcare Context: High FP rates lead to unnecessary follow-up procedures and patient anxiety.Target: < 1 FP per scan for screening applications
Extension of ROC analysis for lesion detection, plotting sensitivity vs false positives per image.Healthcare Context: Standard metric for medical imaging AI competitions and FDA submissions.
Operational Metrics
Time from input submission to AI response generation.Healthcare Context: Critical for real-time triage. Long latencies impact clinical workflow.Target: < 2 seconds for synchronous triage, < 30 seconds for complex analysis
Number of evaluations processed per unit time.Healthcare Context: Must handle peak volumes during high-census periods.Typical: Evaluations per minute/hour
Percentage of time the system is operational and accessible.Healthcare Context: Healthcare AI must maintain high availability for patient safety.Target: > 99.9% uptime (< 8.76 hours downtime/year)
Percentage of AI decisions modified by human reviewers.Healthcare Context: Indicates AI-human agreement and areas needing model improvement.Typical Range: 5-15% (very low may indicate rubber-stamping)
Statistical Concepts
Range of values within which the true metric value likely falls, given sampling variability.Healthcare Context: Report metrics with 95% CIs, especially for safety-critical measures.Example: “Sensitivity: 0.94 (95% CI: 0.91-0.97)”
The probability that observed differences are not due to random chance.Healthcare Context: When comparing model versions, ensure improvements are statistically significant (p < 0.05).
Magnitude of difference between groups, independent of sample size.Healthcare Context: A statistically significant but tiny improvement may not be clinically meaningful.Common Measures: Cohen’s d, Odds Ratio, Risk Ratio
Agreement between multiple human reviewers on the same cases.Healthcare Context: Important for establishing ground truth quality. Use Cohen’s Kappa or Fleiss’ Kappa.Interpretation: > 0.8 (excellent), 0.6-0.8 (good), 0.4-0.6 (moderate)
