Skip to main content

Clinical Accuracy Metrics

Triage Accuracy

Measures the percentage of cases where the AI assigned the correct urgency level.
Triage Accuracy = (Correct Classifications) / (Total Cases) × 100%

Weighted Accuracy = Σ(weight_i × correct_i) / Σ(weight_i)

Where weights reflect clinical severity of each triage level.
MetricFormulaInterpretation
Raw AccuracyCorrect / TotalSimple correctness rate
Weighted AccuracyΣ(w × correct) / ΣwAccounts for severity differences
Under-triage RateUnder-triaged / TotalCases assigned less urgent than needed
Over-triage RateOver-triaged / TotalCases assigned more urgent than needed
Under-triage vs Over-triage: In healthcare, under-triage (missing emergencies) is generally far more dangerous than over-triage (unnecessary ER visits). Weight your metrics accordingly.

Sensitivity & Specificity

Critical metrics for evaluating detection of specific conditions or red flags.
Sensitivity (Recall) = True Positives / (True Positives + False Negatives)
"Of all actual positives, how many did we catch?"

Specificity = True Negatives / (True Negatives + False Positives)
"Of all actual negatives, how many did we correctly rule out?"

Positive Predictive Value (PPV) = TP / (TP + FP)
"When we say positive, how often are we right?"

Negative Predictive Value (NPV) = TN / (TN + FN)
"When we say negative, how often are we right?"

Clinical Context

ConditionPriority MetricTargetRationale
Chest Pain → MISensitivity≥ 99%Cannot miss heart attacks
Stroke SymptomsSensitivity≥ 99%Time-critical intervention
Pediatric SepsisSensitivity≥ 98%Rapid deterioration risk
Routine Follow-upSpecificity≥ 85%Avoid unnecessary burden
Mental Health CrisisSensitivity≥ 95%Safety-critical detection
sensitivity_calculation.py
from rubric import Rubric

client = Rubric()

# Get evaluation results with sensitivity/specificity breakdown
results = client.evaluations.get("eval_abc123")

for condition in results.conditions:
    print(f"""
Condition: {condition.name}
  Sensitivity: {condition.sensitivity:.1%}
  Specificity: {condition.specificity:.1%}
  PPV: {condition.ppv:.1%}
  NPV: {condition.npv:.1%}
  F1 Score: {condition.f1:.3f}

  Confusion Matrix:
    TP: {condition.true_positives}  FP: {condition.false_positives}
    FN: {condition.false_negatives}  TN: {condition.true_negatives}
""")

Rubric-Based Scoring

For complex outputs like clinical notes or patient communications, rubric-based scoring provides structured evaluation across multiple dimensions.
rubric_definition.py
rubric = {
    "name": "Clinical Note Quality",
    "version": "2.1",
    "dimensions": [
        {
            "name": "Clinical Accuracy",
            "weight": 0.30,
            "criteria": [
                {"score": 5, "description": "All medical facts accurate, appropriate terminology"},
                {"score": 4, "description": "Minor terminology issues, facts correct"},
                {"score": 3, "description": "Some inaccuracies, no patient safety impact"},
                {"score": 2, "description": "Multiple inaccuracies, potential confusion"},
                {"score": 1, "description": "Significant errors, safety concerns"}
            ]
        },
        {
            "name": "Completeness",
            "weight": 0.25,
            "criteria": [
                {"score": 5, "description": "All required elements present, thorough"},
                {"score": 4, "description": "Minor omissions, key info captured"},
                {"score": 3, "description": "Some gaps but usable for care"},
                {"score": 2, "description": "Missing important elements"},
                {"score": 1, "description": "Critically incomplete"}
            ]
        },
        {
            "name": "Safety Documentation",
            "weight": 0.25,
            "criteria": [
                {"score": 5, "description": "All safety concerns documented, clear follow-up"},
                {"score": 4, "description": "Safety items present, minor gaps"},
                {"score": 3, "description": "Basic safety documented"},
                {"score": 2, "description": "Safety gaps present"},
                {"score": 1, "description": "Critical safety omissions"}
            ]
        },
        {
            "name": "Actionability",
            "weight": 0.20,
            "criteria": [
                {"score": 5, "description": "Clear next steps, specific instructions"},
                {"score": 4, "description": "Good guidance, minor ambiguity"},
                {"score": 3, "description": "Adequate but could be clearer"},
                {"score": 2, "description": "Vague or confusing recommendations"},
                {"score": 1, "description": "No clear action items"}
            ]
        }
    ],

    "aggregate_method": "weighted_average",
    "passing_threshold": 3.5
}

# Use in evaluation
evaluation = client.evaluations.create(
    name="Clinical Note Evaluation",
    dataset="ds_clinical_notes",
    evaluators=[{
        "type": "rubric",
        "config": {"rubric": rubric}
    }]
)

Custom Metrics

Define metrics specific to your clinical domain and use cases.
custom_metrics.py
from rubric import Metric, MetricResult

class SymptomCoverageMetric(Metric):
    """
    Measures what percentage of documented symptoms
    were addressed in the AI's assessment.
    """

    name = "symptom_coverage"
    display_name = "Symptom Coverage"
    description = "Percentage of symptoms addressed in assessment"

    def calculate(self, sample) -> MetricResult:
        documented_symptoms = set(sample.input.get("symptoms", []))
        addressed_symptoms = set(sample.ai_output.get("addressed_symptoms", []))

        if not documented_symptoms:
            return MetricResult(
                value=1.0,
                details={"note": "No symptoms documented"}
            )

        coverage = len(addressed_symptoms & documented_symptoms) / len(documented_symptoms)
        missed = documented_symptoms - addressed_symptoms

        return MetricResult(
            value=coverage,
            details={
                "total_symptoms": len(documented_symptoms),
                "addressed": len(addressed_symptoms & documented_symptoms),
                "missed_symptoms": list(missed)
            },
            flags=["incomplete_assessment"] if coverage < 0.8 else []
        )


class TimeToTriageMetric(Metric):
    """Measures efficiency of triage decision-making."""

    name = "time_to_triage"
    display_name = "Time to Triage Decision"
    description = "Conversation turns before triage decision"

    def calculate(self, sample) -> MetricResult:
        turns = sample.ai_output.get("conversation_turns", 0)
        triage_turn = sample.ai_output.get("triage_decision_turn", turns)

        # Clinical benchmark: triage within 5 turns for most cases
        efficiency_score = max(0, 1 - (triage_turn - 3) / 10)

        return MetricResult(
            value=efficiency_score,
            details={
                "total_turns": turns,
                "triage_turn": triage_turn,
                "benchmark": 5
            }
        )

# Register custom metrics
client.metrics.register(SymptomCoverageMetric)
client.metrics.register(TimeToTriageMetric)

Confidence Intervals

All metrics include confidence intervals to quantify uncertainty, especially important for small sample sizes.
results = client.evaluations.get("eval_abc123")

print(f"""
Triage Accuracy: {results.triage_accuracy.value:.1%}
  95% CI: [{results.triage_accuracy.ci_lower:.1%}, {results.triage_accuracy.ci_upper:.1%}]
  Sample Size: {results.triage_accuracy.n}

Sensitivity (Chest Pain): {results.sensitivity_chest_pain.value:.1%}
  95% CI: [{results.sensitivity_chest_pain.ci_lower:.1%}, {results.sensitivity_chest_pain.ci_upper:.1%}]
""")
Sample Size Matters: For rare conditions, you may need larger datasets to achieve narrow confidence intervals. Rubric warns when sample sizes are too small for reliable conclusions.

Metric Aggregation

Combine multiple metrics into composite scores for overall model assessment.
MethodDescriptionUse Case
Weighted AverageSum of (weight × metric)General performance score
MinimumLowest individual metricEnsure no weak spots
Geometric Mean∏(metric)^(1/n)Balance across metrics
Threshold GatePass only if all thresholds metSafety-critical deployment
evaluation = client.evaluations.create(
    name="Production Readiness Check",
    dataset="ds_validation",
    evaluators=[...],

    aggregation={
        "method": "threshold_gate",
        "thresholds": {
            "triage_accuracy": {"min": 0.85},
            "sensitivity_critical": {"min": 0.95},
            "hallucination_rate": {"max": 0.01},
            "red_flag_detection": {"min": 0.98}
        },
        "require_all": True  # Must pass ALL thresholds
    }
)

Exporting Metrics

Export metrics in formats suitable for regulatory documentation, dashboards, or CI/CD pipelines.
# Export for regulatory documentation
client.evaluations.export(
    "eval_abc123",
    format="regulatory_report",
    output_path="./reports/fda_submission.pdf",
    include=[
        "methodology",
        "dataset_description",
        "metric_definitions",
        "results_summary",
        "confidence_intervals",
        "failure_analysis"
    ]
)

# Export for CI/CD
results = client.evaluations.export("eval_abc123", format="json")

# Use in deployment decision
if results["composite_score"] >= 0.9 and results["safety_gate"] == "pass":
    deploy_model()
else:
    block_deployment(reason=results["failure_reasons"])