Metrics Reference

Clinical Accuracy Metrics

Triage Accuracy

Measures the percentage of cases where the AI assigned the correct urgency level.

Triage Accuracy = (Correct Classifications) / (Total Cases) × 100%

Weighted Accuracy = Σ(weight_i × correct_i) / Σ(weight_i)

Where weights reflect clinical severity of each triage level.

Metric	Formula	Interpretation
Raw Accuracy	Correct / Total	Simple correctness rate
Weighted Accuracy	Σ(w × correct) / Σw	Accounts for severity differences
Under-triage Rate	Under-triaged / Total	Cases assigned less urgent than needed
Over-triage Rate	Over-triaged / Total	Cases assigned more urgent than needed

Under-triage vs Over-triage: In healthcare, under-triage (missing emergencies) is generally far more dangerous than over-triage (unnecessary ER visits). Weight your metrics accordingly.

Sensitivity & Specificity

Critical metrics for evaluating detection of specific conditions or red flags.

Sensitivity (Recall) = True Positives / (True Positives + False Negatives)
"Of all actual positives, how many did we catch?"

Specificity = True Negatives / (True Negatives + False Positives)
"Of all actual negatives, how many did we correctly rule out?"

Positive Predictive Value (PPV) = TP / (TP + FP)
"When we say positive, how often are we right?"

Negative Predictive Value (NPV) = TN / (TN + FN)
"When we say negative, how often are we right?"

Clinical Context

Condition	Priority Metric	Target	Rationale
Chest Pain → MI	Sensitivity	≥ 99%	Cannot miss heart attacks
Stroke Symptoms	Sensitivity	≥ 99%	Time-critical intervention
Pediatric Sepsis	Sensitivity	≥ 98%	Rapid deterioration risk
Routine Follow-up	Specificity	≥ 85%	Avoid unnecessary burden
Mental Health Crisis	Sensitivity	≥ 95%	Safety-critical detection

sensitivity_calculation.py

from rubric import Rubric

client = Rubric()

# Get evaluation results with sensitivity/specificity breakdown
results = client.evaluations.get("eval_abc123")

for condition in results.conditions:
    print(f"""
Condition: {condition.name}
  Sensitivity: {condition.sensitivity:.1%}
  Specificity: {condition.specificity:.1%}
  PPV: {condition.ppv:.1%}
  NPV: {condition.npv:.1%}
  F1 Score: {condition.f1:.3f}

  Confusion Matrix:
    TP: {condition.true_positives}  FP: {condition.false_positives}
    FN: {condition.false_negatives}  TN: {condition.true_negatives}
""")

Rubric-Based Scoring

For complex outputs like clinical notes or patient communications, rubric-based scoring provides structured evaluation across multiple dimensions.

rubric_definition.py

rubric = {
    "name": "Clinical Note Quality",
    "version": "2.1",
    "dimensions": [
        {
            "name": "Clinical Accuracy",
            "weight": 0.30,
            "criteria": [
                {"score": 5, "description": "All medical facts accurate, appropriate terminology"},
                {"score": 4, "description": "Minor terminology issues, facts correct"},
                {"score": 3, "description": "Some inaccuracies, no patient safety impact"},
                {"score": 2, "description": "Multiple inaccuracies, potential confusion"},
                {"score": 1, "description": "Significant errors, safety concerns"}
            ]
        },
        {
            "name": "Completeness",
            "weight": 0.25,
            "criteria": [
                {"score": 5, "description": "All required elements present, thorough"},
                {"score": 4, "description": "Minor omissions, key info captured"},
                {"score": 3, "description": "Some gaps but usable for care"},
                {"score": 2, "description": "Missing important elements"},
                {"score": 1, "description": "Critically incomplete"}
            ]
        },
        {
            "name": "Safety Documentation",
            "weight": 0.25,
            "criteria": [
                {"score": 5, "description": "All safety concerns documented, clear follow-up"},
                {"score": 4, "description": "Safety items present, minor gaps"},
                {"score": 3, "description": "Basic safety documented"},
                {"score": 2, "description": "Safety gaps present"},
                {"score": 1, "description": "Critical safety omissions"}
            ]
        },
        {
            "name": "Actionability",
            "weight": 0.20,
            "criteria": [
                {"score": 5, "description": "Clear next steps, specific instructions"},
                {"score": 4, "description": "Good guidance, minor ambiguity"},
                {"score": 3, "description": "Adequate but could be clearer"},
                {"score": 2, "description": "Vague or confusing recommendations"},
                {"score": 1, "description": "No clear action items"}
            ]
        }
    ],

    "aggregate_method": "weighted_average",
    "passing_threshold": 3.5
}

# Use in evaluation
evaluation = client.evaluations.create(
    name="Clinical Note Evaluation",
    dataset="ds_clinical_notes",
    evaluators=[{
        "type": "rubric",
        "config": {"rubric": rubric}
    }]
)

Custom Metrics

Define metrics specific to your clinical domain and use cases.

custom_metrics.py

from rubric import Metric, MetricResult

class SymptomCoverageMetric(Metric):
    """
    Measures what percentage of documented symptoms
    were addressed in the AI's assessment.
    """

    name = "symptom_coverage"
    display_name = "Symptom Coverage"
    description = "Percentage of symptoms addressed in assessment"

    def calculate(self, sample) -> MetricResult:
        documented_symptoms = set(sample.input.get("symptoms", []))
        addressed_symptoms = set(sample.ai_output.get("addressed_symptoms", []))

        if not documented_symptoms:
            return MetricResult(
                value=1.0,
                details={"note": "No symptoms documented"}
            )

        coverage = len(addressed_symptoms & documented_symptoms) / len(documented_symptoms)
        missed = documented_symptoms - addressed_symptoms

        return MetricResult(
            value=coverage,
            details={
                "total_symptoms": len(documented_symptoms),
                "addressed": len(addressed_symptoms & documented_symptoms),
                "missed_symptoms": list(missed)
            },
            flags=["incomplete_assessment"] if coverage < 0.8 else []
        )


class TimeToTriageMetric(Metric):
    """Measures efficiency of triage decision-making."""

    name = "time_to_triage"
    display_name = "Time to Triage Decision"
    description = "Conversation turns before triage decision"

    def calculate(self, sample) -> MetricResult:
        turns = sample.ai_output.get("conversation_turns", 0)
        triage_turn = sample.ai_output.get("triage_decision_turn", turns)

        # Clinical benchmark: triage within 5 turns for most cases
        efficiency_score = max(0, 1 - (triage_turn - 3) / 10)

        return MetricResult(
            value=efficiency_score,
            details={
                "total_turns": turns,
                "triage_turn": triage_turn,
                "benchmark": 5
            }
        )

# Register custom metrics
client.metrics.register(SymptomCoverageMetric)
client.metrics.register(TimeToTriageMetric)

Confidence Intervals

All metrics include confidence intervals to quantify uncertainty, especially important for small sample sizes.

results = client.evaluations.get("eval_abc123")

print(f"""
Triage Accuracy: {results.triage_accuracy.value:.1%}
  95% CI: [{results.triage_accuracy.ci_lower:.1%}, {results.triage_accuracy.ci_upper:.1%}]
  Sample Size: {results.triage_accuracy.n}

Sensitivity (Chest Pain): {results.sensitivity_chest_pain.value:.1%}
  95% CI: [{results.sensitivity_chest_pain.ci_lower:.1%}, {results.sensitivity_chest_pain.ci_upper:.1%}]
""")

Sample Size Matters: For rare conditions, you may need larger datasets to achieve narrow confidence intervals. Rubric warns when sample sizes are too small for reliable conclusions.

Metric Aggregation

Combine multiple metrics into composite scores for overall model assessment.

Method	Description	Use Case
Weighted Average	Sum of (weight × metric)	General performance score
Minimum	Lowest individual metric	Ensure no weak spots
Geometric Mean	∏(metric)^(1/n)	Balance across metrics
Threshold Gate	Pass only if all thresholds met	Safety-critical deployment

evaluation = client.evaluations.create(
    name="Production Readiness Check",
    dataset="ds_validation",
    evaluators=[...],

    aggregation={
        "method": "threshold_gate",
        "thresholds": {
            "triage_accuracy": {"min": 0.85},
            "sensitivity_critical": {"min": 0.95},
            "hallucination_rate": {"max": 0.01},
            "red_flag_detection": {"min": 0.98}
        },
        "require_all": True  # Must pass ALL thresholds
    }
)

Exporting Metrics

Export metrics in formats suitable for regulatory documentation, dashboards, or CI/CD pipelines.

# Export for regulatory documentation
client.evaluations.export(
    "eval_abc123",
    format="regulatory_report",
    output_path="./reports/fda_submission.pdf",
    include=[
        "methodology",
        "dataset_description",
        "metric_definitions",
        "results_summary",
        "confidence_intervals",
        "failure_analysis"
    ]
)

# Export for CI/CD
results = client.evaluations.export("eval_abc123", format="json")

# Use in deployment decision
if results["composite_score"] >= 0.9 and results["safety_gate"] == "pass":
    deploy_model()
else:
    block_deployment(reason=results["failure_reasons"])

Home

Getting Started

Core Concepts

Onboarding Guides

Evaluation Framework

Tutorials

Integrations

Security & Compliance

Voice AI

Medical Imaging

Clinical Notes

Workflows

Glossary & Appendix

Metrics Reference

Clinical Accuracy Metrics

Triage Accuracy

Sensitivity & Specificity

Clinical Context

Rubric-Based Scoring

Custom Metrics

Confidence Intervals

Metric Aggregation

Exporting Metrics

Home

Getting Started

Core Concepts

Onboarding Guides

Evaluation Framework

Tutorials

Integrations

Security & Compliance

Voice AI

Medical Imaging

Clinical Notes

Workflows

Glossary & Appendix

​Clinical Accuracy Metrics

​Triage Accuracy

​Sensitivity & Specificity

​Clinical Context

​Rubric-Based Scoring

​Custom Metrics

​Confidence Intervals

​Metric Aggregation

​Exporting Metrics

Clinical Accuracy Metrics

Triage Accuracy

Sensitivity & Specificity

Clinical Context

Rubric-Based Scoring

Custom Metrics

Confidence Intervals

Metric Aggregation

Exporting Metrics