Skip to main content

Overview

Evaluations are the core of Rubric. They automatically score your AI’s outputs against clinical criteria like triage accuracy, red flag detection, and guideline compliance.

Prerequisites

Before running an evaluation, you need:

Available Evaluators

Triage Accuracy

Measures whether the AI assigned the correct urgency level. Penalizes under-triage more heavily than over-triage.

Red Flag Detection

Checks if the AI identified critical symptoms requiring immediate attention.

Guideline Compliance

Scores adherence to clinical protocols and decision trees.

Hallucination Detection

Identifies claims not supported by the patient’s input.

Run an Evaluation

Via Dashboard

  1. Navigate to your project
  2. Click Run Evaluation
  3. Select evaluators and configure weights
  4. Choose dataset or date range
  5. Click Start

Via SDK

from rubric import Rubric

client = Rubric()

# Create and run an evaluation
evaluation = client.evaluations.create(
    name="Weekly Triage Review",
    project="proj_abc123",
    dataset="ds_xyz789",  # Or use filters instead
    evaluators=[
        {
            "type": "triage_accuracy",
            "config": {
                "severity_weights": {
                    "under_triage": 5.0,  # Penalize more heavily
                    "over_triage": 1.0
                }
            }
        },
        {
            "type": "red_flag_detection",
            "config": {
                "protocols": ["chest_pain", "headache", "stroke"]
            }
        }
    ]
)

print(f"Evaluation started: {evaluation.id}")

Monitor Progress

Evaluations run asynchronously. Check progress via:
# Poll for completion
evaluation = client.evaluations.get("eval_abc123")
print(f"Status: {evaluation.status}")
print(f"Progress: {evaluation.progress.completed}/{evaluation.progress.total}")
Or use webhooks for real-time updates:
# Configure webhook in dashboard or via API
client.webhooks.create(
    url="https://your-app.com/webhooks/rubric",
    events=["evaluation.completed", "evaluation.failed"]
)

View Results

Once complete, examine results in the dashboard or via API:
results = client.evaluations.get_results("eval_abc123")

print(f"Overall Score: {results.overall_score}%")

for evaluator in results.evaluators:
    print(f"{evaluator.name}: {evaluator.score}%")
    
# Get samples that failed specific criteria
failed_triage = client.evaluations.get_samples(
    "eval_abc123",
    evaluator="triage_accuracy",
    status="failed"
)

Interpret Scores

Score RangeInterpretationAction
90-100%ExcellentMonitor for regression
80-89%GoodReview edge cases
70-79%Needs improvementAnalyze failure patterns
< 70%CriticalImmediate review required
Safety-critical evaluators like Red Flag Detection should maintain > 95% accuracy. Any missed red flag should trigger immediate review.

Compare Evaluations

Track progress over time by comparing evaluations:
comparison = client.evaluations.compare([
    "eval_week1",
    "eval_week2", 
    "eval_week3"
])

for metric in comparison.metrics:
    print(f"{metric.name}: {metric.trend}")  # "improving", "stable", "regressing"

Next Steps