Overview
Evaluations are the core of Rubric. They automatically score your AI’s outputs against clinical criteria like triage accuracy, red flag detection, and guideline compliance.
Prerequisites
Before running an evaluation, you need:
Available Evaluators
Triage Accuracy Measures whether the AI assigned the correct urgency level. Penalizes under-triage more heavily than over-triage.
Red Flag Detection Checks if the AI identified critical symptoms requiring immediate attention.
Guideline Compliance Scores adherence to clinical protocols and decision trees.
Hallucination Detection Identifies claims not supported by the patient’s input.
Run an Evaluation
Via Dashboard
Navigate to your project
Click Run Evaluation
Select evaluators and configure weights
Choose dataset or date range
Click Start
Via SDK
from rubric import Rubric
client = Rubric()
# Create and run an evaluation
evaluation = client.evaluations.create(
name = "Weekly Triage Review" ,
project = "proj_abc123" ,
dataset = "ds_xyz789" , # Or use filters instead
evaluators = [
{
"type" : "triage_accuracy" ,
"config" : {
"severity_weights" : {
"under_triage" : 5.0 , # Penalize more heavily
"over_triage" : 1.0
}
}
},
{
"type" : "red_flag_detection" ,
"config" : {
"protocols" : [ "chest_pain" , "headache" , "stroke" ]
}
}
]
)
print ( f "Evaluation started: { evaluation.id } " )
Monitor Progress
Evaluations run asynchronously. Check progress via:
# Poll for completion
evaluation = client.evaluations.get( "eval_abc123" )
print ( f "Status: { evaluation.status } " )
print ( f "Progress: { evaluation.progress.completed } / { evaluation.progress.total } " )
Or use webhooks for real-time updates:
# Configure webhook in dashboard or via API
client.webhooks.create(
url = "https://your-app.com/webhooks/rubric" ,
events = [ "evaluation.completed" , "evaluation.failed" ]
)
View Results
Once complete, examine results in the dashboard or via API:
results = client.evaluations.get_results( "eval_abc123" )
print ( f "Overall Score: { results.overall_score } %" )
for evaluator in results.evaluators:
print ( f " { evaluator.name } : { evaluator.score } %" )
# Get samples that failed specific criteria
failed_triage = client.evaluations.get_samples(
"eval_abc123" ,
evaluator = "triage_accuracy" ,
status = "failed"
)
Interpret Scores
Score Range Interpretation Action 90-100% Excellent Monitor for regression 80-89% Good Review edge cases 70-79% Needs improvement Analyze failure patterns < 70% Critical Immediate review required
Safety-critical evaluators like Red Flag Detection should maintain > 95% accuracy. Any missed red flag should trigger immediate review.
Compare Evaluations
Track progress over time by comparing evaluations:
comparison = client.evaluations.compare([
"eval_week1" ,
"eval_week2" ,
"eval_week3"
])
for metric in comparison.metrics:
print ( f " { metric.name } : { metric.trend } " ) # "improving", "stable", "regressing"
Next Steps