Overview
This guide walks you through creating a complete evaluation from scratch:- Create a dataset to hold your test cases
- Add sample data (transcripts, decisions)
- Select evaluators to run
- Execute the evaluation
- View and interpret results
Estimated time: 10-15 minutes
Step 1: Create a Dataset
Datasets are collections of samples that you evaluate together. Think of them as your test sets.- Python SDK
- Dashboard
Step 2: Add Samples
Samples are individual test cases with inputs, AI outputs, and expected results.Sample Structure
Each sample contains:| Field | Description | Required |
|---|---|---|
input | The data the AI received (transcript, audio URL, etc.) | Yes |
output | What the AI produced (triage level, symptoms, etc.) | Yes |
expected | Ground truth for evaluation | No (but recommended) |
metadata | Additional context (call ID, model version, etc.) | No |
Add Samples via SDK
Batch Upload
For larger datasets, use batch upload:Upload from File
Step 3: Select Evaluators
Evaluators are the scoring functions that assess your AI’s performance.Available Evaluators
triage_accuracy
triage_accuracy
Compares predicted triage level against expected.Output metrics:
accuracy: Overall accuracy percentageunder_triage_rate: Rate of dangerous under-classificationover_triage_rate: Rate of over-classificationconfusion_matrix: Full breakdown by class
red_flag_detection
red_flag_detection
Checks if critical symptoms were identified.Output metrics:
recall: Percentage of red flags caughtprecision: Accuracy of red flag callsf1: Balanced scoremissed_flags: List of missed critical symptoms
guideline_compliance
guideline_compliance
Measures adherence to clinical protocols.Output metrics:
compliance_score: Overall compliance percentagesteps_followed: Number of protocol steps followeddeviations: List of protocol deviations
symptom_extraction
symptom_extraction
Evaluates symptom identification accuracy.Output metrics:
entity_f1: F1 score for entity extractionprecision: Extraction precisionrecall: Extraction recallfalse_positives: Incorrectly identified symptoms
Step 4: Run the Evaluation
Now let’s execute the evaluation:Monitor Progress
Step 5: View Results
Summary Metrics
Per-Sample Results
View in Dashboard
- Navigate to your project in app.rubric.ai
- Click Evaluations in the sidebar
- Click on your evaluation
- Explore:
- Summary: Overall metrics and trends
- Samples: Per-sample breakdown with filtering
- Issues: Failed samples grouped by error type
- Compare: Side-by-side with previous evaluations
Next Steps
Core Concepts
Understand Rubric’s data model
Evaluation Lifecycle
Learn about evaluation states and triggers
Human Review
Route flagged samples to clinicians
API Reference
Full API documentation
