Overview
This guide walks you through creating a complete evaluation from scratch:- Create a dataset to hold your test cases
- Add sample data (transcripts, decisions)
- Select evaluators to run
- Execute the evaluation
- View and interpret results
Estimated time: 10-15 minutes
Step 1: Create a Dataset
Datasets are collections of samples that you evaluate together. Think of them as your test sets.- Python SDK
- Dashboard
Step 2: Add Samples
Samples are individual test cases with inputs, AI outputs, and expected results.Sample Structure
Each sample contains:| Field | Description | Required |
|---|---|---|
input | The data the AI received (transcript, audio URL, etc.) | Yes |
output | What the AI produced (triage level, symptoms, etc.) | Yes |
expected | Ground truth for evaluation | No (but recommended) |
metadata | Additional context (call ID, model version, etc.) | No |
Add Samples via SDK
Batch Upload
For larger datasets, use batch upload:Upload from File
Step 3: Select Evaluators
Evaluators are the scoring functions that assess your AI’s performance.Available Evaluators
triage_accuracy
triage_accuracy
Compares predicted triage level against expected.Output metrics:
accuracy: Overall accuracy percentageunder_triage_rate: Rate of dangerous under-classificationover_triage_rate: Rate of over-classificationconfusion_matrix: Full breakdown by class
red_flag_detection
red_flag_detection
Checks if critical symptoms were identified.Output metrics:
recall: Percentage of red flags caughtprecision: Accuracy of red flag callsf1: Balanced scoremissed_flags: List of missed critical symptoms
guideline_compliance
guideline_compliance
Measures adherence to clinical protocols.Output metrics:
compliance_score: Overall compliance percentagesteps_followed: Number of protocol steps followeddeviations: List of protocol deviations
symptom_extraction
symptom_extraction
Evaluates symptom identification accuracy.Output metrics:
entity_f1: F1 score for entity extractionprecision: Extraction precisionrecall: Extraction recallfalse_positives: Incorrectly identified symptoms
Step 4: Run the Evaluation
Now let’s execute the evaluation:Monitor Progress
Step 5: View Results
Summary Metrics
Per-Sample Results
View in Dashboard
- Navigate to your project in app.rubric.ai
- Click Evaluations in the sidebar
- Click on your evaluation
- Explore:
- Summary: Overall metrics and trends
- Samples: Per-sample breakdown with filtering
- Issues: Failed samples grouped by error type
- Compare: Side-by-side with previous evaluations
