Skip to main content

Overview

This guide walks you through creating a complete evaluation from scratch:
  1. Create a dataset to hold your test cases
  2. Add sample data (transcripts, decisions)
  3. Select evaluators to run
  4. Execute the evaluation
  5. View and interpret results
Estimated time: 10-15 minutes

Step 1: Create a Dataset

Datasets are collections of samples that you evaluate together. Think of them as your test sets.
from rubric import Rubric

client = Rubric()

# Create a dataset
dataset = client.datasets.create(
    name="Triage Test Set - Q1 2025",
    description="Golden test set for triage model evaluation",
    project="patient-triage",
    
    # Optional: add tags for organization
    tags=["golden-set", "quarterly-eval"]
)

print(f"Created dataset: {dataset.id}")
# Output: Created dataset: ds_abc123

Step 2: Add Samples

Samples are individual test cases with inputs, AI outputs, and expected results.

Sample Structure

Each sample contains:
FieldDescriptionRequired
inputThe data the AI received (transcript, audio URL, etc.)Yes
outputWhat the AI produced (triage level, symptoms, etc.)Yes
expectedGround truth for evaluationNo (but recommended)
metadataAdditional context (call ID, model version, etc.)No

Add Samples via SDK

# Add a single sample
sample = client.samples.create(
    dataset="ds_abc123",
    
    input={
        "transcript": [
            {"speaker": "agent", "text": "How can I help you today?"},
            {"speaker": "patient", "text": "I have severe chest pain radiating to my left arm."},
            {"speaker": "agent", "text": "Are you experiencing shortness of breath?"},
            {"speaker": "patient", "text": "Yes, and I'm sweating a lot."}
        ],
        "patient_context": {
            "age": 58,
            "sex": "male"
        }
    },
    
    output={
        "triage_level": "emergency",
        "symptoms": ["chest_pain", "arm_radiation", "dyspnea", "diaphoresis"],
        "red_flags": ["acs_symptoms"],
        "recommended_action": "call_911",
        "confidence": 0.95
    },
    
    expected={
        "triage_level": "emergency",
        "should_escalate": True,
        "correct_protocol": "chest_pain_acs"
    },
    
    metadata={
        "call_id": "call_12345",
        "model_version": "v2.3.1",
        "annotator": "dr_smith"
    }
)

print(f"Added sample: {sample.id}")

Batch Upload

For larger datasets, use batch upload:
# Prepare samples
samples = [
    {
        "input": {...},
        "output": {...},
        "expected": {...}
    },
    # ... more samples
]

# Batch upload (up to 1000 samples per call)
result = client.samples.batch_create(
    dataset="ds_abc123",
    samples=samples
)

print(f"Added {result.created} samples")
print(f"Failed: {result.failed}")

Upload from File

# Upload from JSONL file
client.samples.upload(
    dataset="ds_abc123",
    file="test_cases.jsonl"
)

# Upload from CSV
client.samples.upload(
    dataset="ds_abc123",
    file="test_cases.csv",
    mapping={
        "input.transcript": "transcript_column",
        "output.triage_level": "ai_triage",
        "expected.triage_level": "expert_triage"
    }
)

Step 3: Select Evaluators

Evaluators are the scoring functions that assess your AI’s performance.

Available Evaluators

Compares predicted triage level against expected.Output metrics:
  • accuracy: Overall accuracy percentage
  • under_triage_rate: Rate of dangerous under-classification
  • over_triage_rate: Rate of over-classification
  • confusion_matrix: Full breakdown by class
Configuration:
{
    "type": "triage_accuracy",
    "config": {
        "levels": ["emergency", "urgent", "semi-urgent", "routine"],
        "severity_weights": {
            "under_triage": 5.0,  # Penalize missing emergencies
            "over_triage": 1.0   # Acceptable to over-classify
        }
    }
}
Checks if critical symptoms were identified.Output metrics:
  • recall: Percentage of red flags caught
  • precision: Accuracy of red flag calls
  • f1: Balanced score
  • missed_flags: List of missed critical symptoms
Configuration:
{
    "type": "red_flag_detection",
    "config": {
        "protocols": ["chest_pain", "stroke", "sepsis"],
        "require_all_flags": False,
        "minimum_recall": 0.99
    }
}
Measures adherence to clinical protocols.Output metrics:
  • compliance_score: Overall compliance percentage
  • steps_followed: Number of protocol steps followed
  • deviations: List of protocol deviations
Configuration:
{
    "type": "guideline_compliance",
    "config": {
        "guideline": "schmitt_thompson",
        "required_questions": ["pain_location", "duration", "severity"],
        "strict_mode": False
    }
}
Evaluates symptom identification accuracy.Output metrics:
  • entity_f1: F1 score for entity extraction
  • precision: Extraction precision
  • recall: Extraction recall
  • false_positives: Incorrectly identified symptoms
Configuration:
{
    "type": "symptom_extraction",
    "config": {
        "entity_types": ["symptom", "medication", "condition"],
        "partial_match": True,
        "case_sensitive": False
    }
}

Step 4: Run the Evaluation

Now let’s execute the evaluation:
# Create and run evaluation
evaluation = client.evaluations.create(
    name="Triage Model v2.3 - Quarterly Review",
    project="patient-triage",
    dataset="ds_abc123",
    
    evaluators=[
        {
            "type": "triage_accuracy",
            "config": {
                "severity_weights": {"under_triage": 5.0, "over_triage": 1.0}
            }
        },
        {
            "type": "red_flag_detection",
            "config": {
                "protocols": ["chest_pain", "stroke", "sepsis"]
            }
        },
        {
            "type": "guideline_compliance",
            "config": {
                "guideline": "schmitt_thompson"
            }
        }
    ],
    
    metadata={
        "model_version": "v2.3.1",
        "triggered_by": "quarterly_review"
    }
)

print(f"Evaluation started: {evaluation.id}")
print(f"Status: {evaluation.status}")

Monitor Progress

# Poll for completion
import time

while True:
    status = client.evaluations.get_status(evaluation.id)
    
    print(f"Progress: {status.completed}/{status.total} samples")
    
    if status.status == "completed":
        break
    elif status.status == "failed":
        raise Exception(f"Evaluation failed: {status.error}")
    
    time.sleep(5)

# Or use the convenience method
result = client.evaluations.wait(evaluation.id, timeout=300)

Step 5: View Results

Summary Metrics

# Get evaluation results
results = client.evaluations.get(evaluation.id)

print("=== Evaluation Results ===")
print(f"Overall Score: {results.score}%")
print()
print("Metrics:")
for metric, value in results.metrics.items():
    print(f"  {metric}: {value}")
Example output:
=== Evaluation Results ===
Overall Score: 87%

Metrics:
  triage_accuracy: 0.89
  under_triage_rate: 0.02
  over_triage_rate: 0.09
  red_flag_recall: 0.97
  red_flag_precision: 0.91
  guideline_compliance: 0.85

Per-Sample Results

# Get detailed per-sample results
sample_results = client.evaluations.get_samples(evaluation.id)

for sample in sample_results:
    print(f"Sample {sample.id}:")
    print(f"  Score: {sample.score}")
    print(f"  Issues: {sample.issues}")
    print()

View in Dashboard

  1. Navigate to your project in app.rubric.ai
  2. Click Evaluations in the sidebar
  3. Click on your evaluation
  4. Explore:
    • Summary: Overall metrics and trends
    • Samples: Per-sample breakdown with filtering
    • Issues: Failed samples grouped by error type
    • Compare: Side-by-side with previous evaluations

Next Steps