Skip to main content

Data Modalities

Rubric supports evaluation across all major healthcare AI data types:

Voice & Audio

Patient calls, triage conversations, voice assistants

Clinical Notes

SOAP notes, discharge summaries, visit documentation

Medical Imaging

DICOM studies, X-rays, CT, MRI, pathology slides

Voice & Audio

FeatureDescription
Audio file supportWAV, MP3, M4A, FLAC up to 2 hours
Transcript formatsJSON with speaker labels and timestamps
Real-time streamingWebSocket API for live call evaluation
Multi-speakerAutomatic speaker diarization

Clinical Notes

FeatureDescription
Document typesSOAP, H&P, Progress notes, Discharge summaries
Structured extractionICD-10, CPT, SNOMED CT code validation
Section parsingAutomatic section identification
Template supportCustom documentation templates

Medical Imaging (DICOM)

FeatureDescription
ModalitiesCR, CT, MR, US, PT, MG, DX, and more
PACS integrationDICOMweb (WADO-RS, STOW-RS)
Coordinate systemsPixel, anatomical, and normalized coordinates
Series handlingMulti-frame and multi-series support

Evaluation Framework

The core of Rubric — automated clinical evaluation powered by healthcare-specific evaluators.

Evaluation Types

Validates that AI model outputs are correct and match expected results.Use cases: Classification accuracy, entity extraction, structured output validation
Evaluates whether AI outputs meet clinical safety standards and don’t cause patient harm.Checks: Red flag detection, contraindication identification, escalation appropriateness
Identifies when AI generates information not grounded in the source data.Methods: Citation verification, fact checking, source attribution analysis
Measures whether AI captured all relevant information from the input.Metrics: Recall, coverage score, missing element identification

Metrics

MetricDescription
Clinical AccuracyValidates medical information correctness against clinical guidelines
Sensitivity / SpecificityMeasures true positive and true negative rates for clinical decisions
Rubric-Based ScoringMulti-dimensional scoring using customizable clinical rubrics
Custom MetricsDefine your own metrics for specialized evaluation needs

Human Review Design

Configure how clinical experts review AI outputs:
  • Review templates: Pre-built forms for common clinical review tasks
  • Grading rubrics: Multi-criteria scoring with weighted dimensions
  • Annotation tools: Highlight, comment, and label AI outputs
  • Side-by-side comparison: View AI output alongside source data

Consensus & Disagreement Handling

When multiple reviewers evaluate the same output:
FeatureDescription
Multi-reviewer assignmentRoute samples to 2+ reviewers for consensus
Adjudication workflowsEscalate disagreements to senior reviewers
Inter-rater reliabilityCalculate Cohen’s kappa and agreement metrics
Tie-breaking rulesConfigurable resolution for split decisions

Evaluation Versioning

Track changes to your evaluation configurations over time:
  • Version history: Full audit trail of evaluation changes
  • Rollback support: Revert to previous evaluation versions
  • Change comparison: Diff view between evaluation versions
  • Release management: Tag and deploy evaluation versions

Comparing Model Runs

Compare model versions, prompts, or configurations with statistical rigor.
experiments.py
# Run A/B evaluation
experiment = client.experiments.create(
    name="Triage Model v2 vs v3",
    project="patient-triage",
    dataset="ds_golden_test",

    variants=[
        {"name": "v2-baseline", "model": "triage-v2"},
        {"name": "v3-candidate", "model": "triage-v3"}
    ],

    evaluators=[
        {"type": "triage_accuracy"},
        {"type": "red_flag_detection"},
        {"type": "latency"}
    ],

    # Statistical config
    significance_level=0.05,
    min_sample_size=500
)

# Get comparison results
results = client.experiments.get_results(experiment.id)
print(f"Winner: {results.winner}")
print(f"Improvement: {results.improvement_pct}%")
print(f"Significant: {results.is_significant}")

Reproducibility Guarantees

Ensure consistent evaluation results:
  • Deterministic evaluation: Seeded random sampling and consistent ordering
  • Environment pinning: Lock evaluator versions and dependencies
  • Input hashing: Verify dataset integrity across runs
  • Audit logging: Complete record of evaluation parameters and results

Observability & Logging

Real-time visibility into your healthcare AI in production.

Structured Logging

Log inputs, outputs, and metadata with healthcare-specific schemas

Real-time Dashboard

Monitor evaluation metrics, error rates, and trends

Alerting

Get notified when metrics degrade or safety thresholds are breached

Tracing

Track requests through multi-step AI pipelines

Logging Example

logging.py
from rubric import Rubric

client = Rubric()

# Log with full context
client.log(
    project="patient-triage",
    
    input={
        "transcript": transcript,
        "audio_url": "s3://bucket/call.wav",
        "patient_context": {"age": 45, "sex": "M"}
    },
    
    output={
        "triage_level": "urgent",
        "symptoms": ["chest_pain", "shortness_of_breath"],
        "recommended_action": "schedule_same_day"
    },
    
    metadata={
        "model_version": "v2.3.1",
        "latency_ms": 234,
        "call_id": "call_abc123"
    }
)

Human Expert Network

Route AI outputs to clinical experts for review, feedback, and ground truth generation.

Who Reviews

Our network includes credentialed healthcare professionals across specialties:

Physicians

Board-certified MDs and DOs across specialties

Nurses

RNs and NPs with clinical experience

Coders

Certified medical coders (CPC, CCS, RHIA)

Dieticians

Registered dietitians and nutritionists

Mental Health Coaches

Licensed counselors and therapists

Allied Health

Physical therapists, pharmacists, and more

Credentialing & Verification

All reviewers undergo rigorous verification:
CheckDescription
License verificationActive license confirmed with state boards
Education validationDegrees verified with institutions
Background checkCriminal and sanctions screening
Skills assessmentDomain-specific competency testing
Ongoing monitoringContinuous license and sanctions monitoring

Reviewer Assignment Logic

Intelligent matching of reviews to qualified experts:
  • Credential matching: Route to reviewers with appropriate licenses
  • Specialty alignment: Match clinical domain expertise
  • Workload balancing: Distribute work evenly across pool
  • Availability windows: Respect reviewer schedules and time zones
  • Performance-based routing: Prioritize high-quality reviewers

Conflict-of-Interest Controls

Ensure unbiased reviews:
ControlDescription
Blinded reviewHide customer identity from reviewers
Employer exclusionsBlock reviews of competitor organizations
Relationship declarationsReviewers disclose potential conflicts
Rotation policiesPrevent over-familiarity with specific outputs

Quality Assurance & Calibration

Maintain consistent, high-quality reviews:
  • Gold standard datasets: Test reviewers against known-correct answers
  • Inter-rater reliability: Monitor agreement across reviewers
  • Calibration sessions: Regular alignment on scoring criteria
  • Performance tracking: Individual reviewer quality metrics
  • Feedback loops: Share aggregated feedback with reviewers

Security & Compliance

HIPAA Compliant

BAA available, PHI handling, audit logs

SOC 2 Type II

Annual audits, security controls

Encryption

AES-256 at rest, TLS 1.3 in transit

Access Control

RBAC, SSO, MFA support

Data Handling

  • PHI De-identification: Automatic PII/PHI detection and redaction
  • Data Residency: Choose US, EU, or custom regions
  • Retention Policies: Configurable retention with secure deletion
  • Audit Logging: Complete audit trail for compliance

Integrations

EHR Systems

Epic, Cerner, Meditech

Voice Platforms

Twilio, Vonage, Amazon Connect

PACS

DICOMweb, Orthanc, dcm4chee

LLM Providers

OpenAI, Anthropic, Azure, AWS Bedrock

CI/CD

GitHub Actions, GitLab CI, Jenkins

Monitoring

Datadog, Grafana, PagerDuty

Next Steps