Skip to main content

Why CI/CD Evaluation?

Healthcare AI requires rigorous testing before deployment. By integrating evaluation into your CI/CD pipeline, you can:
  • Automatically validate every model change against safety criteria
  • Catch regressions before they reach patients
  • Maintain audit trails of all evaluation results
  • Enforce quality gates for production deployment
Estimated TimePrerequisitesDifficulty
20 minutesCI/CD system (GitHub Actions, GitLab, etc.), Rubric API keyIntermediate

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        CI/CD Pipeline                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐ │
│   │   Build  │───▶│   Test   │───▶│ Evaluate │───▶│  Deploy  │ │
│   └──────────┘    └──────────┘    └────┬─────┘    └──────────┘ │
│                                        │                        │
│                                        ▼                        │
│                                  ┌──────────┐                   │
│                                  │  Rubric  │                   │
│                                  │   API    │                   │
│                                  └────┬─────┘                   │
│                                       │                         │
│                              ┌────────┴────────┐                │
│                              ▼                 ▼                │
│                        ┌──────────┐     ┌──────────┐           │
│                        │ Automated│     │  Human   │           │
│                        │  Scoring │     │  Review  │           │
│                        └────┬─────┘     └────┬─────┘           │
│                              │               │                  │
│                              └───────┬───────┘                  │
│                                      ▼                          │
│                               ┌──────────┐                      │
│                               │  Pass/   │                      │
│                               │  Fail    │                      │
│                               └──────────┘                      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Step 1: Configure Evaluation Definition

First, create a versioned evaluation definition that will be used consistently across all pipeline runs.
evaluations/triage_safety.yaml
# evaluations/triage_safety.yaml
# Store in your repository for version control

name: Triage Safety Gate
version: 1.2.0

evaluators:
  - type: triage_accuracy
    version: 1.0.0
    config:
      levels: [emergent, urgent, semi_urgent, routine]
      severity_weights:
        under_triage: 5.0
        over_triage: 1.0

  - type: red_flag_detection
    version: 1.0.0
    config:
      protocols: [chest_pain_v2, stroke_fast, sepsis]
      critical_miss_penalty: 100

# Quality gates - must pass ALL to deploy
gates:
  - name: triage_accuracy_gate
    metric: triage_accuracy
    operator: gte
    threshold: 0.85

  - name: red_flag_sensitivity_gate
    metric: red_flag_sensitivity
    operator: gte
    threshold: 0.98

  - name: no_critical_misses
    metric: critical_miss_count
    operator: eq
    threshold: 0

# Compare against baseline
baseline:
  type: production
  model_tag: production-current
  max_regression:
    triage_accuracy: 0.02
    red_flag_sensitivity: 0.01

Step 2: GitHub Actions Integration

.github/workflows/model-evaluation.yml
name: Model Evaluation

on:
  pull_request:
    paths:
      - 'models/**'
      - 'prompts/**'
  push:
    branches: [main]
    paths:
      - 'models/**'
      - 'prompts/**'

env:
  RUBRIC_API_KEY: ${{ secrets.RUBRIC_API_KEY }}

jobs:
  evaluate:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install rubric pytest
          pip install -r requirements.txt

      - name: Build model artifacts
        run: |
          python scripts/build_model.py

      - name: Run Rubric Evaluation
        id: evaluation
        run: |
          python scripts/run_evaluation.py \
            --model-path ./dist/model \
            --config ./evaluations/triage_safety.yaml \
            --dataset ds_validation_v2 \
            --output ./evaluation_results.json

      - name: Check Quality Gates
        run: |
          python scripts/check_gates.py \
            --results ./evaluation_results.json \
            --config ./evaluations/triage_safety.yaml

      - name: Upload Evaluation Report
        uses: actions/upload-artifact@v3
        with:
          name: evaluation-report
          path: ./evaluation_results.json

      - name: Comment on PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v6
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('./evaluation_results.json'));

            const body = `
            ## Model Evaluation Results

            | Metric | Score | Threshold | Status |
            |--------|-------|-----------|--------|
            | Triage Accuracy | ${(results.triage_accuracy * 100).toFixed(1)}% | ≥85% | ${results.triage_accuracy >= 0.85 ? '✅' : '❌'} |
            | Red Flag Sensitivity | ${(results.red_flag_sensitivity * 100).toFixed(1)}% | ≥98% | ${results.red_flag_sensitivity >= 0.98 ? '✅' : '❌'} |
            | Critical Misses | ${results.critical_misses} | 0 | ${results.critical_misses === 0 ? '✅' : '❌'} |

            **Overall:** ${results.gates_passed ? '✅ All gates passed' : '❌ Some gates failed'}

            [View full report](${results.report_url})
            `;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

  deploy:
    needs: evaluate
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest

    steps:
      - name: Download evaluation results
        uses: actions/download-artifact@v3
        with:
          name: evaluation-report

      - name: Verify gates passed
        run: |
          PASSED=$(jq '.gates_passed' evaluation_results.json)
          if [ "$PASSED" != "true" ]; then
            echo "❌ Evaluation gates not passed. Blocking deployment."
            exit 1
          fi

      - name: Deploy to production
        run: |
          echo "Deploying model to production..."
          # Your deployment script here

Step 3: Evaluation Script

scripts/run_evaluation.py
#!/usr/bin/env python3
"""Run Rubric evaluation as part of CI/CD pipeline."""

import argparse
import json
import sys
import yaml
from rubric import Rubric

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model-path', required=True)
    parser.add_argument('--config', required=True)
    parser.add_argument('--dataset', required=True)
    parser.add_argument('--output', required=True)
    parser.add_argument('--timeout', type=int, default=3600)
    args = parser.parse_args()

    # Load evaluation config
    with open(args.config) as f:
        config = yaml.safe_load(f)

    # Initialize client
    client = Rubric()

    # Get model version from build
    model_version = get_model_version(args.model_path)

    print(f"Running evaluation for model: {model_version}")
    print(f"Dataset: {args.dataset}")
    print(f"Config: {config['name']} v{config['version']}")

    # Create evaluation
    evaluation = client.evaluations.create(
        name=f"CI/CD Evaluation - {model_version}",
        dataset=args.dataset,
        model_version=model_version,
        evaluators=config['evaluators'],

        # CI mode: wait for automated scoring only
        # (human review happens async, shouldn't block deploy)
        ci_mode=True,

        # Tag for tracking
        tags=["ci-cd", f"commit:{get_git_commit()}"],

        metadata={
            "git_commit": get_git_commit(),
            "git_branch": get_git_branch(),
            "pipeline_run": get_pipeline_id()
        }
    )

    print(f"Evaluation started: {evaluation.id}")

    # Wait for completion (automated scoring only)
    try:
        evaluation.wait(timeout=args.timeout, stage="automated")
    except TimeoutError:
        print(f"❌ Evaluation timed out after {args.timeout}s")
        sys.exit(1)

    # Get results
    results = client.evaluations.get(evaluation.id)

    # Check gates
    gates_passed = check_gates(results, config['gates'])

    # Check for regression against baseline
    regression_check = None
    if 'baseline' in config:
        regression_check = check_regression(client, results, config['baseline'])

    # Prepare output
    output = {
        "evaluation_id": evaluation.id,
        "model_version": model_version,
        "triage_accuracy": results.scores.triage_accuracy,
        "red_flag_sensitivity": results.scores.red_flag_sensitivity,
        "critical_misses": results.safety.critical_misses,
        "gates_passed": gates_passed,
        "regression_check": regression_check,
        "report_url": results.dashboard_url
    }

    # Write results
    with open(args.output, 'w') as f:
        json.dump(output, f, indent=2)

    print(f"\nResults written to {args.output}")
    print(f"Dashboard: {results.dashboard_url}")

    if not gates_passed:
        print("\n❌ FAILED: Quality gates not met")
        sys.exit(1)

    if regression_check and not regression_check['passed']:
        print("\n❌ FAILED: Regression detected")
        sys.exit(1)

    print("\n✅ PASSED: All checks passed")


def check_gates(results, gates):
    """Check if all quality gates are met."""
    all_passed = True

    for gate in gates:
        metric_value = getattr(results.scores, gate['metric'], None)
        if metric_value is None:
            metric_value = getattr(results.safety, gate['metric'], None)

        if gate['operator'] == 'gte':
            passed = metric_value >= gate['threshold']
        elif gate['operator'] == 'lte':
            passed = metric_value <= gate['threshold']
        elif gate['operator'] == 'eq':
            passed = metric_value == gate['threshold']
        else:
            passed = False

        status = "✅" if passed else "❌"
        print(f"  {status} {gate['name']}: {metric_value} (threshold: {gate['operator']} {gate['threshold']})")

        if not passed:
            all_passed = False

    return all_passed


def check_regression(client, results, baseline_config):
    """Check for regression against baseline."""
    # Get baseline results
    baseline = client.evaluations.get_by_tag(
        baseline_config['model_tag']
    )

    if not baseline:
        print("  ⚠️ No baseline found, skipping regression check")
        return None

    regression_found = False
    details = {}

    for metric, max_drop in baseline_config['max_regression'].items():
        current = getattr(results.scores, metric)
        previous = getattr(baseline.scores, metric)
        diff = current - previous

        if diff < -max_drop:
            regression_found = True
            print(f"  ❌ Regression in {metric}: {previous:.1%}{current:.1%} (max allowed: -{max_drop:.1%})")
        else:
            print(f"  ✅ {metric}: {previous:.1%}{current:.1%}")

        details[metric] = {
            "baseline": previous,
            "current": current,
            "diff": diff,
            "max_allowed": max_drop
        }

    return {
        "passed": not regression_found,
        "baseline_evaluation": baseline.id,
        "details": details
    }


def get_model_version(path):
    import hashlib
    # Generate version from model hash
    return f"model-{hashlib.sha256(open(path, 'rb').read()).hexdigest()[:8]}"

def get_git_commit():
    import subprocess
    return subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()[:8]

def get_git_branch():
    import subprocess
    return subprocess.check_output(['git', 'rev-parse', '--abbrev-ref', 'HEAD']).decode().strip()

def get_pipeline_id():
    import os
    return os.environ.get('GITHUB_RUN_ID', os.environ.get('CI_PIPELINE_ID', 'local'))


if __name__ == '__main__':
    main()

Step 4: GitLab CI Integration

For GitLab users, here’s the equivalent configuration:
.gitlab-ci.yml
stages:
  - build
  - evaluate
  - deploy

variables:
  RUBRIC_API_KEY: $RUBRIC_API_KEY

build:
  stage: build
  script:
    - pip install -r requirements.txt
    - python scripts/build_model.py
  artifacts:
    paths:
      - dist/

evaluate:
  stage: evaluate
  script:
    - pip install rubric
    - python scripts/run_evaluation.py
        --model-path ./dist/model
        --config ./evaluations/triage_safety.yaml
        --dataset ds_validation_v2
        --output ./evaluation_results.json
  artifacts:
    paths:
      - evaluation_results.json
    reports:
      junit: evaluation_results.xml
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"

deploy:
  stage: deploy
  script:
    - |
      PASSED=$(jq '.gates_passed' evaluation_results.json)
      if [ "$PASSED" != "true" ]; then
        echo "Evaluation gates not passed"
        exit 1
      fi
    - ./scripts/deploy.sh
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
  needs:
    - evaluate

Best Practices

PracticeRationale
Use versioned evaluation configsReproducibility and audit trail
Store configs in version controlChange tracking and review
Set appropriate timeoutsDon’t block pipeline indefinitely
Use ci_mode for speedHuman review shouldn’t block deployment
Check for regressionsCatch performance drops early
Archive all resultsRequired for regulatory compliance
Post results to PRsVisibility for reviewers
Safety-Critical Deployments: For production deployments of safety-critical AI, consider requiring completed human review before final deployment, not just automated evaluation.

Monitoring After Deploy

# Tag the deployed model for baseline comparison
client.models.tag(
    model_version="model-abc123",
    tag="production-current",

    # Include evaluation results
    evaluation_id=evaluation.id,

    # Track deployment
    metadata={
        "deployed_at": datetime.now().isoformat(),
        "deployed_by": "ci-pipeline",
        "pipeline_id": get_pipeline_id()
    }
)