Deploy Evaluations in CI/CD

Why CI/CD Evaluation?

Healthcare AI requires rigorous testing before deployment. By integrating evaluation into your CI/CD pipeline, you can:

Automatically validate every model change against safety criteria
Catch regressions before they reach patients
Maintain audit trails of all evaluation results
Enforce quality gates for production deployment

Estimated Time	Prerequisites	Difficulty
20 minutes	CI/CD system (GitHub Actions, GitLab, etc.), Rubric API key	Intermediate

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        CI/CD Pipeline                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐ │
│   │   Build  │───▶│   Test   │───▶│ Evaluate │───▶│  Deploy  │ │
│   └──────────┘    └──────────┘    └────┬─────┘    └──────────┘ │
│                                        │                        │
│                                        ▼                        │
│                                  ┌──────────┐                   │
│                                  │  Rubric  │                   │
│                                  │   API    │                   │
│                                  └────┬─────┘                   │
│                                       │                         │
│                              ┌────────┴────────┐                │
│                              ▼                 ▼                │
│                        ┌──────────┐     ┌──────────┐           │
│                        │ Automated│     │  Human   │           │
│                        │  Scoring │     │  Review  │           │
│                        └────┬─────┘     └────┬─────┘           │
│                              │               │                  │
│                              └───────┬───────┘                  │
│                                      ▼                          │
│                               ┌──────────┐                      │
│                               │  Pass/   │                      │
│                               │  Fail    │                      │
│                               └──────────┘                      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Step 1: Configure Evaluation Definition

First, create a versioned evaluation definition that will be used consistently across all pipeline runs.

evaluations/triage_safety.yaml

# evaluations/triage_safety.yaml
# Store in your repository for version control

name: Triage Safety Gate
version: 1.2.0

evaluators:
  - type: triage_accuracy
    version: 1.0.0
    config:
      levels: [emergent, urgent, semi_urgent, routine]
      severity_weights:
        under_triage: 5.0
        over_triage: 1.0

  - type: red_flag_detection
    version: 1.0.0
    config:
      protocols: [chest_pain_v2, stroke_fast, sepsis]
      critical_miss_penalty: 100

# Quality gates - must pass ALL to deploy
gates:
  - name: triage_accuracy_gate
    metric: triage_accuracy
    operator: gte
    threshold: 0.85

  - name: red_flag_sensitivity_gate
    metric: red_flag_sensitivity
    operator: gte
    threshold: 0.98

  - name: no_critical_misses
    metric: critical_miss_count
    operator: eq
    threshold: 0

# Compare against baseline
baseline:
  type: production
  model_tag: production-current
  max_regression:
    triage_accuracy: 0.02
    red_flag_sensitivity: 0.01

Step 2: GitHub Actions Integration

.github/workflows/model-evaluation.yml

name: Model Evaluation

on:
  pull_request:
    paths:
      - 'models/**'
      - 'prompts/**'
  push:
    branches: [main]
    paths:
      - 'models/**'
      - 'prompts/**'

env:
  RUBRIC_API_KEY: ${{ secrets.RUBRIC_API_KEY }}

jobs:
  evaluate:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install rubric pytest
          pip install -r requirements.txt

      - name: Build model artifacts
        run: |
          python scripts/build_model.py

      - name: Run Rubric Evaluation
        id: evaluation
        run: |
          python scripts/run_evaluation.py \
            --model-path ./dist/model \
            --config ./evaluations/triage_safety.yaml \
            --dataset ds_validation_v2 \
            --output ./evaluation_results.json

      - name: Check Quality Gates
        run: |
          python scripts/check_gates.py \
            --results ./evaluation_results.json \
            --config ./evaluations/triage_safety.yaml

      - name: Upload Evaluation Report
        uses: actions/upload-artifact@v3
        with:
          name: evaluation-report
          path: ./evaluation_results.json

      - name: Comment on PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v6
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('./evaluation_results.json'));

            const body = `
            ## Model Evaluation Results

            | Metric | Score | Threshold | Status |
            |--------|-------|-----------|--------|
            | Triage Accuracy | ${(results.triage_accuracy * 100).toFixed(1)}% | ≥85% | ${results.triage_accuracy >= 0.85 ? '✅' : '❌'} |
            | Red Flag Sensitivity | ${(results.red_flag_sensitivity * 100).toFixed(1)}% | ≥98% | ${results.red_flag_sensitivity >= 0.98 ? '✅' : '❌'} |
            | Critical Misses | ${results.critical_misses} | 0 | ${results.critical_misses === 0 ? '✅' : '❌'} |

            **Overall:** ${results.gates_passed ? '✅ All gates passed' : '❌ Some gates failed'}

            [View full report](${results.report_url})
            `;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

  deploy:
    needs: evaluate
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest

    steps:
      - name: Download evaluation results
        uses: actions/download-artifact@v3
        with:
          name: evaluation-report

      - name: Verify gates passed
        run: |
          PASSED=$(jq '.gates_passed' evaluation_results.json)
          if [ "$PASSED" != "true" ]; then
            echo "❌ Evaluation gates not passed. Blocking deployment."
            exit 1
          fi

      - name: Deploy to production
        run: |
          echo "Deploying model to production..."
          # Your deployment script here

Step 3: Evaluation Script

scripts/run_evaluation.py

#!/usr/bin/env python3
"""Run Rubric evaluation as part of CI/CD pipeline."""

import argparse
import json
import sys
import yaml
from rubric import Rubric

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model-path', required=True)
    parser.add_argument('--config', required=True)
    parser.add_argument('--dataset', required=True)
    parser.add_argument('--output', required=True)
    parser.add_argument('--timeout', type=int, default=3600)
    args = parser.parse_args()

    # Load evaluation config
    with open(args.config) as f:
        config = yaml.safe_load(f)

    # Initialize client
    client = Rubric()

    # Get model version from build
    model_version = get_model_version(args.model_path)

    print(f"Running evaluation for model: {model_version}")
    print(f"Dataset: {args.dataset}")
    print(f"Config: {config['name']} v{config['version']}")

    # Create evaluation
    evaluation = client.evaluations.create(
        name=f"CI/CD Evaluation - {model_version}",
        dataset=args.dataset,
        model_version=model_version,
        evaluators=config['evaluators'],

        # CI mode: wait for automated scoring only
        # (human review happens async, shouldn't block deploy)
        ci_mode=True,

        # Tag for tracking
        tags=["ci-cd", f"commit:{get_git_commit()}"],

        metadata={
            "git_commit": get_git_commit(),
            "git_branch": get_git_branch(),
            "pipeline_run": get_pipeline_id()
        }
    )

    print(f"Evaluation started: {evaluation.id}")

    # Wait for completion (automated scoring only)
    try:
        evaluation.wait(timeout=args.timeout, stage="automated")
    except TimeoutError:
        print(f"❌ Evaluation timed out after {args.timeout}s")
        sys.exit(1)

    # Get results
    results = client.evaluations.get(evaluation.id)

    # Check gates
    gates_passed = check_gates(results, config['gates'])

    # Check for regression against baseline
    regression_check = None
    if 'baseline' in config:
        regression_check = check_regression(client, results, config['baseline'])

    # Prepare output
    output = {
        "evaluation_id": evaluation.id,
        "model_version": model_version,
        "triage_accuracy": results.scores.triage_accuracy,
        "red_flag_sensitivity": results.scores.red_flag_sensitivity,
        "critical_misses": results.safety.critical_misses,
        "gates_passed": gates_passed,
        "regression_check": regression_check,
        "report_url": results.dashboard_url
    }

    # Write results
    with open(args.output, 'w') as f:
        json.dump(output, f, indent=2)

    print(f"\nResults written to {args.output}")
    print(f"Dashboard: {results.dashboard_url}")

    if not gates_passed:
        print("\n❌ FAILED: Quality gates not met")
        sys.exit(1)

    if regression_check and not regression_check['passed']:
        print("\n❌ FAILED: Regression detected")
        sys.exit(1)

    print("\n✅ PASSED: All checks passed")


def check_gates(results, gates):
    """Check if all quality gates are met."""
    all_passed = True

    for gate in gates:
        metric_value = getattr(results.scores, gate['metric'], None)
        if metric_value is None:
            metric_value = getattr(results.safety, gate['metric'], None)

        if gate['operator'] == 'gte':
            passed = metric_value >= gate['threshold']
        elif gate['operator'] == 'lte':
            passed = metric_value <= gate['threshold']
        elif gate['operator'] == 'eq':
            passed = metric_value == gate['threshold']
        else:
            passed = False

        status = "✅" if passed else "❌"
        print(f"  {status} {gate['name']}: {metric_value} (threshold: {gate['operator']} {gate['threshold']})")

        if not passed:
            all_passed = False

    return all_passed


def check_regression(client, results, baseline_config):
    """Check for regression against baseline."""
    # Get baseline results
    baseline = client.evaluations.get_by_tag(
        baseline_config['model_tag']
    )

    if not baseline:
        print("  ⚠️ No baseline found, skipping regression check")
        return None

    regression_found = False
    details = {}

    for metric, max_drop in baseline_config['max_regression'].items():
        current = getattr(results.scores, metric)
        previous = getattr(baseline.scores, metric)
        diff = current - previous

        if diff < -max_drop:
            regression_found = True
            print(f"  ❌ Regression in {metric}: {previous:.1%} → {current:.1%} (max allowed: -{max_drop:.1%})")
        else:
            print(f"  ✅ {metric}: {previous:.1%} → {current:.1%}")

        details[metric] = {
            "baseline": previous,
            "current": current,
            "diff": diff,
            "max_allowed": max_drop
        }

    return {
        "passed": not regression_found,
        "baseline_evaluation": baseline.id,
        "details": details
    }


def get_model_version(path):
    import hashlib
    # Generate version from model hash
    return f"model-{hashlib.sha256(open(path, 'rb').read()).hexdigest()[:8]}"

def get_git_commit():
    import subprocess
    return subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()[:8]

def get_git_branch():
    import subprocess
    return subprocess.check_output(['git', 'rev-parse', '--abbrev-ref', 'HEAD']).decode().strip()

def get_pipeline_id():
    import os
    return os.environ.get('GITHUB_RUN_ID', os.environ.get('CI_PIPELINE_ID', 'local'))


if __name__ == '__main__':
    main()

Step 4: GitLab CI Integration

For GitLab users, here’s the equivalent configuration:

.gitlab-ci.yml

stages:
  - build
  - evaluate
  - deploy

variables:
  RUBRIC_API_KEY: $RUBRIC_API_KEY

build:
  stage: build
  script:
    - pip install -r requirements.txt
    - python scripts/build_model.py
  artifacts:
    paths:
      - dist/

evaluate:
  stage: evaluate
  script:
    - pip install rubric
    - python scripts/run_evaluation.py
        --model-path ./dist/model
        --config ./evaluations/triage_safety.yaml
        --dataset ds_validation_v2
        --output ./evaluation_results.json
  artifacts:
    paths:
      - evaluation_results.json
    reports:
      junit: evaluation_results.xml
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"

deploy:
  stage: deploy
  script:
    - |
      PASSED=$(jq '.gates_passed' evaluation_results.json)
      if [ "$PASSED" != "true" ]; then
        echo "Evaluation gates not passed"
        exit 1
      fi
    - ./scripts/deploy.sh
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
  needs:
    - evaluate

Best Practices

Practice	Rationale
Use versioned evaluation configs	Reproducibility and audit trail
Store configs in version control	Change tracking and review
Set appropriate timeouts	Don’t block pipeline indefinitely
Use ci_mode for speed	Human review shouldn’t block deployment
Check for regressions	Catch performance drops early
Archive all results	Required for regulatory compliance
Post results to PRs	Visibility for reviewers

Safety-Critical Deployments: For production deployments of safety-critical AI, consider requiring completed human review before final deployment, not just automated evaluation.

Monitoring After Deploy

# Tag the deployed model for baseline comparison
client.models.tag(
    model_version="model-abc123",
    tag="production-current",

    # Include evaluation results
    evaluation_id=evaluation.id,

    # Track deployment
    metadata={
        "deployed_at": datetime.now().isoformat(),
        "deployed_by": "ci-pipeline",
        "pipeline_id": get_pipeline_id()
    }
)

Home

Getting Started

Core Concepts

Onboarding Guides

Evaluation Framework

Tutorials

Integrations

Security & Compliance

Voice AI

Medical Imaging

Clinical Notes

Workflows

Glossary & Appendix

Deploy Evaluations in CI/CD

Why CI/CD Evaluation?

Architecture Overview

Step 1: Configure Evaluation Definition

Step 2: GitHub Actions Integration

Step 3: Evaluation Script

Step 4: GitLab CI Integration

Best Practices

Monitoring After Deploy

Home

Getting Started

Core Concepts

Onboarding Guides

Evaluation Framework

Tutorials

Integrations

Security & Compliance

Voice AI

Medical Imaging

Clinical Notes

Workflows

Glossary & Appendix

​Why CI/CD Evaluation?

​Architecture Overview

​Step 1: Configure Evaluation Definition

​Step 2: GitHub Actions Integration

​Step 3: Evaluation Script

​Step 4: GitLab CI Integration

​Best Practices

​Monitoring After Deploy

Why CI/CD Evaluation?

Architecture Overview

Step 1: Configure Evaluation Definition

Step 2: GitHub Actions Integration

Step 3: Evaluation Script

Step 4: GitLab CI Integration

Best Practices

Monitoring After Deploy