Why CI/CD Evaluation?
Healthcare AI requires rigorous testing before deployment. By integrating evaluation into your CI/CD pipeline, you can:- Automatically validate every model change against safety criteria
- Catch regressions before they reach patients
- Maintain audit trails of all evaluation results
- Enforce quality gates for production deployment
| Estimated Time | Prerequisites | Difficulty |
|---|---|---|
| 20 minutes | CI/CD system (GitHub Actions, GitLab, etc.), Rubric API key | Intermediate |
Architecture Overview
Copy
Ask AI
┌─────────────────────────────────────────────────────────────────┐
│ CI/CD Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Build │───▶│ Test │───▶│ Evaluate │───▶│ Deploy │ │
│ └──────────┘ └──────────┘ └────┬─────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ Rubric │ │
│ │ API │ │
│ └────┬─────┘ │
│ │ │
│ ┌────────┴────────┐ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Automated│ │ Human │ │
│ │ Scoring │ │ Review │ │
│ └────┬─────┘ └────┬─────┘ │
│ │ │ │
│ └───────┬───────┘ │
│ ▼ │
│ ┌──────────┐ │
│ │ Pass/ │ │
│ │ Fail │ │
│ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Step 1: Configure Evaluation Definition
First, create a versioned evaluation definition that will be used consistently across all pipeline runs.evaluations/triage_safety.yaml
Copy
Ask AI
# evaluations/triage_safety.yaml
# Store in your repository for version control
name: Triage Safety Gate
version: 1.2.0
evaluators:
- type: triage_accuracy
version: 1.0.0
config:
levels: [emergent, urgent, semi_urgent, routine]
severity_weights:
under_triage: 5.0
over_triage: 1.0
- type: red_flag_detection
version: 1.0.0
config:
protocols: [chest_pain_v2, stroke_fast, sepsis]
critical_miss_penalty: 100
# Quality gates - must pass ALL to deploy
gates:
- name: triage_accuracy_gate
metric: triage_accuracy
operator: gte
threshold: 0.85
- name: red_flag_sensitivity_gate
metric: red_flag_sensitivity
operator: gte
threshold: 0.98
- name: no_critical_misses
metric: critical_miss_count
operator: eq
threshold: 0
# Compare against baseline
baseline:
type: production
model_tag: production-current
max_regression:
triage_accuracy: 0.02
red_flag_sensitivity: 0.01
Step 2: GitHub Actions Integration
.github/workflows/model-evaluation.yml
Copy
Ask AI
name: Model Evaluation
on:
pull_request:
paths:
- 'models/**'
- 'prompts/**'
push:
branches: [main]
paths:
- 'models/**'
- 'prompts/**'
env:
RUBRIC_API_KEY: ${{ secrets.RUBRIC_API_KEY }}
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install rubric pytest
pip install -r requirements.txt
- name: Build model artifacts
run: |
python scripts/build_model.py
- name: Run Rubric Evaluation
id: evaluation
run: |
python scripts/run_evaluation.py \
--model-path ./dist/model \
--config ./evaluations/triage_safety.yaml \
--dataset ds_validation_v2 \
--output ./evaluation_results.json
- name: Check Quality Gates
run: |
python scripts/check_gates.py \
--results ./evaluation_results.json \
--config ./evaluations/triage_safety.yaml
- name: Upload Evaluation Report
uses: actions/upload-artifact@v3
with:
name: evaluation-report
path: ./evaluation_results.json
- name: Comment on PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v6
with:
script: |
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('./evaluation_results.json'));
const body = `
## Model Evaluation Results
| Metric | Score | Threshold | Status |
|--------|-------|-----------|--------|
| Triage Accuracy | ${(results.triage_accuracy * 100).toFixed(1)}% | ≥85% | ${results.triage_accuracy >= 0.85 ? '✅' : '❌'} |
| Red Flag Sensitivity | ${(results.red_flag_sensitivity * 100).toFixed(1)}% | ≥98% | ${results.red_flag_sensitivity >= 0.98 ? '✅' : '❌'} |
| Critical Misses | ${results.critical_misses} | 0 | ${results.critical_misses === 0 ? '✅' : '❌'} |
**Overall:** ${results.gates_passed ? '✅ All gates passed' : '❌ Some gates failed'}
[View full report](${results.report_url})
`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
deploy:
needs: evaluate
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- name: Download evaluation results
uses: actions/download-artifact@v3
with:
name: evaluation-report
- name: Verify gates passed
run: |
PASSED=$(jq '.gates_passed' evaluation_results.json)
if [ "$PASSED" != "true" ]; then
echo "❌ Evaluation gates not passed. Blocking deployment."
exit 1
fi
- name: Deploy to production
run: |
echo "Deploying model to production..."
# Your deployment script here
Step 3: Evaluation Script
scripts/run_evaluation.py
Copy
Ask AI
#!/usr/bin/env python3
"""Run Rubric evaluation as part of CI/CD pipeline."""
import argparse
import json
import sys
import yaml
from rubric import Rubric
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--model-path', required=True)
parser.add_argument('--config', required=True)
parser.add_argument('--dataset', required=True)
parser.add_argument('--output', required=True)
parser.add_argument('--timeout', type=int, default=3600)
args = parser.parse_args()
# Load evaluation config
with open(args.config) as f:
config = yaml.safe_load(f)
# Initialize client
client = Rubric()
# Get model version from build
model_version = get_model_version(args.model_path)
print(f"Running evaluation for model: {model_version}")
print(f"Dataset: {args.dataset}")
print(f"Config: {config['name']} v{config['version']}")
# Create evaluation
evaluation = client.evaluations.create(
name=f"CI/CD Evaluation - {model_version}",
dataset=args.dataset,
model_version=model_version,
evaluators=config['evaluators'],
# CI mode: wait for automated scoring only
# (human review happens async, shouldn't block deploy)
ci_mode=True,
# Tag for tracking
tags=["ci-cd", f"commit:{get_git_commit()}"],
metadata={
"git_commit": get_git_commit(),
"git_branch": get_git_branch(),
"pipeline_run": get_pipeline_id()
}
)
print(f"Evaluation started: {evaluation.id}")
# Wait for completion (automated scoring only)
try:
evaluation.wait(timeout=args.timeout, stage="automated")
except TimeoutError:
print(f"❌ Evaluation timed out after {args.timeout}s")
sys.exit(1)
# Get results
results = client.evaluations.get(evaluation.id)
# Check gates
gates_passed = check_gates(results, config['gates'])
# Check for regression against baseline
regression_check = None
if 'baseline' in config:
regression_check = check_regression(client, results, config['baseline'])
# Prepare output
output = {
"evaluation_id": evaluation.id,
"model_version": model_version,
"triage_accuracy": results.scores.triage_accuracy,
"red_flag_sensitivity": results.scores.red_flag_sensitivity,
"critical_misses": results.safety.critical_misses,
"gates_passed": gates_passed,
"regression_check": regression_check,
"report_url": results.dashboard_url
}
# Write results
with open(args.output, 'w') as f:
json.dump(output, f, indent=2)
print(f"\nResults written to {args.output}")
print(f"Dashboard: {results.dashboard_url}")
if not gates_passed:
print("\n❌ FAILED: Quality gates not met")
sys.exit(1)
if regression_check and not regression_check['passed']:
print("\n❌ FAILED: Regression detected")
sys.exit(1)
print("\n✅ PASSED: All checks passed")
def check_gates(results, gates):
"""Check if all quality gates are met."""
all_passed = True
for gate in gates:
metric_value = getattr(results.scores, gate['metric'], None)
if metric_value is None:
metric_value = getattr(results.safety, gate['metric'], None)
if gate['operator'] == 'gte':
passed = metric_value >= gate['threshold']
elif gate['operator'] == 'lte':
passed = metric_value <= gate['threshold']
elif gate['operator'] == 'eq':
passed = metric_value == gate['threshold']
else:
passed = False
status = "✅" if passed else "❌"
print(f" {status} {gate['name']}: {metric_value} (threshold: {gate['operator']} {gate['threshold']})")
if not passed:
all_passed = False
return all_passed
def check_regression(client, results, baseline_config):
"""Check for regression against baseline."""
# Get baseline results
baseline = client.evaluations.get_by_tag(
baseline_config['model_tag']
)
if not baseline:
print(" ⚠️ No baseline found, skipping regression check")
return None
regression_found = False
details = {}
for metric, max_drop in baseline_config['max_regression'].items():
current = getattr(results.scores, metric)
previous = getattr(baseline.scores, metric)
diff = current - previous
if diff < -max_drop:
regression_found = True
print(f" ❌ Regression in {metric}: {previous:.1%} → {current:.1%} (max allowed: -{max_drop:.1%})")
else:
print(f" ✅ {metric}: {previous:.1%} → {current:.1%}")
details[metric] = {
"baseline": previous,
"current": current,
"diff": diff,
"max_allowed": max_drop
}
return {
"passed": not regression_found,
"baseline_evaluation": baseline.id,
"details": details
}
def get_model_version(path):
import hashlib
# Generate version from model hash
return f"model-{hashlib.sha256(open(path, 'rb').read()).hexdigest()[:8]}"
def get_git_commit():
import subprocess
return subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()[:8]
def get_git_branch():
import subprocess
return subprocess.check_output(['git', 'rev-parse', '--abbrev-ref', 'HEAD']).decode().strip()
def get_pipeline_id():
import os
return os.environ.get('GITHUB_RUN_ID', os.environ.get('CI_PIPELINE_ID', 'local'))
if __name__ == '__main__':
main()
Step 4: GitLab CI Integration
For GitLab users, here’s the equivalent configuration:.gitlab-ci.yml
Copy
Ask AI
stages:
- build
- evaluate
- deploy
variables:
RUBRIC_API_KEY: $RUBRIC_API_KEY
build:
stage: build
script:
- pip install -r requirements.txt
- python scripts/build_model.py
artifacts:
paths:
- dist/
evaluate:
stage: evaluate
script:
- pip install rubric
- python scripts/run_evaluation.py
--model-path ./dist/model
--config ./evaluations/triage_safety.yaml
--dataset ds_validation_v2
--output ./evaluation_results.json
artifacts:
paths:
- evaluation_results.json
reports:
junit: evaluation_results.xml
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
- if: $CI_COMMIT_BRANCH == "main"
deploy:
stage: deploy
script:
- |
PASSED=$(jq '.gates_passed' evaluation_results.json)
if [ "$PASSED" != "true" ]; then
echo "Evaluation gates not passed"
exit 1
fi
- ./scripts/deploy.sh
rules:
- if: $CI_COMMIT_BRANCH == "main"
needs:
- evaluate
Best Practices
| Practice | Rationale |
|---|---|
| Use versioned evaluation configs | Reproducibility and audit trail |
| Store configs in version control | Change tracking and review |
| Set appropriate timeouts | Don’t block pipeline indefinitely |
| Use ci_mode for speed | Human review shouldn’t block deployment |
| Check for regressions | Catch performance drops early |
| Archive all results | Required for regulatory compliance |
| Post results to PRs | Visibility for reviewers |
Safety-Critical Deployments: For production deployments of safety-critical AI, consider requiring completed human review before final deployment, not just automated evaluation.
Monitoring After Deploy
Copy
Ask AI
# Tag the deployed model for baseline comparison
client.models.tag(
model_version="model-abc123",
tag="production-current",
# Include evaluation results
evaluation_id=evaluation.id,
# Track deployment
metadata={
"deployed_at": datetime.now().isoformat(),
"deployed_by": "ci-pipeline",
"pipeline_id": get_pipeline_id()
}
)
