Files
git.stella-ops.org/docs/operations/upgrade-runbook.md

9.5 KiB

Stella Ops Upgrade Runbook

This runbook provides step-by-step procedures for upgrading Stella Ops with evidence continuity preservation.

Quick Reference

Phase Duration Owner Rollback Point
Pre-Upgrade 2-4 hours Platform Team N/A
Backup 1-2 hours DBA Full restore
Deploy Green 30-60 min Platform Team Abort deploy
Cutover 15-30 min Platform Team Instant rollback
Validation 1-2 hours QA + Security 72h observation
Cleanup 30 min Platform Team N/A

Pre-Upgrade Checklist

Environment Verification

# Step 1: Record current version
stella version > /tmp/pre-upgrade-version.txt
echo "Current version: $(cat /tmp/pre-upgrade-version.txt)"

# Step 2: Verify system health
stella doctor --full --output /tmp/pre-upgrade-health.json
if [ $? -ne 0 ]; then
  echo "ABORT: System health check failed"
  exit 1
fi

# Step 3: Check pending migrations
stella system migrations-status
# Ensure no pending migrations before upgrade

# Step 4: Verify queue depths
stella queue status --all
# All queues should be empty or near-empty

Evidence Integrity Baseline

# Step 5: Capture evidence baseline
stella evidence verify-all \
  --output /backup/pre-upgrade-evidence-baseline.json \
  --include-merkle-roots

# Step 6: Export Merkle root summary
stella evidence roots-export \
  --output /backup/pre-upgrade-merkle-roots.json

# Step 7: Record evidence counts
stella evidence stats > /backup/pre-upgrade-evidence-stats.txt

Backup Procedures

# Step 8: PostgreSQL backup
BACKUP_TIMESTAMP=$(date +%Y%m%d-%H%M%S)
pg_dump -Fc -d stellaops -f /backup/stellaops-${BACKUP_TIMESTAMP}.dump

# Step 9: Verify backup integrity
pg_restore --list /backup/stellaops-${BACKUP_TIMESTAMP}.dump > /dev/null
if [ $? -ne 0 ]; then
  echo "ABORT: Backup verification failed"
  exit 1
fi

# Step 10: Evidence bundle backup
stella evidence export \
  --all \
  --output /backup/evidence-bundles-${BACKUP_TIMESTAMP}/

# Step 11: Configuration backup
kubectl get configmap -n stellaops -o yaml > /backup/configmaps-${BACKUP_TIMESTAMP}.yaml
kubectl get secret -n stellaops -o yaml > /backup/secrets-${BACKUP_TIMESTAMP}.yaml

Pre-Flight Approval

Complete this checklist before proceeding:

  • Current version documented
  • System health: GREEN
  • Evidence baseline captured
  • PostgreSQL backup completed and verified
  • Evidence bundles exported
  • Configuration backed up
  • Maintenance window approved
  • Stakeholders notified
  • Rollback plan reviewed

Approver signature: __________________ Date: __________

Upgrade Execution

Deploy Green Environment

# Step 12: Create green namespace
kubectl create namespace stellaops-green

# Step 13: Copy secrets to green namespace
kubectl get secret stellaops-secrets -n stellaops -o yaml | \
  sed 's/namespace: stellaops/namespace: stellaops-green/' | \
  kubectl apply -f -

# Step 14: Deploy new version
helm upgrade stellaops-green ./helm/stellaops \
  --namespace stellaops-green \
  --values values-production.yaml \
  --set image.tag=${TARGET_VERSION} \
  --wait --timeout 10m

# Step 15: Verify deployment
kubectl get pods -n stellaops-green -w
# Wait for all pods to be Running and Ready

Run Migrations

# Step 16: Apply Category A migrations (startup)
stella system migrations-run \
  --category A \
  --namespace stellaops-green

# Step 17: Verify migration success
stella system migrations-status --namespace stellaops-green
# All migrations should show "Applied"

# Step 18: Apply Category B migrations if needed (manual)
# Review migration list first
stella system migrations-pending --category B

# Apply after review
stella system migrations-run \
  --category B \
  --namespace stellaops-green \
  --confirm

Evidence Migration (If Required)

# Step 19: Check if evidence migration needed
stella evidence migrate --dry-run --namespace stellaops-green

# Step 20: If migration needed, execute
stella evidence migrate \
  --namespace stellaops-green \
  --batch-size 100 \
  --progress

# Step 21: Verify evidence integrity post-migration
stella evidence verify-all \
  --namespace stellaops-green \
  --output /tmp/post-migration-evidence.json

Health Validation

# Step 22: Run health checks on green
stella doctor --full --namespace stellaops-green

# Step 23: Run smoke tests
stella test smoke --namespace stellaops-green

# Step 24: Verify critical paths
stella test critical-paths --namespace stellaops-green

Traffic Cutover

Gradual Cutover

# Step 25: Enable canary (10%)
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: stellaops-canary
  namespace: stellaops-green
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"
spec:
  ingressClassName: nginx
  rules:
  - host: stellaops.company.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: stellaops-api
            port:
              number: 80
EOF

# Step 26: Monitor for 15 minutes
# Check error rates, latency, evidence operations

# Step 27: Increase to 50%
kubectl patch ingress stellaops-canary -n stellaops-green \
  --type='json' \
  -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "50"}]'

# Step 28: Monitor for 15 minutes

# Step 29: Complete cutover (100%)
kubectl patch ingress stellaops-canary -n stellaops-green \
  --type='json' \
  -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "100"}]'

Monitoring During Cutover

Watch these dashboards:

  • Grafana: Stella Ops Overview
  • Grafana: Evidence Operations
  • Grafana: Attestation Pipeline

Alert thresholds:

  • Error rate > 1%: Pause cutover
  • p99 latency > 5s: Investigate
  • Evidence failures > 0: Rollback

Post-Upgrade Validation

Evidence Continuity Verification

# Step 30: Verify chain-of-custody
stella evidence verify-continuity \
  --baseline /backup/pre-upgrade-evidence-baseline.json \
  --output /reports/continuity-report.html

# Step 31: Verify Merkle roots
stella evidence verify-roots \
  --baseline /backup/pre-upgrade-merkle-roots.json \
  --output /reports/roots-verification.json

# Step 32: Compare evidence stats
stella evidence stats > /tmp/post-upgrade-evidence-stats.txt
diff /backup/pre-upgrade-evidence-stats.txt /tmp/post-upgrade-evidence-stats.txt

# Step 33: Generate audit report
stella evidence audit-report \
  --since "${UPGRADE_START_TIME}" \
  --format pdf \
  --output /reports/upgrade-audit-$(date +%Y%m%d).pdf

Functional Validation

# Step 34: Full integration test
stella test integration --full

# Step 35: Scan test
stella scan \
  --image registry.company.com/test-app:latest \
  --sbom-format spdx-2.3

# Step 36: Attestation test
stella attest \
  --subject sha256:test123 \
  --predicate-type slsa-provenance

# Step 37: Policy evaluation test
stella policy evaluate \
  --artifact sha256:test123 \
  --environment production

Post-Upgrade Checklist

  • Evidence continuity verified
  • Merkle roots consistent
  • All services healthy
  • Integration tests passing
  • Scan capability verified
  • Attestation generation working
  • Policy evaluation working
  • No elevated error rates
  • Latency within SLO

Validator signature: __________________ Date: __________

Rollback Procedures

Immediate Rollback (During Cutover)

# Revert canary to 0%
kubectl patch ingress stellaops-canary -n stellaops-green \
  --type='json' \
  -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "0"}]'

# Or delete canary entirely
kubectl delete ingress stellaops-canary -n stellaops-green

Full Rollback (After Cutover)

# Step R1: Assess database state
stella system migrations-status

# Step R2: If migrations are backward-compatible
# Simply redeploy previous version
helm upgrade stellaops ./helm/stellaops \
  --namespace stellaops \
  --set image.tag=${PREVIOUS_VERSION} \
  --wait

# Step R3: If database restore needed
# Stop all services first
kubectl scale deployment --all --replicas=0 -n stellaops

# Restore database
pg_restore -d stellaops -c /backup/stellaops-${BACKUP_TIMESTAMP}.dump

# Redeploy previous version
helm upgrade stellaops ./helm/stellaops \
  --namespace stellaops \
  --set image.tag=${PREVIOUS_VERSION} \
  --wait

# Step R4: Verify rollback
stella doctor --full
stella evidence verify-all

Cleanup

After 72-Hour Observation

# Step 40: Verify stable operation
stella doctor --full
stella evidence verify-all

# Step 41: Remove blue environment
kubectl delete namespace stellaops-blue

# Step 42: Archive upgrade artifacts
tar -czf /archive/upgrade-${UPGRADE_TIMESTAMP}.tar.gz \
  /backup/ \
  /reports/ \
  /tmp/pre-upgrade-*.txt

# Step 43: Update documentation
echo "${TARGET_VERSION}" > docs/CURRENT_VERSION.md

Appendix

Version-Specific Notes

See docs/releases/{version}/MIGRATION.md for version-specific migration notes.

Breaking Changes Matrix

From To Breaking Changes Migration Required
2027.Q1 2027.Q2 None No
2026.Q4 2027.Q1 Policy schema v2 Yes

Support Contacts