old sprints work, new sprints for exposing functionality via cli, improve code_of_conduct and other agents instructions
This commit is contained in:
381
docs/operations/upgrade-runbook.md
Normal file
381
docs/operations/upgrade-runbook.md
Normal file
@@ -0,0 +1,381 @@
|
||||
# Stella Ops Upgrade Runbook
|
||||
|
||||
This runbook provides step-by-step procedures for upgrading Stella Ops with evidence continuity preservation.
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Phase | Duration | Owner | Rollback Point |
|
||||
|-------|----------|-------|----------------|
|
||||
| Pre-Upgrade | 2-4 hours | Platform Team | N/A |
|
||||
| Backup | 1-2 hours | DBA | Full restore |
|
||||
| Deploy Green | 30-60 min | Platform Team | Abort deploy |
|
||||
| Cutover | 15-30 min | Platform Team | Instant rollback |
|
||||
| Validation | 1-2 hours | QA + Security | 72h observation |
|
||||
| Cleanup | 30 min | Platform Team | N/A |
|
||||
|
||||
## Pre-Upgrade Checklist
|
||||
|
||||
### Environment Verification
|
||||
|
||||
```bash
|
||||
# Step 1: Record current version
|
||||
stella version > /tmp/pre-upgrade-version.txt
|
||||
echo "Current version: $(cat /tmp/pre-upgrade-version.txt)"
|
||||
|
||||
# Step 2: Verify system health
|
||||
stella doctor --full --output /tmp/pre-upgrade-health.json
|
||||
if [ $? -ne 0 ]; then
|
||||
echo "ABORT: System health check failed"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Step 3: Check pending migrations
|
||||
stella system migrations-status
|
||||
# Ensure no pending migrations before upgrade
|
||||
|
||||
# Step 4: Verify queue depths
|
||||
stella queue status --all
|
||||
# All queues should be empty or near-empty
|
||||
```
|
||||
|
||||
### Evidence Integrity Baseline
|
||||
|
||||
```bash
|
||||
# Step 5: Capture evidence baseline
|
||||
stella evidence verify-all \
|
||||
--output /backup/pre-upgrade-evidence-baseline.json \
|
||||
--include-merkle-roots
|
||||
|
||||
# Step 6: Export Merkle root summary
|
||||
stella evidence roots-export \
|
||||
--output /backup/pre-upgrade-merkle-roots.json
|
||||
|
||||
# Step 7: Record evidence counts
|
||||
stella evidence stats > /backup/pre-upgrade-evidence-stats.txt
|
||||
```
|
||||
|
||||
### Backup Procedures
|
||||
|
||||
```bash
|
||||
# Step 8: PostgreSQL backup
|
||||
BACKUP_TIMESTAMP=$(date +%Y%m%d-%H%M%S)
|
||||
pg_dump -Fc -d stellaops -f /backup/stellaops-${BACKUP_TIMESTAMP}.dump
|
||||
|
||||
# Step 9: Verify backup integrity
|
||||
pg_restore --list /backup/stellaops-${BACKUP_TIMESTAMP}.dump > /dev/null
|
||||
if [ $? -ne 0 ]; then
|
||||
echo "ABORT: Backup verification failed"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Step 10: Evidence bundle backup
|
||||
stella evidence export \
|
||||
--all \
|
||||
--output /backup/evidence-bundles-${BACKUP_TIMESTAMP}/
|
||||
|
||||
# Step 11: Configuration backup
|
||||
kubectl get configmap -n stellaops -o yaml > /backup/configmaps-${BACKUP_TIMESTAMP}.yaml
|
||||
kubectl get secret -n stellaops -o yaml > /backup/secrets-${BACKUP_TIMESTAMP}.yaml
|
||||
```
|
||||
|
||||
### Pre-Flight Approval
|
||||
|
||||
Complete this checklist before proceeding:
|
||||
|
||||
- [ ] Current version documented
|
||||
- [ ] System health: GREEN
|
||||
- [ ] Evidence baseline captured
|
||||
- [ ] PostgreSQL backup completed and verified
|
||||
- [ ] Evidence bundles exported
|
||||
- [ ] Configuration backed up
|
||||
- [ ] Maintenance window approved
|
||||
- [ ] Stakeholders notified
|
||||
- [ ] Rollback plan reviewed
|
||||
|
||||
**Approver signature**: __________________ **Date**: __________
|
||||
|
||||
## Upgrade Execution
|
||||
|
||||
### Deploy Green Environment
|
||||
|
||||
```bash
|
||||
# Step 12: Create green namespace
|
||||
kubectl create namespace stellaops-green
|
||||
|
||||
# Step 13: Copy secrets to green namespace
|
||||
kubectl get secret stellaops-secrets -n stellaops -o yaml | \
|
||||
sed 's/namespace: stellaops/namespace: stellaops-green/' | \
|
||||
kubectl apply -f -
|
||||
|
||||
# Step 14: Deploy new version
|
||||
helm upgrade stellaops-green ./helm/stellaops \
|
||||
--namespace stellaops-green \
|
||||
--values values-production.yaml \
|
||||
--set image.tag=${TARGET_VERSION} \
|
||||
--wait --timeout 10m
|
||||
|
||||
# Step 15: Verify deployment
|
||||
kubectl get pods -n stellaops-green -w
|
||||
# Wait for all pods to be Running and Ready
|
||||
```
|
||||
|
||||
### Run Migrations
|
||||
|
||||
```bash
|
||||
# Step 16: Apply Category A migrations (startup)
|
||||
stella system migrations-run \
|
||||
--category A \
|
||||
--namespace stellaops-green
|
||||
|
||||
# Step 17: Verify migration success
|
||||
stella system migrations-status --namespace stellaops-green
|
||||
# All migrations should show "Applied"
|
||||
|
||||
# Step 18: Apply Category B migrations if needed (manual)
|
||||
# Review migration list first
|
||||
stella system migrations-pending --category B
|
||||
|
||||
# Apply after review
|
||||
stella system migrations-run \
|
||||
--category B \
|
||||
--namespace stellaops-green \
|
||||
--confirm
|
||||
```
|
||||
|
||||
### Evidence Migration (If Required)
|
||||
|
||||
```bash
|
||||
# Step 19: Check if evidence migration needed
|
||||
stella evidence migrate --dry-run --namespace stellaops-green
|
||||
|
||||
# Step 20: If migration needed, execute
|
||||
stella evidence migrate \
|
||||
--namespace stellaops-green \
|
||||
--batch-size 100 \
|
||||
--progress
|
||||
|
||||
# Step 21: Verify evidence integrity post-migration
|
||||
stella evidence verify-all \
|
||||
--namespace stellaops-green \
|
||||
--output /tmp/post-migration-evidence.json
|
||||
```
|
||||
|
||||
### Health Validation
|
||||
|
||||
```bash
|
||||
# Step 22: Run health checks on green
|
||||
stella doctor --full --namespace stellaops-green
|
||||
|
||||
# Step 23: Run smoke tests
|
||||
stella test smoke --namespace stellaops-green
|
||||
|
||||
# Step 24: Verify critical paths
|
||||
stella test critical-paths --namespace stellaops-green
|
||||
```
|
||||
|
||||
## Traffic Cutover
|
||||
|
||||
### Gradual Cutover
|
||||
|
||||
```bash
|
||||
# Step 25: Enable canary (10%)
|
||||
kubectl apply -f - <<EOF
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: stellaops-canary
|
||||
namespace: stellaops-green
|
||||
annotations:
|
||||
nginx.ingress.kubernetes.io/canary: "true"
|
||||
nginx.ingress.kubernetes.io/canary-weight: "10"
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
rules:
|
||||
- host: stellaops.company.com
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend:
|
||||
service:
|
||||
name: stellaops-api
|
||||
port:
|
||||
number: 80
|
||||
EOF
|
||||
|
||||
# Step 26: Monitor for 15 minutes
|
||||
# Check error rates, latency, evidence operations
|
||||
|
||||
# Step 27: Increase to 50%
|
||||
kubectl patch ingress stellaops-canary -n stellaops-green \
|
||||
--type='json' \
|
||||
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "50"}]'
|
||||
|
||||
# Step 28: Monitor for 15 minutes
|
||||
|
||||
# Step 29: Complete cutover (100%)
|
||||
kubectl patch ingress stellaops-canary -n stellaops-green \
|
||||
--type='json' \
|
||||
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "100"}]'
|
||||
```
|
||||
|
||||
### Monitoring During Cutover
|
||||
|
||||
Watch these dashboards:
|
||||
- Grafana: Stella Ops Overview
|
||||
- Grafana: Evidence Operations
|
||||
- Grafana: Attestation Pipeline
|
||||
|
||||
Alert thresholds:
|
||||
- Error rate > 1%: Pause cutover
|
||||
- p99 latency > 5s: Investigate
|
||||
- Evidence failures > 0: Rollback
|
||||
|
||||
## Post-Upgrade Validation
|
||||
|
||||
### Evidence Continuity Verification
|
||||
|
||||
```bash
|
||||
# Step 30: Verify chain-of-custody
|
||||
stella evidence verify-continuity \
|
||||
--baseline /backup/pre-upgrade-evidence-baseline.json \
|
||||
--output /reports/continuity-report.html
|
||||
|
||||
# Step 31: Verify Merkle roots
|
||||
stella evidence verify-roots \
|
||||
--baseline /backup/pre-upgrade-merkle-roots.json \
|
||||
--output /reports/roots-verification.json
|
||||
|
||||
# Step 32: Compare evidence stats
|
||||
stella evidence stats > /tmp/post-upgrade-evidence-stats.txt
|
||||
diff /backup/pre-upgrade-evidence-stats.txt /tmp/post-upgrade-evidence-stats.txt
|
||||
|
||||
# Step 33: Generate audit report
|
||||
stella evidence audit-report \
|
||||
--since "${UPGRADE_START_TIME}" \
|
||||
--format pdf \
|
||||
--output /reports/upgrade-audit-$(date +%Y%m%d).pdf
|
||||
```
|
||||
|
||||
### Functional Validation
|
||||
|
||||
```bash
|
||||
# Step 34: Full integration test
|
||||
stella test integration --full
|
||||
|
||||
# Step 35: Scan test
|
||||
stella scan \
|
||||
--image registry.company.com/test-app:latest \
|
||||
--sbom-format spdx-2.3
|
||||
|
||||
# Step 36: Attestation test
|
||||
stella attest \
|
||||
--subject sha256:test123 \
|
||||
--predicate-type slsa-provenance
|
||||
|
||||
# Step 37: Policy evaluation test
|
||||
stella policy evaluate \
|
||||
--artifact sha256:test123 \
|
||||
--environment production
|
||||
```
|
||||
|
||||
### Post-Upgrade Checklist
|
||||
|
||||
- [ ] Evidence continuity verified
|
||||
- [ ] Merkle roots consistent
|
||||
- [ ] All services healthy
|
||||
- [ ] Integration tests passing
|
||||
- [ ] Scan capability verified
|
||||
- [ ] Attestation generation working
|
||||
- [ ] Policy evaluation working
|
||||
- [ ] No elevated error rates
|
||||
- [ ] Latency within SLO
|
||||
|
||||
**Validator signature**: __________________ **Date**: __________
|
||||
|
||||
## Rollback Procedures
|
||||
|
||||
### Immediate Rollback (During Cutover)
|
||||
|
||||
```bash
|
||||
# Revert canary to 0%
|
||||
kubectl patch ingress stellaops-canary -n stellaops-green \
|
||||
--type='json' \
|
||||
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "0"}]'
|
||||
|
||||
# Or delete canary entirely
|
||||
kubectl delete ingress stellaops-canary -n stellaops-green
|
||||
```
|
||||
|
||||
### Full Rollback (After Cutover)
|
||||
|
||||
```bash
|
||||
# Step R1: Assess database state
|
||||
stella system migrations-status
|
||||
|
||||
# Step R2: If migrations are backward-compatible
|
||||
# Simply redeploy previous version
|
||||
helm upgrade stellaops ./helm/stellaops \
|
||||
--namespace stellaops \
|
||||
--set image.tag=${PREVIOUS_VERSION} \
|
||||
--wait
|
||||
|
||||
# Step R3: If database restore needed
|
||||
# Stop all services first
|
||||
kubectl scale deployment --all --replicas=0 -n stellaops
|
||||
|
||||
# Restore database
|
||||
pg_restore -d stellaops -c /backup/stellaops-${BACKUP_TIMESTAMP}.dump
|
||||
|
||||
# Redeploy previous version
|
||||
helm upgrade stellaops ./helm/stellaops \
|
||||
--namespace stellaops \
|
||||
--set image.tag=${PREVIOUS_VERSION} \
|
||||
--wait
|
||||
|
||||
# Step R4: Verify rollback
|
||||
stella doctor --full
|
||||
stella evidence verify-all
|
||||
```
|
||||
|
||||
## Cleanup
|
||||
|
||||
### After 72-Hour Observation
|
||||
|
||||
```bash
|
||||
# Step 40: Verify stable operation
|
||||
stella doctor --full
|
||||
stella evidence verify-all
|
||||
|
||||
# Step 41: Remove blue environment
|
||||
kubectl delete namespace stellaops-blue
|
||||
|
||||
# Step 42: Archive upgrade artifacts
|
||||
tar -czf /archive/upgrade-${UPGRADE_TIMESTAMP}.tar.gz \
|
||||
/backup/ \
|
||||
/reports/ \
|
||||
/tmp/pre-upgrade-*.txt
|
||||
|
||||
# Step 43: Update documentation
|
||||
echo "${TARGET_VERSION}" > docs/CURRENT_VERSION.md
|
||||
```
|
||||
|
||||
## Appendix
|
||||
|
||||
### Version-Specific Notes
|
||||
|
||||
See `docs/releases/{version}/MIGRATION.md` for version-specific migration notes.
|
||||
|
||||
### Breaking Changes Matrix
|
||||
|
||||
| From | To | Breaking Changes | Migration Required |
|
||||
|------|-----|-----------------|-------------------|
|
||||
| 2027.Q1 | 2027.Q2 | None | No |
|
||||
| 2026.Q4 | 2027.Q1 | Policy schema v2 | Yes |
|
||||
|
||||
### Support Contacts
|
||||
|
||||
- Platform Team: platform@company.com
|
||||
- DBA Team: dba@company.com
|
||||
- Security Team: security@company.com
|
||||
- On-Call: +1-555-OPS-CALL
|
||||
Reference in New Issue
Block a user