Files
git.stella-ops.org/docs/ops/launch-cutover.md
master 96d52884e8
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Add Policy DSL Validator, Schema Exporter, and Simulation Smoke tools
- Implemented PolicyDslValidator with command-line options for strict mode and JSON output.
- Created PolicySchemaExporter to generate JSON schemas for policy-related models.
- Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes.
- Added project files and necessary dependencies for each tool.
- Ensured proper error handling and usage instructions across tools.
2025-10-27 08:00:11 +02:00

7.5 KiB

Launch Cutover Runbook - Stella Ops

Document owner: DevOps Guild (2025-10-26)
Scope: Full-platform launch from staging to production for release 2025.09.2.

1. Roles and Communication

Role Primary Backup Contact
Cutover lead DevOps Guild (on-call engineer) Platform Ops lead #launch-bridge (Mattermost)
Authority stack Authority Core guild rep Security guild rep #authority
Scanner / Queue Scanner WebService guild rep Runtime guild rep #scanner
Storage Mongo/MinIO operators Backup DB admin Pager escalation
Observability Telemetry guild rep SRE on-call #telemetry
Approvals Product owner + CTO DevOps lead Approval recorded in change ticket

Set up a bridge call 30 minutes before start and keep #launch-bridge updated every 10 minutes.

2. Timeline Overview (UTC)

Time Activity Owner
T-24h Change ticket approved, prod secrets verified, offline kit build status checked (DEVOPS-OFFLINE-18-005). DevOps lead
T-12h Run deploy/tools/validate-profiles.sh; capture logs in ticket. DevOps engineer
T-6h Freeze non-launch deployments; notify guild leads. Product owner
T-2h Execute rehearsal in staging (Section 3) using values-stage.yaml to verify scripts. DevOps + module reps
T-30m Final go/no-go with guild leads; confirm monitoring dashboards green. Cutover lead
T0 Execute production cutover steps (Section 4). Cutover team
T+45m Smoke tests complete (Section 5); announce success or trigger rollback. Cutover lead
T+4h Post-cutover metrics review, notify stakeholders, close ticket. DevOps + product owner

3. Rehearsal (Staging) Checklist

  1. docker network create stellaops_frontdoor || true (if not present on staging jump host).
  2. Run deploy/tools/validate-profiles.sh and archive output.
  3. Apply staging secrets (kubectl apply -f secrets/stage/*.yaml or helm secrets upgrade) ensuring stellaops-stage credentials align with values-stage.yaml.
  4. Perform helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-stage.yaml in staging cluster.
  5. Verify health endpoints: curl https://authority.stage.../healthz, curl https://scanner.stage.../healthz.
  6. Execute smoke CLI: stellaops-cli scan submit --profile staging --sbom samples/sbom/demo.json and confirm report status in UI.
  7. Document total wall time and any deviations in the rehearsal log.

Rehearsal must complete without manual interventions before proceeding to production.

4. Production Cutover Steps

4.1 Pre-flight

  • Confirm production secrets in the appropriate secret store (stellaops-prod-core, stellaops-prod-mongo, stellaops-prod-minio, stellaops-prod-notify) contain the keys referenced in values-prod.yaml.
  • Ensure the external reverse proxy network exists: docker network create stellaops_frontdoor || true on each compose host.
  • Back up current configuration and data:
    • Mongo snapshot: mongodump --uri "$MONGO_BACKUP_URI" --out /backups/launch-$(date -Iseconds).
    • MinIO policy export: mc mirror --overwrite minio/stellaops minio-backup/stellaops-$(date +%Y%m%d%H%M).

4.2 Apply Updates (Compose)

  1. On each compose node, pull updated images for release 2025.09.2:
    docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml pull
    
  2. Deploy changes:
    docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml up -d
    
  3. Confirm containers healthy via docker compose ps and docker logs <service> --tail 50.

4.3 Apply Updates (Helm/Kubernetes)

If using Kubernetes, perform:

helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml --atomic --timeout 15m

Monitor rollout with kubectl get pods -n stellaops --watch and kubectl rollout status deployment/<service>.

4.4 Configuration Validation

  • Verify Authority issuer metadata: curl https://authority.prod.../.well-known/openid-configuration.
  • Validate Signer DSSE endpoint: stellaops-cli signer verify --base-url https://signer.prod... --bundle samples/dsse/demo.json.
  • Check Scanner queue connectivity: docker exec stellaops-scanner-web dotnet StellaOps.Scanner.WebService.dll health queue (returns success).
  • Ensure Notify (legacy) still accessible while Notifier migration pending.

5. Smoke Tests

Test Command / Action Expected Result
API health curl https://scanner.prod.../healthz HTTP 200 with status":"Healthy"
Scan submit stellaops-cli scan submit --profile prod --sbom samples/sbom/demo.json Scan completes < 5 minutes; report accessible with signed DSSE
Runtime event ingest Post sample event from Zastava observer fixture /runtime/events responds 202 Accepted; record visible in Mongo runtime_events
Signing stellaops-cli signer sign --bundle demo.json Returns DSSE with matching SHA256 and signer metadata
Attestor verify stellaops-cli attestor verify --uuid <uuid> Verification result ok=true
Web UI Manual login, verify dashboards render and latency within budget UI loads under 2 seconds; policy views consistent

Log results in the change ticket with timestamps and screenshots where applicable.

6. Rollback Procedure

  1. Assess failure scope; if systemic, initiate rollback immediately while preserving logs/artifacts.
  2. For Compose:
    docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml down
    docker compose --env-file stage.env -f deploy/compose/docker-compose.stage.yaml up -d
    
  3. For Helm:
    helm rollback stellaops <previous-release-number> --namespace stellaops
    
  4. Restore Mongo snapshot if data inconsistency detected: mongorestore --uri "$MONGO_BACKUP_URI" --drop /backups/launch-<timestamp>.
  5. Restore MinIO mirror if required: mc mirror minio-backup/stellaops-<timestamp> minio/stellaops.
  6. Notify stakeholders of rollback and capture root cause notes in incident ticket.

7. Post-cutover Actions

  • Keep heightened monitoring for 4 hours post cutover; track latency, error rates, and queue depth.
  • Confirm audit trails: Authority tokens issued, Scanner events recorded, Attestor submissions stored.
  • Update docs/ops/launch-readiness.md if any new gaps or follow-ups discovered.
  • Schedule retrospective within 48 hours; include DevOps, module guilds, and product owner.

8. Approval Matrix

Step Required Approvers Record Location
Production deployment plan CTO + DevOps lead Change ticket comment
Cutover start (T0) DevOps lead + module reps #launch-bridge summary
Post-smoke success DevOps lead + product owner Change ticket closure
Rollback (if invoked) DevOps lead + CTO Incident ticket

Retain all approvals and logs for audit. Update this runbook after each execution to record actual timings and lessons learned.

9. Rehearsal Log

Date (UTC) What We Exercised Outcome Follow-up
2025-10-26 Dry-run of compose/Helm validation via deploy/tools/validate-profiles.sh (dev/stage/prod/airgap/mirror). Network creation simulated (docker network create stellaops_frontdoor planned) and stage CLI submission reviewed. Validation script succeeded; all profiles templated cleanly. Stage deployment apply deferred because no staging cluster is accessible from the current environment. Schedule full stage rehearsal once staging cluster credentials are available; reuse this log section to capture timings.