Restructure solution layout by module
	
		
			
	
		
	
	
		
	
		
			Some checks failed
		
		
	
	
		
			
				
	
				Docs CI / lint-and-preview (push) Has been cancelled
				
			
		
		
	
	
				
					
				
			
		
			Some checks failed
		
		
	
	Docs CI / lint-and-preview (push) Has been cancelled
				
			This commit is contained in:
		| @@ -1,128 +1,128 @@ | ||||
| # Launch Cutover Runbook - Stella Ops | ||||
|  | ||||
| _Document owner: DevOps Guild (2025-10-26)_   | ||||
| _Scope:_ Full-platform launch from staging to production for release `2025.09.2`. | ||||
|  | ||||
| ## 1. Roles and Communication | ||||
|  | ||||
| | Role | Primary | Backup | Contact | | ||||
| | --- | --- | --- | --- | | ||||
| | Cutover lead | DevOps Guild (on-call engineer) | Platform Ops lead | `#launch-bridge` (Mattermost) | | ||||
| | Authority stack | Authority Core guild rep | Security guild rep | `#authority` | | ||||
| | Scanner / Queue | Scanner WebService guild rep | Runtime guild rep | `#scanner` | | ||||
| | Storage | Mongo/MinIO operators | Backup DB admin | Pager escalation | | ||||
| | Observability | Telemetry guild rep | SRE on-call | `#telemetry` | | ||||
| | Approvals | Product owner + CTO | DevOps lead | Approval recorded in change ticket | | ||||
|  | ||||
| Set up a bridge call 30 minutes before start and keep `#launch-bridge` updated every 10 minutes. | ||||
|  | ||||
| ## 2. Timeline Overview (UTC) | ||||
|  | ||||
| | Time | Activity | Owner | | ||||
| | --- | --- | --- | | ||||
| | T-24h | Change ticket approved, prod secrets verified, offline kit build status checked (`DEVOPS-OFFLINE-18-005`). | DevOps lead | | ||||
| | T-12h | Run `deploy/tools/validate-profiles.sh`; capture logs in ticket. | DevOps engineer | | ||||
| | T-6h | Freeze non-launch deployments; notify guild leads. | Product owner | | ||||
| | T-2h | Execute rehearsal in staging (Section 3) using `values-stage.yaml` to verify scripts. | DevOps + module reps | | ||||
| | T-30m | Final go/no-go with guild leads; confirm monitoring dashboards green. | Cutover lead | | ||||
| | T0 | Execute production cutover steps (Section 4). | Cutover team | | ||||
| | T+45m | Smoke tests complete (Section 5); announce success or trigger rollback. | Cutover lead | | ||||
| | T+4h | Post-cutover metrics review, notify stakeholders, close ticket. | DevOps + product owner | | ||||
|  | ||||
| ## 3. Rehearsal (Staging) Checklist | ||||
|  | ||||
| 1. `docker network create stellaops_frontdoor || true` (if not present on staging jump host). | ||||
| 2. Run `deploy/tools/validate-profiles.sh` and archive output. | ||||
| 3. Apply staging secrets (`kubectl apply -f secrets/stage/*.yaml` or `helm secrets upgrade`) ensuring `stellaops-stage` credentials align with `values-stage.yaml`. | ||||
| 4. Perform `helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-stage.yaml` in staging cluster. | ||||
| 5. Verify health endpoints: `curl https://authority.stage.../healthz`, `curl https://scanner.stage.../healthz`. | ||||
| 6. Execute smoke CLI: `stellaops-cli scan submit --profile staging --sbom samples/sbom/demo.json` and confirm report status in UI. | ||||
| 7. Document total wall time and any deviations in the rehearsal log. | ||||
|  | ||||
| Rehearsal must complete without manual interventions before proceeding to production. | ||||
|  | ||||
| ## 4. Production Cutover Steps | ||||
|  | ||||
| ### 4.1 Pre-flight | ||||
| - Confirm production secrets in the appropriate secret store (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`) contain the keys referenced in `values-prod.yaml`. | ||||
| - Ensure the external reverse proxy network exists: `docker network create stellaops_frontdoor || true` on each compose host. | ||||
| - Back up current configuration and data: | ||||
|   - Mongo snapshot: `mongodump --uri "$MONGO_BACKUP_URI" --out /backups/launch-$(date -Iseconds)`. | ||||
|   - MinIO policy export: `mc mirror --overwrite minio/stellaops minio-backup/stellaops-$(date +%Y%m%d%H%M)`. | ||||
|  | ||||
| ### 4.2 Apply Updates (Compose) | ||||
| 1. On each compose node, pull updated images for release `2025.09.2`: | ||||
|    ```bash | ||||
|    docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml pull | ||||
|    ``` | ||||
| 2. Deploy changes: | ||||
|    ```bash | ||||
|    docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml up -d | ||||
|    ``` | ||||
| 3. Confirm containers healthy via `docker compose ps` and `docker logs <service> --tail 50`. | ||||
|  | ||||
| ### 4.3 Apply Updates (Helm/Kubernetes) | ||||
| If using Kubernetes, perform: | ||||
| ```bash | ||||
| helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml --atomic --timeout 15m | ||||
| ``` | ||||
| Monitor rollout with `kubectl get pods -n stellaops --watch` and `kubectl rollout status deployment/<service>`. | ||||
|  | ||||
| ### 4.4 Configuration Validation | ||||
| - Verify Authority issuer metadata: `curl https://authority.prod.../.well-known/openid-configuration`. | ||||
| - Validate Signer DSSE endpoint: `stellaops-cli signer verify --base-url https://signer.prod... --bundle samples/dsse/demo.json`. | ||||
| - Check Scanner queue connectivity: `docker exec stellaops-scanner-web dotnet StellaOps.Scanner.WebService.dll health queue` (returns success). | ||||
| - Ensure Notify (legacy) still accessible while Notifier migration pending. | ||||
|  | ||||
| ## 5. Smoke Tests | ||||
|  | ||||
| | Test | Command / Action | Expected Result | | ||||
| | --- | --- | --- | | ||||
| | API health | `curl https://scanner.prod.../healthz` | HTTP 200 with `status":"Healthy"` | | ||||
| | Scan submit | `stellaops-cli scan submit --profile prod --sbom samples/sbom/demo.json` | Scan completes < 5 minutes; report accessible with signed DSSE | | ||||
| | Runtime event ingest | Post sample event from Zastava observer fixture | `/runtime/events` responds 202 Accepted; record visible in Mongo `runtime_events` | | ||||
| | Signing | `stellaops-cli signer sign --bundle demo.json` | Returns DSSE with matching SHA256 and signer metadata | | ||||
| | Attestor verify | `stellaops-cli attestor verify --uuid <uuid>` | Verification result `ok=true` | | ||||
| | Web UI | Manual login, verify dashboards render and latency within budget | UI loads under 2 seconds; policy views consistent | | ||||
|  | ||||
| Log results in the change ticket with timestamps and screenshots where applicable. | ||||
|  | ||||
| ## 6. Rollback Procedure | ||||
|  | ||||
| 1. Assess failure scope; if systemic, initiate rollback immediately while preserving logs/artifacts. | ||||
| 2. For Compose: | ||||
|    ```bash | ||||
|    docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml down | ||||
|    docker compose --env-file stage.env -f deploy/compose/docker-compose.stage.yaml up -d | ||||
|    ``` | ||||
| 3. For Helm: | ||||
|    ```bash | ||||
|    helm rollback stellaops <previous-release-number> --namespace stellaops | ||||
|    ``` | ||||
| 4. Restore Mongo snapshot if data inconsistency detected: `mongorestore --uri "$MONGO_BACKUP_URI" --drop /backups/launch-<timestamp>`. | ||||
| 5. Restore MinIO mirror if required: `mc mirror minio-backup/stellaops-<timestamp> minio/stellaops`. | ||||
| 6. Notify stakeholders of rollback and capture root cause notes in incident ticket. | ||||
|  | ||||
| ## 7. Post-cutover Actions | ||||
|  | ||||
| - Keep heightened monitoring for 4 hours post cutover; track latency, error rates, and queue depth. | ||||
| - Confirm audit trails: Authority tokens issued, Scanner events recorded, Attestor submissions stored. | ||||
| - Update `docs/ops/launch-readiness.md` if any new gaps or follow-ups discovered. | ||||
| - Schedule retrospective within 48 hours; include DevOps, module guilds, and product owner. | ||||
|  | ||||
| ## 8. Approval Matrix | ||||
|  | ||||
| | Step | Required Approvers | Record Location | | ||||
| | --- | --- | --- | | ||||
| | Production deployment plan | CTO + DevOps lead | Change ticket comment | | ||||
| | Cutover start (T0) | DevOps lead + module reps | `#launch-bridge` summary | | ||||
| | Post-smoke success | DevOps lead + product owner | Change ticket closure | | ||||
| | Rollback (if invoked) | DevOps lead + CTO | Incident ticket | | ||||
|  | ||||
| Retain all approvals and logs for audit. Update this runbook after each execution to record actual timings and lessons learned. | ||||
|  | ||||
| ## 9. Rehearsal Log | ||||
|  | ||||
| | Date (UTC) | What We Exercised | Outcome | Follow-up | | ||||
| | --- | --- | --- | --- | | ||||
| | 2025-10-26 | Dry-run of compose/Helm validation via `deploy/tools/validate-profiles.sh` (dev/stage/prod/airgap/mirror). Network creation simulated (`docker network create stellaops_frontdoor` planned) and stage CLI submission reviewed. | Validation script succeeded; all profiles templated cleanly. Stage deployment apply deferred because no staging cluster is accessible from the current environment. | Schedule full stage rehearsal once staging cluster credentials are available; reuse this log section to capture timings. | | ||||
| # Launch Cutover Runbook - Stella Ops | ||||
|  | ||||
| _Document owner: DevOps Guild (2025-10-26)_   | ||||
| _Scope:_ Full-platform launch from staging to production for release `2025.09.2`. | ||||
|  | ||||
| ## 1. Roles and Communication | ||||
|  | ||||
| | Role | Primary | Backup | Contact | | ||||
| | --- | --- | --- | --- | | ||||
| | Cutover lead | DevOps Guild (on-call engineer) | Platform Ops lead | `#launch-bridge` (Mattermost) | | ||||
| | Authority stack | Authority Core guild rep | Security guild rep | `#authority` | | ||||
| | Scanner / Queue | Scanner WebService guild rep | Runtime guild rep | `#scanner` | | ||||
| | Storage | Mongo/MinIO operators | Backup DB admin | Pager escalation | | ||||
| | Observability | Telemetry guild rep | SRE on-call | `#telemetry` | | ||||
| | Approvals | Product owner + CTO | DevOps lead | Approval recorded in change ticket | | ||||
|  | ||||
| Set up a bridge call 30 minutes before start and keep `#launch-bridge` updated every 10 minutes. | ||||
|  | ||||
| ## 2. Timeline Overview (UTC) | ||||
|  | ||||
| | Time | Activity | Owner | | ||||
| | --- | --- | --- | | ||||
| | T-24h | Change ticket approved, prod secrets verified, offline kit build status checked (`DEVOPS-OFFLINE-18-005`). | DevOps lead | | ||||
| | T-12h | Run `deploy/tools/validate-profiles.sh`; capture logs in ticket. | DevOps engineer | | ||||
| | T-6h | Freeze non-launch deployments; notify guild leads. | Product owner | | ||||
| | T-2h | Execute rehearsal in staging (Section 3) using `values-stage.yaml` to verify scripts. | DevOps + module reps | | ||||
| | T-30m | Final go/no-go with guild leads; confirm monitoring dashboards green. | Cutover lead | | ||||
| | T0 | Execute production cutover steps (Section 4). | Cutover team | | ||||
| | T+45m | Smoke tests complete (Section 5); announce success or trigger rollback. | Cutover lead | | ||||
| | T+4h | Post-cutover metrics review, notify stakeholders, close ticket. | DevOps + product owner | | ||||
|  | ||||
| ## 3. Rehearsal (Staging) Checklist | ||||
|  | ||||
| 1. `docker network create stellaops_frontdoor || true` (if not present on staging jump host). | ||||
| 2. Run `deploy/tools/validate-profiles.sh` and archive output. | ||||
| 3. Apply staging secrets (`kubectl apply -f secrets/stage/*.yaml` or `helm secrets upgrade`) ensuring `stellaops-stage` credentials align with `values-stage.yaml`. | ||||
| 4. Perform `helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-stage.yaml` in staging cluster. | ||||
| 5. Verify health endpoints: `curl https://authority.stage.../healthz`, `curl https://scanner.stage.../healthz`. | ||||
| 6. Execute smoke CLI: `stellaops-cli scan submit --profile staging --sbom samples/sbom/demo.json` and confirm report status in UI. | ||||
| 7. Document total wall time and any deviations in the rehearsal log. | ||||
|  | ||||
| Rehearsal must complete without manual interventions before proceeding to production. | ||||
|  | ||||
| ## 4. Production Cutover Steps | ||||
|  | ||||
| ### 4.1 Pre-flight | ||||
| - Confirm production secrets in the appropriate secret store (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`) contain the keys referenced in `values-prod.yaml`. | ||||
| - Ensure the external reverse proxy network exists: `docker network create stellaops_frontdoor || true` on each compose host. | ||||
| - Back up current configuration and data: | ||||
|   - Mongo snapshot: `mongodump --uri "$MONGO_BACKUP_URI" --out /backups/launch-$(date -Iseconds)`. | ||||
|   - MinIO policy export: `mc mirror --overwrite minio/stellaops minio-backup/stellaops-$(date +%Y%m%d%H%M)`. | ||||
|  | ||||
| ### 4.2 Apply Updates (Compose) | ||||
| 1. On each compose node, pull updated images for release `2025.09.2`: | ||||
|    ```bash | ||||
|    docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml pull | ||||
|    ``` | ||||
| 2. Deploy changes: | ||||
|    ```bash | ||||
|    docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml up -d | ||||
|    ``` | ||||
| 3. Confirm containers healthy via `docker compose ps` and `docker logs <service> --tail 50`. | ||||
|  | ||||
| ### 4.3 Apply Updates (Helm/Kubernetes) | ||||
| If using Kubernetes, perform: | ||||
| ```bash | ||||
| helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml --atomic --timeout 15m | ||||
| ``` | ||||
| Monitor rollout with `kubectl get pods -n stellaops --watch` and `kubectl rollout status deployment/<service>`. | ||||
|  | ||||
| ### 4.4 Configuration Validation | ||||
| - Verify Authority issuer metadata: `curl https://authority.prod.../.well-known/openid-configuration`. | ||||
| - Validate Signer DSSE endpoint: `stellaops-cli signer verify --base-url https://signer.prod... --bundle samples/dsse/demo.json`. | ||||
| - Check Scanner queue connectivity: `docker exec stellaops-scanner-web dotnet StellaOps.Scanner.WebService.dll health queue` (returns success). | ||||
| - Ensure Notify (legacy) still accessible while Notifier migration pending. | ||||
|  | ||||
| ## 5. Smoke Tests | ||||
|  | ||||
| | Test | Command / Action | Expected Result | | ||||
| | --- | --- | --- | | ||||
| | API health | `curl https://scanner.prod.../healthz` | HTTP 200 with `status":"Healthy"` | | ||||
| | Scan submit | `stellaops-cli scan submit --profile prod --sbom samples/sbom/demo.json` | Scan completes < 5 minutes; report accessible with signed DSSE | | ||||
| | Runtime event ingest | Post sample event from Zastava observer fixture | `/runtime/events` responds 202 Accepted; record visible in Mongo `runtime_events` | | ||||
| | Signing | `stellaops-cli signer sign --bundle demo.json` | Returns DSSE with matching SHA256 and signer metadata | | ||||
| | Attestor verify | `stellaops-cli attestor verify --uuid <uuid>` | Verification result `ok=true` | | ||||
| | Web UI | Manual login, verify dashboards render and latency within budget | UI loads under 2 seconds; policy views consistent | | ||||
|  | ||||
| Log results in the change ticket with timestamps and screenshots where applicable. | ||||
|  | ||||
| ## 6. Rollback Procedure | ||||
|  | ||||
| 1. Assess failure scope; if systemic, initiate rollback immediately while preserving logs/artifacts. | ||||
| 2. For Compose: | ||||
|    ```bash | ||||
|    docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml down | ||||
|    docker compose --env-file stage.env -f deploy/compose/docker-compose.stage.yaml up -d | ||||
|    ``` | ||||
| 3. For Helm: | ||||
|    ```bash | ||||
|    helm rollback stellaops <previous-release-number> --namespace stellaops | ||||
|    ``` | ||||
| 4. Restore Mongo snapshot if data inconsistency detected: `mongorestore --uri "$MONGO_BACKUP_URI" --drop /backups/launch-<timestamp>`. | ||||
| 5. Restore MinIO mirror if required: `mc mirror minio-backup/stellaops-<timestamp> minio/stellaops`. | ||||
| 6. Notify stakeholders of rollback and capture root cause notes in incident ticket. | ||||
|  | ||||
| ## 7. Post-cutover Actions | ||||
|  | ||||
| - Keep heightened monitoring for 4 hours post cutover; track latency, error rates, and queue depth. | ||||
| - Confirm audit trails: Authority tokens issued, Scanner events recorded, Attestor submissions stored. | ||||
| - Update `docs/ops/launch-readiness.md` if any new gaps or follow-ups discovered. | ||||
| - Schedule retrospective within 48 hours; include DevOps, module guilds, and product owner. | ||||
|  | ||||
| ## 8. Approval Matrix | ||||
|  | ||||
| | Step | Required Approvers | Record Location | | ||||
| | --- | --- | --- | | ||||
| | Production deployment plan | CTO + DevOps lead | Change ticket comment | | ||||
| | Cutover start (T0) | DevOps lead + module reps | `#launch-bridge` summary | | ||||
| | Post-smoke success | DevOps lead + product owner | Change ticket closure | | ||||
| | Rollback (if invoked) | DevOps lead + CTO | Incident ticket | | ||||
|  | ||||
| Retain all approvals and logs for audit. Update this runbook after each execution to record actual timings and lessons learned. | ||||
|  | ||||
| ## 9. Rehearsal Log | ||||
|  | ||||
| | Date (UTC) | What We Exercised | Outcome | Follow-up | | ||||
| | --- | --- | --- | --- | | ||||
| | 2025-10-26 | Dry-run of compose/Helm validation via `deploy/tools/validate-profiles.sh` (dev/stage/prod/airgap/mirror). Network creation simulated (`docker network create stellaops_frontdoor` planned) and stage CLI submission reviewed. | Validation script succeeded; all profiles templated cleanly. Stage deployment apply deferred because no staging cluster is accessible from the current environment. | Schedule full stage rehearsal once staging cluster credentials are available; reuse this log section to capture timings. | | ||||
|   | ||||
		Reference in New Issue
	
	Block a user