131 lines
8.0 KiB
Markdown
131 lines
8.0 KiB
Markdown
# Launch Cutover Runbook - Stella Ops
|
|
|
|
_Document owner: DevOps Guild (2025-10-26)_
|
|
_Scope:_ Full-platform launch from staging to production for release `2025.09.2`.
|
|
|
|
> **Note (2025-12):** This document reflects the state at initial launch. Since then, MongoDB has been fully removed (Sprint 4400) and replaced with PostgreSQL. MinIO references now use RustFS. Redis references now use Valkey. See current deployment docs in `deploy/` for up-to-date configuration.
|
|
|
|
## 1. Roles and Communication
|
|
|
|
| Role | Primary | Backup | Contact |
|
|
| --- | --- | --- | --- |
|
|
| Cutover lead | DevOps Guild (on-call engineer) | Platform Ops lead | `#launch-bridge` (Mattermost) |
|
|
| Authority stack | Authority Core guild rep | Security guild rep | `#authority` |
|
|
| Scanner / Queue | Scanner WebService guild rep | Runtime guild rep | `#scanner` |
|
|
| Storage | Mongo/MinIO operators | Backup DB admin | Pager escalation |
|
|
| Observability | Telemetry guild rep | SRE on-call | `#telemetry` |
|
|
| Approvals | Product owner + CTO | DevOps lead | Approval recorded in change ticket |
|
|
|
|
Set up a bridge call 30 minutes before start and keep `#launch-bridge` updated every 10 minutes.
|
|
|
|
## 2. Timeline Overview (UTC)
|
|
|
|
| Time | Activity | Owner |
|
|
| --- | --- | --- |
|
|
| T-24h | Change ticket approved, prod secrets verified, offline kit build status checked (`DEVOPS-OFFLINE-18-005`). | DevOps lead |
|
|
| T-12h | Run `deploy/tools/validate-profiles.sh`; capture logs in ticket. | DevOps engineer |
|
|
| T-6h | Freeze non-launch deployments; notify guild leads. | Product owner |
|
|
| T-2h | Execute rehearsal in staging (Section 3) using `values-stage.yaml` to verify scripts. | DevOps + module reps |
|
|
| T-30m | Final go/no-go with guild leads; confirm monitoring dashboards green. | Cutover lead |
|
|
| T0 | Execute production cutover steps (Section 4). | Cutover team |
|
|
| T+45m | Smoke tests complete (Section 5); announce success or trigger rollback. | Cutover lead |
|
|
| T+4h | Post-cutover metrics review, notify stakeholders, close ticket. | DevOps + product owner |
|
|
|
|
## 3. Rehearsal (Staging) Checklist
|
|
|
|
1. `docker network create stellaops_frontdoor || true` (if not present on staging jump host).
|
|
2. Run `deploy/tools/validate-profiles.sh` and archive output.
|
|
3. Apply staging secrets (`kubectl apply -f secrets/stage/*.yaml` or `helm secrets upgrade`) ensuring `stellaops-stage` credentials align with `values-stage.yaml`.
|
|
4. Perform `helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-stage.yaml` in staging cluster.
|
|
5. Verify health endpoints: `curl https://authority.stage.../healthz`, `curl https://scanner.stage.../healthz`.
|
|
6. Execute smoke CLI: `stellaops-cli scan submit --profile staging --sbom samples/sbom/demo.json` and confirm report status in UI.
|
|
7. Document total wall time and any deviations in the rehearsal log.
|
|
|
|
Rehearsal must complete without manual interventions before proceeding to production.
|
|
|
|
## 4. Production Cutover Steps
|
|
|
|
### 4.1 Pre-flight
|
|
- Confirm production secrets in the appropriate secret store (`stellaops-prod-core`, `stellaops-prod-mongo`, `stellaops-prod-minio`, `stellaops-prod-notify`) contain the keys referenced in `values-prod.yaml`.
|
|
- Ensure the external reverse proxy network exists: `docker network create stellaops_frontdoor || true` on each compose host.
|
|
- Back up current configuration and data:
|
|
- Mongo snapshot: `mongodump --uri "$MONGO_BACKUP_URI" --out /backups/launch-$(date -Iseconds)`.
|
|
- MinIO policy export: `mc mirror --overwrite minio/stellaops minio-backup/stellaops-$(date +%Y%m%d%H%M)`.
|
|
|
|
### 4.2 Apply Updates (Compose)
|
|
1. On each compose node, pull updated images for release `2025.09.2`:
|
|
```bash
|
|
docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml pull
|
|
```
|
|
2. Deploy changes:
|
|
```bash
|
|
docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml up -d
|
|
```
|
|
3. Confirm containers healthy via `docker compose ps` and `docker logs <service> --tail 50`.
|
|
|
|
### 4.3 Apply Updates (Helm/Kubernetes)
|
|
If using Kubernetes, perform:
|
|
```bash
|
|
helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml --atomic --timeout 15m
|
|
```
|
|
Monitor rollout with `kubectl get pods -n stellaops --watch` and `kubectl rollout status deployment/<service>`.
|
|
|
|
### 4.4 Configuration Validation
|
|
- Verify Authority issuer metadata: `curl https://authority.prod.../.well-known/openid-configuration`.
|
|
- Validate Signer DSSE endpoint: `stellaops-cli signer verify --base-url https://signer.prod... --bundle samples/dsse/demo.json`.
|
|
- Check Scanner queue connectivity: `docker exec stellaops-scanner-web dotnet StellaOps.Scanner.WebService.dll health queue` (returns success).
|
|
- Ensure Notify (legacy) still accessible while Notifier migration pending.
|
|
|
|
## 5. Smoke Tests
|
|
|
|
| Test | Command / Action | Expected Result |
|
|
| --- | --- | --- |
|
|
| API health | `curl https://scanner.prod.../healthz` | HTTP 200 with `status":"Healthy"` |
|
|
| Scan submit | `stellaops-cli scan submit --profile prod --sbom samples/sbom/demo.json` | Scan completes < 5 minutes; report accessible with signed DSSE |
|
|
| Runtime event ingest | Post sample event from Zastava observer fixture | `/runtime/events` responds 202 Accepted; record visible in Mongo `runtime_events` |
|
|
| Signing | `stellaops-cli signer sign --bundle demo.json` | Returns DSSE with matching SHA256 and signer metadata |
|
|
| Attestor verify | `stellaops-cli attestor verify --uuid <uuid>` | Verification result `ok=true` |
|
|
| Web UI | Manual login, verify dashboards render and latency within budget | UI loads under 2 seconds; policy views consistent |
|
|
|
|
Log results in the change ticket with timestamps and screenshots where applicable.
|
|
|
|
## 6. Rollback Procedure
|
|
|
|
1. Assess failure scope; if systemic, initiate rollback immediately while preserving logs/artifacts.
|
|
2. For Compose:
|
|
```bash
|
|
docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml down
|
|
docker compose --env-file stage.env -f deploy/compose/docker-compose.stage.yaml up -d
|
|
```
|
|
3. For Helm:
|
|
```bash
|
|
helm rollback stellaops <previous-release-number> --namespace stellaops
|
|
```
|
|
4. Restore Mongo snapshot if data inconsistency detected: `mongorestore --uri "$MONGO_BACKUP_URI" --drop /backups/launch-<timestamp>`.
|
|
5. Restore MinIO mirror if required: `mc mirror minio-backup/stellaops-<timestamp> minio/stellaops`.
|
|
6. Notify stakeholders of rollback and capture root cause notes in incident ticket.
|
|
|
|
## 7. Post-cutover Actions
|
|
|
|
- Keep heightened monitoring for 4 hours post cutover; track latency, error rates, and queue depth.
|
|
- Confirm audit trails: Authority tokens issued, Scanner events recorded, Attestor submissions stored.
|
|
- Update `docs/modules/devops/runbooks/launch-readiness.md` if any new gaps or follow-ups discovered.
|
|
- Schedule retrospective within 48 hours; include DevOps, module guilds, and product owner.
|
|
|
|
## 8. Approval Matrix
|
|
|
|
| Step | Required Approvers | Record Location |
|
|
| --- | --- | --- |
|
|
| Production deployment plan | CTO + DevOps lead | Change ticket comment |
|
|
| Cutover start (T0) | DevOps lead + module reps | `#launch-bridge` summary |
|
|
| Post-smoke success | DevOps lead + product owner | Change ticket closure |
|
|
| Rollback (if invoked) | DevOps lead + CTO | Incident ticket |
|
|
|
|
Retain all approvals and logs for audit. Update this runbook after each execution to record actual timings and lessons learned.
|
|
|
|
## 9. Rehearsal Log
|
|
|
|
| Date (UTC) | What We Exercised | Outcome | Follow-up |
|
|
| --- | --- | --- | --- |
|
|
| 2025-10-26 | Dry-run of compose/Helm validation via `deploy/tools/validate-profiles.sh` (dev/stage/prod/airgap/mirror). Network creation simulated (`docker network create stellaops_frontdoor` planned) and stage CLI submission reviewed. | Validation script succeeded; all profiles templated cleanly. Stage deployment apply deferred because no staging cluster is accessible from the current environment. | Schedule full stage rehearsal once staging cluster credentials are available; reuse this log section to capture timings. |
|