Files
git.stella-ops.org/docs/ops/deployment-upgrade-runbook.md
root 68da90a11a
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Restructure solution layout by module
2025-10-28 15:10:40 +02:00

152 lines
6.6 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# StellaOps Deployment Upgrade & Rollback Runbook
_Last updated: 2025-10-26 (Sprint 14 DEVOPS-OPS-14-003)._
This runbook describes how to promote a new release across the supported deployment profiles (Helm and Docker Compose), how to roll back safely, and how to keep channels (`edge`, `stable`, `airgap`) aligned. All steps assume you are working from a clean checkout of the release branch/tag.
---
## 1. Channel overview
| Channel | Release manifest | Helm values | Compose profile |
|---------|------------------|-------------|-----------------|
| `edge` | `deploy/releases/2025.10-edge.yaml` | `deploy/helm/stellaops/values-dev.yaml` | `deploy/compose/docker-compose.dev.yaml` |
| `stable` | `deploy/releases/2025.09-stable.yaml` | `deploy/helm/stellaops/values-stage.yaml`, `deploy/helm/stellaops/values-prod.yaml` | `deploy/compose/docker-compose.stage.yaml`, `deploy/compose/docker-compose.prod.yaml` |
| `airgap` | `deploy/releases/2025.09-airgap.yaml` | `deploy/helm/stellaops/values-airgap.yaml` | `deploy/compose/docker-compose.airgap.yaml` |
Infrastructure components (MongoDB, MinIO, RustFS) are pinned in the release manifests and inherited by the deployment profiles. Supporting dependencies such as `nats` remain on upstream LTS tags; review `deploy/compose/*.yaml` for the authoritative set.
---
## 2. Pre-flight checklist
1. **Refresh release manifest**
Pull the latest manifest for the channel you are promoting (`deploy/releases/<version>-<channel>.yaml`).
2. **Align deployment bundles with the manifest**
Run the alignment checker for every profile that should pick up the release. Pass `--ignore-repo nats` to skip auxiliary services.
```bash
./deploy/tools/check-channel-alignment.py \
--release deploy/releases/2025.10-edge.yaml \
--target deploy/helm/stellaops/values-dev.yaml \
--target deploy/compose/docker-compose.dev.yaml \
--ignore-repo nats
```
Repeat for other channels (`stable`, `airgap`), substituting the manifest and target files.
3. **Lint and template profiles**
```bash
./deploy/tools/validate-profiles.sh
```
4. **Smoke the Offline Kit debug store (edge/stable only)**
When the release pipeline has generated `out/release/debug/.build-id/**`, mirror the assets into the Offline Kit staging tree:
```bash
./ops/offline-kit/mirror_debug_store.py \
--release-dir out/release \
--offline-kit-dir out/offline-kit
```
Archive the resulting `out/offline-kit/metadata/debug-store.json` alongside the kit bundle.
5. **Review compatibility matrix**
Confirm MongoDB, MinIO, and RustFS versions in the release manifest match platform SLOs. The default targets are `mongo@sha256:c258`, `minio@sha256:14ce`, `rustfs:2025.10.0-edge`.
6. **Create a rollback bookmark**
Record the current Helm revision (`helm history stellaops -n stellaops`) and compose tag (`git describe --tags`) before applying changes.
---
## 3. Helm upgrade procedure (staging → production)
1. Switch to the deployment branch and ensure secrets/config maps are current.
2. Apply the upgrade in the staging cluster:
```bash
helm upgrade stellaops deploy/helm/stellaops \
-f deploy/helm/stellaops/values-stage.yaml \
--namespace stellaops \
--atomic \
--timeout 15m
```
3. Run smoke tests (`scripts/smoke-tests.sh` or environment-specific checks).
4. Promote to production using the prod values file and the same command.
5. Record the new revision number and Git SHA in the change log.
### Rollback (Helm)
1. Identify the previous revision: `helm history stellaops -n stellaops`.
2. Execute:
```bash
helm rollback stellaops <revision> \
--namespace stellaops \
--wait \
--timeout 10m
```
3. Verify `kubectl get pods` returns healthy workloads; rerun smoke tests.
4. Update the incident/operations log with root cause and rollback details.
---
## 4. Docker Compose upgrade procedure
1. Update environment files (`deploy/compose/env/*.env.example`) with any new settings and sync secrets to hosts.
2. Pull the tagged repository state corresponding to the release (e.g. `git checkout 2025.09.2` for stable).
3. Apply the upgrade:
```bash
docker compose \
--env-file deploy/compose/env/prod.env \
-f deploy/compose/docker-compose.prod.yaml \
pull
docker compose \
--env-file deploy/compose/env/prod.env \
-f deploy/compose/docker-compose.prod.yaml \
up -d
```
4. Tail logs for critical services (`docker compose logs -f authority concelier`).
5. Update monitoring dashboards/alerts to confirm normal operation.
### Rollback (Compose)
1. Check out the previous release tag (e.g. `git checkout 2025.09.1`).
2. Re-run `docker compose pull` and `docker compose up -d` with that profile. Docker will restore the prior digests.
3. If reverting to a known-good snapshot is required, restore volume backups (see `docs/ops/authority-backup-restore.md` and associated service guides).
4. Log the rollback in the operations journal.
---
## 5. Channel promotion workflow
1. Author or update the channel manifest under `deploy/releases/`.
2. Mirror the new digests into Helm/Compose values and run the alignment script for each profile.
3. Commit the changes with a message that references the release version and channel (e.g. `deploy: promote 2025.10.0-edge`).
4. Publish release notes and update `deploy/releases/README.md` (if applicable).
5. Tag the repository when promoting stable or airgap builds.
---
## 6. Upgrade rehearsal & rollback drill log
Maintain rehearsal notes in `docs/ops/launch-cutover.md` or the relevant sprint planning document. After each drill capture:
- Release version tested
- Date/time
- Participants
- Issues encountered & fixes
- Rollback duration (if executed)
Attach the log to the sprint retro or operational wiki.
| Date (UTC) | Channel | Outcome | Notes |
|------------|---------|---------|-------|
| 2025-10-26 | Documentation dry-run | Planned | Runbook refreshed; next live drill scheduled for 2025-11 edge → stable promotion.
---
## 7. References
- `deploy/README.md` structure and validation workflow for deployment bundles.
- `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` release automation and signing pipeline.
- `docs/ARCHITECTURE_DEVOPS.md` high-level DevOps architecture, SLOs, and compliance requirements.
- `ops/offline-kit/mirror_debug_store.py` debug-store mirroring helper.
- `deploy/tools/check-channel-alignment.py` release vs deployment digest alignment checker.