Files
git.stella-ops.org/deploy/README.md
StellaOps Bot 582a88e8f8
Some checks failed
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
feat(docs): Add sprint documentation for CLI and API governance
- Created documentation for Sprint 200, 202, 203, 204, and 205 focusing on CLI enhancements and SDKs.
- Normalized legacy filenames to prevent divergent updates.
- Documented completed tasks, dependencies, and active items for CLI commands related to observability, orchestration, packaging, and policy management.
- Implemented API governance tooling and OpenAPI composition for Sprint 511, detailing task statuses and dependencies.
- Updated legacy web sprint documentation to reflect new naming conventions and standard templates.
2025-12-06 00:41:59 +02:00

75 lines
6.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Deployment Profiles
This directory contains deterministic deployment bundles for the core Stella Ops stack. All manifests reference immutable image digests and map 1:1 to the release manifests stored under `deploy/releases/`.
## Structure
- `releases/` canonical release manifests (edge, stable, airgap) used to source image digests.
- `compose/` Docker Compose bundles for dev/stage/airgap targets plus `.env` seed files.
- `compose/docker-compose.mirror.yaml` managed mirror bundle for `*.stella-ops.org` with gateway cache and multi-tenant auth.
- `compose/docker-compose.telemetry.yaml` optional OpenTelemetry collector overlay (mutual TLS, OTLP pipelines).
- `compose/docker-compose.telemetry-storage.yaml` optional Prometheus/Tempo/Loki stack for observability backends.
- `helm/stellaops/` multi-profile Helm chart with values files for dev/stage/airgap.
- `helm/stellaops/INSTALL.md` install/runbook for prod and airgap profiles with digest pins.
- `telemetry/` shared OpenTelemetry collector configuration and certificate artefacts (generated via tooling).
- `tools/validate-profiles.sh` helper that runs `docker compose config` and `helm lint/template` for every profile.
## Workflow
1. Update or add a release manifest under `releases/` with the new digests.
2. Mirror the digests into the Compose and Helm profiles that correspond to that channel.
3. Run `deploy/tools/validate-profiles.sh` (requires Docker CLI and Helm) to ensure the bundles lint and template cleanly.
4. If telemetry ingest is required for the release, generate development certificates using
`./ops/devops/telemetry/generate_dev_tls.sh` and run the collector smoke test with
`python ./ops/devops/telemetry/smoke_otel_collector.py` to verify the OTLP endpoints.
5. Commit the change alongside any documentation updates (e.g. install guide cross-links).
Maintaining the digest linkage keeps offline/air-gapped installs reproducible and avoids tag drift between environments.
### Surface.Env rollout warnings
- Compose (`deploy/compose/env/*.env.example`) and Helm (`deploy/helm/stellaops/values-*.yaml`) now seed `SCANNER_SURFACE_*` _and_ `ZASTAVA_SURFACE_*` variables so Scanner Worker/WebService and Zastava Observer/Webhook resolve cache roots, Surface.FS endpoints, and secrets providers through `StellaOps.Scanner.Surface.Env`.
- During rollout, watch for structured log messages (and readiness output) prefixed with `surface.env.`—for example, `surface.env.cache_root_missing`, `surface.env.endpoint_unreachable`, or `surface.env.secrets_provider_invalid`.
- Treat these warnings as deployment blockers: update the endpoint/cache/secrets values or permissions before promoting the environment, otherwise workers will fail fast at startup.
- Air-gapped bundles default the secrets provider to `file` with `/etc/stellaops/secrets`; connected clusters default to `kubernetes`. Adjust the provider/root pair if your secrets manager differs.
- Secret provisioning workflows for Kubernetes/Compose/Offline Kit are documented in `ops/devops/secrets/surface-secrets-provisioning.md`; follow that for `Surface.Secrets` handles and RBAC/permissions.
### Mongo2Go OpenSSL prerequisites
- Linux runners that execute Mongo2Go-backed suites (Excititor, Scheduler, Graph, etc.) must expose OpenSSL 1.1 (`libcrypto.so.1.1`, `libssl.so.1.1`). The canonical copies live under `tests/native/openssl-1.1/linux-x64`.
- Export `LD_LIBRARY_PATH="$(git rev-parse --show-toplevel)/tests/native/openssl-1.1/linux-x64:${LD_LIBRARY_PATH:-}"` before invoking `dotnet test`. Example:\
`LD_LIBRARY_PATH="$(pwd)/tests/native/openssl-1.1/linux-x64" dotnet test src/Excititor/__Tests/StellaOps.Excititor.WebService.Tests/StellaOps.Excititor.WebService.Tests.csproj --nologo`.
- CI agents or Dockerfiles that host these tests should either mount the directory into the container or copy the two `.so` files into a directory that is already on the runtime library path.
### Additional tooling
- `deploy/tools/check-channel-alignment.py` verifies that Helm/Compose profiles reference the exact images listed in a release manifest. Run it for each channel before promoting a release.
- `ops/devops/telemetry/generate_dev_tls.sh` produces local CA/server/client certificates for Compose-based collector testing.
- `ops/devops/telemetry/smoke_otel_collector.py` sends OTLP traffic and asserts the collector accepted traces, metrics, and logs.
- `ops/devops/telemetry/package_offline_bundle.py` packages telemetry assets (config/Helm/Compose) into a signed tarball for air-gapped installs.
- `docs/modules/devops/runbooks/deployment-upgrade.md` end-to-end instructions for upgrade, rollback, and channel promotion workflows (Helm + Compose).
### Tenancy observability & chaos (DEVOPS-TEN-49-001)
- Import `ops/devops/tenant/recording-rules.yaml` and `ops/devops/tenant/alerts.yaml` into your Prometheus rule groups.
- Add Grafana dashboard `ops/devops/tenant/dashboards/tenant-audit.json` (folder `StellaOps / Tenancy`) to watch latency/error/auth cache ratios per tenant/service.
- Run the multi-tenant k6 harness `ops/devops/tenant/k6-tenant-load.js` to hit 5k concurrent tenant-labelled requests (defaults to read/write 90/10, header `X-StellaOps-Tenant`).
- Execute JWKS outage chaos via `ops/devops/tenant/jwks-chaos.sh` on an isolated agent with sudo/iptables; watch alerts `jwks_cache_miss_spike` and `tenant_auth_failures_spike` while load is active.
## CI smoke checks
The `.gitea/workflows/build-test-deploy.yml` pipeline includes a `notify-smoke` stage that validates scanner event propagation after staging deployments. Configure the following repository secrets (or environment-level secrets) so the job can connect to Redis and the Notify API:
- `NOTIFY_SMOKE_REDIS_DSN` Redis connection string (`redis://user:pass@host:port/db`).
- `NOTIFY_SMOKE_NOTIFY_BASEURL` Base URL for the staging Notify WebService (e.g. `https://notify.stage.stella-ops.internal`).
- `NOTIFY_SMOKE_NOTIFY_TOKEN` OAuth bearer token (service account) with permission to read deliveries.
- `NOTIFY_SMOKE_NOTIFY_TENANT` Tenant identifier used for the smoke validation requests.
- *(Optional)* `NOTIFY_SMOKE_NOTIFY_TENANT_HEADER` Override for the tenant header name (defaults to `X-StellaOps-Tenant`).
Define the following repository variables (or secrets) to drive the assertions performed by the smoke check:
- `NOTIFY_SMOKE_EXPECT_KINDS` Comma-separated event kinds the checker must observe (for example `scanner.report.ready,scanner.scan.completed`).
- `NOTIFY_SMOKE_LOOKBACK_MINUTES` Time window (in minutes) used when scanning the Redis stream for recent events (for example `30`).
All of the above values are required—the workflow fails fast with a descriptive error if any are missing or empty. Provide the variables at the organisation or repository scope before enabling the smoke stage.