Some checks failed
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
- Created documentation for Sprint 200, 202, 203, 204, and 205 focusing on CLI enhancements and SDKs. - Normalized legacy filenames to prevent divergent updates. - Documented completed tasks, dependencies, and active items for CLI commands related to observability, orchestration, packaging, and policy management. - Implemented API governance tooling and OpenAPI composition for Sprint 511, detailing task statuses and dependencies. - Updated legacy web sprint documentation to reflect new naming conventions and standard templates.
75 lines
6.8 KiB
Markdown
75 lines
6.8 KiB
Markdown
# Deployment Profiles
|
||
|
||
This directory contains deterministic deployment bundles for the core Stella Ops stack. All manifests reference immutable image digests and map 1:1 to the release manifests stored under `deploy/releases/`.
|
||
|
||
## Structure
|
||
|
||
- `releases/` – canonical release manifests (edge, stable, airgap) used to source image digests.
|
||
- `compose/` – Docker Compose bundles for dev/stage/airgap targets plus `.env` seed files.
|
||
- `compose/docker-compose.mirror.yaml` – managed mirror bundle for `*.stella-ops.org` with gateway cache and multi-tenant auth.
|
||
- `compose/docker-compose.telemetry.yaml` – optional OpenTelemetry collector overlay (mutual TLS, OTLP pipelines).
|
||
- `compose/docker-compose.telemetry-storage.yaml` – optional Prometheus/Tempo/Loki stack for observability backends.
|
||
- `helm/stellaops/` – multi-profile Helm chart with values files for dev/stage/airgap.
|
||
- `helm/stellaops/INSTALL.md` – install/runbook for prod and airgap profiles with digest pins.
|
||
- `telemetry/` – shared OpenTelemetry collector configuration and certificate artefacts (generated via tooling).
|
||
- `tools/validate-profiles.sh` – helper that runs `docker compose config` and `helm lint/template` for every profile.
|
||
|
||
## Workflow
|
||
|
||
1. Update or add a release manifest under `releases/` with the new digests.
|
||
2. Mirror the digests into the Compose and Helm profiles that correspond to that channel.
|
||
3. Run `deploy/tools/validate-profiles.sh` (requires Docker CLI and Helm) to ensure the bundles lint and template cleanly.
|
||
4. If telemetry ingest is required for the release, generate development certificates using
|
||
`./ops/devops/telemetry/generate_dev_tls.sh` and run the collector smoke test with
|
||
`python ./ops/devops/telemetry/smoke_otel_collector.py` to verify the OTLP endpoints.
|
||
5. Commit the change alongside any documentation updates (e.g. install guide cross-links).
|
||
|
||
Maintaining the digest linkage keeps offline/air-gapped installs reproducible and avoids tag drift between environments.
|
||
|
||
### Surface.Env rollout warnings
|
||
|
||
- Compose (`deploy/compose/env/*.env.example`) and Helm (`deploy/helm/stellaops/values-*.yaml`) now seed `SCANNER_SURFACE_*` _and_ `ZASTAVA_SURFACE_*` variables so Scanner Worker/WebService and Zastava Observer/Webhook resolve cache roots, Surface.FS endpoints, and secrets providers through `StellaOps.Scanner.Surface.Env`.
|
||
- During rollout, watch for structured log messages (and readiness output) prefixed with `surface.env.`—for example, `surface.env.cache_root_missing`, `surface.env.endpoint_unreachable`, or `surface.env.secrets_provider_invalid`.
|
||
- Treat these warnings as deployment blockers: update the endpoint/cache/secrets values or permissions before promoting the environment, otherwise workers will fail fast at startup.
|
||
- Air-gapped bundles default the secrets provider to `file` with `/etc/stellaops/secrets`; connected clusters default to `kubernetes`. Adjust the provider/root pair if your secrets manager differs.
|
||
- Secret provisioning workflows for Kubernetes/Compose/Offline Kit are documented in `ops/devops/secrets/surface-secrets-provisioning.md`; follow that for `Surface.Secrets` handles and RBAC/permissions.
|
||
|
||
### Mongo2Go OpenSSL prerequisites
|
||
|
||
- Linux runners that execute Mongo2Go-backed suites (Excititor, Scheduler, Graph, etc.) must expose OpenSSL 1.1 (`libcrypto.so.1.1`, `libssl.so.1.1`). The canonical copies live under `tests/native/openssl-1.1/linux-x64`.
|
||
- Export `LD_LIBRARY_PATH="$(git rev-parse --show-toplevel)/tests/native/openssl-1.1/linux-x64:${LD_LIBRARY_PATH:-}"` before invoking `dotnet test`. Example:\
|
||
`LD_LIBRARY_PATH="$(pwd)/tests/native/openssl-1.1/linux-x64" dotnet test src/Excititor/__Tests/StellaOps.Excititor.WebService.Tests/StellaOps.Excititor.WebService.Tests.csproj --nologo`.
|
||
- CI agents or Dockerfiles that host these tests should either mount the directory into the container or copy the two `.so` files into a directory that is already on the runtime library path.
|
||
|
||
### Additional tooling
|
||
|
||
- `deploy/tools/check-channel-alignment.py` – verifies that Helm/Compose profiles reference the exact images listed in a release manifest. Run it for each channel before promoting a release.
|
||
- `ops/devops/telemetry/generate_dev_tls.sh` – produces local CA/server/client certificates for Compose-based collector testing.
|
||
- `ops/devops/telemetry/smoke_otel_collector.py` – sends OTLP traffic and asserts the collector accepted traces, metrics, and logs.
|
||
- `ops/devops/telemetry/package_offline_bundle.py` – packages telemetry assets (config/Helm/Compose) into a signed tarball for air-gapped installs.
|
||
- `docs/modules/devops/runbooks/deployment-upgrade.md` – end-to-end instructions for upgrade, rollback, and channel promotion workflows (Helm + Compose).
|
||
|
||
### Tenancy observability & chaos (DEVOPS-TEN-49-001)
|
||
|
||
- Import `ops/devops/tenant/recording-rules.yaml` and `ops/devops/tenant/alerts.yaml` into your Prometheus rule groups.
|
||
- Add Grafana dashboard `ops/devops/tenant/dashboards/tenant-audit.json` (folder `StellaOps / Tenancy`) to watch latency/error/auth cache ratios per tenant/service.
|
||
- Run the multi-tenant k6 harness `ops/devops/tenant/k6-tenant-load.js` to hit 5k concurrent tenant-labelled requests (defaults to read/write 90/10, header `X-StellaOps-Tenant`).
|
||
- Execute JWKS outage chaos via `ops/devops/tenant/jwks-chaos.sh` on an isolated agent with sudo/iptables; watch alerts `jwks_cache_miss_spike` and `tenant_auth_failures_spike` while load is active.
|
||
|
||
## CI smoke checks
|
||
|
||
The `.gitea/workflows/build-test-deploy.yml` pipeline includes a `notify-smoke` stage that validates scanner event propagation after staging deployments. Configure the following repository secrets (or environment-level secrets) so the job can connect to Redis and the Notify API:
|
||
|
||
- `NOTIFY_SMOKE_REDIS_DSN` – Redis connection string (`redis://user:pass@host:port/db`).
|
||
- `NOTIFY_SMOKE_NOTIFY_BASEURL` – Base URL for the staging Notify WebService (e.g. `https://notify.stage.stella-ops.internal`).
|
||
- `NOTIFY_SMOKE_NOTIFY_TOKEN` – OAuth bearer token (service account) with permission to read deliveries.
|
||
- `NOTIFY_SMOKE_NOTIFY_TENANT` – Tenant identifier used for the smoke validation requests.
|
||
- *(Optional)* `NOTIFY_SMOKE_NOTIFY_TENANT_HEADER` – Override for the tenant header name (defaults to `X-StellaOps-Tenant`).
|
||
|
||
Define the following repository variables (or secrets) to drive the assertions performed by the smoke check:
|
||
|
||
- `NOTIFY_SMOKE_EXPECT_KINDS` – Comma-separated event kinds the checker must observe (for example `scanner.report.ready,scanner.scan.completed`).
|
||
- `NOTIFY_SMOKE_LOOKBACK_MINUTES` – Time window (in minutes) used when scanning the Redis stream for recent events (for example `30`).
|
||
|
||
All of the above values are required—the workflow fails fast with a descriptive error if any are missing or empty. Provide the variables at the organisation or repository scope before enabling the smoke stage.
|