devops folders consolidate

This commit is contained in:
master
2026-01-25 23:27:41 +02:00
parent 6e687b523a
commit a50bbb38ef
334 changed files with 35079 additions and 5569 deletions

View File

@@ -2,34 +2,44 @@
This directory contains operational tooling, deployment configurations, and CI/CD support for StellaOps.
## Infrastructure Stack
| Component | Technology | Purpose |
|-----------|------------|---------|
| Database | PostgreSQL 18.1 | Primary data store |
| Messaging/Cache | Valkey 9.0.1 | Queues, caching, pub/sub |
| Object Storage | RustFS | S3-compatible storage |
| Transparency Log | Rekor v2 | Sigstore transparency |
## Directory Structure
```
devops/
├── ansible/ # Ansible playbooks for deployment automation
├── compose/ # Docker Compose configurations
├── compose/ # Docker Compose configurations (consolidated)
│ ├── docker-compose.stella-ops.yml # Main stack
│ ├── docker-compose.telemetry.yml # Observability stack
│ ├── docker-compose.testing.yml # CI/testing services
│ └── docker-compose.compliance-*.yml # Regional crypto overlays
├── database/ # Database schemas and migrations
│ ├── mongo/ # MongoDB (deprecated)
│ └── postgres/ # PostgreSQL schemas
│ ├── migrations/ # Schema migration scripts
│ └── postgres/ # PostgreSQL configuration
├── docker/ # Dockerfiles and container build scripts
│ ├── Dockerfile.ci # CI runner environment
│ └── base/ # Base images
│ └── repro-builders/ # Reproducible build containers
├── docs/ # This documentation
├── gitlab/ # GitLab CI templates (legacy)
├── helm/ # Helm charts for Kubernetes deployment
│ └── stellaops/ # Main Helm chart with env-specific values
├── logging/ # Logging configuration templates
├── serilog.json.template # Serilog config for .NET services
│ ├── filebeat.yml # Filebeat for log shipping
│ └── logrotate.conf # Log rotation configuration
├── observability/ # Monitoring, metrics, and tracing
├── observability/ # Monitoring, alerting, and dashboards
├── offline/ # Air-gap deployment support
│ ├── airgap/ # Air-gap bundle scripts
│ └── kit/ # Offline installation kit
├── releases/ # Release artifacts and manifests
├── scripts/ # Operational scripts
├── scripts/ # Operational scripts and libraries
├── services/ # Per-service operational configs
├── telemetry/ # OpenTelemetry and metrics configs
└── tools/ # DevOps tooling
├── telemetry/ # OpenTelemetry collector and storage
└── tools/ # DevOps tooling and helpers
```
## Quick Start

View File

@@ -9,8 +9,8 @@ This directory contains deterministic deployment bundles for the core Stella Ops
- `compose/docker-compose.mirror.yaml` managed mirror bundle for `*.stella-ops.org` with gateway cache and multi-tenant auth.
- `compose/docker-compose.telemetry.yaml` optional OpenTelemetry collector overlay (mutual TLS, OTLP pipelines).
- `compose/docker-compose.telemetry-storage.yaml` optional Prometheus/Tempo/Loki stack for observability backends.
- `helm/stellaops/` multi-profile Helm chart with values files for dev/stage/airgap.
- `helm/stellaops/INSTALL.md` install/runbook for prod and airgap profiles with digest pins.
- `helm/stellaops/` multi-profile Helm chart with values files for dev/stage/airgap.
- `helm/stellaops/INSTALL.md` install/runbook for prod and airgap profiles with digest pins.
- `telemetry/` shared OpenTelemetry collector configuration and certificate artefacts (generated via tooling).
- `tools/validate-profiles.sh` helper that runs `docker compose config` and `helm lint/template` for every profile.
@@ -24,37 +24,30 @@ This directory contains deterministic deployment bundles for the core Stella Ops
`python ./ops/devops/telemetry/smoke_otel_collector.py` to verify the OTLP endpoints.
5. Commit the change alongside any documentation updates (e.g. install guide cross-links).
Maintaining the digest linkage keeps offline/air-gapped installs reproducible and avoids tag drift between environments.
### Surface.Env rollout warnings
- Compose (`deploy/compose/env/*.env.example`) and Helm (`deploy/helm/stellaops/values-*.yaml`) now seed `SCANNER_SURFACE_*` _and_ `ZASTAVA_SURFACE_*` variables so Scanner Worker/WebService and Zastava Observer/Webhook resolve cache roots, Surface.FS endpoints, and secrets providers through `StellaOps.Scanner.Surface.Env`.
- During rollout, watch for structured log messages (and readiness output) prefixed with `surface.env.`—for example, `surface.env.cache_root_missing`, `surface.env.endpoint_unreachable`, or `surface.env.secrets_provider_invalid`.
- Treat these warnings as deployment blockers: update the endpoint/cache/secrets values or permissions before promoting the environment, otherwise workers will fail fast at startup.
- Air-gapped bundles default the secrets provider to `file` with `/etc/stellaops/secrets`; connected clusters default to `kubernetes`. Adjust the provider/root pair if your secrets manager differs.
- Secret provisioning workflows for Kubernetes/Compose/Offline Kit are documented in `ops/devops/secrets/surface-secrets-provisioning.md`; follow that for `Surface.Secrets` handles and RBAC/permissions.
### Mongo2Go OpenSSL prerequisites
- Linux runners that execute Mongo2Go-backed suites (Excititor, Scheduler, Graph, etc.) must expose OpenSSL 1.1 (`libcrypto.so.1.1`, `libssl.so.1.1`). The canonical copies live under `tests/native/openssl-1.1/linux-x64`.
- Export `LD_LIBRARY_PATH="$(git rev-parse --show-toplevel)/tests/native/openssl-1.1/linux-x64:${LD_LIBRARY_PATH:-}"` before invoking `dotnet test`. Example:\
`LD_LIBRARY_PATH="$(pwd)/tests/native/openssl-1.1/linux-x64" dotnet test src/Excititor/__Tests/StellaOps.Excititor.WebService.Tests/StellaOps.Excititor.WebService.Tests.csproj --nologo`.
- CI agents or Dockerfiles that host these tests should either mount the directory into the container or copy the two `.so` files into a directory that is already on the runtime library path.
### Additional tooling
- `deploy/tools/check-channel-alignment.py` verifies that Helm/Compose profiles reference the exact images listed in a release manifest. Run it for each channel before promoting a release.
- `ops/devops/telemetry/generate_dev_tls.sh` produces local CA/server/client certificates for Compose-based collector testing.
- `ops/devops/telemetry/smoke_otel_collector.py` sends OTLP traffic and asserts the collector accepted traces, metrics, and logs.
- `ops/devops/telemetry/package_offline_bundle.py` packages telemetry assets (config/Helm/Compose) into a signed tarball for air-gapped installs.
- `docs/modules/devops/runbooks/deployment-upgrade.md` end-to-end instructions for upgrade, rollback, and channel promotion workflows (Helm + Compose).
### Tenancy observability & chaos (DEVOPS-TEN-49-001)
- Import `ops/devops/tenant/recording-rules.yaml` and `ops/devops/tenant/alerts.yaml` into your Prometheus rule groups.
- Add Grafana dashboard `ops/devops/tenant/dashboards/tenant-audit.json` (folder `StellaOps / Tenancy`) to watch latency/error/auth cache ratios per tenant/service.
- Run the multi-tenant k6 harness `ops/devops/tenant/k6-tenant-load.js` to hit 5k concurrent tenant-labelled requests (defaults to read/write 90/10, header `X-StellaOps-Tenant`).
- Execute JWKS outage chaos via `ops/devops/tenant/jwks-chaos.sh` on an isolated agent with sudo/iptables; watch alerts `jwks_cache_miss_spike` and `tenant_auth_failures_spike` while load is active.
Maintaining the digest linkage keeps offline/air-gapped installs reproducible and avoids tag drift between environments.
### Surface.Env rollout warnings
- Compose (`deploy/compose/env/*.env.example`) and Helm (`deploy/helm/stellaops/values-*.yaml`) now seed `SCANNER_SURFACE_*` _and_ `ZASTAVA_SURFACE_*` variables so Scanner Worker/WebService and Zastava Observer/Webhook resolve cache roots, Surface.FS endpoints, and secrets providers through `StellaOps.Scanner.Surface.Env`.
- During rollout, watch for structured log messages (and readiness output) prefixed with `surface.env.`—for example, `surface.env.cache_root_missing`, `surface.env.endpoint_unreachable`, or `surface.env.secrets_provider_invalid`.
- Treat these warnings as deployment blockers: update the endpoint/cache/secrets values or permissions before promoting the environment, otherwise workers will fail fast at startup.
- Air-gapped bundles default the secrets provider to `file` with `/etc/stellaops/secrets`; connected clusters default to `kubernetes`. Adjust the provider/root pair if your secrets manager differs.
- Secret provisioning workflows for Kubernetes/Compose/Offline Kit are documented in `ops/devops/secrets/surface-secrets-provisioning.md`; follow that for `Surface.Secrets` handles and RBAC/permissions.
### Additional tooling
- `deploy/tools/check-channel-alignment.py` verifies that Helm/Compose profiles reference the exact images listed in a release manifest. Run it for each channel before promoting a release.
- `ops/devops/telemetry/generate_dev_tls.sh` produces local CA/server/client certificates for Compose-based collector testing.
- `ops/devops/telemetry/smoke_otel_collector.py` sends OTLP traffic and asserts the collector accepted traces, metrics, and logs.
- `ops/devops/telemetry/package_offline_bundle.py` packages telemetry assets (config/Helm/Compose) into a signed tarball for air-gapped installs.
- `docs/modules/devops/runbooks/deployment-upgrade.md` end-to-end instructions for upgrade, rollback, and channel promotion workflows (Helm + Compose).
### Tenancy observability & chaos (DEVOPS-TEN-49-001)
- Import `ops/devops/tenant/recording-rules.yaml` and `ops/devops/tenant/alerts.yaml` into your Prometheus rule groups.
- Add Grafana dashboard `ops/devops/tenant/dashboards/tenant-audit.json` (folder `StellaOps / Tenancy`) to watch latency/error/auth cache ratios per tenant/service.
- Run the multi-tenant k6 harness `ops/devops/tenant/k6-tenant-load.js` to hit 5k concurrent tenant-labelled requests (defaults to read/write 90/10, header `X-StellaOps-Tenant`).
- Execute JWKS outage chaos via `ops/devops/tenant/jwks-chaos.sh` on an isolated agent with sudo/iptables; watch alerts `jwks_cache_miss_spike` and `tenant_auth_failures_spike` while load is active.
## CI smoke checks