git.stella-ops.org/deploy/compose/README.md

# Stella Ops Compose Profiles

These Compose bundles ship the minimum services required to exercise the scanner pipeline plus control-plane dependencies. Every profile is pinned to immutable image digests sourced from `deploy/releases/*.yaml` and is linted via `docker compose config` in CI.

## Layout

| Path | Purpose |
| ---- | ------- |
| `docker-compose.dev.yaml` | Edge/nightly stack tuned for laptops and iterative work. |
| `docker-compose.stage.yaml` | Stable channel stack mirroring pre-production clusters. |
| `docker-compose.prod.yaml` | Production cutover stack with front-door network hand-off and Notify events enabled. |
| `docker-compose.airgap.yaml` | Stable stack with air-gapped defaults (no outbound hostnames). |
| `docker-compose.mirror.yaml` | Managed mirror topology for `*.stella-ops.org` distribution (Concelier + Excititor + CDN gateway). |
| `docker-compose.telemetry.yaml` | Optional OpenTelemetry collector overlay (mutual TLS, OTLP ingest endpoints). |
| `docker-compose.telemetry-storage.yaml` | Prometheus/Tempo/Loki storage overlay with multi-tenant defaults. |
| `docker-compose.gpu.yaml` | Optional GPU overlay enabling NVIDIA devices for Advisory AI web/worker. Apply with `-f docker-compose.<env>.yaml -f docker-compose.gpu.yaml`. |
| `env/*.env.example` | Seed `.env` files that document required secrets and ports per profile. |
| `scripts/backup.sh` | Pauses workers and creates tar.gz of Mongo/MinIO/Redis volumes (deterministic snapshot). |
| `scripts/reset.sh` | Stops the stack and removes Mongo/MinIO/Redis volumes after explicit confirmation. |
| `scripts/quickstart.sh` | Helper to validate config and start dev stack; set `USE_MOCK=1` to include `docker-compose.mock.yaml` overlay. |
| `docker-compose.mock.yaml` | Dev-only overlay with placeholder digests for missing services (orchestrator, policy-registry, packs, task-runner, VEX/Vuln stack). Use only with mock release manifest `deploy/releases/2025.09-mock-dev.yaml`. |

## Usage

```bash
cp env/dev.env.example dev.env
docker compose --env-file dev.env -f docker-compose.dev.yaml config
docker compose --env-file dev.env -f docker-compose.dev.yaml up -d
```

The stage and airgap variants behave the same way—swap the file names accordingly. All profiles expose 443/8443 for the UI and REST APIs, and they share a `stellaops` Docker network scoped to the compose project.

> **Graph Explorer reminder:** If you enable Cartographer or Graph API containers alongside these profiles, update `etc/authority.yaml` so the `cartographer-service` client is marked with `properties.serviceIdentity: "cartographer"` and carries a tenant hint. The Authority host now refuses `graph:write` tokens without that marker, so apply the configuration change before rolling out the updated images.

### Telemetry collector overlay

The OpenTelemetry collector overlay is optional and can be layered on top of any profile:

```bash
./ops/devops/telemetry/generate_dev_tls.sh
docker compose -f docker-compose.telemetry.yaml up -d
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
docker compose -f docker-compose.telemetry-storage.yaml up -d
```

The generator script creates a development CA plus server/client certificates under
`deploy/telemetry/certs/`. The smoke test sends OTLP/HTTP payloads using the generated
client certificate and asserts the collector reports accepted traces, metrics, and logs.
The storage overlay starts Prometheus, Tempo, and Loki with multitenancy enabled so you
can validate the end-to-end pipeline before promoting changes to staging. Adjust the
configs in `deploy/telemetry/storage/` before running in production.
Mount the same certificates when running workloads so the collector can enforce mutual TLS.

For production cutovers copy `env/prod.env.example` to `prod.env`, update the secret placeholders, and create the external network expected by the profile:

```bash
docker network create stellaops_frontdoor
docker compose --env-file prod.env -f docker-compose.prod.yaml config
```

### Scanner event stream settings

Scanner WebService can emit signed `scanner.report.*` events to Redis Streams when `SCANNER__EVENTS__ENABLED=true`. Each profile ships environment placeholders you can override in the `.env` file:

- `SCANNER_EVENTS_ENABLED` – toggle emission on/off (defaults to `false`).
- `SCANNER_EVENTS_DRIVER` – currently only `redis` is supported.
- `SCANNER_EVENTS_DSN` – Redis endpoint; leave blank to reuse the queue DSN when it uses `redis://`.
- `SCANNER_EVENTS_STREAM` – stream name (`stella.events` by default).
- `SCANNER_EVENTS_PUBLISH_TIMEOUT_SECONDS` – per-publish timeout window (defaults to `5`).
- `SCANNER_EVENTS_MAX_STREAM_LENGTH` – max stream length before Redis trims entries (defaults to `10000`).

Helm values mirror the same knobs under each service’s `env` map (see `deploy/helm/stellaops/values-*.yaml`).

### Scheduler worker configuration

Every Compose profile now provisions the `scheduler-worker` container (backed by the
`StellaOps.Scheduler.Worker.Host` entrypoint). The environment placeholders exposed
in the `.env` samples match the options bound by `AddSchedulerWorker`:

- `SCHEDULER_QUEUE_KIND` – queue transport (`Nats` or `Redis`).
- `SCHEDULER_QUEUE_NATS_URL` – NATS connection string used by planner/runner consumers.
- `SCHEDULER_STORAGE_DATABASE` – MongoDB database name for scheduler state.
- `SCHEDULER_SCANNER_BASEADDRESS` – base URL the runner uses when invoking Scanner’s
  `/api/v1/reports` (defaults to the in-cluster `http://scanner-web:8444`).

Helm deployments inherit the same defaults from `services.scheduler-worker.env` in
`values.yaml`; override them per environment as needed.

### Advisory AI configuration

`advisory-ai-web` hosts the API/plan cache while `advisory-ai-worker` executes queued tasks. Both containers mount the shared volumes (`advisory-ai-queue`, `advisory-ai-plans`, `advisory-ai-outputs`) so they always read/write the same deterministic state. New environment knobs:

- `ADVISORY_AI_SBOM_BASEADDRESS` – endpoint the SBOM context client hits (defaults to the in-cluster Scanner URL).
- `ADVISORY_AI_INFERENCE_MODE` – `Local` (default) keeps inference on-prem; `Remote` posts sanitized prompts to the URL supplied via `ADVISORY_AI_REMOTE_BASEADDRESS`. Optional `ADVISORY_AI_REMOTE_APIKEY` carries the bearer token when remote inference is enabled.
- `ADVISORY_AI_WEB_PORT` – host port for `advisory-ai-web`.

The Helm chart mirrors these settings under `services.advisory-ai-web` / `advisory-ai-worker` and expects a PVC named `stellaops-advisory-ai-data` so both deployments can mount the same RWX volume.

### Front-door network hand-off

`docker-compose.prod.yaml` adds a `frontdoor` network so operators can attach Traefik, Envoy, or an on-prem load balancer that terminates TLS. Override `FRONTDOOR_NETWORK` in `prod.env` if your reverse proxy uses a different bridge name. Attach only the externally reachable services (Authority, Signer, Attestor, Concelier, Scanner Web, Notify Web, UI) to that network—internal infrastructure (Mongo, MinIO, RustFS, NATS) stays on the private `stellaops` network.

### Updating to a new release

1. Import the new manifest into `deploy/releases/` (see `deploy/README.md`).
2. Update image digests in the relevant Compose file(s).
3. Re-run `docker compose config` to confirm the bundle is deterministic.

### Mock overlay for missing digests (dev only)

Until official digests land, you can exercise Compose packaging with mock placeholders:

```bash
# assumes docker-compose.dev.yaml as the base profile
USE_MOCK=1 ./scripts/quickstart.sh env/dev.env.example
```

The overlay pins the missing services (orchestrator, policy-registry, packs-registry, task-runner, VEX/Vuln stack) to mock digests from `deploy/releases/2025.09-mock-dev.yaml` and starts their real entrypoints so integration flows can be exercised end-to-end. Replace the mock pins with production digests once releases publish; keep the mock overlay dev-only.

Keep digests synchronized between Compose, Helm, and the release manifest to preserve reproducibility guarantees. `deploy/tools/validate-profiles.sh` performs a quick audit.

### GPU toggle for Advisory AI

GPU is disabled by default. To run inference on NVIDIA GPUs:

```bash
docker compose \
  --env-file prod.env \
  -f docker-compose.prod.yaml \
  -f docker-compose.gpu.yaml \
  up -d
```

The GPU overlay requests one GPU for `advisory-ai-worker` and `advisory-ai-web` and sets `ADVISORY_AI_INFERENCE_GPU=true`. Ensure the host has the NVIDIA container runtime and that the base compose file still sets the correct digests.