Files
git.stella-ops.org/EPIC_13.md
master 651b8e0fa3 feat: Add new projects to solution and implement contract testing documentation
- Added "StellaOps.Policy.Engine", "StellaOps.Cartographer", and "StellaOps.SbomService" projects to the StellaOps solution.
- Created AGENTS.md to outline the Contract Testing Guild Charter, detailing mission, scope, and definition of done.
- Established TASKS.md for the Contract Testing Task Board, outlining tasks for Sprint 62 and Sprint 63 related to mock servers and replay testing.
2025-10-27 07:57:55 +02:00

431 lines
36 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

No file to print
Fine. Shipping containers, but for software. Heres the serious version you can paste into your docs without the sarcasm.
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
---
# Epic 13: Containerized Distribution & Quickstart
**Short name:** `Containerized Distribution & Quickstart`
**Primary components:** OCI images for all services, Compose Quickstart, Helm chart for production, Airgap bundles
**Surfaces:** Container registry, `/deploy/*`, `/docs/install/*`, Console onboarding screen
**Touches:** Authority (authN/Z), Web Services API, Orchestrator, Task Runner, Policy Engine, Conseiller (Feedser), Excitator (Vexer), Findings Ledger, Export Center, Notifications Studio, Advisory AI Assistant, Object Storage/KMS, Telemetry
**AOC ground rule reminder:** Conseiller and Excitator aggregate and link advisories/VEX. They never merge or mutate source records. Containerized deployments must preserve this behavior and expose links to originals.
---
## 1) What it is
A complete, reproducible containerized distribution of StellaOps with three delivery modes:
1. **Quickstart (single host)** using Docker Compose: one command to run a full stack suitable for evaluation and local development. Ships with seed data and sane defaults.
2. **Production Helm chart** for Kubernetes: modular, scalable, securebydefault deployment with optional HA and external dependencies.
3. **Airgapped bundles**: signed offline packages containing images, seed configs, and installation scripts for disconnected environments.
All images are multiarch (amd64/arm64), signed, SBOMattached, and versioned with consistent tags. A “Download & Install” doc set guides users from zero to a working system in minutes and to a productionready posture in hours.
---
## 2) Why (brief)
People dont adopt tools they cant run quickly or securely. Containers make our deployment reproducible; Quickstart removes friction; Helm unlocks real ops. Airgap bundles acknowledge reality in regulated environments.
---
## 3) How it should work (maximum detail)
### 3.1 Image catalog
Build and publish OCI images for the following:
* `stella-api` (Web Services API)
* `stella-console` (Web UI)
* `stella-orchestrator` (source/job scheduler)
* `stella-task-runner` (executes Task Packs remotely)
* `stella-conseiller` (Feedser; advisory aggregator)
* `stella-excitator` (Vexer; VEX aggregator)
* `stella-policy` (Policy Engine)
* `stella-ledger` (Findings Ledger worker; if separated from API)
* `stella-export` (Export Center worker; optional if part of API)
* `stella-notify` (Notifications Studio worker)
* `stella-ai` (Advisory AI Assistant; lightweight service calling configured LLM backends or local models)
* Support services (optionally bundled for Quickstart): `postgres`, `redis`, `object-store` (S3compatible), `queue` (NATS or RabbitMQ), `otel-collector`.
**Image standards**
* **Base:** distroless or minimal; nonroot user; readonly filesystem; writable `/tmp` only if needed.
* **Ports:** declare via labels; expose health endpoints `/health/liveness`, `/health/readiness`.
* **Env:** explicit, documented, with safe defaults; secrets via env or file mounts only.
* **Config:** `STELLA_*` envs or mounted config directory `/etc/stella/`.
* **SBOM:** attach SPDX JSON as OCI artifact and include in `/app/sbom.spdx.json` baked at build time.
* **Signing:** cosign attestations for image, SBOM, and provenance.
* **Labels:** org.opencontainers.image.* (title, version, revision, source, licenses).
* **Entrypoint:** PID 1 with reap; graceful shutdown on SIGTERM; configurable termination grace period.
* **Logs:** structured JSON by default; stdout/stderr only.
**Tagging scheme**
* `:vX.Y.Z` (immutable release)
* `:vX.Y.Z-rc.N` (release candidate)
* `:edge` (latest main)
* `:nightly-YYYYMMDD` (optional)
* Multiarch manifest lists for linux/amd64 and linux/arm64.
### 3.2 Quickstart (Compose)
**Goal:** `curl | sh` equivalent that yields a working stack on a laptop/server with defaults and demo data. No internet beyond pulling images, unless configured.
**Compose file `deploy/compose/docker-compose.yml`**
* Services:
* `api`, `console`, `orchestrator`, `task-runner`, `conseiller`, `excitator`, `policy`, `notify`, `export`, `ai`
* `postgres`, `redis`, `minio` (S3), `nats` or `rabbitmq`, `otel-collector`
* Volumes:
* `pgdata`, `minio-data`, `redis-data`, `stella-state` (for local cache, packs registry)
* Networks:
* `stella-net` bridge
* Ports (defaults):
* Console `8080`, API `8081`, MinIO `9000`, NATS/RabbitMQ default ports
* Env files:
* `.env.example` with safe defaults; users copy to `.env`.
**Seed data**
* Seed admin account and tenant on first run via `stella-api` migration/seed job.
* Seed demo SBOMs, advisories, VEX samples, baseline policy, and a task pack.
* On first login, Console shows “Welcome” wizard: confirm endpoints, generate API token, run sample scan import, open Vulnerability Explorer.
**Security posture**
* Default credentials only for Quickstart; randomize secrets on first `up` and store in `.secrets/` file.
* All services run as nonroot; bind to localhost by default unless `EXPOSE_PUBLIC=1` set.
* TLS optional via `CADDY` or `nginx` sidecar disabled by default.
**Oneliner**
* `./deploy/compose/quickstart.sh` does: preflight checks, pulls images, writes `.env`, runs `docker compose up -d`, polls readiness, prints URLs and credentials.
**Backups & reset**
* `./deploy/compose/backup.sh` creates a tarball of volumes and config.
* `./deploy/compose/reset.sh` nukes persistent volumes with a big scary prompt unless `--yes`.
### 3.3 Production Helm chart
**Chart location:** `deploy/helm/stella/` with subcharts or toggles.
**Chart features**
* Components enabled via values: `api`, `console`, `orchestrator`, `taskRunner`, `conseiller`, `excitator`, `policy`, `notify`, `export`, `ai`.
* External dependencies by default:
* PostgreSQL, Redis, S3 bucket, Message queue, OTel endpoint provided via values.
* Optional “bundled” mode for lab clusters using StatefulSets.
* Security:
* PodSecurityContext: runAsNonRoot, readOnlyRootFilesystem, fsGroup when needed.
* NetworkPolicy for eastwest traffic; denyall then allow specific ports.
* Secrets as `Secret` from External Secrets operator or sealed secrets.
* HPA per component; PDBs; liveness/readiness probes.
* Ingress:
* One hostname for Console, one for API; TLS required in production values.
* Option to serve Console as static behind CDN while API behind private ingress gateway.
* Config:
* Values for Authority provider, token TTLs, policy cache TTL, pack registry endpoint, notifications sinks, export locations.
* Feature flags per epic enablement.
* Migrations:
* `stella-migrator` Job runs before rollouts; idempotent migrations.
* Optional “break glass” manual job.
* Observability:
* `/metrics` endpoints scraped by Prometheus; exemplars via OTel; logs structured.
* OpenTelemetry autoconfig via env if collector provided.
* Upgrades:
* Blue/green or rolling; readiness gates based on background indexers catching up.
* Chart hooks to block until Conseiller/Excitator catch up to feed watermarks.
### 3.4 Airgapped distribution
**Bundle format**
* `stella-bundle-vX.Y.Z.tar.zst` containing:
* All images as OCI layout (multiarch), cosign signatures, SBOMs, SLSA provenance.
* `load.sh` to import into a local registry.
* `compose/` and `helm/` directories with pinned image digests.
* `checksums.txt` and `bundle.sig`.
* **Process**
* Online build job crafts bundle; signatures produced by CI keys.
* Offline install:
* Verify `bundle.sig`
* `./load.sh --to registry.local:5000`
* `helm install stella ./helm -f values-airgap.yaml --set image.registry=registry.local:5000`
### 3.5 Configuration matrix
Document every config knob in a single table:
* Auth: Authority issuer, JWKS, RBAC cache TTL.
* Storage: DB URL, pool sizes, migration flags.
* Object store: S3 endpoint, buckets, SSE, IAM.
* Queue: URL, prefetch, retention.
* Policy engine: rule cache TTL, default policy version.
* Conseiller/Excitator: polling intervals, feed sources, retry backoff, max inflight; **merge disabled** enforced.
* Orchestrator/Task Runner: concurrency, sandbox, network egress policy, artifact retention.
* Notifications: sinks, templates path, batch windows.
* Export Center: formats enabled, rate limits.
* AI Assistant: model endpoint, token limits, guardrails, disable by default.
### 3.6 Health, readiness, and upgrades
* **Health endpoints:** `GET /health/liveness` returns 200 if process responsive; `GET /health/readiness` checks dependencies with timeout.
* **Graceful shutdown:** SIGTERM starts drain; HTTP returns 503; background workers flush; exit on deadline.
* **Upgrade choreography:** migrations run, API becomes ready, workers rolling restart, indexes catch up, AOC evaluation warms caches, then flip traffic.
* **Version skew policy:** define supported skew between components; chart validates.
### 3.7 Security & compliance
* **Image signing & verification:** cosign attestations; optional admission policy to verify signatures by key.
* **SBOM provenance:** attach SPDX and provenance attestations; publish via registry referrers.
* **Nonroot & least privilege:** capabilities dropped; only NET_BIND for proxies if needed.
* **Secrets handling:** mount from files; avoid putting secrets in args; redacted logs by default.
* **Audit:** container labels propagate release metadata to all logs and spans.
* **AOC enforcement:** images for Conseiller/Excitator harddisable merge code paths via env/defaults.
### 3.8 Quickstart UX polish
* Console shows “Connected to Quickstart” banner with a button “View install docs” and “Export pack to production.”
* One click to generate a `Task Pack` that exports seed data from Quickstart to a production tenant via Export Center.
---
## 4) Architecture
### 4.1 Repos & layout
```
/deploy
/compose
docker-compose.yml
.env.example
quickstart.sh
backup.sh
reset.sh
/helm
/stella
Chart.yaml
values.yaml
values-prod.yaml
values-airgap.yaml
templates/*.yaml
/docker
stella-api.Dockerfile
stella-console.Dockerfile
stella-orchestrator.Dockerfile
stella-task-runner.Dockerfile
stella-conseiller.Dockerfile
stella-excitator.Dockerfile
stella-policy.Dockerfile
stella-notify.Dockerfile
stella-export.Dockerfile
stella-ai.Dockerfile
```
### 4.2 CI/CD flow
* Build multiarch with buildx; run unit/integration tests; embed version metadata and SBOM.
* Sign images; push to registry; publish Helm chart with pinned digests.
* Generate Airgap bundle and signatures.
* Smoke test Quickstart on fresh VM; e2e tests exercise Console and CLI parity (Epic 12).
---
## 5) APIs and contracts
No new external APIs, but every service must expose:
* `GET /health/liveness` and `GET /health/readiness`.
* `GET /version` returning `{ version, gitCommit, buildDate }`.
* `GET /metrics` when enabled.
* Config discovery endpoint for Console with trimmed, safe values (no secrets).
* Conseiller/Excitator must expose `GET /capabilities` returning `{"merge": false}` to prove merge is disabled.
---
## 6) Documentation changes
Create/update:
1. `/docs/install/overview.md`
Supported deployment modes, hardware requirements, network ports, quickstart vs production.
2. `/docs/install/compose-quickstart.md`
Preconditions, oneliner, firstlogin wizard, seed data, reset/backup, common pitfalls.
3. `/docs/install/helm-prod.md`
Prereqs, external dependencies, values reference, TLS/ingress, HPA, PDB, upgrades, rollbacks.
4. `/docs/install/airgap.md`
Bundle verification, loading into private registry, running without internet, patching images.
5. `/docs/install/configuration-reference.md`
The full configuration matrix with examples.
6. `/docs/security/supply-chain.md`
Image signing, SBOMs, provenance, admission controls, nonroot posture.
7. `/docs/operations/health-and-readiness.md`
Endpoints, probes, troubleshooting, expected states during upgrades.
8. `/docs/release/image-catalog.md`
All image names, tags, architectures, checksums; mapping between chart version and image digests.
9. `/docs/console/onboarding.md`
Quickstart banner, links to install docs, exporting data to production.
Add at the top of each page:
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
---
## 7) Implementation plan
### New modules/artifacts
* Dockerfiles per service under `/deploy/docker/` with common builder stages.
* Helm chart under `/deploy/helm/stella`.
* Compose quickstart under `/deploy/compose/`.
* Airgap bundle generator in CI, script `tools/make-airgap-bundle.sh`.
* Seed dataset packaged as container image layer or mounted config.
### Changes to services
* Add health/version/metrics endpoints where missing.
* Ensure all services read config from env/files with defaults suitable for Quickstart.
* Conseiller/Excitator: add hard config flag `DISABLE_MERGE=true` defaulted in images and values.
* API: seed job and migration runner; serve `/welcome` state for Console wizard.
* Console: onboarding wizard and Quickstart banner.
* Task Runner: respect offline mode by failing gracefully if egress blocked.
### Packaging & signing
* Embed SBOM in all images; publish as OCI referrers.
* Cosign sign images and attest provenance; verify in CI.
* Publish checksums and signatures on release page.
---
## 8) Engineering tasks
**Images**
* [ ] Author multistage Dockerfiles with cacheefficient builds.
* [ ] Add nonroot user, drop capabilities, readonly FS, healthcheck scripts.
* [ ] Generate and attach SBOM for each image.
* [ ] Implement `/health/*`, `/version`, optional `/metrics`.
**Compose**
* [ ] Write `docker-compose.yml` with all core services and deps.
* [ ] Create `.env.example`, `quickstart.sh`, `backup.sh`, `reset.sh`.
* [ ] Seed job container and sample data ingestion on first run.
**Helm**
* [ ] Scaffold chart; values for each component; pinned digests.
* [ ] Ingress, TLS, HPA, PDB, NetworkPolicy, ServiceAccount/RBAC.
* [ ] Migration Job and upgrade hooks; readiness gates for indexers.
* [ ] Documentation of values with `helm-docs` generator.
**Airgap**
* [ ] Build script to save images to OCI layout; compress, sign, and checksum.
* [ ] `load.sh` to import into private registry and rewrite manifests.
* [ ] `values-airgap.yaml` with image registry overrides.
**Console & API**
* [ ] Onboarding wizard, Quickstart banner, links to docs.
* [ ] Seed data endpoints guarded behind `QUICKSTART_MODE`.
* [ ] Config discovery endpoint for console.
**Security**
* [ ] Cosign integration; key management; CI verification step.
* [ ] Admission policy example in docs to enforce signatures.
* [ ] Secret redaction in logs; env var audit.
**Observability**
* [ ] OTel config sample; `/metrics` endpoints; compose prom scrape.
* [ ] Helm values for tracing and metrics.
**Validation**
* [ ] Fresh VM smoke test for Compose quickstart.
* [ ] Kind cluster e2e for Helm path.
* [ ] Airgap install test in CI with a local registry.
**Docs**
* [ ] Write all pages listed in §6 with copypasteable commands and screenshots.
* [ ] Include a troubleshooting matrix: symptom → probable cause → fix.
* [ ] Add “Imposed rule” header line to each page.
---
## 9) Feature changes required
* **Console:** Onboarding wizard, Quickstart banner, and deep links to install docs; “Copy CLI” buttons should prefer the `stella` container image in quickstart if local binary missing.
* **API:** Seed job and health endpoints; version reporting; feature flag `QUICKSTART_MODE`.
* **Registry/Release tooling:** Publish image catalog and checksums; maintain compatibility matrix per chart version.
* **Task Runner:** Offline mode awareness and explicit error when attempting egress in airgap.
* **Conseiller/Excitator:** enforce nonmerge at runtime and show capability endpoint.
---
## 10) Acceptance criteria
* Quickstart: from clean host to working Console in under 5 minutes on a typical laptop; seed data visible; AOC rules active.
* Helm: install succeeds with external dependencies; roll forward and roll back with zero data loss; probes green.
* Airgap: bundle verifies, loads to a private registry, and installs without external network.
* All images: signed, SBOMattached, nonroot, readonly FS, health endpoints exposed.
* Docs: a new user can complete Quickstart without assistance; a platform team can deploy the chart with only values editing.
* Conseiller/Excitator: capability endpoint confirms `merge=false`; tests prove aggregationonly behavior.
---
## 11) Risks & mitigations
* **Config sprawl.** Centralize in `/docs/install/configuration-reference.md` and ship sane defaults.
* **Drift between Compose and Helm.** Pin digests; generate manifests from a common values source where possible; CI diff.
* **Resource contention in Quickstart.** Limit concurrency; ship low default worker counts; document overrides.
* **Airgap surprises.** Remove implicit egress; provide offline doc copies in bundle; deterministic artifact paths.
* **Security regressions.** Enforce nonroot/readonly in CI; signature verification gates release.
---
## 12) Philosophy
* **First run matters.** Quickstart must be boring, predictable, and immediately useful.
* **Prod isnt a flag.** Helm defaults are safe; “convenience” belongs in Quickstart, not production.
* **Prove your supply chain.** Signed images, SBOMs, and provenance are table stakes, not an upsell.
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.