# component_architecture_devops.md — **Stella Ops Release & Operations** (2025Q4) > **Scope.** Implementation‑ready blueprint for **how Stella Ops is built, versioned, signed, distributed, upgraded, licensed (PoE)**, and operated in customer environments (online and air‑gapped). Covers reproducible builds, supply‑chain attestations, registries, offline kits, migration/rollback, artifact lifecycle (RustFS default + Mongo, S3 fallback), monitoring SLOs, and customer activation. --- ## 0) Product vision (operations lens) Stella Ops must be **trustable at a glance** and **boringly operable**: * Every release ships with **first‑party SBOMs, provenance, and signatures**; services verify **each other’s** integrity at runtime. * Customers can deploy by **digest** and stay aligned with **LTS/stable/edge** channels. * Paid customers receive **attestation authority** (Signer accepts their PoE) while the core platform remains **free to run**. * Air‑gapped customers receive **offline kits** with verifiable digests and deterministic import. * Artifacts expire predictably; operators know what’s kept, for how long, and why. --- ## 1) Release trains & versioning ### 1.1 Channels * **LTS** (12‑month support window): quarterly cadence (Q1/Q2/Q3/Q4). * **Stable** (default): monthly rollup (bug fixes + compatible features). * **Edge**: weekly; for early adopters, no guarantees. ### 1.2 Version strings Semantic core + calendar tag: ``` .. (.) e.g., 2.4.1 (2027.06) ``` * **MAJOR**: breaking API/DB changes (rare). * **MINOR**: new features, compatible schema migrations (expand/contract pattern). * **PATCH**: bug fixes, perf and security updates. * **Calendar tag** exposes **release year** used by Signer for **PoE window checks**. ### 1.3 Component alignment A release is a **bundle** of image digests + charts + manifests. All services in a bundle are **wire‑compatible**. Mixed minor versions are allowed within a bounded skew: * **Web UI ↔ backend**: `±1 minor`. * **Scanner ↔ Policy/Excititor/Concelier**: `±1 minor`. * **Authority/Signer/Attestor triangle**: **must** be same minor (crypto and DPoP/mTLS binding rules). At startup, services **self‑advertise** their semver & channel; the UI surfaces **mismatch warnings**. --- ## 2) Supply‑chain pipeline (how a release is built) ### 2.1 Deterministic builds * **Builders**: isolated **BuildKit** workers with pinned base images (digest only). * **Pinning**: lock files or `go.mod`, `package-lock.json`, `global.json`, `Directory.Packages.props` are **frozen** at tag. * **Reproducibility**: timestamps normalized; source date epoch; deterministic zips/tars. * **Multi‑arch**: linux/amd64 + linux/arm64 (Windows images track M2 roadmap). ### 2.2 First‑party SBOMs & provenance * Each image gets **CycloneDX (JSON+Protobuf) SBOM** and **SLSA‑style provenance** attached as **OCI referrers**. * Scanner’s **Buildx generator** is used to produce SBOMs *during* build; a separate post‑build scan verifies parity (red flag if drift). * **Release manifest** (see §6.1) lists all digests and SBOM/attestation refs. ### 2.3 Signing & transparency * Images are **cosign‑signed** (keyless) with a Stella Ops release identity; inclusion in a **transparency log** (Rekor) is required. * SBOM and provenance attestations are **DSSE** and also transparency‑logged. * Release keys (Fulcio roots or public keys) are embedded in **Signer** policy (for **scanner‑release validation** at customer side). ### 2.4 Gates & tests * **Static**: linters, codegen checks, protobuf API freeze (backward‑compat tests). * **Unit/integration**: per‑component, plus **end‑to‑end** flows (scan→vex→policy→sign→attest). * **Perf SLOs**: hot paths (SBOM compose, diff, export) measured against budgets. * **Security**: dependency audit vs Concelier export; container hardening tests; minimal caps. * **Canary cohort**: internal staging + selected customers; one week on **edge** before **stable** tag. --- ## 3) Distribution & activation ### 3.1 Registries * **Primary**: `registry.stella-ops.org` (OCI v2, supports Referrers API). * **Mirrors**: GHCR (read‑only), regional mirrors for latency. * Operational runbook: see `docs/ops/concelier-mirror-operations.md` for deployment profiles, CDN guidance, and sync automation. * **Pull by digest only** in Kubernetes/Compose manifests. **Gating policy**: * **Core images** (Authority, Scanner, Concelier, Excititor, Attestor, UI): public **read**. * **Enterprise add‑ons** (if any) and **pre‑release**: private repos via OAuth2 token service. > Monetization lever is **signing** (PoE gate), not image pulls, so the core remains simple to consume. ### 3.2 OAuth2 token service (for private repos) * Docker Registry’s token flow backed by **Authority**: 1. Client hits registry (`401` with `WWW-Authenticate: Bearer realm=…`). 2. Client gets an **access token** from the token service (validated by Authority) with `scope=repository:…:pull`. 3. Registry allows pull for the requested repo. * Tokens are **short‑lived** (60–300 s) and **DPoP‑bound**. ### 3.3 Offline kits (air‑gapped) * Tarball per release channel: ``` stellaops-kit--.tar.zst /images/ OCI layout with all first-party images (multi-arch) /sboms/ CycloneDX JSON+PB for each image /attest/ DSSE bundles + Rekor proofs /charts/ Helm charts + values templates /compose/ docker-compose.yml + .env template /plugins/ Concelier/Excititor connectors (restart-time) /policy/ example policies /manifest/ release.yaml (see §6.1) ``` * Import via CLI `offline kit import`; checks digests and signatures before load. --- ## 4) Licensing (PoE) & monetization **Principle**: **Only paid Stella Ops issues valid signed attestations.** Running the stack is free; signing requires PoE. ### 4.1 PoE issuance * Customers purchase a plan and obtain a **PoE artifact** from `www.stella-ops.org`: * **PoE‑JWT** (DPoP/mTLS‑bound) **or** **PoE mTLS client certificate**. * Contains: `license_id`, `plan`, `valid_release_year`, `max_version`, `exp`, optional `tenant/customer` IDs. ### 4.2 Online enforcement * **Signer** calls **Licensing /license/introspect** on every signing request (see signer doc). * If **revoked/expired/out‑of‑window** → deny with machine‑readable reason. * All **valid** bundles are DSSE‑signed and **Attestor** logs them; Rekor UUID returned. * UI badges: “**Verified by Stella Ops**” with link to the public log. ### 4.3 Air‑gapped / offline * Customers obtain a **time‑boxed PoE lease** (signed JSON, 7–30 days). * Signer accepts the lease and emits **provisional** attestations (clearly labeled). * When connectivity returns, a background job **endorses** the provisional entries with the cloud service, updating their status to **verified**. * Operators can export a **verification bundle** for auditors even before endorsement (contains DSSE + local Rekor proof + lease snapshot). ### 4.4 Stolen/abused PoE * Customers report theft; **Licensing** flags `license_id` as **revoked**. * Subsequent Signer requests **deny**; previous attestations remain but can be marked **contested** (UI shows badge, optional re‑sign path upon new PoE). --- ## 5) Deployment path (customer side) ### 5.1 First install * **Helm** (Kubernetes) or **Compose** (VMs). Example (K8s): ```bash helm repo add stellaops https://charts.stella-ops.org helm install stella stellaops/platform \ --version 2.4.0 \ --set global.channel=stable \ --set authority.issuer=https://authority.stella.local \ --set scanner.minio.endpoint=http://minio.stella.local:9000 \ --set scanner.mongo.uri=mongodb://mongo/scanner \ --set concelier.mongo.uri=mongodb://mongo/concelier \ --set excititor.mongo.uri=mongodb://mongo/excititor ``` * Post‑install job registers **Authority clients** (Scanner, Signer, Attestor, UI) and prints **bootstrap** URLs and client credentials (sealed secrets). * UI banner shows **release bundle** and verification state (cosign OK? Rekor OK?). ### 5.2 Updates * **Blue/green**: pull new bundle by **digest**; deploy side‑by‑side; cut traffic. * **Rolling**: upgrade stateful components in safe order: 1. Authority (stateless, dual‑key rotation ready) 2. Signer/Attestor (same minor) 3. Scanner WebService & Workers 4. Concelier, then Excititor (schema migrations are expand/contract) 5. UI last * **DB migrations** are **expand/contract**: * Phase A (release N): **add** new fields/indexes, write old+new. * Phase B (N+1): **read** new fields; **drop** old. * Rollback is a matter of redeploying previous images and keeping both schemas valid. ### 5.3 Rollback * Images referenced by **digest**; keep previous release manifest `K` versions back. * `helm rollback` or compose `docker compose -f release-K.yml up -d`. * Mongo migrations are additive; **no destructive changes** within a single minor. --- ## 6) Release payloads & manifests ### 6.1 Release manifest (`release.yaml`) ```yaml release: version: "2.4.1" channel: "stable" date: "2027-06-20T12:00:00Z" calendar: "2027.06" components: - name: scanner-webservice image: registry.stella-ops.org/stellaops/scanner-web@sha256:aa..bb sbom: oci://.../referrers/cdx-json@sha256:11..22 provenance: oci://.../attest/provenance@sha256:33..44 signature: { rekorUUID: "…" } - name: signer image: registry.stella-ops.org/stellaops/signer@sha256:cc..dd signature: { rekorUUID: "…" } charts: - name: platform version: "2.4.1" digest: "sha256:ee..ff" compose: file: "docker-compose.yml" digest: "sha256:77..88" checksums: sha256: "… digest of this release.yaml …" ``` The manifest is **cosign‑signed**; UI/CLI can verify a bundle without talking to registries. > Deployment guardrails – The repository keeps channel-aligned Compose bundles > in `deploy/compose/` and Helm overlays in `deploy/helm/stellaops/`. Both sets > pull their digests from `deploy/releases/` and are validated by > `deploy/tools/validate-profiles.sh` to guarantee lint/dry-run cleanliness. ### 6.2 Image labels (release metadata) Each image sets OCI labels: ``` org.opencontainers.image.version = "2.4.1" org.opencontainers.image.revision = "" org.opencontainers.image.created = "2027-06-20T12:00:00Z" org.stellaops.release.calendar = "2027.06" org.stellaops.release.channel = "stable" org.stellaops.build.slsaProvenance = "oci://…" ``` Signer validates **scanner** image’s cosign identity + calendar tag for **release window** checks. --- ## 7) Artifact lifecycle & storage (RustFS/Mongo) ### 7.1 Buckets & prefixes (RustFS) ``` rustfs://stellaops/ scanner/ layers//sbom.cdx.json.zst images//inventory.cdx.pb images//usage.cdx.pb diffs/_/diff.json.zst attest/.dsse.json concelier/ json//... trivy//... excititor/ exports//... attestor/ dsse/.json proof/.json ``` ### 7.2 ILM classes * **`short`**: working artifacts (diffs, queues) — TTL 7–14 days. * **`default`**: SBOMs & indexes — TTL 90–180 days (configurable). * **`compliance`**: signed reports & attested exports — retention enforced via RustFS hold or S3 Object Lock (governance/compliance) 1–7 years. ### 7.3 Artifact Lifecycle Controller (ALC) * A background worker (part of Scanner.WebService) enforces **TTL** and **reference counting**: * Artifacts referenced by **reports** or **tickets** are pinned. * ILM actions logged; UI shows per‑class usage & upcoming purges. > **Migration note.** Follow `docs/ops/scanner-rustfs-migration.md` when transitioning existing > MinIO buckets to RustFS. The provided migrator is idempotent and safe to rerun per prefix. ### 7.4 Mongo retention * **Scanner**: `runtime.events` use TTL (e.g., 30–90 days); **catalog** permanent. * **Concelier/Excititor**: raw docs keep **last N windows**; canonical stores permanent. * **Attestor**: `entries` permanent; `dedupe` TTL 24–48h. ### 7.5 Mongo server baseline * **Minimum supported server:** MongoDB **4.2+**. Driver 3.5.0 removes compatibility shims for 4.0; upstream has already announced 4.0 support will be dropped in upcoming C# driver releases. citeturn1open1 * **Deploy images:** Compose/Helm defaults stay on `mongo:7.x`. For air-gapped installs, refresh Offline Kit bundles so the packaged `mongod` matches ≥4.2. * **Upgrade guard:** During rollout, verify replica sets reach FCV `4.2` or above before swapping binaries; automation should hard-stop if FCV is <4.2. --- ## 8) Observability & SLOs (operations) * **Uptime SLO**: 99.9% for Signer/Authority/Attestor; 99.5% for Scanner WebService; Excititor/Concelier 99.0%. * **Error budgets**: tracked per month; dashboards show burn rates. * **Golden signals**: * **Latency**: token issuance, sign→attest round‑trip, scan enqueue→emit, export build. * **Saturation**: queue depth, Mongo write IOPS, RustFS throughput / queue depth (or S3 metrics when in fallback mode). * **Traffic**: scans/min, attestations/min, webhook admits/min. * **Errors**: 5xx rates, cosign verification failures, Rekor timeouts. Prometheus + OTLP; Grafana dashboards ship in the charts. --- ## 9) Security & compliance operations * **Key rotation**: * Authority JWKS: 60‑day cadence, dual‑key overlap. * Release signing identities: rotate per minor or quarterly. * Sigstore roots mirrored and pinned; alarms on drift. * **FIPS mode** (Gov build): * Enforce `ES256` + KMS/HSM; disable Ed25519; MLS ciphers only. * Local **Rekor v2** and **Fulcio** alternatives; **air‑gapped** CA. * **Vulnerability response**: * Concelier red-flag advisories trigger accelerated **stable** patch rollout; UI/CLI “security patch available” notice. * 2025-10: Pinned `MongoDB.Driver` **3.5.0** and `SharpCompress` **0.41.0** across services (DEVOPS-SEC-10-301) to eliminate NU1902/NU1903 warnings surfaced during scanner cache/worker test runs; repacked the local `Mongo2Go` feed so test fixtures inherit the patched dependencies; future bumps follow the same central override pattern. * **Backups/DR**: * Mongo nightly snapshots; MinIO versioning + replication (if configured). * Restore runbooks tested quarterly with synthetic data. --- ## 10) Customer update flow (how versions are fetched & activated) ### 10.1 Online clusters * **UI** surfaces update banner with **release manifest** diff and risk notes. * Operator approves → **Controller** pulls new images by digest; health‑checks; moves traffic; deprecates old revision. * Post‑switch, **schema Phase B** migrations (if any) run automatically. ### 10.2 Air‑gapped clusters * Operator downloads **offline kit** from a mirror → `stellaops offline kit import`. * Controller validates bundle checksums and **cosign signatures**; applies charts/compose by digest. * After install, **verify** page shows green checks: image sigs, SBOMs attached, provenance logged. ### 10.3 CLI self‑update (optional) * `stellaops self-update` pulls a **signed release manifest** and verifies the **CLI binary** with cosign before swapping (admin can disable). --- ## 11) Compatibility & deprecation policy * **APIs** are stable within a **major**; breaking changes imply **MAJOR++** and deprecation period of one minor. * **Storage**: expand/contract; “drop old fields” only after one minor grace. * **Config**: feature flags (default off) for risky features (e.g., eBPF). --- ## 12) Runbooks (selected) ### 12.1 Lost PoE 1. Suspend **automatic attestation** jobs. 2. Use CLI `stellaops signer status` to confirm `entitlement_denied`. 3. Obtain new PoE from portal; verify on Signer `/poe/verify`. 4. Re‑enable; optionally **re‑sign** last N reports (UI button → batch). ### 12.2 Rekor outage (self‑hosted) * Attestor returns `202 (pending)` with queued proof fetch. * Keep DSSE bundles locally; re‑submit on schedule; UI badge shows **Pending**. * If outage > SLA, you can switch to a **mirror** log in config; Attestor writes to both when restored. ### 12.3 Emergency downgrade * Identify prior release manifest (UI → Admin → Releases). * `helm rollback stella ` (or compose apply previous file). * Services tolerate skew per §1.3; ensure **Signer/Authority/Attestor** are rolled together. --- ## 13) Example: cluster bootstrap (Compose) ```yaml version: "3.9" services: authority: image: registry.stella-ops.org/stellaops/authority@sha256:... env_file: ./env/authority.env ports: ["8440:8440"] signer: image: registry.stella-ops.org/stellaops/signer@sha256:... depends_on: [authority] environment: - SIGNER__POE__LICENSING__INTROSPECTURL=https://www.stella-ops.org/api/v1/license/introspect attestor: image: registry.stella-ops.org/stellaops/attestor@sha256:... depends_on: [signer] scanner-web: image: registry.stella-ops.org/stellaops/scanner-web@sha256:... environment: - SCANNER__S3__ENDPOINT=http://minio:9000 scanner-worker: image: registry.stella-ops.org/stellaops/scanner-worker@sha256:... deploy: { replicas: 4 } concelier: image: registry.stella-ops.org/stellaops/concelier@sha256:... excititor: image: registry.stella-ops.org/stellaops/excititor@sha256:... web-ui: image: registry.stella-ops.org/stellaops/web-ui@sha256:... mongo: image: mongo:7 minio: image: minio/minio:RELEASE.2025-07-10T00-00-00Z ``` --- ## 14) Governance & keys (who owns the trust root) * **Release key policy**: only the Release Engineering group can push signed releases; 4‑eyes approval; TUF‑style manifest possible in future. * **Signer acceptance policy**: embedded release identities are updated **only** via minor upgrade; emergency CRL supported. * **Customer keys**: none needed for core use; enterprise add‑ons may require per‑customer registries and keys. --- ## 15) Roadmap (Ops) * **Windows containers GA** (Scanner + Zastava). * **Key Transparency** for Signer certs. * **Delta‑kit** (offline) for incremental updates. * **Operator CRDs** (K8s) to manage policy and ILM declaratively. * **SBOM **protobuf** as default transport at rest (smaller, faster). --- ### Appendix A — Minimal SLO monitors * `authority.tokens_issued_total` slope ≈ normal. * `signer.requests_total{result="success"}/minute` > 0 (when scans occur). * `attestor.submit_latency_seconds{quantile=0.95}` < 0.3. * `scanner.scan_latency_seconds{quantile=0.95}` < target per image size. * `concelier.export.duration_seconds` stable; `excititor.consensus.conflicts_total` not exploding after policy changes. * RustFS request error rate near zero (or `s3_requests_errors_total` when operating against S3); Mongo `opcounters` hit expected baseline. ### Appendix B — Upgrade safety checklist * Verify **release manifest** signature. * Ensure **Signer/Authority/Attestor** are same minor. * Verify **DB backups** < 24h old. * Confirm **ILM** won’t purge compliance artifacts during upgrade window. * Roll **one component** at a time; watch SLOs; abort on regression. --- **End — component_architecture_devops.md**