463 lines
17 KiB
Markdown
463 lines
17 KiB
Markdown
# component_architecture_devops.md — **Stella Ops Release & Operations** (2025Q4)
|
||
|
||
> **Scope.** Implementation‑ready blueprint for **how Stella Ops is built, versioned, signed, distributed, upgraded, licensed (PoE)**, and operated in customer environments (online and air‑gapped). Covers reproducible builds, supply‑chain attestations, registries, offline kits, migration/rollback, artifact lifecycle (MinIO/Mongo), monitoring SLOs, and customer activation.
|
||
|
||
---
|
||
|
||
## 0) Product vision (operations lens)
|
||
|
||
Stella Ops must be **trustable at a glance** and **boringly operable**:
|
||
|
||
* Every release ships with **first‑party SBOMs, provenance, and signatures**; services verify **each other’s** integrity at runtime.
|
||
* Customers can deploy by **digest** and stay aligned with **LTS/stable/edge** channels.
|
||
* Paid customers receive **attestation authority** (Signer accepts their PoE) while the core platform remains **free to run**.
|
||
* Air‑gapped customers receive **offline kits** with verifiable digests and deterministic import.
|
||
* Artifacts expire predictably; operators know what’s kept, for how long, and why.
|
||
|
||
---
|
||
|
||
## 1) Release trains & versioning
|
||
|
||
### 1.1 Channels
|
||
|
||
* **LTS** (12‑month support window): quarterly cadence (Q1/Q2/Q3/Q4).
|
||
* **Stable** (default): monthly rollup (bug fixes + compatible features).
|
||
* **Edge**: weekly; for early adopters, no guarantees.
|
||
|
||
### 1.2 Version strings
|
||
|
||
Semantic core + calendar tag:
|
||
|
||
```
|
||
<MAJOR>.<MINOR>.<PATCH> (<YYYY>.<MM>) e.g., 2.4.1 (2027.06)
|
||
```
|
||
|
||
* **MAJOR**: breaking API/DB changes (rare).
|
||
* **MINOR**: new features, compatible schema migrations (expand/contract pattern).
|
||
* **PATCH**: bug fixes, perf and security updates.
|
||
* **Calendar tag** exposes **release year** used by Signer for **PoE window checks**.
|
||
|
||
### 1.3 Component alignment
|
||
|
||
A release is a **bundle** of image digests + charts + manifests. All services in a bundle are **wire‑compatible**. Mixed minor versions are allowed within a bounded skew:
|
||
|
||
* **Web UI ↔ backend**: `±1 minor`.
|
||
* **Scanner ↔ Policy/Excititor/Concelier**: `±1 minor`.
|
||
* **Authority/Signer/Attestor triangle**: **must** be same minor (crypto and DPoP/mTLS binding rules).
|
||
|
||
At startup, services **self‑advertise** their semver & channel; the UI surfaces **mismatch warnings**.
|
||
|
||
---
|
||
|
||
## 2) Supply‑chain pipeline (how a release is built)
|
||
|
||
### 2.1 Deterministic builds
|
||
|
||
* **Builders**: isolated **BuildKit** workers with pinned base images (digest only).
|
||
* **Pinning**: lock files or `go.mod`, `package-lock.json`, `global.json`, `Directory.Packages.props` are **frozen** at tag.
|
||
* **Reproducibility**: timestamps normalized; source date epoch; deterministic zips/tars.
|
||
* **Multi‑arch**: linux/amd64 + linux/arm64 (Windows images track M2 roadmap).
|
||
|
||
### 2.2 First‑party SBOMs & provenance
|
||
|
||
* Each image gets **CycloneDX (JSON+Protobuf) SBOM** and **SLSA‑style provenance** attached as **OCI referrers**.
|
||
* Scanner’s **Buildx generator** is used to produce SBOMs *during* build; a separate post‑build scan verifies parity (red flag if drift).
|
||
* **Release manifest** (see §6.1) lists all digests and SBOM/attestation refs.
|
||
|
||
### 2.3 Signing & transparency
|
||
|
||
* Images are **cosign‑signed** (keyless) with a Stella Ops release identity; inclusion in a **transparency log** (Rekor) is required.
|
||
* SBOM and provenance attestations are **DSSE** and also transparency‑logged.
|
||
* Release keys (Fulcio roots or public keys) are embedded in **Signer** policy (for **scanner‑release validation** at customer side).
|
||
|
||
### 2.4 Gates & tests
|
||
|
||
* **Static**: linters, codegen checks, protobuf API freeze (backward‑compat tests).
|
||
* **Unit/integration**: per‑component, plus **end‑to‑end** flows (scan→vex→policy→sign→attest).
|
||
* **Perf SLOs**: hot paths (SBOM compose, diff, export) measured against budgets.
|
||
* **Security**: dependency audit vs Concelier export; container hardening tests; minimal caps.
|
||
* **Canary cohort**: internal staging + selected customers; one week on **edge** before **stable** tag.
|
||
|
||
---
|
||
|
||
## 3) Distribution & activation
|
||
|
||
### 3.1 Registries
|
||
|
||
* **Primary**: `registry.stella-ops.org` (OCI v2, supports Referrers API).
|
||
* **Mirrors**: GHCR (read‑only), regional mirrors for latency.
|
||
* **Pull by digest only** in Kubernetes/Compose manifests.
|
||
|
||
**Gating policy**:
|
||
|
||
* **Core images** (Authority, Scanner, Concelier, Excititor, Attestor, UI): public **read**.
|
||
* **Enterprise add‑ons** (if any) and **pre‑release**: private repos via OAuth2 token service.
|
||
|
||
> Monetization lever is **signing** (PoE gate), not image pulls, so the core remains simple to consume.
|
||
|
||
### 3.2 OAuth2 token service (for private repos)
|
||
|
||
* Docker Registry’s token flow backed by **Authority**:
|
||
|
||
1. Client hits registry (`401` with `WWW-Authenticate: Bearer realm=…`).
|
||
2. Client gets an **access token** from the token service (validated by Authority) with `scope=repository:…:pull`.
|
||
3. Registry allows pull for the requested repo.
|
||
* Tokens are **short‑lived** (60–300 s) and **DPoP‑bound**.
|
||
|
||
### 3.3 Offline kits (air‑gapped)
|
||
|
||
* Tarball per release channel:
|
||
|
||
```
|
||
stellaops-kit-<ver>-<channel>.tar.zst
|
||
/images/ OCI layout with all first-party images (multi-arch)
|
||
/sboms/ CycloneDX JSON+PB for each image
|
||
/attest/ DSSE bundles + Rekor proofs
|
||
/charts/ Helm charts + values templates
|
||
/compose/ docker-compose.yml + .env template
|
||
/plugins/ Concelier/Excititor connectors (restart-time)
|
||
/policy/ example policies
|
||
/manifest/ release.yaml (see §6.1)
|
||
```
|
||
* Import via CLI `offline kit import`; checks digests and signatures before load.
|
||
|
||
---
|
||
|
||
## 4) Licensing (PoE) & monetization
|
||
|
||
**Principle**: **Only paid Stella Ops issues valid signed attestations.** Running the stack is free; signing requires PoE.
|
||
|
||
### 4.1 PoE issuance
|
||
|
||
* Customers purchase a plan and obtain a **PoE artifact** from `www.stella-ops.org`:
|
||
|
||
* **PoE‑JWT** (DPoP/mTLS‑bound) **or** **PoE mTLS client certificate**.
|
||
* Contains: `license_id`, `plan`, `valid_release_year`, `max_version`, `exp`, optional `tenant/customer` IDs.
|
||
|
||
### 4.2 Online enforcement
|
||
|
||
* **Signer** calls **Licensing /license/introspect** on every signing request (see signer doc).
|
||
* If **revoked/expired/out‑of‑window** → deny with machine‑readable reason.
|
||
* All **valid** bundles are DSSE‑signed and **Attestor** logs them; Rekor UUID returned.
|
||
* UI badges: “**Verified by Stella Ops**” with link to the public log.
|
||
|
||
### 4.3 Air‑gapped / offline
|
||
|
||
* Customers obtain a **time‑boxed PoE lease** (signed JSON, 7–30 days).
|
||
* Signer accepts the lease and emits **provisional** attestations (clearly labeled).
|
||
* When connectivity returns, a background job **endorses** the provisional entries with the cloud service, updating their status to **verified**.
|
||
* Operators can export a **verification bundle** for auditors even before endorsement (contains DSSE + local Rekor proof + lease snapshot).
|
||
|
||
### 4.4 Stolen/abused PoE
|
||
|
||
* Customers report theft; **Licensing** flags `license_id` as **revoked**.
|
||
* Subsequent Signer requests **deny**; previous attestations remain but can be marked **contested** (UI shows badge, optional re‑sign path upon new PoE).
|
||
|
||
---
|
||
|
||
## 5) Deployment path (customer side)
|
||
|
||
### 5.1 First install
|
||
|
||
* **Helm** (Kubernetes) or **Compose** (VMs). Example (K8s):
|
||
|
||
```bash
|
||
helm repo add stellaops https://charts.stella-ops.org
|
||
helm install stella stellaops/platform \
|
||
--version 2.4.0 \
|
||
--set global.channel=stable \
|
||
--set authority.issuer=https://authority.stella.local \
|
||
--set scanner.minio.endpoint=http://minio.stella.local:9000 \
|
||
--set scanner.mongo.uri=mongodb://mongo/scanner \
|
||
--set concelier.mongo.uri=mongodb://mongo/concelier \
|
||
--set excititor.mongo.uri=mongodb://mongo/excititor
|
||
```
|
||
|
||
* Post‑install job registers **Authority clients** (Scanner, Signer, Attestor, UI) and prints **bootstrap** URLs and client credentials (sealed secrets).
|
||
* UI banner shows **release bundle** and verification state (cosign OK? Rekor OK?).
|
||
|
||
### 5.2 Updates
|
||
|
||
* **Blue/green**: pull new bundle by **digest**; deploy side‑by‑side; cut traffic.
|
||
|
||
* **Rolling**: upgrade stateful components in safe order:
|
||
|
||
1. Authority (stateless, dual‑key rotation ready)
|
||
2. Signer/Attestor (same minor)
|
||
3. Scanner WebService & Workers
|
||
4. Concelier, then Excititor (schema migrations are expand/contract)
|
||
5. UI last
|
||
|
||
* **DB migrations** are **expand/contract**:
|
||
|
||
* Phase A (release N): **add** new fields/indexes, write old+new.
|
||
* Phase B (N+1): **read** new fields; **drop** old.
|
||
* Rollback is a matter of redeploying previous images and keeping both schemas valid.
|
||
|
||
### 5.3 Rollback
|
||
|
||
* Images referenced by **digest**; keep previous release manifest `K` versions back.
|
||
* `helm rollback` or compose `docker compose -f release-K.yml up -d`.
|
||
* Mongo migrations are additive; **no destructive changes** within a single minor.
|
||
|
||
---
|
||
|
||
## 6) Release payloads & manifests
|
||
|
||
### 6.1 Release manifest (`release.yaml`)
|
||
|
||
```yaml
|
||
release:
|
||
version: "2.4.1"
|
||
channel: "stable"
|
||
date: "2027-06-20T12:00:00Z"
|
||
calendar: "2027.06"
|
||
components:
|
||
- name: scanner-webservice
|
||
image: registry.stella-ops.org/stellaops/scanner-web@sha256:aa..bb
|
||
sbom: oci://.../referrers/cdx-json@sha256:11..22
|
||
provenance: oci://.../attest/provenance@sha256:33..44
|
||
signature: { rekorUUID: "…" }
|
||
- name: signer
|
||
image: registry.stella-ops.org/stellaops/signer@sha256:cc..dd
|
||
signature: { rekorUUID: "…" }
|
||
charts:
|
||
- name: platform
|
||
version: "2.4.1"
|
||
digest: "sha256:ee..ff"
|
||
compose:
|
||
file: "docker-compose.yml"
|
||
digest: "sha256:77..88"
|
||
checksums:
|
||
sha256: "… digest of this release.yaml …"
|
||
```
|
||
|
||
The manifest is **cosign‑signed**; UI/CLI can verify a bundle without talking to registries.
|
||
|
||
### 6.2 Image labels (release metadata)
|
||
|
||
Each image sets OCI labels:
|
||
|
||
```
|
||
org.opencontainers.image.version = "2.4.1"
|
||
org.opencontainers.image.revision = "<git sha>"
|
||
org.opencontainers.image.created = "2027-06-20T12:00:00Z"
|
||
org.stellaops.release.calendar = "2027.06"
|
||
org.stellaops.release.channel = "stable"
|
||
org.stellaops.build.slsaProvenance = "oci://…"
|
||
```
|
||
|
||
Signer validates **scanner** image’s cosign identity + calendar tag for **release window** checks.
|
||
|
||
---
|
||
|
||
## 7) Artifact lifecycle & storage (MinIO/Mongo)
|
||
|
||
### 7.1 Buckets & prefixes (MinIO)
|
||
|
||
```
|
||
s3://stellaops/
|
||
scanner/
|
||
layers/<sha256>/sbom.cdx.json.zst
|
||
images/<imgDigest>/inventory.cdx.pb
|
||
images/<imgDigest>/usage.cdx.pb
|
||
diffs/<old>_<new>/diff.json.zst
|
||
attest/<artifactSha256>.dsse.json
|
||
concelier/
|
||
json/<exportId>/...
|
||
trivy/<exportId>/...
|
||
excititor/
|
||
exports/<exportId>/...
|
||
attestor/
|
||
dsse/<bundleSha256>.json
|
||
proof/<rekorUuid>.json
|
||
```
|
||
|
||
### 7.2 ILM classes
|
||
|
||
* **`short`**: working artifacts (diffs, queues) — TTL 7–14 days.
|
||
* **`default`**: SBOMs & indexes — TTL 90–180 days (configurable).
|
||
* **`compliance`**: signed reports & attested exports — **Object Lock** (governance/compliance) 1–7 years.
|
||
|
||
### 7.3 Artifact Lifecycle Controller (ALC)
|
||
|
||
* A background worker (part of Scanner.WebService) enforces **TTL** and **reference counting**:
|
||
|
||
* Artifacts referenced by **reports** or **tickets** are pinned.
|
||
* ILM actions logged; UI shows per‑class usage & upcoming purges.
|
||
|
||
### 7.4 Mongo retention
|
||
|
||
* **Scanner**: `runtime.events` use TTL (e.g., 30–90 days); **catalog** permanent.
|
||
* **Concelier/Excititor**: raw docs keep **last N windows**; canonical stores permanent.
|
||
* **Attestor**: `entries` permanent; `dedupe` TTL 24–48h.
|
||
|
||
---
|
||
|
||
## 8) Observability & SLOs (operations)
|
||
|
||
* **Uptime SLO**: 99.9% for Signer/Authority/Attestor; 99.5% for Scanner WebService; Excititor/Concelier 99.0%.
|
||
* **Error budgets**: tracked per month; dashboards show burn rates.
|
||
* **Golden signals**:
|
||
|
||
* **Latency**: token issuance, sign→attest round‑trip, scan enqueue→emit, export build.
|
||
* **Saturation**: queue depth, Mongo write IOPS, MinIO net throughput.
|
||
* **Traffic**: scans/min, attestations/min, webhook admits/min.
|
||
* **Errors**: 5xx rates, cosign verification failures, Rekor timeouts.
|
||
|
||
Prometheus + OTLP; Grafana dashboards ship in the charts.
|
||
|
||
---
|
||
|
||
## 9) Security & compliance operations
|
||
|
||
* **Key rotation**:
|
||
|
||
* Authority JWKS: 60‑day cadence, dual‑key overlap.
|
||
* Release signing identities: rotate per minor or quarterly.
|
||
* Sigstore roots mirrored and pinned; alarms on drift.
|
||
|
||
* **FIPS mode** (Gov build):
|
||
|
||
* Enforce `ES256` + KMS/HSM; disable Ed25519; MLS ciphers only.
|
||
* Local **Rekor v2** and **Fulcio** alternatives; **air‑gapped** CA.
|
||
|
||
* **Vulnerability response**:
|
||
|
||
* Concelier red‑flag advisories trigger accelerated **stable** patch rollout; UI/CLI “security patch available” notice.
|
||
|
||
* **Backups/DR**:
|
||
|
||
* Mongo nightly snapshots; MinIO versioning + replication (if configured).
|
||
* Restore runbooks tested quarterly with synthetic data.
|
||
|
||
---
|
||
|
||
## 10) Customer update flow (how versions are fetched & activated)
|
||
|
||
### 10.1 Online clusters
|
||
|
||
* **UI** surfaces update banner with **release manifest** diff and risk notes.
|
||
* Operator approves → **Controller** pulls new images by digest; health‑checks; moves traffic; deprecates old revision.
|
||
* Post‑switch, **schema Phase B** migrations (if any) run automatically.
|
||
|
||
### 10.2 Air‑gapped clusters
|
||
|
||
* Operator downloads **offline kit** from a mirror → `stellaops offline kit import`.
|
||
* Controller validates bundle checksums and **cosign signatures**; applies charts/compose by digest.
|
||
* After install, **verify** page shows green checks: image sigs, SBOMs attached, provenance logged.
|
||
|
||
### 10.3 CLI self‑update (optional)
|
||
|
||
* `stellaops self-update` pulls a **signed release manifest** and verifies the **CLI binary** with cosign before swapping (admin can disable).
|
||
|
||
---
|
||
|
||
## 11) Compatibility & deprecation policy
|
||
|
||
* **APIs** are stable within a **major**; breaking changes imply **MAJOR++** and deprecation period of one minor.
|
||
* **Storage**: expand/contract; “drop old fields” only after one minor grace.
|
||
* **Config**: feature flags (default off) for risky features (e.g., eBPF).
|
||
|
||
---
|
||
|
||
## 12) Runbooks (selected)
|
||
|
||
### 12.1 Lost PoE
|
||
|
||
1. Suspend **automatic attestation** jobs.
|
||
2. Use CLI `stellaops signer status` to confirm `entitlement_denied`.
|
||
3. Obtain new PoE from portal; verify on Signer `/poe/verify`.
|
||
4. Re‑enable; optionally **re‑sign** last N reports (UI button → batch).
|
||
|
||
### 12.2 Rekor outage (self‑hosted)
|
||
|
||
* Attestor returns `202 (pending)` with queued proof fetch.
|
||
* Keep DSSE bundles locally; re‑submit on schedule; UI badge shows **Pending**.
|
||
* If outage > SLA, you can switch to a **mirror** log in config; Attestor writes to both when restored.
|
||
|
||
### 12.3 Emergency downgrade
|
||
|
||
* Identify prior release manifest (UI → Admin → Releases).
|
||
* `helm rollback stella <revision>` (or compose apply previous file).
|
||
* Services tolerate skew per §1.3; ensure **Signer/Authority/Attestor** are rolled together.
|
||
|
||
---
|
||
|
||
## 13) Example: cluster bootstrap (Compose)
|
||
|
||
```yaml
|
||
version: "3.9"
|
||
services:
|
||
authority:
|
||
image: registry.stella-ops.org/stellaops/authority@sha256:...
|
||
env_file: ./env/authority.env
|
||
ports: ["8440:8440"]
|
||
signer:
|
||
image: registry.stella-ops.org/stellaops/signer@sha256:...
|
||
depends_on: [authority]
|
||
environment:
|
||
- SIGNER__POE__LICENSING__INTROSPECTURL=https://www.stella-ops.org/api/v1/license/introspect
|
||
attestor:
|
||
image: registry.stella-ops.org/stellaops/attestor@sha256:...
|
||
depends_on: [signer]
|
||
scanner-web:
|
||
image: registry.stella-ops.org/stellaops/scanner-web@sha256:...
|
||
environment:
|
||
- SCANNER__S3__ENDPOINT=http://minio:9000
|
||
scanner-worker:
|
||
image: registry.stella-ops.org/stellaops/scanner-worker@sha256:...
|
||
deploy: { replicas: 4 }
|
||
concelier:
|
||
image: registry.stella-ops.org/stellaops/concelier@sha256:...
|
||
excititor:
|
||
image: registry.stella-ops.org/stellaops/excititor@sha256:...
|
||
web-ui:
|
||
image: registry.stella-ops.org/stellaops/web-ui@sha256:...
|
||
mongo:
|
||
image: mongo:7
|
||
minio:
|
||
image: minio/minio:RELEASE.2025-07-10T00-00-00Z
|
||
```
|
||
|
||
---
|
||
|
||
## 14) Governance & keys (who owns the trust root)
|
||
|
||
* **Release key policy**: only the Release Engineering group can push signed releases; 4‑eyes approval; TUF‑style manifest possible in future.
|
||
* **Signer acceptance policy**: embedded release identities are updated **only** via minor upgrade; emergency CRL supported.
|
||
* **Customer keys**: none needed for core use; enterprise add‑ons may require per‑customer registries and keys.
|
||
|
||
---
|
||
|
||
## 15) Roadmap (Ops)
|
||
|
||
* **Windows containers GA** (Scanner + Zastava).
|
||
* **Key Transparency** for Signer certs.
|
||
* **Delta‑kit** (offline) for incremental updates.
|
||
* **Operator CRDs** (K8s) to manage policy and ILM declaratively.
|
||
* **SBOM **protobuf** as default transport at rest (smaller, faster).
|
||
|
||
---
|
||
|
||
### Appendix A — Minimal SLO monitors
|
||
|
||
* `authority.tokens_issued_total` slope ≈ normal.
|
||
* `signer.requests_total{result="success"}/minute` > 0 (when scans occur).
|
||
* `attestor.submit_latency_seconds{quantile=0.95}` < 0.3.
|
||
* `scanner.scan_latency_seconds{quantile=0.95}` < target per image size.
|
||
* `concelier.export.duration_seconds` stable; `excititor.consensus.conflicts_total` not exploding after policy changes.
|
||
* MinIO `s3_requests_errors_total` near zero; Mongo `opcounters` hit expected baseline.
|
||
|
||
### Appendix B — Upgrade safety checklist
|
||
|
||
* Verify **release manifest** signature.
|
||
* Ensure **Signer/Authority/Attestor** are same minor.
|
||
* Verify **DB backups** < 24h old.
|
||
* Confirm **ILM** won’t purge compliance artifacts during upgrade window.
|
||
* Roll **one component** at a time; watch SLOs; abort on regression.
|
||
|
||
---
|
||
|
||
**End — component_architecture_devops.md**
|