Rewrite architecture docs and add Vexer connector template

This commit is contained in:
2025-10-17 19:34:43 +03:00
parent 29a7d51e41
commit fbd1826ef3
25 changed files with 4885 additions and 777 deletions

462
docs/ARCHITECTURE_DEVOPS.md Normal file
View File

@@ -0,0 +1,462 @@
# component_architecture_devops.md — **StellaOps Release & Operations** (2025Q4)
> **Scope.** Implementationready blueprint for **how StellaOps is built, versioned, signed, distributed, upgraded, licensed (PoE)**, and operated in customer environments (online and airgapped). Covers reproducible builds, supplychain attestations, registries, offline kits, migration/rollback, artifact lifecycle (MinIO/Mongo), monitoring SLOs, and customer activation.
---
## 0) Product vision (operations lens)
StellaOps must be **trustable at a glance** and **boringly operable**:
* Every release ships with **firstparty SBOMs, provenance, and signatures**; services verify **each others** integrity at runtime.
* Customers can deploy by **digest** and stay aligned with **LTS/stable/edge** channels.
* Paid customers receive **attestation authority** (Signer accepts their PoE) while the core platform remains **free to run**.
* Airgapped customers receive **offline kits** with verifiable digests and deterministic import.
* Artifacts expire predictably; operators know whats kept, for how long, and why.
---
## 1) Release trains & versioning
### 1.1 Channels
* **LTS** (12month support window): quarterly cadence (Q1/Q2/Q3/Q4).
* **Stable** (default): monthly rollup (bug fixes + compatible features).
* **Edge**: weekly; for early adopters, no guarantees.
### 1.2 Version strings
Semantic core + calendar tag:
```
<MAJOR>.<MINOR>.<PATCH> (<YYYY>.<MM>) e.g., 2.4.1 (2027.06)
```
* **MAJOR**: breaking API/DB changes (rare).
* **MINOR**: new features, compatible schema migrations (expand/contract pattern).
* **PATCH**: bug fixes, perf and security updates.
* **Calendar tag** exposes **release year** used by Signer for **PoE window checks**.
### 1.3 Component alignment
A release is a **bundle** of image digests + charts + manifests. All services in a bundle are **wirecompatible**. Mixed minor versions are allowed within a bounded skew:
* **Web UI ↔ backend**: `±1 minor`.
* **Scanner ↔ Policy/Vexer/Feedser**: `±1 minor`.
* **Authority/Signer/Attestor triangle**: **must** be same minor (crypto and DPoP/mTLS binding rules).
At startup, services **selfadvertise** their semver & channel; the UI surfaces **mismatch warnings**.
---
## 2) Supplychain pipeline (how a release is built)
### 2.1 Deterministic builds
* **Builders**: isolated **BuildKit** workers with pinned base images (digest only).
* **Pinning**: lock files or `go.mod`, `package-lock.json`, `global.json`, `Directory.Packages.props` are **frozen** at tag.
* **Reproducibility**: timestamps normalized; source date epoch; deterministic zips/tars.
* **Multiarch**: linux/amd64 + linux/arm64 (Windows images track M2 roadmap).
### 2.2 Firstparty SBOMs & provenance
* Each image gets **CycloneDX (JSON+Protobuf) SBOM** and **SLSAstyle provenance** attached as **OCI referrers**.
* Scanners **Buildx generator** is used to produce SBOMs *during* build; a separate postbuild scan verifies parity (red flag if drift).
* **Release manifest** (see §6.1) lists all digests and SBOM/attestation refs.
### 2.3 Signing & transparency
* Images are **cosignsigned** (keyless) with a StellaOps release identity; inclusion in a **transparency log** (Rekor) is required.
* SBOM and provenance attestations are **DSSE** and also transparencylogged.
* Release keys (Fulcio roots or public keys) are embedded in **Signer** policy (for **scannerrelease validation** at customer side).
### 2.4 Gates & tests
* **Static**: linters, codegen checks, protobuf API freeze (backwardcompat tests).
* **Unit/integration**: percomponent, plus **endtoend** flows (scan→vex→policy→sign→attest).
* **Perf SLOs**: hot paths (SBOM compose, diff, export) measured against budgets.
* **Security**: dependency audit vs Feedser export; container hardening tests; minimal caps.
* **Canary cohort**: internal staging + selected customers; one week on **edge** before **stable** tag.
---
## 3) Distribution & activation
### 3.1 Registries
* **Primary**: `registry.stella-ops.org` (OCI v2, supports Referrers API).
* **Mirrors**: GHCR (readonly), regional mirrors for latency.
* **Pull by digest only** in Kubernetes/Compose manifests.
**Gating policy**:
* **Core images** (Authority, Scanner, Feedser, Vexer, Attestor, UI): public **read**.
* **Enterprise addons** (if any) and **prerelease**: private repos via OAuth2 token service.
> Monetization lever is **signing** (PoE gate), not image pulls, so the core remains simple to consume.
### 3.2 OAuth2 token service (for private repos)
* Docker Registrys token flow backed by **Authority**:
1. Client hits registry (`401` with `WWW-Authenticate: Bearer realm=…`).
2. Client gets an **access token** from the token service (validated by Authority) with `scope=repository:…:pull`.
3. Registry allows pull for the requested repo.
* Tokens are **shortlived** (60300s) and **DPoPbound**.
### 3.3 Offline kits (airgapped)
* Tarball per release channel:
```
stellaops-kit-<ver>-<channel>.tar.zst
/images/ OCI layout with all first-party images (multi-arch)
/sboms/ CycloneDX JSON+PB for each image
/attest/ DSSE bundles + Rekor proofs
/charts/ Helm charts + values templates
/compose/ docker-compose.yml + .env template
/plugins/ Feedser/Vexer connectors (restart-time)
/policy/ example policies
/manifest/ release.yaml (see §6.1)
```
* Import via CLI `offline kit import`; checks digests and signatures before load.
---
## 4) Licensing (PoE) & monetization
**Principle**: **Only paid StellaOps issues valid signed attestations.** Running the stack is free; signing requires PoE.
### 4.1 PoE issuance
* Customers purchase a plan and obtain a **PoE artifact** from `www.stella-ops.org`:
* **PoEJWT** (DPoP/mTLSbound) **or** **PoE mTLS client certificate**.
* Contains: `license_id`, `plan`, `valid_release_year`, `max_version`, `exp`, optional `tenant/customer` IDs.
### 4.2 Online enforcement
* **Signer** calls **Licensing /license/introspect** on every signing request (see signer doc).
* If **revoked/expired/outofwindow** → deny with machinereadable reason.
* All **valid** bundles are DSSEsigned and **Attestor** logs them; Rekor UUID returned.
* UI badges: “**Verified by StellaOps**” with link to the public log.
### 4.3 Airgapped / offline
* Customers obtain a **timeboxed PoE lease** (signed JSON, 730 days).
* Signer accepts the lease and emits **provisional** attestations (clearly labeled).
* When connectivity returns, a background job **endorses** the provisional entries with the cloud service, updating their status to **verified**.
* Operators can export a **verification bundle** for auditors even before endorsement (contains DSSE + local Rekor proof + lease snapshot).
### 4.4 Stolen/abused PoE
* Customers report theft; **Licensing** flags `license_id` as **revoked**.
* Subsequent Signer requests **deny**; previous attestations remain but can be marked **contested** (UI shows badge, optional resign path upon new PoE).
---
## 5) Deployment path (customer side)
### 5.1 First install
* **Helm** (Kubernetes) or **Compose** (VMs). Example (K8s):
```bash
helm repo add stellaops https://charts.stella-ops.org
helm install stella stellaops/platform \
--version 2.4.0 \
--set global.channel=stable \
--set authority.issuer=https://authority.stella.local \
--set scanner.minio.endpoint=http://minio.stella.local:9000 \
--set scanner.mongo.uri=mongodb://mongo/scanner \
--set feedser.mongo.uri=mongodb://mongo/feedser \
--set vexer.mongo.uri=mongodb://mongo/vexer
```
* Postinstall job registers **Authority clients** (Scanner, Signer, Attestor, UI) and prints **bootstrap** URLs and client credentials (sealed secrets).
* UI banner shows **release bundle** and verification state (cosign OK? Rekor OK?).
### 5.2 Updates
* **Blue/green**: pull new bundle by **digest**; deploy sidebyside; cut traffic.
* **Rolling**: upgrade stateful components in safe order:
1. Authority (stateless, dualkey rotation ready)
2. Signer/Attestor (same minor)
3. Scanner WebService & Workers
4. Feedser, then Vexer (schema migrations are expand/contract)
5. UI last
* **DB migrations** are **expand/contract**:
* Phase A (release N): **add** new fields/indexes, write old+new.
* Phase B (N+1): **read** new fields; **drop** old.
* Rollback is a matter of redeploying previous images and keeping both schemas valid.
### 5.3 Rollback
* Images referenced by **digest**; keep previous release manifest `K` versions back.
* `helm rollback` or compose `docker compose -f release-K.yml up -d`.
* Mongo migrations are additive; **no destructive changes** within a single minor.
---
## 6) Release payloads & manifests
### 6.1 Release manifest (`release.yaml`)
```yaml
release:
version: "2.4.1"
channel: "stable"
date: "2027-06-20T12:00:00Z"
calendar: "2027.06"
components:
- name: scanner-webservice
image: registry.stella-ops.org/stellaops/scanner-web@sha256:aa..bb
sbom: oci://.../referrers/cdx-json@sha256:11..22
provenance: oci://.../attest/provenance@sha256:33..44
signature: { rekorUUID: "…" }
- name: signer
image: registry.stella-ops.org/stellaops/signer@sha256:cc..dd
signature: { rekorUUID: "…" }
charts:
- name: platform
version: "2.4.1"
digest: "sha256:ee..ff"
compose:
file: "docker-compose.yml"
digest: "sha256:77..88"
checksums:
sha256: "… digest of this release.yaml …"
```
The manifest is **cosignsigned**; UI/CLI can verify a bundle without talking to registries.
### 6.2 Image labels (release metadata)
Each image sets OCI labels:
```
org.opencontainers.image.version = "2.4.1"
org.opencontainers.image.revision = "<git sha>"
org.opencontainers.image.created = "2027-06-20T12:00:00Z"
org.stellaops.release.calendar = "2027.06"
org.stellaops.release.channel = "stable"
org.stellaops.build.slsaProvenance = "oci://…"
```
Signer validates **scanner** images cosign identity + calendar tag for **release window** checks.
---
## 7) Artifact lifecycle & storage (MinIO/Mongo)
### 7.1 Buckets & prefixes (MinIO)
```
s3://stellaops/
scanner/
layers/<sha256>/sbom.cdx.json.zst
images/<imgDigest>/inventory.cdx.pb
images/<imgDigest>/usage.cdx.pb
diffs/<old>_<new>/diff.json.zst
attest/<artifactSha256>.dsse.json
feedser/
json/<exportId>/...
trivy/<exportId>/...
vexer/
exports/<exportId>/...
attestor/
dsse/<bundleSha256>.json
proof/<rekorUuid>.json
```
### 7.2 ILM classes
* **`short`**: working artifacts (diffs, queues) — TTL 714 days.
* **`default`**: SBOMs & indexes — TTL 90180 days (configurable).
* **`compliance`**: signed reports & attested exports — **Object Lock** (governance/compliance) 17 years.
### 7.3 Artifact Lifecycle Controller (ALC)
* A background worker (part of Scanner.WebService) enforces **TTL** and **reference counting**:
* Artifacts referenced by **reports** or **tickets** are pinned.
* ILM actions logged; UI shows perclass usage & upcoming purges.
### 7.4 Mongo retention
* **Scanner**: `runtime.events` use TTL (e.g., 3090 days); **catalog** permanent.
* **Feedser/Vexer**: raw docs keep **last N windows**; canonical stores permanent.
* **Attestor**: `entries` permanent; `dedupe` TTL 2448h.
---
## 8) Observability & SLOs (operations)
* **Uptime SLO**: 99.9% for Signer/Authority/Attestor; 99.5% for Scanner WebService; Vexer/Feedser 99.0%.
* **Error budgets**: tracked per month; dashboards show burn rates.
* **Golden signals**:
* **Latency**: token issuance, sign→attest roundtrip, scan enqueue→emit, export build.
* **Saturation**: queue depth, Mongo write IOPS, MinIO net throughput.
* **Traffic**: scans/min, attestations/min, webhook admits/min.
* **Errors**: 5xx rates, cosign verification failures, Rekor timeouts.
Prometheus + OTLP; Grafana dashboards ship in the charts.
---
## 9) Security & compliance operations
* **Key rotation**:
* Authority JWKS: 60day cadence, dualkey overlap.
* Release signing identities: rotate per minor or quarterly.
* Sigstore roots mirrored and pinned; alarms on drift.
* **FIPS mode** (Gov build):
* Enforce `ES256` + KMS/HSM; disable Ed25519; MLS ciphers only.
* Local **Rekor v2** and **Fulcio** alternatives; **airgapped** CA.
* **Vulnerability response**:
* Feedser redflag advisories trigger accelerated **stable** patch rollout; UI/CLI “security patch available” notice.
* **Backups/DR**:
* Mongo nightly snapshots; MinIO versioning + replication (if configured).
* Restore runbooks tested quarterly with synthetic data.
---
## 10) Customer update flow (how versions are fetched & activated)
### 10.1 Online clusters
* **UI** surfaces update banner with **release manifest** diff and risk notes.
* Operator approves → **Controller** pulls new images by digest; healthchecks; moves traffic; deprecates old revision.
* Postswitch, **schema Phase B** migrations (if any) run automatically.
### 10.2 Airgapped clusters
* Operator downloads **offline kit** from a mirror → `stellaops offline kit import`.
* Controller validates bundle checksums and **cosign signatures**; applies charts/compose by digest.
* After install, **verify** page shows green checks: image sigs, SBOMs attached, provenance logged.
### 10.3 CLI selfupdate (optional)
* `stellaops self-update` pulls a **signed release manifest** and verifies the **CLI binary** with cosign before swapping (admin can disable).
---
## 11) Compatibility & deprecation policy
* **APIs** are stable within a **major**; breaking changes imply **MAJOR++** and deprecation period of one minor.
* **Storage**: expand/contract; “drop old fields” only after one minor grace.
* **Config**: feature flags (default off) for risky features (e.g., eBPF).
---
## 12) Runbooks (selected)
### 12.1 Lost PoE
1. Suspend **automatic attestation** jobs.
2. Use CLI `stellaops signer status` to confirm `entitlement_denied`.
3. Obtain new PoE from portal; verify on Signer `/poe/verify`.
4. Reenable; optionally **resign** last N reports (UI button → batch).
### 12.2 Rekor outage (selfhosted)
* Attestor returns `202 (pending)` with queued proof fetch.
* Keep DSSE bundles locally; resubmit on schedule; UI badge shows **Pending**.
* If outage > SLA, you can switch to a **mirror** log in config; Attestor writes to both when restored.
### 12.3 Emergency downgrade
* Identify prior release manifest (UI → Admin → Releases).
* `helm rollback stella <revision>` (or compose apply previous file).
* Services tolerate skew per §1.3; ensure **Signer/Authority/Attestor** are rolled together.
---
## 13) Example: cluster bootstrap (Compose)
```yaml
version: "3.9"
services:
authority:
image: registry.stella-ops.org/stellaops/authority@sha256:...
env_file: ./env/authority.env
ports: ["8440:8440"]
signer:
image: registry.stella-ops.org/stellaops/signer@sha256:...
depends_on: [authority]
environment:
- SIGNER__POE__LICENSING__INTROSPECTURL=https://www.stella-ops.org/api/v1/license/introspect
attestor:
image: registry.stella-ops.org/stellaops/attestor@sha256:...
depends_on: [signer]
scanner-web:
image: registry.stella-ops.org/stellaops/scanner-web@sha256:...
environment:
- SCANNER__S3__ENDPOINT=http://minio:9000
scanner-worker:
image: registry.stella-ops.org/stellaops/scanner-worker@sha256:...
deploy: { replicas: 4 }
feedser:
image: registry.stella-ops.org/stellaops/feedser@sha256:...
vexer:
image: registry.stella-ops.org/stellaops/vexer@sha256:...
web-ui:
image: registry.stella-ops.org/stellaops/web-ui@sha256:...
mongo:
image: mongo:7
minio:
image: minio/minio:RELEASE.2025-07-10T00-00-00Z
```
---
## 14) Governance & keys (who owns the trust root)
* **Release key policy**: only the Release Engineering group can push signed releases; 4eyes approval; TUFstyle manifest possible in future.
* **Signer acceptance policy**: embedded release identities are updated **only** via minor upgrade; emergency CRL supported.
* **Customer keys**: none needed for core use; enterprise addons may require percustomer registries and keys.
---
## 15) Roadmap (Ops)
* **Windows containers GA** (Scanner + Zastava).
* **Key Transparency** for Signer certs.
* **Deltakit** (offline) for incremental updates.
* **Operator CRDs** (K8s) to manage policy and ILM declaratively.
* **SBOM **protobuf** as default transport at rest (smaller, faster).
---
### Appendix A — Minimal SLO monitors
* `authority.tokens_issued_total` slope ≈ normal.
* `signer.requests_total{result="success"}/minute` > 0 (when scans occur).
* `attestor.submit_latency_seconds{quantile=0.95}` < 0.3.
* `scanner.scan_latency_seconds{quantile=0.95}` < target per image size.
* `feedser.export.duration_seconds` stable; `vexer.consensus.conflicts_total` not exploding after policy changes.
* MinIO `s3_requests_errors_total` near zero; Mongo `opcounters` hit expected baseline.
### Appendix B — Upgrade safety checklist
* Verify **release manifest** signature.
* Ensure **Signer/Authority/Attestor** are same minor.
* Verify **DB backups** < 24h old.
* Confirm **ILM** wont purge compliance artifacts during upgrade window.
* Roll **one component** at a time; watch SLOs; abort on regression.
---
**End — component_architecture_devops.md**