Files
git.stella-ops.org/docs/ARCHITECTURE_DEVOPS.md
master 5fd4032c7c
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Add channel test providers for Email, Slack, Teams, and Webhook
- Implemented EmailChannelTestProvider to generate email preview payloads.
- Implemented SlackChannelTestProvider to create Slack message previews.
- Implemented TeamsChannelTestProvider for generating Teams Adaptive Card previews.
- Implemented WebhookChannelTestProvider to create webhook payloads.
- Added INotifyChannelTestProvider interface for channel-specific preview generation.
- Created ChannelTestPreviewContracts for request and response models.
- Developed NotifyChannelTestService to handle test send requests and generate previews.
- Added rate limit policies for test sends and delivery history.
- Implemented unit tests for service registration and binding.
- Updated project files to include necessary dependencies and configurations.
2025-10-19 23:29:34 +03:00

476 lines
19 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# component_architecture_devops.md — **StellaOps Release & Operations** (2025Q4)
> **Scope.** Implementationready blueprint for **how StellaOps is built, versioned, signed, distributed, upgraded, licensed (PoE)**, and operated in customer environments (online and airgapped). Covers reproducible builds, supplychain attestations, registries, offline kits, migration/rollback, artifact lifecycle (MinIO/Mongo), monitoring SLOs, and customer activation.
---
## 0) Product vision (operations lens)
StellaOps must be **trustable at a glance** and **boringly operable**:
* Every release ships with **firstparty SBOMs, provenance, and signatures**; services verify **each others** integrity at runtime.
* Customers can deploy by **digest** and stay aligned with **LTS/stable/edge** channels.
* Paid customers receive **attestation authority** (Signer accepts their PoE) while the core platform remains **free to run**.
* Airgapped customers receive **offline kits** with verifiable digests and deterministic import.
* Artifacts expire predictably; operators know whats kept, for how long, and why.
---
## 1) Release trains & versioning
### 1.1 Channels
* **LTS** (12month support window): quarterly cadence (Q1/Q2/Q3/Q4).
* **Stable** (default): monthly rollup (bug fixes + compatible features).
* **Edge**: weekly; for early adopters, no guarantees.
### 1.2 Version strings
Semantic core + calendar tag:
```
<MAJOR>.<MINOR>.<PATCH> (<YYYY>.<MM>) e.g., 2.4.1 (2027.06)
```
* **MAJOR**: breaking API/DB changes (rare).
* **MINOR**: new features, compatible schema migrations (expand/contract pattern).
* **PATCH**: bug fixes, perf and security updates.
* **Calendar tag** exposes **release year** used by Signer for **PoE window checks**.
### 1.3 Component alignment
A release is a **bundle** of image digests + charts + manifests. All services in a bundle are **wirecompatible**. Mixed minor versions are allowed within a bounded skew:
* **Web UI ↔ backend**: `±1 minor`.
* **Scanner ↔ Policy/Excititor/Concelier**: `±1 minor`.
* **Authority/Signer/Attestor triangle**: **must** be same minor (crypto and DPoP/mTLS binding rules).
At startup, services **selfadvertise** their semver & channel; the UI surfaces **mismatch warnings**.
---
## 2) Supplychain pipeline (how a release is built)
### 2.1 Deterministic builds
* **Builders**: isolated **BuildKit** workers with pinned base images (digest only).
* **Pinning**: lock files or `go.mod`, `package-lock.json`, `global.json`, `Directory.Packages.props` are **frozen** at tag.
* **Reproducibility**: timestamps normalized; source date epoch; deterministic zips/tars.
* **Multiarch**: linux/amd64 + linux/arm64 (Windows images track M2 roadmap).
### 2.2 Firstparty SBOMs & provenance
* Each image gets **CycloneDX (JSON+Protobuf) SBOM** and **SLSAstyle provenance** attached as **OCI referrers**.
* Scanners **Buildx generator** is used to produce SBOMs *during* build; a separate postbuild scan verifies parity (red flag if drift).
* **Release manifest** (see §6.1) lists all digests and SBOM/attestation refs.
### 2.3 Signing & transparency
* Images are **cosignsigned** (keyless) with a StellaOps release identity; inclusion in a **transparency log** (Rekor) is required.
* SBOM and provenance attestations are **DSSE** and also transparencylogged.
* Release keys (Fulcio roots or public keys) are embedded in **Signer** policy (for **scannerrelease validation** at customer side).
### 2.4 Gates & tests
* **Static**: linters, codegen checks, protobuf API freeze (backwardcompat tests).
* **Unit/integration**: percomponent, plus **endtoend** flows (scan→vex→policy→sign→attest).
* **Perf SLOs**: hot paths (SBOM compose, diff, export) measured against budgets.
* **Security**: dependency audit vs Concelier export; container hardening tests; minimal caps.
* **Canary cohort**: internal staging + selected customers; one week on **edge** before **stable** tag.
---
## 3) Distribution & activation
### 3.1 Registries
* **Primary**: `registry.stella-ops.org` (OCI v2, supports Referrers API).
* **Mirrors**: GHCR (readonly), regional mirrors for latency.
* Operational runbook: see `docs/ops/concelier-mirror-operations.md` for deployment profiles, CDN guidance, and sync automation.
* **Pull by digest only** in Kubernetes/Compose manifests.
**Gating policy**:
* **Core images** (Authority, Scanner, Concelier, Excititor, Attestor, UI): public **read**.
* **Enterprise addons** (if any) and **prerelease**: private repos via OAuth2 token service.
> Monetization lever is **signing** (PoE gate), not image pulls, so the core remains simple to consume.
### 3.2 OAuth2 token service (for private repos)
* Docker Registrys token flow backed by **Authority**:
1. Client hits registry (`401` with `WWW-Authenticate: Bearer realm=…`).
2. Client gets an **access token** from the token service (validated by Authority) with `scope=repository:…:pull`.
3. Registry allows pull for the requested repo.
* Tokens are **shortlived** (60300s) and **DPoPbound**.
### 3.3 Offline kits (airgapped)
* Tarball per release channel:
```
stellaops-kit-<ver>-<channel>.tar.zst
/images/ OCI layout with all first-party images (multi-arch)
/sboms/ CycloneDX JSON+PB for each image
/attest/ DSSE bundles + Rekor proofs
/charts/ Helm charts + values templates
/compose/ docker-compose.yml + .env template
/plugins/ Concelier/Excititor connectors (restart-time)
/policy/ example policies
/manifest/ release.yaml (see §6.1)
```
* Import via CLI `offline kit import`; checks digests and signatures before load.
---
## 4) Licensing (PoE) & monetization
**Principle**: **Only paid StellaOps issues valid signed attestations.** Running the stack is free; signing requires PoE.
### 4.1 PoE issuance
* Customers purchase a plan and obtain a **PoE artifact** from `www.stella-ops.org`:
* **PoEJWT** (DPoP/mTLSbound) **or** **PoE mTLS client certificate**.
* Contains: `license_id`, `plan`, `valid_release_year`, `max_version`, `exp`, optional `tenant/customer` IDs.
### 4.2 Online enforcement
* **Signer** calls **Licensing /license/introspect** on every signing request (see signer doc).
* If **revoked/expired/outofwindow** → deny with machinereadable reason.
* All **valid** bundles are DSSEsigned and **Attestor** logs them; Rekor UUID returned.
* UI badges: “**Verified by StellaOps**” with link to the public log.
### 4.3 Airgapped / offline
* Customers obtain a **timeboxed PoE lease** (signed JSON, 730 days).
* Signer accepts the lease and emits **provisional** attestations (clearly labeled).
* When connectivity returns, a background job **endorses** the provisional entries with the cloud service, updating their status to **verified**.
* Operators can export a **verification bundle** for auditors even before endorsement (contains DSSE + local Rekor proof + lease snapshot).
### 4.4 Stolen/abused PoE
* Customers report theft; **Licensing** flags `license_id` as **revoked**.
* Subsequent Signer requests **deny**; previous attestations remain but can be marked **contested** (UI shows badge, optional resign path upon new PoE).
---
## 5) Deployment path (customer side)
### 5.1 First install
* **Helm** (Kubernetes) or **Compose** (VMs). Example (K8s):
```bash
helm repo add stellaops https://charts.stella-ops.org
helm install stella stellaops/platform \
--version 2.4.0 \
--set global.channel=stable \
--set authority.issuer=https://authority.stella.local \
--set scanner.minio.endpoint=http://minio.stella.local:9000 \
--set scanner.mongo.uri=mongodb://mongo/scanner \
--set concelier.mongo.uri=mongodb://mongo/concelier \
--set excititor.mongo.uri=mongodb://mongo/excititor
```
* Postinstall job registers **Authority clients** (Scanner, Signer, Attestor, UI) and prints **bootstrap** URLs and client credentials (sealed secrets).
* UI banner shows **release bundle** and verification state (cosign OK? Rekor OK?).
### 5.2 Updates
* **Blue/green**: pull new bundle by **digest**; deploy sidebyside; cut traffic.
* **Rolling**: upgrade stateful components in safe order:
1. Authority (stateless, dualkey rotation ready)
2. Signer/Attestor (same minor)
3. Scanner WebService & Workers
4. Concelier, then Excititor (schema migrations are expand/contract)
5. UI last
* **DB migrations** are **expand/contract**:
* Phase A (release N): **add** new fields/indexes, write old+new.
* Phase B (N+1): **read** new fields; **drop** old.
* Rollback is a matter of redeploying previous images and keeping both schemas valid.
### 5.3 Rollback
* Images referenced by **digest**; keep previous release manifest `K` versions back.
* `helm rollback` or compose `docker compose -f release-K.yml up -d`.
* Mongo migrations are additive; **no destructive changes** within a single minor.
---
## 6) Release payloads & manifests
### 6.1 Release manifest (`release.yaml`)
```yaml
release:
version: "2.4.1"
channel: "stable"
date: "2027-06-20T12:00:00Z"
calendar: "2027.06"
components:
- name: scanner-webservice
image: registry.stella-ops.org/stellaops/scanner-web@sha256:aa..bb
sbom: oci://.../referrers/cdx-json@sha256:11..22
provenance: oci://.../attest/provenance@sha256:33..44
signature: { rekorUUID: "…" }
- name: signer
image: registry.stella-ops.org/stellaops/signer@sha256:cc..dd
signature: { rekorUUID: "…" }
charts:
- name: platform
version: "2.4.1"
digest: "sha256:ee..ff"
compose:
file: "docker-compose.yml"
digest: "sha256:77..88"
checksums:
sha256: "… digest of this release.yaml …"
```
The manifest is **cosignsigned**; UI/CLI can verify a bundle without talking to registries.
> Deployment guardrails The repository keeps channel-aligned Compose bundles
> in `deploy/compose/` and Helm overlays in `deploy/helm/stellaops/`. Both sets
> pull their digests from `deploy/releases/` and are validated by
> `deploy/tools/validate-profiles.sh` to guarantee lint/dry-run cleanliness.
### 6.2 Image labels (release metadata)
Each image sets OCI labels:
```
org.opencontainers.image.version = "2.4.1"
org.opencontainers.image.revision = "<git sha>"
org.opencontainers.image.created = "2027-06-20T12:00:00Z"
org.stellaops.release.calendar = "2027.06"
org.stellaops.release.channel = "stable"
org.stellaops.build.slsaProvenance = "oci://…"
```
Signer validates **scanner** images cosign identity + calendar tag for **release window** checks.
---
## 7) Artifact lifecycle & storage (MinIO/Mongo)
### 7.1 Buckets & prefixes (MinIO)
```
s3://stellaops/
scanner/
layers/<sha256>/sbom.cdx.json.zst
images/<imgDigest>/inventory.cdx.pb
images/<imgDigest>/usage.cdx.pb
diffs/<old>_<new>/diff.json.zst
attest/<artifactSha256>.dsse.json
concelier/
json/<exportId>/...
trivy/<exportId>/...
excititor/
exports/<exportId>/...
attestor/
dsse/<bundleSha256>.json
proof/<rekorUuid>.json
```
### 7.2 ILM classes
* **`short`**: working artifacts (diffs, queues) — TTL 714 days.
* **`default`**: SBOMs & indexes — TTL 90180 days (configurable).
* **`compliance`**: signed reports & attested exports — **Object Lock** (governance/compliance) 17 years.
### 7.3 Artifact Lifecycle Controller (ALC)
* A background worker (part of Scanner.WebService) enforces **TTL** and **reference counting**:
* Artifacts referenced by **reports** or **tickets** are pinned.
* ILM actions logged; UI shows perclass usage & upcoming purges.
### 7.4 Mongo retention
* **Scanner**: `runtime.events` use TTL (e.g., 3090 days); **catalog** permanent.
* **Concelier/Excititor**: raw docs keep **last N windows**; canonical stores permanent.
* **Attestor**: `entries` permanent; `dedupe` TTL 2448h.
### 7.5 Mongo server baseline
* **Minimum supported server:** MongoDB **4.2+**. Driver 3.5.0 removes compatibility shims for 4.0; upstream has already announced 4.0 support will be dropped in upcoming C# driver releases. citeturn1open1
* **Deploy images:** Compose/Helm defaults stay on `mongo:7.x`. For air-gapped installs, refresh Offline Kit bundles so the packaged `mongod` matches ≥4.2.
* **Upgrade guard:** During rollout, verify replica sets reach FCV `4.2` or above before swapping binaries; automation should hard-stop if FCV is <4.2.
---
## 8) Observability & SLOs (operations)
* **Uptime SLO**: 99.9% for Signer/Authority/Attestor; 99.5% for Scanner WebService; Excititor/Concelier 99.0%.
* **Error budgets**: tracked per month; dashboards show burn rates.
* **Golden signals**:
* **Latency**: token issuance, sign→attest roundtrip, scan enqueue→emit, export build.
* **Saturation**: queue depth, Mongo write IOPS, MinIO net throughput.
* **Traffic**: scans/min, attestations/min, webhook admits/min.
* **Errors**: 5xx rates, cosign verification failures, Rekor timeouts.
Prometheus + OTLP; Grafana dashboards ship in the charts.
---
## 9) Security & compliance operations
* **Key rotation**:
* Authority JWKS: 60day cadence, dualkey overlap.
* Release signing identities: rotate per minor or quarterly.
* Sigstore roots mirrored and pinned; alarms on drift.
* **FIPS mode** (Gov build):
* Enforce `ES256` + KMS/HSM; disable Ed25519; MLS ciphers only.
* Local **Rekor v2** and **Fulcio** alternatives; **airgapped** CA.
* **Vulnerability response**:
* Concelier red-flag advisories trigger accelerated **stable** patch rollout; UI/CLI “security patch available” notice.
* 2025-10: Pinned `MongoDB.Driver` **3.5.0** and `SharpCompress` **0.41.0** across services (DEVOPS-SEC-10-301) to eliminate NU1902/NU1903 warnings surfaced during scanner cache/worker test runs; future dependency bumps follow the same central override pattern.
* **Backups/DR**:
* Mongo nightly snapshots; MinIO versioning + replication (if configured).
* Restore runbooks tested quarterly with synthetic data.
---
## 10) Customer update flow (how versions are fetched & activated)
### 10.1 Online clusters
* **UI** surfaces update banner with **release manifest** diff and risk notes.
* Operator approves → **Controller** pulls new images by digest; healthchecks; moves traffic; deprecates old revision.
* Postswitch, **schema Phase B** migrations (if any) run automatically.
### 10.2 Airgapped clusters
* Operator downloads **offline kit** from a mirror → `stellaops offline kit import`.
* Controller validates bundle checksums and **cosign signatures**; applies charts/compose by digest.
* After install, **verify** page shows green checks: image sigs, SBOMs attached, provenance logged.
### 10.3 CLI selfupdate (optional)
* `stellaops self-update` pulls a **signed release manifest** and verifies the **CLI binary** with cosign before swapping (admin can disable).
---
## 11) Compatibility & deprecation policy
* **APIs** are stable within a **major**; breaking changes imply **MAJOR++** and deprecation period of one minor.
* **Storage**: expand/contract; “drop old fields” only after one minor grace.
* **Config**: feature flags (default off) for risky features (e.g., eBPF).
---
## 12) Runbooks (selected)
### 12.1 Lost PoE
1. Suspend **automatic attestation** jobs.
2. Use CLI `stellaops signer status` to confirm `entitlement_denied`.
3. Obtain new PoE from portal; verify on Signer `/poe/verify`.
4. Reenable; optionally **resign** last N reports (UI button → batch).
### 12.2 Rekor outage (selfhosted)
* Attestor returns `202 (pending)` with queued proof fetch.
* Keep DSSE bundles locally; resubmit on schedule; UI badge shows **Pending**.
* If outage > SLA, you can switch to a **mirror** log in config; Attestor writes to both when restored.
### 12.3 Emergency downgrade
* Identify prior release manifest (UI → Admin → Releases).
* `helm rollback stella <revision>` (or compose apply previous file).
* Services tolerate skew per §1.3; ensure **Signer/Authority/Attestor** are rolled together.
---
## 13) Example: cluster bootstrap (Compose)
```yaml
version: "3.9"
services:
authority:
image: registry.stella-ops.org/stellaops/authority@sha256:...
env_file: ./env/authority.env
ports: ["8440:8440"]
signer:
image: registry.stella-ops.org/stellaops/signer@sha256:...
depends_on: [authority]
environment:
- SIGNER__POE__LICENSING__INTROSPECTURL=https://www.stella-ops.org/api/v1/license/introspect
attestor:
image: registry.stella-ops.org/stellaops/attestor@sha256:...
depends_on: [signer]
scanner-web:
image: registry.stella-ops.org/stellaops/scanner-web@sha256:...
environment:
- SCANNER__S3__ENDPOINT=http://minio:9000
scanner-worker:
image: registry.stella-ops.org/stellaops/scanner-worker@sha256:...
deploy: { replicas: 4 }
concelier:
image: registry.stella-ops.org/stellaops/concelier@sha256:...
excititor:
image: registry.stella-ops.org/stellaops/excititor@sha256:...
web-ui:
image: registry.stella-ops.org/stellaops/web-ui@sha256:...
mongo:
image: mongo:7
minio:
image: minio/minio:RELEASE.2025-07-10T00-00-00Z
```
---
## 14) Governance & keys (who owns the trust root)
* **Release key policy**: only the Release Engineering group can push signed releases; 4eyes approval; TUFstyle manifest possible in future.
* **Signer acceptance policy**: embedded release identities are updated **only** via minor upgrade; emergency CRL supported.
* **Customer keys**: none needed for core use; enterprise addons may require percustomer registries and keys.
---
## 15) Roadmap (Ops)
* **Windows containers GA** (Scanner + Zastava).
* **Key Transparency** for Signer certs.
* **Deltakit** (offline) for incremental updates.
* **Operator CRDs** (K8s) to manage policy and ILM declaratively.
* **SBOM **protobuf** as default transport at rest (smaller, faster).
---
### Appendix A — Minimal SLO monitors
* `authority.tokens_issued_total` slope ≈ normal.
* `signer.requests_total{result="success"}/minute` > 0 (when scans occur).
* `attestor.submit_latency_seconds{quantile=0.95}` < 0.3.
* `scanner.scan_latency_seconds{quantile=0.95}` < target per image size.
* `concelier.export.duration_seconds` stable; `excititor.consensus.conflicts_total` not exploding after policy changes.
* MinIO `s3_requests_errors_total` near zero; Mongo `opcounters` hit expected baseline.
### Appendix B — Upgrade safety checklist
* Verify **release manifest** signature.
* Ensure **Signer/Authority/Attestor** are same minor.
* Verify **DB backups** < 24h old.
* Confirm **ILM** wont purge compliance artifacts during upgrade window.
* Roll **one component** at a time; watch SLOs; abort on regression.
---
**End — component_architecture_devops.md**