Files
git.stella-ops.org/docs/ARCHITECTURE_DEVOPS.md
2025-10-18 20:44:59 +03:00

17 KiB
Raw Blame History

component_architecture_devops.md — StellaOps Release & Operations (2025Q4)

Scope. Implementationready blueprint for how StellaOps is built, versioned, signed, distributed, upgraded, licensed (PoE), and operated in customer environments (online and airgapped). Covers reproducible builds, supplychain attestations, registries, offline kits, migration/rollback, artifact lifecycle (MinIO/Mongo), monitoring SLOs, and customer activation.


0) Product vision (operations lens)

StellaOps must be trustable at a glance and boringly operable:

  • Every release ships with firstparty SBOMs, provenance, and signatures; services verify each others integrity at runtime.
  • Customers can deploy by digest and stay aligned with LTS/stable/edge channels.
  • Paid customers receive attestation authority (Signer accepts their PoE) while the core platform remains free to run.
  • Airgapped customers receive offline kits with verifiable digests and deterministic import.
  • Artifacts expire predictably; operators know whats kept, for how long, and why.

1) Release trains & versioning

1.1 Channels

  • LTS (12month support window): quarterly cadence (Q1/Q2/Q3/Q4).
  • Stable (default): monthly rollup (bug fixes + compatible features).
  • Edge: weekly; for early adopters, no guarantees.

1.2 Version strings

Semantic core + calendar tag:

<MAJOR>.<MINOR>.<PATCH>  (<YYYY>.<MM>)   e.g., 2.4.1 (2027.06)
  • MAJOR: breaking API/DB changes (rare).
  • MINOR: new features, compatible schema migrations (expand/contract pattern).
  • PATCH: bug fixes, perf and security updates.
  • Calendar tag exposes release year used by Signer for PoE window checks.

1.3 Component alignment

A release is a bundle of image digests + charts + manifests. All services in a bundle are wirecompatible. Mixed minor versions are allowed within a bounded skew:

  • Web UI ↔ backend: ±1 minor.
  • Scanner ↔ Policy/Excititor/Feedser: ±1 minor.
  • Authority/Signer/Attestor triangle: must be same minor (crypto and DPoP/mTLS binding rules).

At startup, services selfadvertise their semver & channel; the UI surfaces mismatch warnings.


2) Supplychain pipeline (how a release is built)

2.1 Deterministic builds

  • Builders: isolated BuildKit workers with pinned base images (digest only).
  • Pinning: lock files or go.mod, package-lock.json, global.json, Directory.Packages.props are frozen at tag.
  • Reproducibility: timestamps normalized; source date epoch; deterministic zips/tars.
  • Multiarch: linux/amd64 + linux/arm64 (Windows images track M2 roadmap).

2.2 Firstparty SBOMs & provenance

  • Each image gets CycloneDX (JSON+Protobuf) SBOM and SLSAstyle provenance attached as OCI referrers.
  • Scanners Buildx generator is used to produce SBOMs during build; a separate postbuild scan verifies parity (red flag if drift).
  • Release manifest (see §6.1) lists all digests and SBOM/attestation refs.

2.3 Signing & transparency

  • Images are cosignsigned (keyless) with a StellaOps release identity; inclusion in a transparency log (Rekor) is required.
  • SBOM and provenance attestations are DSSE and also transparencylogged.
  • Release keys (Fulcio roots or public keys) are embedded in Signer policy (for scannerrelease validation at customer side).

2.4 Gates & tests

  • Static: linters, codegen checks, protobuf API freeze (backwardcompat tests).
  • Unit/integration: percomponent, plus endtoend flows (scan→vex→policy→sign→attest).
  • Perf SLOs: hot paths (SBOM compose, diff, export) measured against budgets.
  • Security: dependency audit vs Feedser export; container hardening tests; minimal caps.
  • Canary cohort: internal staging + selected customers; one week on edge before stable tag.

3) Distribution & activation

3.1 Registries

  • Primary: registry.stella-ops.org (OCI v2, supports Referrers API).
  • Mirrors: GHCR (readonly), regional mirrors for latency.
  • Pull by digest only in Kubernetes/Compose manifests.

Gating policy:

  • Core images (Authority, Scanner, Feedser, Excititor, Attestor, UI): public read.
  • Enterprise addons (if any) and prerelease: private repos via OAuth2 token service.

Monetization lever is signing (PoE gate), not image pulls, so the core remains simple to consume.

3.2 OAuth2 token service (for private repos)

  • Docker Registrys token flow backed by Authority:

    1. Client hits registry (401 with WWW-Authenticate: Bearer realm=…).
    2. Client gets an access token from the token service (validated by Authority) with scope=repository:…:pull.
    3. Registry allows pull for the requested repo.
  • Tokens are shortlived (60300s) and DPoPbound.

3.3 Offline kits (airgapped)

  • Tarball per release channel:

    stellaops-kit-<ver>-<channel>.tar.zst
      /images/   OCI layout with all first-party images (multi-arch)
      /sboms/    CycloneDX JSON+PB for each image
      /attest/   DSSE bundles + Rekor proofs
      /charts/   Helm charts + values templates
      /compose/  docker-compose.yml + .env template
      /plugins/  Feedser/Excititor connectors (restart-time)
      /policy/   example policies
      /manifest/ release.yaml  (see §6.1)
    
  • Import via CLI offline kit import; checks digests and signatures before load.


4) Licensing (PoE) & monetization

Principle: Only paid StellaOps issues valid signed attestations. Running the stack is free; signing requires PoE.

4.1 PoE issuance

  • Customers purchase a plan and obtain a PoE artifact from www.stella-ops.org:

    • PoEJWT (DPoP/mTLSbound) or PoE mTLS client certificate.
    • Contains: license_id, plan, valid_release_year, max_version, exp, optional tenant/customer IDs.

4.2 Online enforcement

  • Signer calls Licensing /license/introspect on every signing request (see signer doc).
  • If revoked/expired/outofwindow → deny with machinereadable reason.
  • All valid bundles are DSSEsigned and Attestor logs them; Rekor UUID returned.
  • UI badges: “Verified by StellaOps” with link to the public log.

4.3 Airgapped / offline

  • Customers obtain a timeboxed PoE lease (signed JSON, 730 days).
  • Signer accepts the lease and emits provisional attestations (clearly labeled).
  • When connectivity returns, a background job endorses the provisional entries with the cloud service, updating their status to verified.
  • Operators can export a verification bundle for auditors even before endorsement (contains DSSE + local Rekor proof + lease snapshot).

4.4 Stolen/abused PoE

  • Customers report theft; Licensing flags license_id as revoked.
  • Subsequent Signer requests deny; previous attestations remain but can be marked contested (UI shows badge, optional resign path upon new PoE).

5) Deployment path (customer side)

5.1 First install

  • Helm (Kubernetes) or Compose (VMs). Example (K8s):
helm repo add stellaops https://charts.stella-ops.org
helm install stella stellaops/platform \
  --version 2.4.0 \
  --set global.channel=stable \
  --set authority.issuer=https://authority.stella.local \
  --set scanner.minio.endpoint=http://minio.stella.local:9000 \
  --set scanner.mongo.uri=mongodb://mongo/scanner \
  --set feedser.mongo.uri=mongodb://mongo/feedser \
  --set excititor.mongo.uri=mongodb://mongo/excititor
  • Postinstall job registers Authority clients (Scanner, Signer, Attestor, UI) and prints bootstrap URLs and client credentials (sealed secrets).
  • UI banner shows release bundle and verification state (cosign OK? Rekor OK?).

5.2 Updates

  • Blue/green: pull new bundle by digest; deploy sidebyside; cut traffic.

  • Rolling: upgrade stateful components in safe order:

    1. Authority (stateless, dualkey rotation ready)
    2. Signer/Attestor (same minor)
    3. Scanner WebService & Workers
    4. Feedser, then Excititor (schema migrations are expand/contract)
    5. UI last
  • DB migrations are expand/contract:

    • Phase A (release N): add new fields/indexes, write old+new.
    • Phase B (N+1): read new fields; drop old.
    • Rollback is a matter of redeploying previous images and keeping both schemas valid.

5.3 Rollback

  • Images referenced by digest; keep previous release manifest K versions back.
  • helm rollback or compose docker compose -f release-K.yml up -d.
  • Mongo migrations are additive; no destructive changes within a single minor.

6) Release payloads & manifests

6.1 Release manifest (release.yaml)

release:
  version: "2.4.1"
  channel: "stable"
  date: "2027-06-20T12:00:00Z"
  calendar: "2027.06"
  components:
    - name: scanner-webservice
      image: registry.stella-ops.org/stellaops/scanner-web@sha256:aa..bb
      sbom: oci://.../referrers/cdx-json@sha256:11..22
      provenance: oci://.../attest/provenance@sha256:33..44
      signature: { rekorUUID: "…" }
    - name: signer
      image: registry.stella-ops.org/stellaops/signer@sha256:cc..dd
      signature: { rekorUUID: "…" }
  charts:
    - name: platform
      version: "2.4.1"
      digest: "sha256:ee..ff"
  compose:
    file: "docker-compose.yml"
    digest: "sha256:77..88"
  checksums:
    sha256: "… digest of this release.yaml …"

The manifest is cosignsigned; UI/CLI can verify a bundle without talking to registries.

6.2 Image labels (release metadata)

Each image sets OCI labels:

org.opencontainers.image.version = "2.4.1"
org.opencontainers.image.revision = "<git sha>"
org.opencontainers.image.created = "2027-06-20T12:00:00Z"
org.stellaops.release.calendar = "2027.06"
org.stellaops.release.channel  = "stable"
org.stellaops.build.slsaProvenance = "oci://…"

Signer validates scanner images cosign identity + calendar tag for release window checks.


7) Artifact lifecycle & storage (MinIO/Mongo)

7.1 Buckets & prefixes (MinIO)

s3://stellaops/
  scanner/
    layers/<sha256>/sbom.cdx.json.zst
    images/<imgDigest>/inventory.cdx.pb
    images/<imgDigest>/usage.cdx.pb
    diffs/<old>_<new>/diff.json.zst
    attest/<artifactSha256>.dsse.json
  feedser/
    json/<exportId>/...
    trivy/<exportId>/...
  excititor/
    exports/<exportId>/...
  attestor/
    dsse/<bundleSha256>.json
    proof/<rekorUuid>.json

7.2 ILM classes

  • short: working artifacts (diffs, queues) — TTL 714 days.
  • default: SBOMs & indexes — TTL 90180 days (configurable).
  • compliance: signed reports & attested exports — Object Lock (governance/compliance) 17 years.

7.3 Artifact Lifecycle Controller (ALC)

  • A background worker (part of Scanner.WebService) enforces TTL and reference counting:

    • Artifacts referenced by reports or tickets are pinned.
    • ILM actions logged; UI shows perclass usage & upcoming purges.

7.4 Mongo retention

  • Scanner: runtime.events use TTL (e.g., 3090 days); catalog permanent.
  • Feedser/Excititor: raw docs keep last N windows; canonical stores permanent.
  • Attestor: entries permanent; dedupe TTL 2448h.

8) Observability & SLOs (operations)

  • Uptime SLO: 99.9% for Signer/Authority/Attestor; 99.5% for Scanner WebService; Excititor/Feedser 99.0%.

  • Error budgets: tracked per month; dashboards show burn rates.

  • Golden signals:

    • Latency: token issuance, sign→attest roundtrip, scan enqueue→emit, export build.
    • Saturation: queue depth, Mongo write IOPS, MinIO net throughput.
    • Traffic: scans/min, attestations/min, webhook admits/min.
    • Errors: 5xx rates, cosign verification failures, Rekor timeouts.

Prometheus + OTLP; Grafana dashboards ship in the charts.


9) Security & compliance operations

  • Key rotation:

    • Authority JWKS: 60day cadence, dualkey overlap.
    • Release signing identities: rotate per minor or quarterly.
    • Sigstore roots mirrored and pinned; alarms on drift.
  • FIPS mode (Gov build):

    • Enforce ES256 + KMS/HSM; disable Ed25519; MLS ciphers only.
    • Local Rekor v2 and Fulcio alternatives; airgapped CA.
  • Vulnerability response:

    • Feedser redflag advisories trigger accelerated stable patch rollout; UI/CLI “security patch available” notice.
  • Backups/DR:

    • Mongo nightly snapshots; MinIO versioning + replication (if configured).
    • Restore runbooks tested quarterly with synthetic data.

10) Customer update flow (how versions are fetched & activated)

10.1 Online clusters

  • UI surfaces update banner with release manifest diff and risk notes.
  • Operator approves → Controller pulls new images by digest; healthchecks; moves traffic; deprecates old revision.
  • Postswitch, schema Phase B migrations (if any) run automatically.

10.2 Airgapped clusters

  • Operator downloads offline kit from a mirror → stellaops offline kit import.
  • Controller validates bundle checksums and cosign signatures; applies charts/compose by digest.
  • After install, verify page shows green checks: image sigs, SBOMs attached, provenance logged.

10.3 CLI selfupdate (optional)

  • stellaops self-update pulls a signed release manifest and verifies the CLI binary with cosign before swapping (admin can disable).

11) Compatibility & deprecation policy

  • APIs are stable within a major; breaking changes imply MAJOR++ and deprecation period of one minor.
  • Storage: expand/contract; “drop old fields” only after one minor grace.
  • Config: feature flags (default off) for risky features (e.g., eBPF).

12) Runbooks (selected)

12.1 Lost PoE

  1. Suspend automatic attestation jobs.
  2. Use CLI stellaops signer status to confirm entitlement_denied.
  3. Obtain new PoE from portal; verify on Signer /poe/verify.
  4. Reenable; optionally resign last N reports (UI button → batch).

12.2 Rekor outage (selfhosted)

  • Attestor returns 202 (pending) with queued proof fetch.
  • Keep DSSE bundles locally; resubmit on schedule; UI badge shows Pending.
  • If outage > SLA, you can switch to a mirror log in config; Attestor writes to both when restored.

12.3 Emergency downgrade

  • Identify prior release manifest (UI → Admin → Releases).
  • helm rollback stella <revision> (or compose apply previous file).
  • Services tolerate skew per §1.3; ensure Signer/Authority/Attestor are rolled together.

13) Example: cluster bootstrap (Compose)

version: "3.9"
services:
  authority:
    image: registry.stella-ops.org/stellaops/authority@sha256:...
    env_file: ./env/authority.env
    ports: ["8440:8440"]
  signer:
    image: registry.stella-ops.org/stellaops/signer@sha256:...
    depends_on: [authority]
    environment:
      - SIGNER__POE__LICENSING__INTROSPECTURL=https://www.stella-ops.org/api/v1/license/introspect
  attestor:
    image: registry.stella-ops.org/stellaops/attestor@sha256:...
    depends_on: [signer]
  scanner-web:
    image: registry.stella-ops.org/stellaops/scanner-web@sha256:...
    environment:
      - SCANNER__S3__ENDPOINT=http://minio:9000
  scanner-worker:
    image: registry.stella-ops.org/stellaops/scanner-worker@sha256:...
    deploy: { replicas: 4 }
  feedser:
    image: registry.stella-ops.org/stellaops/feedser@sha256:...
  excititor:
    image: registry.stella-ops.org/stellaops/excititor@sha256:...
  web-ui:
    image: registry.stella-ops.org/stellaops/web-ui@sha256:...
  mongo:
    image: mongo:7
  minio:
    image: minio/minio:RELEASE.2025-07-10T00-00-00Z

14) Governance & keys (who owns the trust root)

  • Release key policy: only the Release Engineering group can push signed releases; 4eyes approval; TUFstyle manifest possible in future.
  • Signer acceptance policy: embedded release identities are updated only via minor upgrade; emergency CRL supported.
  • Customer keys: none needed for core use; enterprise addons may require percustomer registries and keys.

15) Roadmap (Ops)

  • Windows containers GA (Scanner + Zastava).
  • Key Transparency for Signer certs.
  • Deltakit (offline) for incremental updates.
  • Operator CRDs (K8s) to manage policy and ILM declaratively.
  • **SBOM protobuf as default transport at rest (smaller, faster).

Appendix A — Minimal SLO monitors

  • authority.tokens_issued_total slope ≈ normal.
  • signer.requests_total{result="success"}/minute > 0 (when scans occur).
  • attestor.submit_latency_seconds{quantile=0.95} < 0.3.
  • scanner.scan_latency_seconds{quantile=0.95} < target per image size.
  • feedser.export.duration_seconds stable; excititor.consensus.conflicts_total not exploding after policy changes.
  • MinIO s3_requests_errors_total near zero; Mongo opcounters hit expected baseline.

Appendix B — Upgrade safety checklist

  • Verify release manifest signature.
  • Ensure Signer/Authority/Attestor are same minor.
  • Verify DB backups < 24h old.
  • Confirm ILM wont purge compliance artifacts during upgrade window.
  • Roll one component at a time; watch SLOs; abort on regression.

End — component_architecture_devops.md