Files
git.stella-ops.org/docs/ARCHITECTURE_DEVOPS.md
root 68da90a11a
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Restructure solution layout by module
2025-10-28 15:10:40 +02:00

20 KiB
Raw Blame History

component_architecture_devops.md — StellaOps Release & Operations (2025Q4)

Scope. Implementationready blueprint for how StellaOps is built, versioned, signed, distributed, upgraded, licensed (PoE), and operated in customer environments (online and airgapped). Covers reproducible builds, supplychain attestations, registries, offline kits, migration/rollback, artifact lifecycle (RustFS default + Mongo, S3 fallback), monitoring SLOs, and customer activation.


0) Product vision (operations lens)

StellaOps must be trustable at a glance and boringly operable:

  • Every release ships with firstparty SBOMs, provenance, and signatures; services verify each others integrity at runtime.
  • Customers can deploy by digest and stay aligned with LTS/stable/edge channels.
  • Paid customers receive attestation authority (Signer accepts their PoE) while the core platform remains free to run.
  • Airgapped customers receive offline kits with verifiable digests and deterministic import.
  • Artifacts expire predictably; operators know whats kept, for how long, and why.

1) Release trains & versioning

1.1 Channels

  • LTS (12month support window): quarterly cadence (Q1/Q2/Q3/Q4).
  • Stable (default): monthly rollup (bug fixes + compatible features).
  • Edge: weekly; for early adopters, no guarantees.

1.2 Version strings

Semantic core + calendar tag:

<MAJOR>.<MINOR>.<PATCH>  (<YYYY>.<MM>)   e.g., 2.4.1 (2027.06)
  • MAJOR: breaking API/DB changes (rare).
  • MINOR: new features, compatible schema migrations (expand/contract pattern).
  • PATCH: bug fixes, perf and security updates.
  • Calendar tag exposes release year used by Signer for PoE window checks.

1.3 Component alignment

A release is a bundle of image digests + charts + manifests. All services in a bundle are wirecompatible. Mixed minor versions are allowed within a bounded skew:

  • Web UI ↔ backend: ±1 minor.
  • Scanner ↔ Policy/Excititor/Concelier: ±1 minor.
  • Authority/Signer/Attestor triangle: must be same minor (crypto and DPoP/mTLS binding rules).

At startup, services selfadvertise their semver & channel; the UI surfaces mismatch warnings.


2) Supplychain pipeline (how a release is built)

2.1 Deterministic builds

  • Builders: isolated BuildKit workers with pinned base images (digest only).
  • Pinning: lock files or go.mod, package-lock.json, global.json, Directory.Packages.props are frozen at tag.
  • Reproducibility: timestamps normalized; source date epoch; deterministic zips/tars.
  • Multiarch: linux/amd64 + linux/arm64 (Windows images track M2 roadmap).

2.2 Firstparty SBOMs & provenance

  • Each image gets CycloneDX (JSON+Protobuf) SBOM and SLSAstyle provenance attached as OCI referrers.
  • Scanners Buildx generator is used to produce SBOMs during build; a separate postbuild scan verifies parity (red flag if drift).
  • Release manifest (see §6.1) lists all digests and SBOM/attestation refs.

2.3 Signing & transparency

  • Images are cosignsigned (keyless) with a StellaOps release identity; inclusion in a transparency log (Rekor) is required.
  • SBOM and provenance attestations are DSSE and also transparencylogged.
  • Release keys (Fulcio roots or public keys) are embedded in Signer policy (for scannerrelease validation at customer side).

2.4 Gates & tests

  • Static: linters, codegen checks, protobuf API freeze (backwardcompat tests).
  • Unit/integration: percomponent, plus endtoend flows (scan→vex→policy→sign→attest).
  • Perf SLOs: hot paths (SBOM compose, diff, export) measured against budgets.
  • Security: dependency audit vs Concelier export; container hardening tests; minimal caps.
  • Analyzer smoke: restart-time language plug-ins (currently Python) verified via dotnet run --project tools/LanguageAnalyzerSmoke to ensure manifest integrity plus cold vs warm determinism (<30s / <5s budgets); the harness logs deviations from repository goldens for follow-up.
  • Canary cohort: internal staging + selected customers; one week on edge before stable tag.

2.5 Debug-store artefacts

  • Every release exports stripped debug information for ELF binaries discovered in service images. Debug files follow the GNU build-id layout (debug/.build-id/<aa>/<rest>.debug) and are generated via objcopy --only-keep-debug.
  • debug/debug-manifest.json captures build-id → component/image/source mappings with SHA-256 checksums so operators can mirror the directory into debuginfod or offline symbol stores. The manifest (and its .sha256 companion) ships with every release bundle and Offline Kit.

3) Distribution & activation

3.1 Registries

  • Primary: registry.stella-ops.org (OCI v2, supports Referrers API).
  • Mirrors: GHCR (readonly), regional mirrors for latency.
    • Operational runbook: see docs/ops/concelier-mirror-operations.md for deployment profiles, CDN guidance, and sync automation.
  • Pull by digest only in Kubernetes/Compose manifests.

Gating policy:

  • Core images (Authority, Scanner, Concelier, Excititor, Attestor, UI): public read.
  • Enterprise addons (if any) and prerelease: private repos via the Registry Token Service (src/Registry/StellaOps.Registry.TokenService) which exchanges Authority-issued OpToks for short-lived Docker registry bearer tokens.

Monetization lever is signing (PoE gate), not image pulls, so the core remains simple to consume.

3.2 OAuth2 token service (for private repos)

  • Docker Registrys token flow backed by Authority:

    1. Client hits registry (401 with WWW-Authenticate: Bearer realm=…).
    2. Client gets an access token from the token service (validated by Authority) with scope=repository:…:pull.
    3. Registry allows pull for the requested repo.
  • Tokens are shortlived (60300s) and DPoPbound.

The token service enforces plan gating via registry-token.yaml (see docs/ops/registry-token-service.md) and exposes Prometheus metrics (registry_token_issued_total, registry_token_rejected_total). Revoked licence identifiers halt issuance even when scope requirements are met.

3.3 Offline kits (airgapped)

  • Tarball per release channel:

    stellaops-kit-<ver>-<channel>.tar.zst
      /images/   OCI layout with all first-party images (multi-arch)
      /sboms/    CycloneDX JSON+PB for each image
      /attest/   DSSE bundles + Rekor proofs
      /charts/   Helm charts + values templates
      /compose/  docker-compose.yml + .env template
      /plugins/  Concelier/Excititor connectors (restart-time)
      /policy/   example policies
      /manifest/ release.yaml  (see §6.1)
    
  • Import via CLI offline kit import; checks digests and signatures before load.


4) Licensing (PoE) & monetization

Principle: Only paid StellaOps issues valid signed attestations. Running the stack is free; signing requires PoE.

4.1 PoE issuance

  • Customers purchase a plan and obtain a PoE artifact from www.stella-ops.org:

    • PoEJWT (DPoP/mTLSbound) or PoE mTLS client certificate.
    • Contains: license_id, plan, valid_release_year, max_version, exp, optional tenant/customer IDs.

4.2 Online enforcement

  • Signer calls Licensing /license/introspect on every signing request (see signer doc).
  • If revoked/expired/outofwindow → deny with machinereadable reason.
  • All valid bundles are DSSEsigned and Attestor logs them; Rekor UUID returned.
  • UI badges: “Verified by StellaOps” with link to the public log.

4.3 Airgapped / offline

  • Customers obtain a timeboxed PoE lease (signed JSON, 730 days).
  • Signer accepts the lease and emits provisional attestations (clearly labeled).
  • When connectivity returns, a background job endorses the provisional entries with the cloud service, updating their status to verified.
  • Operators can export a verification bundle for auditors even before endorsement (contains DSSE + local Rekor proof + lease snapshot).

4.4 Stolen/abused PoE

  • Customers report theft; Licensing flags license_id as revoked.
  • Subsequent Signer requests deny; previous attestations remain but can be marked contested (UI shows badge, optional resign path upon new PoE).

5) Deployment path (customer side)

5.1 First install

  • Helm (Kubernetes) or Compose (VMs). Example (K8s):
helm repo add stellaops https://charts.stella-ops.org
helm install stella stellaops/platform \
  --version 2.4.0 \
  --set global.channel=stable \
  --set authority.issuer=https://authority.stella.local \
  --set scanner.minio.endpoint=http://minio.stella.local:9000 \
  --set scanner.mongo.uri=mongodb://mongo/scanner \
  --set concelier.mongo.uri=mongodb://mongo/concelier \
  --set excititor.mongo.uri=mongodb://mongo/excititor
  • Postinstall job registers Authority clients (Scanner, Signer, Attestor, UI) and prints bootstrap URLs and client credentials (sealed secrets).
  • UI banner shows release bundle and verification state (cosign OK? Rekor OK?).

5.2 Updates

  • Blue/green: pull new bundle by digest; deploy sidebyside; cut traffic.

  • Rolling: upgrade stateful components in safe order:

    1. Authority (stateless, dualkey rotation ready)
    2. Signer/Attestor (same minor)
    3. Scanner WebService & Workers
    4. Concelier, then Excititor (schema migrations are expand/contract)
    5. UI last
  • DB migrations are expand/contract:

    • Phase A (release N): add new fields/indexes, write old+new.
    • Phase B (N+1): read new fields; drop old.
    • Rollback is a matter of redeploying previous images and keeping both schemas valid.

5.3 Rollback

  • Images referenced by digest; keep previous release manifest K versions back.
  • helm rollback or compose docker compose -f release-K.yml up -d.
  • Mongo migrations are additive; no destructive changes within a single minor.

6) Release payloads & manifests

6.1 Release manifest (release.yaml)

release:
  version: "2.4.1"
  channel: "stable"
  date: "2027-06-20T12:00:00Z"
  calendar: "2027.06"
  components:
    - name: scanner-webservice
      image: registry.stella-ops.org/stellaops/scanner-web@sha256:aa..bb
      sbom: oci://.../referrers/cdx-json@sha256:11..22
      provenance: oci://.../attest/provenance@sha256:33..44
      signature: { rekorUUID: "…" }
    - name: signer
      image: registry.stella-ops.org/stellaops/signer@sha256:cc..dd
      signature: { rekorUUID: "…" }
  charts:
    - name: platform
      version: "2.4.1"
      digest: "sha256:ee..ff"
  compose:
    file: "docker-compose.yml"
    digest: "sha256:77..88"
  checksums:
    sha256: "… digest of this release.yaml …"

The manifest is cosignsigned; UI/CLI can verify a bundle without talking to registries.

Deployment guardrails The repository keeps channel-aligned Compose bundles in deploy/compose/ and Helm overlays in deploy/helm/stellaops/. Both sets pull their digests from deploy/releases/ and are validated by deploy/tools/validate-profiles.sh to guarantee lint/dry-run cleanliness.

6.2 Image labels (release metadata)

Each image sets OCI labels:

org.opencontainers.image.version = "2.4.1"
org.opencontainers.image.revision = "<git sha>"
org.opencontainers.image.created = "2027-06-20T12:00:00Z"
org.stellaops.release.calendar = "2027.06"
org.stellaops.release.channel  = "stable"
org.stellaops.build.slsaProvenance = "oci://…"

Signer validates scanner images cosign identity + calendar tag for release window checks.


7) Artifact lifecycle & storage (RustFS/Mongo)

7.1 Buckets & prefixes (RustFS)

rustfs://stellaops/
  scanner/
    layers/<sha256>/sbom.cdx.json.zst
    images/<imgDigest>/inventory.cdx.pb
    images/<imgDigest>/usage.cdx.pb
    diffs/<old>_<new>/diff.json.zst
    attest/<artifactSha256>.dsse.json
  concelier/
    json/<exportId>/...
    trivy/<exportId>/...
  excititor/
    exports/<exportId>/...
  attestor/
    dsse/<bundleSha256>.json
    proof/<rekorUuid>.json

7.2 ILM classes

  • short: working artifacts (diffs, queues) — TTL 714 days.
  • default: SBOMs & indexes — TTL 90180 days (configurable).
  • compliance: signed reports & attested exports — retention enforced via RustFS hold or S3 Object Lock (governance/compliance) 17 years.

7.3 Artifact Lifecycle Controller (ALC)

  • A background worker (part of Scanner.WebService) enforces TTL and reference counting:

    • Artifacts referenced by reports or tickets are pinned.
    • ILM actions logged; UI shows perclass usage & upcoming purges.

Migration note. Follow docs/ops/scanner-rustfs-migration.md when transitioning existing MinIO buckets to RustFS. The provided migrator is idempotent and safe to rerun per prefix.

7.4 Mongo retention

  • Scanner: runtime.events use TTL (e.g., 3090 days); catalog permanent.
  • Concelier/Excititor: raw docs keep last N windows; canonical stores permanent.
  • Attestor: entries permanent; dedupe TTL 2448h.

7.5 Mongo server baseline

  • Minimum supported server: MongoDB 4.2+. Driver 3.5.0 removes compatibility shims for 4.0; upstream has already announced 4.0 support will be dropped in upcoming C# driver releases. citeturn1open1
  • Deploy images: Compose/Helm defaults stay on mongo:7.x. For air-gapped installs, refresh Offline Kit bundles so the packaged mongod matches ≥4.2.
  • Upgrade guard: During rollout, verify replica sets reach FCV 4.2 or above before swapping binaries; automation should hard-stop if FCV is <4.2.

8) Observability & SLOs (operations)

  • Uptime SLO: 99.9% for Signer/Authority/Attestor; 99.5% for Scanner WebService; Excititor/Concelier 99.0%.

  • Error budgets: tracked per month; dashboards show burn rates.

  • Golden signals:

    • Latency: token issuance, sign→attest roundtrip, scan enqueue→emit, export build.
    • Saturation: queue depth, Mongo write IOPS, RustFS throughput / queue depth (or S3 metrics when in fallback mode).
    • Traffic: scans/min, attestations/min, webhook admits/min.
    • Errors: 5xx rates, cosign verification failures, Rekor timeouts.

Prometheus + OTLP; Grafana dashboards ship in the charts.


9) Security & compliance operations

  • Key rotation:

    • Authority JWKS: 60day cadence, dualkey overlap.
    • Release signing identities: rotate per minor or quarterly.
    • Sigstore roots mirrored and pinned; alarms on drift.
  • FIPS mode (Gov build):

    • Enforce ES256 + KMS/HSM; disable Ed25519; MLS ciphers only.
    • Local Rekor v2 and Fulcio alternatives; airgapped CA.
  • Vulnerability response:

    • Concelier red-flag advisories trigger accelerated stable patch rollout; UI/CLI “security patch available” notice.
    • 2025-10: Pinned MongoDB.Driver 3.5.0 and SharpCompress 0.41.0 across services (DEVOPS-SEC-10-301) to eliminate NU1902/NU1903 warnings surfaced during scanner cache/worker test runs; repacked the local Mongo2Go feed so test fixtures inherit the patched dependencies; future bumps follow the same central override pattern.
  • Backups/DR:

    • Mongo nightly snapshots; MinIO versioning + replication (if configured).
    • Restore runbooks tested quarterly with synthetic data.

10) Customer update flow (how versions are fetched & activated)

10.1 Online clusters

  • UI surfaces update banner with release manifest diff and risk notes.
  • Operator approves → Controller pulls new images by digest; healthchecks; moves traffic; deprecates old revision.
  • Postswitch, schema Phase B migrations (if any) run automatically.

10.2 Airgapped clusters

  • Operator downloads offline kit from a mirror → stellaops offline kit import.
  • Controller validates bundle checksums and cosign signatures; applies charts/compose by digest.
  • After install, verify page shows green checks: image sigs, SBOMs attached, provenance logged.

10.3 CLI selfupdate (optional)

  • stellaops self-update pulls a signed release manifest and verifies the CLI binary with cosign before swapping (admin can disable).

11) Compatibility & deprecation policy

  • APIs are stable within a major; breaking changes imply MAJOR++ and deprecation period of one minor.
  • Storage: expand/contract; “drop old fields” only after one minor grace.
  • Config: feature flags (default off) for risky features (e.g., eBPF).

12) Runbooks (selected)

12.1 Lost PoE

  1. Suspend automatic attestation jobs.
  2. Use CLI stellaops signer status to confirm entitlement_denied.
  3. Obtain new PoE from portal; verify on Signer /poe/verify.
  4. Reenable; optionally resign last N reports (UI button → batch).

12.2 Rekor outage (selfhosted)

  • Attestor returns 202 (pending) with queued proof fetch.
  • Keep DSSE bundles locally; resubmit on schedule; UI badge shows Pending.
  • If outage > SLA, you can switch to a mirror log in config; Attestor writes to both when restored.

12.3 Emergency downgrade

  • Identify prior release manifest (UI → Admin → Releases).
  • helm rollback stella <revision> (or compose apply previous file).
  • Services tolerate skew per §1.3; ensure Signer/Authority/Attestor are rolled together.

13) Example: cluster bootstrap (Compose)

version: "3.9"
services:
  authority:
    image: registry.stella-ops.org/stellaops/authority@sha256:...
    env_file: ./env/authority.env
    ports: ["8440:8440"]
  signer:
    image: registry.stella-ops.org/stellaops/signer@sha256:...
    depends_on: [authority]
    environment:
      - SIGNER__POE__LICENSING__INTROSPECTURL=https://www.stella-ops.org/api/v1/license/introspect
  attestor:
    image: registry.stella-ops.org/stellaops/attestor@sha256:...
    depends_on: [signer]
  scanner-web:
    image: registry.stella-ops.org/stellaops/scanner-web@sha256:...
    environment:
      - SCANNER__S3__ENDPOINT=http://minio:9000
  scanner-worker:
    image: registry.stella-ops.org/stellaops/scanner-worker@sha256:...
    deploy: { replicas: 4 }
  concelier:
    image: registry.stella-ops.org/stellaops/concelier@sha256:...
  excititor:
    image: registry.stella-ops.org/stellaops/excititor@sha256:...
  web-ui:
    image: registry.stella-ops.org/stellaops/web-ui@sha256:...
  mongo:
    image: mongo:7
  minio:
    image: minio/minio:RELEASE.2025-07-10T00-00-00Z

14) Governance & keys (who owns the trust root)

  • Release key policy: only the Release Engineering group can push signed releases; 4eyes approval; TUFstyle manifest possible in future.
  • Signer acceptance policy: embedded release identities are updated only via minor upgrade; emergency CRL supported.
  • Customer keys: none needed for core use; enterprise addons may require percustomer registries and keys.

15) Roadmap (Ops)

  • Windows containers GA (Scanner + Zastava).
  • Key Transparency for Signer certs.
  • Deltakit (offline) for incremental updates.
  • Operator CRDs (K8s) to manage policy and ILM declaratively.
  • **SBOM protobuf as default transport at rest (smaller, faster).

Appendix A — Minimal SLO monitors

  • authority.tokens_issued_total slope ≈ normal.
  • signer.requests_total{result="success"}/minute > 0 (when scans occur).
  • attestor.submit_latency_seconds{quantile=0.95} < 0.3.
  • scanner.scan_latency_seconds{quantile=0.95} < target per image size.
  • concelier.export.duration_seconds stable; excititor.consensus.conflicts_total not exploding after policy changes.
  • RustFS request error rate near zero (or s3_requests_errors_total when operating against S3); Mongo opcounters hit expected baseline.

Appendix B — Upgrade safety checklist

  • Verify release manifest signature.
  • Ensure Signer/Authority/Attestor are same minor.
  • Verify DB backups < 24h old.
  • Confirm ILM wont purge compliance artifacts during upgrade window.
  • Roll one component at a time; watch SLOs; abort on regression.

End — component_architecture_devops.md