- Added `/concelier/advisories/{vulnerabilityKey}/replay` endpoint to return conflict summaries and explainers.
- Introduced `MergeConflictExplainerPayload` to structure conflict details including type, reason, and source rankings.
- Enhanced `MergeConflictSummary` to include structured explainer payloads and hashes for persisted conflicts.
- Updated `MirrorEndpointExtensions` to enforce rate limits and cache headers for mirror distribution endpoints.
- Refactored tests to cover new replay endpoint functionality and validate conflict explainers.
- Documented changes in TASKS.md, noting completion of mirror distribution endpoints and updated operational runbook.
19 KiB
component_architecture_devops.md — Stella Ops Release & Operations (2025Q4)
Scope. Implementation‑ready blueprint for how Stella Ops is built, versioned, signed, distributed, upgraded, licensed (PoE), and operated in customer environments (online and air‑gapped). Covers reproducible builds, supply‑chain attestations, registries, offline kits, migration/rollback, artifact lifecycle (MinIO/Mongo), monitoring SLOs, and customer activation.
0) Product vision (operations lens)
Stella Ops must be trustable at a glance and boringly operable:
- Every release ships with first‑party SBOMs, provenance, and signatures; services verify each other’s integrity at runtime.
- Customers can deploy by digest and stay aligned with LTS/stable/edge channels.
- Paid customers receive attestation authority (Signer accepts their PoE) while the core platform remains free to run.
- Air‑gapped customers receive offline kits with verifiable digests and deterministic import.
- Artifacts expire predictably; operators know what’s kept, for how long, and why.
1) Release trains & versioning
1.1 Channels
- LTS (12‑month support window): quarterly cadence (Q1/Q2/Q3/Q4).
- Stable (default): monthly rollup (bug fixes + compatible features).
- Edge: weekly; for early adopters, no guarantees.
1.2 Version strings
Semantic core + calendar tag:
<MAJOR>.<MINOR>.<PATCH> (<YYYY>.<MM>) e.g., 2.4.1 (2027.06)
- MAJOR: breaking API/DB changes (rare).
- MINOR: new features, compatible schema migrations (expand/contract pattern).
- PATCH: bug fixes, perf and security updates.
- Calendar tag exposes release year used by Signer for PoE window checks.
1.3 Component alignment
A release is a bundle of image digests + charts + manifests. All services in a bundle are wire‑compatible. Mixed minor versions are allowed within a bounded skew:
- Web UI ↔ backend:
±1 minor. - Scanner ↔ Policy/Excititor/Concelier:
±1 minor. - Authority/Signer/Attestor triangle: must be same minor (crypto and DPoP/mTLS binding rules).
At startup, services self‑advertise their semver & channel; the UI surfaces mismatch warnings.
2) Supply‑chain pipeline (how a release is built)
2.1 Deterministic builds
- Builders: isolated BuildKit workers with pinned base images (digest only).
- Pinning: lock files or
go.mod,package-lock.json,global.json,Directory.Packages.propsare frozen at tag. - Reproducibility: timestamps normalized; source date epoch; deterministic zips/tars.
- Multi‑arch: linux/amd64 + linux/arm64 (Windows images track M2 roadmap).
2.2 First‑party SBOMs & provenance
- Each image gets CycloneDX (JSON+Protobuf) SBOM and SLSA‑style provenance attached as OCI referrers.
- Scanner’s Buildx generator is used to produce SBOMs during build; a separate post‑build scan verifies parity (red flag if drift).
- Release manifest (see §6.1) lists all digests and SBOM/attestation refs.
2.3 Signing & transparency
- Images are cosign‑signed (keyless) with a Stella Ops release identity; inclusion in a transparency log (Rekor) is required.
- SBOM and provenance attestations are DSSE and also transparency‑logged.
- Release keys (Fulcio roots or public keys) are embedded in Signer policy (for scanner‑release validation at customer side).
2.4 Gates & tests
- Static: linters, codegen checks, protobuf API freeze (backward‑compat tests).
- Unit/integration: per‑component, plus end‑to‑end flows (scan→vex→policy→sign→attest).
- Perf SLOs: hot paths (SBOM compose, diff, export) measured against budgets.
- Security: dependency audit vs Concelier export; container hardening tests; minimal caps.
- Canary cohort: internal staging + selected customers; one week on edge before stable tag.
3) Distribution & activation
3.1 Registries
- Primary:
registry.stella-ops.org(OCI v2, supports Referrers API). - Mirrors: GHCR (read‑only), regional mirrors for latency.
- Operational runbook: see
docs/ops/concelier-mirror-operations.mdfor deployment profiles, CDN guidance, and sync automation.
- Operational runbook: see
- Pull by digest only in Kubernetes/Compose manifests.
Gating policy:
- Core images (Authority, Scanner, Concelier, Excititor, Attestor, UI): public read.
- Enterprise add‑ons (if any) and pre‑release: private repos via OAuth2 token service.
Monetization lever is signing (PoE gate), not image pulls, so the core remains simple to consume.
3.2 OAuth2 token service (for private repos)
-
Docker Registry’s token flow backed by Authority:
- Client hits registry (
401withWWW-Authenticate: Bearer realm=…). - Client gets an access token from the token service (validated by Authority) with
scope=repository:…:pull. - Registry allows pull for the requested repo.
- Client hits registry (
-
Tokens are short‑lived (60–300 s) and DPoP‑bound.
3.3 Offline kits (air‑gapped)
-
Tarball per release channel:
stellaops-kit-<ver>-<channel>.tar.zst /images/ OCI layout with all first-party images (multi-arch) /sboms/ CycloneDX JSON+PB for each image /attest/ DSSE bundles + Rekor proofs /charts/ Helm charts + values templates /compose/ docker-compose.yml + .env template /plugins/ Concelier/Excititor connectors (restart-time) /policy/ example policies /manifest/ release.yaml (see §6.1) -
Import via CLI
offline kit import; checks digests and signatures before load.
4) Licensing (PoE) & monetization
Principle: Only paid Stella Ops issues valid signed attestations. Running the stack is free; signing requires PoE.
4.1 PoE issuance
-
Customers purchase a plan and obtain a PoE artifact from
www.stella-ops.org:- PoE‑JWT (DPoP/mTLS‑bound) or PoE mTLS client certificate.
- Contains:
license_id,plan,valid_release_year,max_version,exp, optionaltenant/customerIDs.
4.2 Online enforcement
- Signer calls Licensing /license/introspect on every signing request (see signer doc).
- If revoked/expired/out‑of‑window → deny with machine‑readable reason.
- All valid bundles are DSSE‑signed and Attestor logs them; Rekor UUID returned.
- UI badges: “Verified by Stella Ops” with link to the public log.
4.3 Air‑gapped / offline
- Customers obtain a time‑boxed PoE lease (signed JSON, 7–30 days).
- Signer accepts the lease and emits provisional attestations (clearly labeled).
- When connectivity returns, a background job endorses the provisional entries with the cloud service, updating their status to verified.
- Operators can export a verification bundle for auditors even before endorsement (contains DSSE + local Rekor proof + lease snapshot).
4.4 Stolen/abused PoE
- Customers report theft; Licensing flags
license_idas revoked. - Subsequent Signer requests deny; previous attestations remain but can be marked contested (UI shows badge, optional re‑sign path upon new PoE).
5) Deployment path (customer side)
5.1 First install
- Helm (Kubernetes) or Compose (VMs). Example (K8s):
helm repo add stellaops https://charts.stella-ops.org
helm install stella stellaops/platform \
--version 2.4.0 \
--set global.channel=stable \
--set authority.issuer=https://authority.stella.local \
--set scanner.minio.endpoint=http://minio.stella.local:9000 \
--set scanner.mongo.uri=mongodb://mongo/scanner \
--set concelier.mongo.uri=mongodb://mongo/concelier \
--set excititor.mongo.uri=mongodb://mongo/excititor
- Post‑install job registers Authority clients (Scanner, Signer, Attestor, UI) and prints bootstrap URLs and client credentials (sealed secrets).
- UI banner shows release bundle and verification state (cosign OK? Rekor OK?).
5.2 Updates
-
Blue/green: pull new bundle by digest; deploy side‑by‑side; cut traffic.
-
Rolling: upgrade stateful components in safe order:
- Authority (stateless, dual‑key rotation ready)
- Signer/Attestor (same minor)
- Scanner WebService & Workers
- Concelier, then Excititor (schema migrations are expand/contract)
- UI last
-
DB migrations are expand/contract:
- Phase A (release N): add new fields/indexes, write old+new.
- Phase B (N+1): read new fields; drop old.
- Rollback is a matter of redeploying previous images and keeping both schemas valid.
5.3 Rollback
- Images referenced by digest; keep previous release manifest
Kversions back. helm rollbackor composedocker compose -f release-K.yml up -d.- Mongo migrations are additive; no destructive changes within a single minor.
6) Release payloads & manifests
6.1 Release manifest (release.yaml)
release:
version: "2.4.1"
channel: "stable"
date: "2027-06-20T12:00:00Z"
calendar: "2027.06"
components:
- name: scanner-webservice
image: registry.stella-ops.org/stellaops/scanner-web@sha256:aa..bb
sbom: oci://.../referrers/cdx-json@sha256:11..22
provenance: oci://.../attest/provenance@sha256:33..44
signature: { rekorUUID: "…" }
- name: signer
image: registry.stella-ops.org/stellaops/signer@sha256:cc..dd
signature: { rekorUUID: "…" }
charts:
- name: platform
version: "2.4.1"
digest: "sha256:ee..ff"
compose:
file: "docker-compose.yml"
digest: "sha256:77..88"
checksums:
sha256: "… digest of this release.yaml …"
The manifest is cosign‑signed; UI/CLI can verify a bundle without talking to registries.
Deployment guardrails – The repository keeps channel-aligned Compose bundles in
deploy/compose/and Helm overlays indeploy/helm/stellaops/. Both sets pull their digests fromdeploy/releases/and are validated bydeploy/tools/validate-profiles.shto guarantee lint/dry-run cleanliness.
6.2 Image labels (release metadata)
Each image sets OCI labels:
org.opencontainers.image.version = "2.4.1"
org.opencontainers.image.revision = "<git sha>"
org.opencontainers.image.created = "2027-06-20T12:00:00Z"
org.stellaops.release.calendar = "2027.06"
org.stellaops.release.channel = "stable"
org.stellaops.build.slsaProvenance = "oci://…"
Signer validates scanner image’s cosign identity + calendar tag for release window checks.
7) Artifact lifecycle & storage (MinIO/Mongo)
7.1 Buckets & prefixes (MinIO)
s3://stellaops/
scanner/
layers/<sha256>/sbom.cdx.json.zst
images/<imgDigest>/inventory.cdx.pb
images/<imgDigest>/usage.cdx.pb
diffs/<old>_<new>/diff.json.zst
attest/<artifactSha256>.dsse.json
concelier/
json/<exportId>/...
trivy/<exportId>/...
excititor/
exports/<exportId>/...
attestor/
dsse/<bundleSha256>.json
proof/<rekorUuid>.json
7.2 ILM classes
short: working artifacts (diffs, queues) — TTL 7–14 days.default: SBOMs & indexes — TTL 90–180 days (configurable).compliance: signed reports & attested exports — Object Lock (governance/compliance) 1–7 years.
7.3 Artifact Lifecycle Controller (ALC)
-
A background worker (part of Scanner.WebService) enforces TTL and reference counting:
- Artifacts referenced by reports or tickets are pinned.
- ILM actions logged; UI shows per‑class usage & upcoming purges.
7.4 Mongo retention
- Scanner:
runtime.eventsuse TTL (e.g., 30–90 days); catalog permanent. - Concelier/Excititor: raw docs keep last N windows; canonical stores permanent.
- Attestor:
entriespermanent;dedupeTTL 24–48h.
7.5 Mongo server baseline
- Minimum supported server: MongoDB 4.2+. Driver 3.5.0 removes compatibility shims for 4.0; upstream has already announced 4.0 support will be dropped in upcoming C# driver releases. citeturn1open1
- Deploy images: Compose/Helm defaults stay on
mongo:7.x. For air-gapped installs, refresh Offline Kit bundles so the packagedmongodmatches ≥4.2. - Upgrade guard: During rollout, verify replica sets reach FCV
4.2or above before swapping binaries; automation should hard-stop if FCV is <4.2.
8) Observability & SLOs (operations)
-
Uptime SLO: 99.9% for Signer/Authority/Attestor; 99.5% for Scanner WebService; Excititor/Concelier 99.0%.
-
Error budgets: tracked per month; dashboards show burn rates.
-
Golden signals:
- Latency: token issuance, sign→attest round‑trip, scan enqueue→emit, export build.
- Saturation: queue depth, Mongo write IOPS, MinIO net throughput.
- Traffic: scans/min, attestations/min, webhook admits/min.
- Errors: 5xx rates, cosign verification failures, Rekor timeouts.
Prometheus + OTLP; Grafana dashboards ship in the charts.
9) Security & compliance operations
-
Key rotation:
- Authority JWKS: 60‑day cadence, dual‑key overlap.
- Release signing identities: rotate per minor or quarterly.
- Sigstore roots mirrored and pinned; alarms on drift.
-
FIPS mode (Gov build):
- Enforce
ES256+ KMS/HSM; disable Ed25519; MLS ciphers only. - Local Rekor v2 and Fulcio alternatives; air‑gapped CA.
- Enforce
-
Vulnerability response:
- Concelier red-flag advisories trigger accelerated stable patch rollout; UI/CLI “security patch available” notice.
- 2025-10: Pinned
MongoDB.Driver3.5.0 andSharpCompress0.41.0 across services (DEVOPS-SEC-10-301) to eliminate NU1902/NU1903 warnings surfaced during scanner cache/worker test runs; repacked the localMongo2Gofeed so test fixtures inherit the patched dependencies; future bumps follow the same central override pattern.
-
Backups/DR:
- Mongo nightly snapshots; MinIO versioning + replication (if configured).
- Restore runbooks tested quarterly with synthetic data.
10) Customer update flow (how versions are fetched & activated)
10.1 Online clusters
- UI surfaces update banner with release manifest diff and risk notes.
- Operator approves → Controller pulls new images by digest; health‑checks; moves traffic; deprecates old revision.
- Post‑switch, schema Phase B migrations (if any) run automatically.
10.2 Air‑gapped clusters
- Operator downloads offline kit from a mirror →
stellaops offline kit import. - Controller validates bundle checksums and cosign signatures; applies charts/compose by digest.
- After install, verify page shows green checks: image sigs, SBOMs attached, provenance logged.
10.3 CLI self‑update (optional)
stellaops self-updatepulls a signed release manifest and verifies the CLI binary with cosign before swapping (admin can disable).
11) Compatibility & deprecation policy
- APIs are stable within a major; breaking changes imply MAJOR++ and deprecation period of one minor.
- Storage: expand/contract; “drop old fields” only after one minor grace.
- Config: feature flags (default off) for risky features (e.g., eBPF).
12) Runbooks (selected)
12.1 Lost PoE
- Suspend automatic attestation jobs.
- Use CLI
stellaops signer statusto confirmentitlement_denied. - Obtain new PoE from portal; verify on Signer
/poe/verify. - Re‑enable; optionally re‑sign last N reports (UI button → batch).
12.2 Rekor outage (self‑hosted)
- Attestor returns
202 (pending)with queued proof fetch. - Keep DSSE bundles locally; re‑submit on schedule; UI badge shows Pending.
- If outage > SLA, you can switch to a mirror log in config; Attestor writes to both when restored.
12.3 Emergency downgrade
- Identify prior release manifest (UI → Admin → Releases).
helm rollback stella <revision>(or compose apply previous file).- Services tolerate skew per §1.3; ensure Signer/Authority/Attestor are rolled together.
13) Example: cluster bootstrap (Compose)
version: "3.9"
services:
authority:
image: registry.stella-ops.org/stellaops/authority@sha256:...
env_file: ./env/authority.env
ports: ["8440:8440"]
signer:
image: registry.stella-ops.org/stellaops/signer@sha256:...
depends_on: [authority]
environment:
- SIGNER__POE__LICENSING__INTROSPECTURL=https://www.stella-ops.org/api/v1/license/introspect
attestor:
image: registry.stella-ops.org/stellaops/attestor@sha256:...
depends_on: [signer]
scanner-web:
image: registry.stella-ops.org/stellaops/scanner-web@sha256:...
environment:
- SCANNER__S3__ENDPOINT=http://minio:9000
scanner-worker:
image: registry.stella-ops.org/stellaops/scanner-worker@sha256:...
deploy: { replicas: 4 }
concelier:
image: registry.stella-ops.org/stellaops/concelier@sha256:...
excititor:
image: registry.stella-ops.org/stellaops/excititor@sha256:...
web-ui:
image: registry.stella-ops.org/stellaops/web-ui@sha256:...
mongo:
image: mongo:7
minio:
image: minio/minio:RELEASE.2025-07-10T00-00-00Z
14) Governance & keys (who owns the trust root)
- Release key policy: only the Release Engineering group can push signed releases; 4‑eyes approval; TUF‑style manifest possible in future.
- Signer acceptance policy: embedded release identities are updated only via minor upgrade; emergency CRL supported.
- Customer keys: none needed for core use; enterprise add‑ons may require per‑customer registries and keys.
15) Roadmap (Ops)
- Windows containers GA (Scanner + Zastava).
- Key Transparency for Signer certs.
- Delta‑kit (offline) for incremental updates.
- Operator CRDs (K8s) to manage policy and ILM declaratively.
- **SBOM protobuf as default transport at rest (smaller, faster).
Appendix A — Minimal SLO monitors
authority.tokens_issued_totalslope ≈ normal.signer.requests_total{result="success"}/minute> 0 (when scans occur).attestor.submit_latency_seconds{quantile=0.95}< 0.3.scanner.scan_latency_seconds{quantile=0.95}< target per image size.concelier.export.duration_secondsstable;excititor.consensus.conflicts_totalnot exploding after policy changes.- MinIO
s3_requests_errors_totalnear zero; Mongoopcountershit expected baseline.
Appendix B — Upgrade safety checklist
- Verify release manifest signature.
- Ensure Signer/Authority/Attestor are same minor.
- Verify DB backups < 24h old.
- Confirm ILM won’t purge compliance artifacts during upgrade window.
- Roll one component at a time; watch SLOs; abort on regression.
End — component_architecture_devops.md