Files
git.stella-ops.org/docs/07_HIGH_LEVEL_ARCHITECTURE.md

26 KiB
Executable File
Raw Blame History

HighLevel Architecture — StellaOps (Consolidated • 2025Q4)

Purpose. A complete, implementationready map of StellaOps: product vision, all runtime components, trust boundaries, tokens/licensing, control/data flows, storage, APIs, security, scale, DevOps, and verification logic. Scope. This file replaces the separate components.md; all component details now live here.


0) Product vision & principles

Vision. StellaOps is a deterministic SBOM + VEX platform for CI/CD and runtime, tuned for speed (perlayer deltas), quiet output (usagescoped views), and verifiability (DSSE + Rekor v2). It is selfhostable, airgap capable, and commercially enforceable: only licensed installations can produce StellaOpsverified attestations.

Operating principles.

  • Scannerowned SBOMs. We generate our own BOMs; we do not warehouse thirdparty SBOM content (we can link to attested SBOMs).
  • Deterministic evidence. Facts come from package DBs, installed metadata, linkers, and verified attestations; no fuzzy guessing in the core.
  • Perlayer caching. Cache fragments by layer digest and compose image SBOMs via CycloneDX BOMLink / SPDX ExternalRef.
  • Inventory vs Usage. Always record the full inventory of what exists; separately present usage (entrypoint closure + loaded libs).
  • Backend decides. PASS/FAIL is produced by Policy + VEX + Advisories. The scanner reports facts.
  • Attest or it didnt happen. Every export is signed as intoto/DSSE and logged in Rekor v2.
  • Sovereignready. Cloud is used only for licensing and optional endorsement; everything else is firstparty and selfhostable.

1) Service topology & trust boundaries

1.1 Runtime inventory (firstparty)

Service / Tool Container image Core role Scale pattern
Scanner.WebService stellaops/scanner-web Control plane for scans; catalog; SBOM composition (inventory & usage); diff; exports; analysisonly report runs for Scheduler. Stateless; N replicas behind LB.
Scanner.Worker stellaops/scanner-worker Runs analyzers (OS, Lang: Java/Node/Python/Go/.NET/Rust, Native ELF/PE/MachO, EntryTrace); emits perlayer SBOMs and composes image SBOMs. Horizontal; queuedriven; sharded by layer digest.
Scanner.Sbomer.BuildXPlugin stellaops/sbom-indexer BuildKit generator for buildtime SBOMs as OCI referrers. CIside; ephemeral.
Scanner.Sbomer.DockerImage stellaops/scanner-cli CLIorchestrated scanner container for postbuild scans. Local/CI; ephemeral.
Concelier.WebService stellaops/concelier-web Vulnerability ingest/normalize/merge/export (JSON + Trivy DB). HA via Mongo locks.
Excititor.WebService stellaops/excititor-web VEX ingest/normalize/consensus; conflict retention; exports. HA via Mongo locks.
Policy Engine (in scanner-web) YAML DSL evaluator (waivers, vendor preferences, KEV/EPSS, license, usagegating); produces policy digest. Inprocess; cache per digest.
Scheduler.WebService stellaops/scheduler-web Schedules reevaluation runs; consumes Concelier/Excititor deltas; selects impacted images via BOMIndex; orchestrates analysisonly reports. Stateless API.
Scheduler.Worker stellaops/scheduler-worker Executes selection and enqueues batches toward Scanner; enforces rate/limits and windows; maintains impact cursors. Horizontal; queuedriven.
Notify.WebService stellaops/notify-web Rules engine for outbound notifications; manages channels, templates, throttle/digest logic. Stateless API.
Notify.Worker stellaops/notify-worker Delivers to Slack/Teams/Email/Webhooks; idempotent retries; digests. Horizontal; perchannel rate limits.
Signer stellaops/signer Hard gate: validates entitlement + release integrity; mints signing cert (Fulcio keyless) or uses KMS; signs DSSE. Stateless; HPA by QPS.
Attestor stellaops/attestor Posts DSSE bundles to Rekor v2; verification endpoints. Stateless; HPA by QPS.
Authority stellaops/authority Onprem OIDC issuing shortlived OpToks with DPoP/mTLS sender constraint. HA behind LB.
Zastava (Runtime) stellaops/zastava Runtime inspector/enforcer (observer + optional Admission Webhook). DaemonSet + Webhook.
Web UI stellaops/ui Angular app for scans, diffs, policy, VEX, Scheduler, Notify, runtime, reports. Stateless.
StellaOps.Cli stellaops/cli CLI for init/scan/export/diff/policy/report/verify; Buildx helper; schedule and notify verbs. Local/CI.

1.2 Thirdparty (selfhosted)

  • Fulcio (Sigstore CA) — issues shortlived signing certs (keyless).
  • Rekor v2 (tilebacked transparency log).
  • MinIO — S3compatible object store with lifecycle & Object Lock.
  • MongoDB — catalog, advisories, VEX, scheduler, notify.
  • Queue — Redis Streams / NATS / RabbitMQ (pluggable).
  • OCI Registry — must support Referrers API (discover SBOMs/signatures).

1.3 Cloud licensing (StellaOps)

  • Licensing Service (www.stella-ops.org) — issues longlived License Tokens (LT); exchanges LT → ProofofEntitlement (PoE) bound to an installation key; revoke/introspect PoE; optional crosslog endorsement.

1.4 Diagram (control/data planes & trust)

flowchart LR
  subgraph Cloud["www.stella-ops.org (Cloud)"]
    LS[Licensing Service<br/>LT→PoE / revoke / introspect]
  end

  subgraph OnPrem["Customer Site (Self-hosted)"]
    Auth[Authority (OIDC)\nOpTok (DPoP/mTLS)]
    SW[Scanner.WebService]
    WK[Scanner.Worker xN]
    CONC[Concelier]
    EXC[Excititor]
    SCHW[Scheduler.Web]
    SCH[Scheduler.Worker xN]
    NOTW[Notify.Web]
    NOT[Notify.Worker xN]
    POL[Policy Engine (in Scanner.Web)]
    SGN[Signer\n(entitlement + signing)]
    ATT[Attestor\n(Rekor v2 submit/verify)]
    UI[Web UI (Angular)]
    Z[Zastava\n(Runtime Inspector/Enforcer)]
    MIN[(MinIO S3)]
    MGO[(MongoDB)]
    QUE[(Queue/Streams)]
  end

  CLI[StellaOps.Cli / Buildx Plugin]
  REG[(OCI Registry with Referrers)]
  FUL[ Fulcio ]
  REK[ Rekor v2 (tiles) ]

  CLI -->|scan/build| SW
  SW -->|jobs| QUE
  QUE --> WK
  WK --> MIN
  SW --> MGO
  CONC --> MGO
  EXC --> MGO
  UI --> SW
  Z --> SW

  %% New event-driven loop
  CONC -- export.delta --> SCHW
  EXC  -- export.delta --> SCHW
  SCHW --> SCH
  SCH --> SW
  SW -- report.ready --> NOTW
  Z  -- admission/observe --> NOTW

  SGN <--> Auth
  SGN --> FUL
  SGN -->|mTLS| ATT
  ATT --> REK

  SGN <-->|verify referrers| REG

Trust boundaries. Only Signer can sign; only Attestor can write to Rekor v2. Scanner/UI/Scheduler/Notify never sign.


2) Licensing & tokens (installationready, theftresistant)

Twotoken model.

  • License Token (LT) — longlived JWT from Licensing Service; used once to enroll the installation; never used in hot path.
  • ProofofEntitlement (PoE) — bound to the installation key (mTLS client cert or DPoPbound JWT with cnf); mediumlived; renewable; revocable.
  • Operational token (OpTok) — 25min OIDC token from Authority, senderconstrained (DPoP or mTLS). Used to authenticate to Signer/Scanner.WebService/Scheduler.Web/Notify.Web.

Signer enforces both: PoE proves entitlement; OpTok proves “who is calling now”. It also independently verifies the scanner image digest is StellaOpssigned via Referrers + cosign before signing anything.

Enrollment sequence (LT → PoE).

@startuml
actor Operator
participant "Install Agent" as IA
participant "Licensing Service" as LS
Operator -> IA: Provide LT
IA -> IA: Generate K_inst
IA -> LS: /license/enroll {LT, pub(K_inst)}
LS --> IA: PoE (mTLS client cert or JWT with cnf=K_inst), CRL/OCSP/introspect
@enduml

3) Scanner subsystem (facts engine)

3.1 Analyzers (deterministic only)

  • OS packages: apk/dpkg/rpm (Linux); Windows MSI/SxS/GAC (M2).

  • Language (installed state):

    • Java (pom.properties / MANIFEST) → pkg:maven/...
    • Node (node_modules/*/package.json) → pkg:npm/...
    • Python (*.dist-info/METADATA) → pkg:pypi/...
    • Go (buildinfo) → pkg:golang/...
    • .NET (*.deps.json) → pkg:nuget/...
    • Rust: deterministic language markers (symbol mangling) and crates only when present; otherwise bin:{sha256}.
  • Native: ELF/PE/MachO imports, DT_NEEDED, RPATH/RUNPATH, symbol versions, PE version info.

  • EntryTrace: parse ENTRYPOINT/CMD; shell AST; resolve launchers (Java/Node/Python) to terminal program; record file:line chain.

3.2 Caching & composition

  • Layer cache: {layerDigest → SBOM fragment + analyzer meta}.

  • File CAS: {sha256(file) → parse result (ELF/JAR metadata/etc.)}.

  • Composition: build image SBOMs from fragments via BOMLink/ExternalRef; emit two views:

    • Inventory (complete filesystem inventory).
    • Usage (entrypoint closure + linked libs).
  • Transport: JSON and CycloneDX Protobuf (compact, fast to parse).

  • Index: BOMIndex sidecar with purl table + roaring bitmap + usedByEntrypoint flag for fast joins.

3.3 Diff (image → layer → package)

  • Added / Removed / Versionchanged changes, attributed to the layer that caused them.
  • Raw diffs preserved; backend view applies VEX + Policy.

3.4 Buildtime SBOMs (fast CI path)

  • Buildx generator runs analyzers during docker buildx build --attest=type=sbom,generator=stellaops/sbom-indexer, attaches SBOMs as OCI referrers.
  • Scanner.WebService can trust these (policyconfigurable) and skip rescan; DSSE + Rekor v2 can be done either at build time or postpush via Signer/Attestor.

3.5 Events / integrations

  • Out: report.ready (summary + verdict + Rekor UUID) → internal bus for Notify & UI.
  • Expose: imagelevel BOMIndex metadata for Scheduler impact selection.

4) Backend evaluation (decider)

4.1 Concelier (advisories)

  • Ingests vendor, distro, OSS feeds; normalizes & merges; persists canonical advisories in Mongo; exports deterministic JSON and Trivy DB.
  • Offline kit bundles for airgapped sites.

4.2 Excititor (VEX)

  • Ingests OpenVEX / CSAF VEX / CycloneDX VEX; normalizes claims; retains conflicts; computes consensus with provider trust weights and justification gates.

4.3 Policy Engine (YAML DSL)

  • Matchers: image/repo/env/purl/cve/vendor/source/path/layerDigest/usedByEntrypoint
  • Actions: ignore(until, justification), fail, warn, defer, requireVEX{vendors, justifications}, escalate {sev, KEV, EPSS}, license constraints.
  • Produces a policy digest (SHA256 of canonicalized policy).

4.4 PASS/FAIL flow

  1. SBOM (Inventory / Usage) → join with Concelier advisories.
  2. Apply Excititor consensus (statuses & justifications).
  3. Apply Policy; compute PASS/FAIL with waiver TTLs.
  4. Sign the final report (DSSE via Signer) and log to Rekor v2 via Attestor.

5) Runtime enforcement (Zastava)

  • Observer: inventories running containers, checks image signatures, SBOM presence (referrers), detects drift (entrypoint chain divergence), flags unapproved images.
  • Admission Webhook (optional): blocks policyfail pods (dryrun first).
  • Integration: posts runtime events to Scanner.WebService; can request delta scans on changed layers.

6) Storage & catalogs (MinIO/Mongo)

MinIO layout

s3://stellaops/
  layers/<sha256>/sbom.cdx.json.zst
  layers/<sha256>/sbom.spdx.json.zst
  images/<imgDigest>/inventory.cdx.pb
  images/<imgDigest>/usage.cdx.pb
  indexes/<imgDigest>/bom-index.bin
  attest/<artifactSha256>.dsse.json

Catalog (Mongo)

  • artifacts (type/format/sha/size/rekor/ttl/immutable/refCount/createdAt)
  • images, layers, links, lifecycleRules
  • Scheduler: schedules, runs, locks, impact_cursors
  • Notify: rules, deliveries, channels, templates

Retention

  • MinIO ILM for coarse TTL; Scanner.WebService GC decrements refCount and deletes unreferenced metadata; Object Lock for immutable classes (auditable artifacts).

7) APIs (consolidated surface)

7.1 Scanner.WebService

POST /api/scans                          { imageRef|digest, force? } → { scanId }
GET  /api/scans/{id}                     → { status, digests, artifacts[] }
GET  /api/sboms/{imageDigest}            ?format=cdx-json|cdx-pb|spdx-json&view=inventory|usage
GET  /api/diff?old=<digest>&new=<digest> → { added[], removed[], changed[], byLayer[] }
POST /api/exports                        { imageDigest, format, view } → { artifactId, rekorUrl }
POST /api/reports                        { imageDigest, policyRevision?, vexSnapshot? } → { reportId, verdict, rekorUrl }
GET  /api/catalog/artifacts/{id}         → { size, ttl, immutable, rekor, refs }
GET  /healthz | /readyz | /metrics

7.2 Signer (mTLS; hard gate)

POST /sign/dsse    # body: {subjectHash, imageDigest, predicate}; headers: OpTok (DPoP/mTLS) + PoE
GET  /verify/referrers?imageDigest=sha256:...  # is this image StellaOps-signed?

7.3 Attestor (mTLS)

POST /rekor/entries      # DSSE bundle → {uuid, index, proof, logURL}
GET  /rekor/entries/{uuid}

7.4 Authority (OIDC)

  • /.well-known/openid-configuration, /oauth/token (DPoP/mTLS), /oauth/introspect, /jwks

7.5 Licensing (cloud)

POST /license/enroll      { LT, pubKey }           → PoE + introspection endpoints
POST /license/revoke      { license_id }           → ok
POST /license/introspect  { poe }                  → { active, claims, exp }
POST /attest/endorse      { bundle }               → endorsement bundle (optional)

7.6 Scheduler

POST /api/v1/scheduler/schedules         {yaml|json}      → { scheduleId }
GET  /api/v1/scheduler/schedules                          → [ { id, nextRun, status, stats } ]
POST /api/v1/scheduler/run               { id|selector }   → { runId }
GET  /api/v1/scheduler/runs/{id}                          → { status, counts, links }
GET  /api/v1/scheduler/cursor                            → { lastConcelierExportId, lastExcititorExportId }

7.7 Notify

POST /api/v1/notify/test                 { channel, target } → { delivered }
POST /api/v1/notify/rules                {yaml|json}         → { ruleId }
GET  /api/v1/notify/rules                                   → [ { id, match, actions, enabled } ]
GET  /api/v1/notify/deliveries                              → [ { id, eventId, channel, status, attempts } ]

8) Security & verifiability

  • Senderconstrained tokens. All operational calls use DPoP (RFC9449) or mTLSbound tokens (RFC8705).
  • Entitlement. PoE is mandatory; revocation honored online.
  • Release integrity. Signer independently verifies scanner image digest via Referrers + cosign before signing.
  • Separation of duties. Scanner/UI/Scheduler/Notify cannot sign; only Signer can sign; only Attestor can write to Rekor v2.
  • Verifiers. Anyone can verify: DSSE signature → certificate chain to StellaOps Fulcio/KMS rootRekor v2 inclusion.
  • RBAC. Roles: scanner.admin|read, scheduler.admin|read, notify.admin|read, zastava.admin|read.
  • Community vs Authorized. Free/community runs throttled with no official attestations; authorized runs full speed and produce StellaOpsverified bundles.

DSSE predicate (SBOM/report)

{
  "predicateType": "https://stella-ops.org/attestations/sbom/1",
  "subject": [{ "name": "s3://stellaops/images/<digest>/inventory.cdx.pb", "digest": { "sha256": "<sha256>" } }],
  "predicate": {
    "image_digest": "<sha256:...>",
    "stellaops_version": "2.3.1 (2027.04)",
    "license_id": "LIC-9F2A...",
    "customer_id": "CUST-ACME",
    "plan": "pro",
    "policy_digest": "sha256:...",
    "views": ["inventory","usage"],
    "created": "2025-10-17T12:34:56Z"
  }
}

BOMIndex sidecar Binary header + purl table + roaring bitmaps; optional usedByEntrypoint flags for fast policy joins.


9) Scale, performance & quotas

  • Workers: horizontal; distributed lock per layer digest; global CAS in MinIO.

  • Queues: Redis Streams / NATS / RabbitMQ. HPA by queue depth, CPU, memory.

  • Registry throttling: perregistry concurrency budgets.

  • Targets:

    • Buildtime path P95 ≤35s on warmed bases.
    • Postbuild delta scan P95 ≤10s for 200MB images.
    • Policy + VEX evaluation ≤500ms for 5k components using BOMIndex.
    • Event → notification p95 ≤ 3060s under nominal load.
    • Export delta → reevaluation verdict p95 ≤ 5min for 10k impacted images.
  • Quotas: license plan enforces QPS/concurrency/size; Signer throttles and can deny DSSE.


10) DevOps & distribution

  • Releases: all firstparty images cosignsigned; labels embed org.stellaops.version and org.stellaops.release_date.

  • Channels:

    • Community (public registry): throttled, nonattesting.
    • Authorized (private registry): full speed, DSSE enabled.
  • Client update flow: containers selfverify signatures at boot; report version; Signer enforces valid_release_year / max_version from PoE before signing.

  • Compose skeleton:

services:
  authority:       { image: stellaops/authority }
  fulcio:          { image: sigstore/fulcio }
  rekor:           { image: sigstore/rekor-v2 }
  minio:           { image: minio/minio, command: server /data --console-address ":9001" }
  mongo:           { image: mongo:7 }
  signer:          { image: stellaops/signer, depends_on: [authority, fulcio] }
  attestor:        { image: stellaops/attestor, depends_on: [rekor, signer] }
  scanner-web:     { image: stellaops/scanner-web, depends_on: [mongo, minio, signer, attestor] }
  scanner-worker:  { image: stellaops/scanner-worker, deploy: { replicas: 4 }, depends_on: [scanner-web] }
  concelier:       { image: stellaops/concelier-web, depends_on: [mongo] }
  excititor:       { image: stellaops/excititor-web, depends_on: [mongo] }
  scheduler-web:   { image: stellaops/scheduler-web, depends_on: [mongo] }
  scheduler-worker:{ image: stellaops/scheduler-worker, deploy: { replicas: 2 }, depends_on: [scheduler-web] }
  notify-web:      { image: stellaops/notify-web, depends_on: [mongo] }
  notify-worker:   { image: stellaops/notify-worker, deploy: { replicas: 2 }, depends_on: [notify-web] }
  ui:              { image: stellaops/ui, depends_on: [scanner-web, concelier, excititor, scheduler-web, notify-web] }
  • Backups: Mongo dumps; MinIO versioned buckets & replication; Rekor v2 DB snapshots; JWKS/Fulcio/KMS key rotation.
  • Ops runbooks: Scheduler catchup after Concelier/Excititor recovery; connector key rotation (Slack/Teams/SMTP).
  • SLOs & alerts: lag between Concelier/Excititor export and first rescan verdict; delivery failure rates by channel.

11) Observability & audit

  • Metrics: scan latency, layer cache hit %, artifact bytes, DSSE/Rekor latency, policy evaluation time, queue depth, admission decisions (Zastava).
  • Scheduler metrics: scheduler.impacted_images_total, scheduler.jobs_enqueued_total, scheduler.selection_ms, endtoend p95 (event → verdict).
  • Notify metrics: notify.sent_total{channel}, notify.dropped_total{reason}, notify.digest_coalesced_total, notify.latency_ms.
  • Tracing: perstage spans; correlation IDs across Scanner→Signer→Attestor and Concelier/Excititor→Scheduler→Scanner→Notify.
  • Audit logs: every signing records license_id, image_digest, policy_digest, and Rekor UUID; Scheduler records who scheduled what; Notify records where, when, and why messages were sent or deduped.
  • Compliance: MinIO Object Lock for immutable artifacts; reproducible outputs via policy digest + SBOM digest in predicate.

12) Roadmap (anchored to this architecture)

  • M2: Windows MSI/SxS/GAC analyzers; deeper Rust (DWARF enrichers).
  • M2: Buildx generator certified flows; crossregistry trust policies.
  • M3: PatchPresence plugin (signaturebased backport detection), optin.
  • M3: Zastava Admission control GA with policy presets and dryrun→enforce stages.
  • M3: Scheduler GA with exportdelta impact routing and capacityaware pacing.
  • M3: Notify GA with digests, Slack/Teams/Email/Webhooks; M4: PagerDuty/Opsgenie connectors.
  • Continuous: Policy UX (waiver TTLs, vendor rules), Excititor connectors expansion.

13) Canonical sequences (verification, reevaluation & notify)

Sign & log (OpTok + PoE, image verify, DSSE, Rekor).

sequenceDiagram
  autonumber
  participant Scan as Scanner.WebService
  participant Auth as Authority (OIDC)
  participant Sign as Signer
  participant Reg as OCI Registry
  participant Ful as Fulcio/KMS
  participant Att as Attestor
  participant Rek as Rekor v2

  Scan->>Auth: Get OpTok (DPoP/mTLS)
  Scan->>Sign: sign(request) + OpTok + PoE + DPoP proof
  Sign->>Auth: Validate OpTok & sender-constraint
  Sign->>Sign: Validate PoE (introspect/revocation)
  Sign->>Reg: Verify scanner image is StellaOps-signed (Referrers + cosign)
  alt OK
    Sign->>Ful: Get signing cert (keyless) or use KMS key
    Sign-->>Scan: DSSE bundle (cert chain)
    Scan->>Att: Submit bundle
    Att-->>Rek: Create entry
    Rek-->>Att: {uuid,index,proof}
    Att-->>Scan: Rekor URL
  else Deny
    Sign-->>Scan: 403 (no attestation)
  end

Eventdriven reevaluation & notify.

sequenceDiagram
  participant CONC as Concelier
  participant EXC as Excititor
  participant SCH as Scheduler
  participant SC as Scanner.WebService
  participant NO as Notify

  CONC->>SCH: export.delta {changedProductKeys, exportId}
  EXC ->>SCH: export.delta {changedProductKeys, exportId}
  SCH->>SCH: Impact select via BOM-Index bitmaps
  SCH->>SC: Enqueue analysis-only reports (batches)
  SC-->>SCH: verdict stream (PASS/FAIL, deltas)
  SCH->>NO: rescan.delta {imageDigest, newCriticals, links}
  NO-->>Slack/Teams/Email/Webhook: deliver (throttle/digest rules applied)

14) Minimal data shapes (Scheduler & Notify)

Scheduler schedule (YAML via UI/CLI)

name: nightly-eu
when: "0 2 * * * Europe/Sofia"
mode: analysis-only        # or content-refresh
selection:
  scope: all-images        # or tenant/ns/repo label selectors
  onlyIf: { lastReportOlderThanDays: 7 }
notify:
  onNewFindings: true
  minSeverity: high
limits:
  maxJobs: 5000
  ratePerSecond: 50

Notify rule (YAML)

name: high-critical-alerts
match:
  eventKinds: ["report.ready","rescan.delta","zastava.admission"]
  minSeverity: high
  namespaces: ["prod-*"]
  vex: { includeAcceptedJustifications: false }
actions:
  - channel: slack
    target: "#sec-alerts"
    template: "concise"
    throttle: "5m"
  - channel: email
    target: "soc@acme.org"
    digest: "hourly"
enabled: true