FUll implementation plan (first draft)

This commit is contained in:
master
2025-10-19 00:28:48 +03:00
parent 052da7a7d0
commit 8dc7273e27
125 changed files with 5438 additions and 166 deletions

View File

@@ -1,8 +1,3 @@
Below is the **revised, consolidated** `high_level_architecture.md`.
It **absorbs** all content from `components.md` so you have a single, authoritative file. No separate components doc is required.
---
# HighLevel Architecture — **StellaOps** (Consolidated • 2025Q4)
> **Purpose.** A complete, implementationready map of StellaOps: product vision, all runtime components, trust boundaries, tokens/licensing, control/data flows, storage, APIs, security, scale, DevOps, and verification logic.
@@ -30,28 +25,32 @@ It **absorbs** all content from `components.md` so you have a single, authoritat
### 1.1 Runtime inventory (firstparty)
| Service / Tool | Container image | Core role | Scale pattern |
| ------------------------------- | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------- |
| **Scanner.WebService** | `stellaops/scanner-web` | Control plane for scans; catalog; SBOM composition (inventory & usage); diff; exports. | Stateless; N replicas behind LB. |
| **Scanner.Worker** | `stellaops/scanner-worker` | Runs analyzers (OS, Lang: Java/Node/Python/Go/.NET/Rust, Native ELF/PE/MachO, EntryTrace); emits perlayer SBOMs and composes image SBOMs. | Horizontal; queuedriven; sharded by layer digest. |
| **Scanner.Sbomer.BuildXPlugin** | `stellaops/sbom-indexer` | BuildKit **generator** for buildtime SBOMs as OCI **referrers**. | CIside; ephemeral. |
| **Scanner.Sbomer.DockerImage** | `stellaops/scanner-cli` | CLIorchestrated scanner container for postbuild scans. | Local/CI; ephemeral. |
| **Concelier.WebService** | `stellaops/concelier-web` | Vulnerability ingest/normalize/merge/export (JSON + Trivy DB). | HA via Mongo locks. |
| **Excititor.WebService** | `stellaops/excititor-web` | VEX ingest/normalize/consensus; conflict retention; exports. | HA via Mongo locks. |
| **Policy Engine** | (in `scanner-web`) | YAML DSL evaluator (waivers, vendor preferences, KEV/EPSS, license, usagegating); produces **policy digest**. | Inprocess; cache per digest. |
| **Signer** | `stellaops/signer` | **Hard gate:** validates entitlement + release integrity; mints signing cert (Fulcio keyless) or uses KMS; signs DSSE. | Stateless; HPA by QPS. |
| **Attestor** | `stellaops/attestor` | Posts DSSE bundles to **Rekor v2**; verification endpoints. | Stateless; HPA by QPS. |
| **Authority** | `stellaops/authority` | Onprem OIDC issuing **shortlived OpToks** with DPoP/mTLS sender constraint. | HA behind LB. |
| **Zastava** (Runtime) | `stellaops/zastava` | Runtime inspector/enforcer (observer + optional Admission Webhook). | DaemonSet + Webhook. |
| **Web UI** | `stellaops/ui` | Angular app for scans, diffs, policy, VEX, runtime, reports. | Stateless. |
| **StellaOps.Cli** | `stellaops/cli` | CLI for init/scan/export/diff/policy/report/verify; Buildx helper. | Local/CI. |
| Service / Tool | Container image | Core role | Scale pattern |
| ------------------------------- | ---------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------- |
| **Scanner.WebService** | `stellaops/scanner-web` | Control plane for scans; catalog; SBOM composition (inventory & usage); diff; exports; **analysisonly report runs** for Scheduler. | Stateless; N replicas behind LB. |
| **Scanner.Worker** | `stellaops/scanner-worker` | Runs analyzers (OS, Lang: Java/Node/Python/Go/.NET/Rust, Native ELF/PE/MachO, EntryTrace); emits perlayer SBOMs and composes image SBOMs. | Horizontal; queuedriven; sharded by layer digest. |
| **Scanner.Sbomer.BuildXPlugin** | `stellaops/sbom-indexer` | BuildKit **generator** for buildtime SBOMs as OCI **referrers**. | CIside; ephemeral. |
| **Scanner.Sbomer.DockerImage** | `stellaops/scanner-cli` | CLIorchestrated scanner container for postbuild scans. | Local/CI; ephemeral. |
| **Concelier.WebService** | `stellaops/concelier-web` | Vulnerability ingest/normalize/merge/export (JSON + Trivy DB). | HA via Mongo locks. |
| **Excititor.WebService** | `stellaops/excititor-web` | VEX ingest/normalize/consensus; conflict retention; exports. | HA via Mongo locks. |
| **Policy Engine** | (in `scanner-web`) | YAML DSL evaluator (waivers, vendor preferences, KEV/EPSS, license, usagegating); produces **policy digest**. | Inprocess; cache per digest. |
| **Scheduler.WebService** | `stellaops/scheduler-web` | Schedules **reevaluation** runs; consumes Concelier/Excititor deltas; selects **impacted images** via BOMIndex; orchestrates analysisonly reports. | Stateless API. |
| **Scheduler.Worker** | `stellaops/scheduler-worker` | Executes selection and enqueues batches toward Scanner; enforces rate/limits and windows; maintains impact cursors. | Horizontal; queuedriven. |
| **Notify.WebService** | `stellaops/notify-web` | Rules engine for outbound notifications; manages channels, templates, throttle/digest logic. | Stateless API. |
| **Notify.Worker** | `stellaops/notify-worker` | Delivers to Slack/Teams/Email/Webhooks; idempotent retries; digests. | Horizontal; perchannel rate limits. |
| **Signer** | `stellaops/signer` | **Hard gate:** validates entitlement + release integrity; mints signing cert (Fulcio keyless) or uses KMS; signs DSSE. | Stateless; HPA by QPS. |
| **Attestor** | `stellaops/attestor` | Posts DSSE bundles to **Rekor v2**; verification endpoints. | Stateless; HPA by QPS. |
| **Authority** | `stellaops/authority` | Onprem OIDC issuing **shortlived OpToks** with DPoP/mTLS sender constraint. | HA behind LB. |
| **Zastava** (Runtime) | `stellaops/zastava` | Runtime inspector/enforcer (observer + optional Admission Webhook). | DaemonSet + Webhook. |
| **Web UI** | `stellaops/ui` | Angular app for scans, diffs, policy, VEX, **Scheduler**, **Notify**, runtime, reports. | Stateless. |
| **StellaOps.Cli** | `stellaops/cli` | CLI for init/scan/export/diff/policy/report/verify; Buildx helper; **schedule** and **notify** verbs. | Local/CI. |
### 1.2 Thirdparty (selfhosted)
* **Fulcio** (Sigstore CA) — issues shortlived signing certs (keyless).
* **Rekor v2** (tilebacked transparency log).
* **MinIO** — S3compatible object store with lifecycle & Object Lock.
* **MongoDB** — catalog, advisories, VEX.
* **MongoDB** — catalog, advisories, VEX, scheduler, notify.
* **Queue** — Redis Streams / NATS / RabbitMQ (pluggable).
* **OCI Registry** — must support **Referrers API** (discover SBOMs/signatures).
@@ -71,8 +70,12 @@ flowchart LR
Auth[Authority (OIDC)\nOpTok (DPoP/mTLS)]
SW[Scanner.WebService]
WK[Scanner.Worker xN]
FEED[Concelier]
VEX[Excititor]
CONC[Concelier]
EXC[Excititor]
SCHW[Scheduler.Web]
SCH[Scheduler.Worker xN]
NOTW[Notify.Web]
NOT[Notify.Worker xN]
POL[Policy Engine (in Scanner.Web)]
SGN[Signer\n(entitlement + signing)]
ATT[Attestor\n(Rekor v2 submit/verify)]
@@ -93,11 +96,19 @@ flowchart LR
QUE --> WK
WK --> MIN
SW --> MGO
FEED --> MGO
VEX --> MGO
CONC --> MGO
EXC --> MGO
UI --> SW
Z --> SW
%% New event-driven loop
CONC -- export.delta --> SCHW
EXC -- export.delta --> SCHW
SCHW --> SCH
SCH --> SW
SW -- report.ready --> NOTW
Z -- admission/observe --> NOTW
SGN <--> Auth
SGN --> FUL
SGN -->|mTLS| ATT
@@ -106,7 +117,7 @@ flowchart LR
SGN <-->|verify referrers| REG
```
**Trust boundaries.** Only **Signer** can sign; only **Attestor** can write to **Rekor v2**. Scanner/UI never sign.
**Trust boundaries.** Only **Signer** can sign; only **Attestor** can write to **Rekor v2**. Scanner/UI/Scheduler/Notify never sign.
---
@@ -116,7 +127,7 @@ flowchart LR
* **License Token (LT)** — longlived JWT from **Licensing Service**; used **once** to enroll the installation; never used in hot path.
* **ProofofEntitlement (PoE)** — bound to the installation key (mTLS client cert **or** DPoPbound JWT with `cnf`); mediumlived; renewable; revocable.
* **Operational token (OpTok)** — 25min OIDC token from **Authority**, **senderconstrained** (DPoP or mTLS). Used to authenticate to **Signer**/**Scanner.WebService**.
* **Operational token (OpTok)** — 25min OIDC token from **Authority**, **senderconstrained** (DPoP or mTLS). Used to authenticate to **Signer**/**Scanner.WebService**/**Scheduler.Web**/**Notify.Web**.
**Signer enforces both:** PoE proves entitlement; OpTok proves “who is calling now”. It also **independently verifies** the **scanner image digest** is **StellaOpssigned** via **Referrers + cosign** before signing anything.
@@ -173,6 +184,11 @@ LS --> IA: PoE (mTLS client cert or JWT with cnf=K_inst), CRL/OCSP/introspect
* Buildx **generator** runs analyzers during `docker buildx build --attest=type=sbom,generator=stellaops/sbom-indexer`, attaches SBOMs as **OCI referrers**.
* Scanner.WebService can trust these (policyconfigurable) and **skip** rescan; DSSE + Rekor v2 can be done either at build time or postpush via Signer/Attestor.
### 3.5 Events / integrations
* **Out:** `report.ready` (summary + verdict + Rekor UUID) → internal bus for **Notify** & UI.
* **Expose:** imagelevel **BOMIndex** metadata for **Scheduler** impact selection.
---
## 4) Backend evaluation (decider)
@@ -227,6 +243,8 @@ s3://stellaops/
* `artifacts` (type/format/sha/size/rekor/ttl/immutable/refCount/createdAt)
* `images`, `layers`, `links`, `lifecycleRules`
* **Scheduler:** `schedules`, `runs`, `locks`, `impact_cursors`
* **Notify:** `rules`, `deliveries`, `channels`, `templates`
**Retention**
@@ -239,13 +257,13 @@ s3://stellaops/
### 7.1 Scanner.WebService
```
POST /api/scans { imageRef|digest, force? } → { scanId }
GET /api/scans/{id} → { status, digests, artifacts[] }
GET /api/sboms/{imageDigest} ?format=cdx-json|cdx-pb|spdx-json&view=inventory|usage
POST /api/scans { imageRef|digest, force? } → { scanId }
GET /api/scans/{id} → { status, digests, artifacts[] }
GET /api/sboms/{imageDigest} ?format=cdx-json|cdx-pb|spdx-json&view=inventory|usage
GET /api/diff?old=<digest>&new=<digest> → { added[], removed[], changed[], byLayer[] }
POST /api/exports { imageDigest, format, view } → { artifactId, rekorUrl }
POST /api/reports { imageDigest, policyRevision? } → { reportId, rekorUrl }
GET /api/catalog/artifacts/{id} → { size, ttl, immutable, rekor, refs }
POST /api/exports { imageDigest, format, view } → { artifactId, rekorUrl }
POST /api/reports { imageDigest, policyRevision?, vexSnapshot? } → { reportId, verdict, rekorUrl }
GET /api/catalog/artifacts/{id} → { size, ttl, immutable, rekor, refs }
GET /healthz | /readyz | /metrics
```
@@ -276,6 +294,25 @@ POST /license/introspect { poe } → { active, claims, exp }
POST /attest/endorse { bundle } → endorsement bundle (optional)
```
### 7.6 Scheduler
```
POST /api/v1/scheduler/schedules {yaml|json} → { scheduleId }
GET /api/v1/scheduler/schedules → [ { id, nextRun, status, stats } ]
POST /api/v1/scheduler/run { id|selector } → { runId }
GET /api/v1/scheduler/runs/{id} → { status, counts, links }
GET /api/v1/scheduler/cursor → { lastConcelierExportId, lastExcititorExportId }
```
### 7.7 Notify
```
POST /api/v1/notify/test { channel, target } → { delivered }
POST /api/v1/notify/rules {yaml|json} → { ruleId }
GET /api/v1/notify/rules → [ { id, match, actions, enabled } ]
GET /api/v1/notify/deliveries → [ { id, eventId, channel, status, attempts } ]
```
---
## 8) Security & verifiability
@@ -283,8 +320,9 @@ POST /attest/endorse { bundle } → endorsement bundle (optio
* **Senderconstrained tokens.** All operational calls use **DPoP** (RFC9449) or **mTLSbound** tokens (RFC8705).
* **Entitlement.** **PoE** is mandatory; revocation honored online.
* **Release integrity.** **Signer** independently verifies **scanner image digest** via **Referrers + cosign** before signing.
* **Separation of duties.** Scanner/UI cannot sign; only **Signer** can sign; only **Attestor** can write to **Rekor v2**.
* **Separation of duties.** Scanner/UI/Scheduler/Notify cannot sign; only **Signer** can sign; only **Attestor** can write to **Rekor v2**.
* **Verifiers.** Anyone can verify: DSSE signature → certificate chain to **StellaOps Fulcio/KMS root****Rekor v2** inclusion.
* **RBAC.** Roles: `scanner.admin|read`, `scheduler.admin|read`, `notify.admin|read`, `zastava.admin|read`.
* **Community vs Authorized.** Free/community runs throttled with no official attestations; authorized runs full speed and produce **StellaOpsverified** bundles.
**DSSE predicate (SBOM/report)**
@@ -321,6 +359,8 @@ Binary header + purl table + roaring bitmaps; optional `usedByEntrypoint` flags
* Buildtime path P95 ≤35s on warmed bases.
* Postbuild delta scan P95 ≤10s for 200MB images.
* Policy + VEX evaluation ≤500ms for 5k components using BOMIndex.
* **Event → notification** p95 ≤ **3060s** under nominal load.
* **Export delta → reevaluation verdict** p95 ≤ **5min** for 10k impacted images.
* **Quotas:** license plan enforces QPS/concurrency/size; **Signer** throttles and can deny DSSE.
---
@@ -337,32 +377,37 @@ Binary header + purl table + roaring bitmaps; optional `usedByEntrypoint` flags
```yaml
services:
authority: { image: stellaops/authority }
fulcio: { image: sigstore/fulcio }
rekor: { image: sigstore/rekor-v2 }
minio: { image: minio/minio, command: server /data --console-address ":9001" }
mongo: { image: mongo:7 }
signer: { image: stellaops/signer, depends_on: [authority, fulcio] }
attestor: { image: stellaops/attestor, depends_on: [rekor, signer] }
scanner-web:{ image: stellaops/scanner-web, depends_on: [mongo, minio, signer, attestor] }
scanner-worker:
image: stellaops/scanner-worker
deploy: { replicas: 4 }
depends_on: [scanner-web]
concelier: { image: stellaops/concelier-web, depends_on: [mongo] }
excititor: { image: stellaops/excititor-web, depends_on: [mongo] }
ui: { image: stellaops/ui, depends_on: [scanner-web, concelier, excititor] }
authority: { image: stellaops/authority }
fulcio: { image: sigstore/fulcio }
rekor: { image: sigstore/rekor-v2 }
minio: { image: minio/minio, command: server /data --console-address ":9001" }
mongo: { image: mongo:7 }
signer: { image: stellaops/signer, depends_on: [authority, fulcio] }
attestor: { image: stellaops/attestor, depends_on: [rekor, signer] }
scanner-web: { image: stellaops/scanner-web, depends_on: [mongo, minio, signer, attestor] }
scanner-worker: { image: stellaops/scanner-worker, deploy: { replicas: 4 }, depends_on: [scanner-web] }
concelier: { image: stellaops/concelier-web, depends_on: [mongo] }
excititor: { image: stellaops/excititor-web, depends_on: [mongo] }
scheduler-web: { image: stellaops/scheduler-web, depends_on: [mongo] }
scheduler-worker:{ image: stellaops/scheduler-worker, deploy: { replicas: 2 }, depends_on: [scheduler-web] }
notify-web: { image: stellaops/notify-web, depends_on: [mongo] }
notify-worker: { image: stellaops/notify-worker, deploy: { replicas: 2 }, depends_on: [notify-web] }
ui: { image: stellaops/ui, depends_on: [scanner-web, concelier, excititor, scheduler-web, notify-web] }
```
* **Backups:** Mongo dumps; MinIO versioned buckets & replication; Rekor v2 DB snapshots; JWKS/Fulcio/KMS key rotation.
* **Ops runbooks:** Scheduler catchup after Concelier/Excititor recovery; connector key rotation (Slack/Teams/SMTP).
* **SLOs & alerts:** lag between Concelier/Excititor export and first rescan verdict; delivery failure rates by channel.
---
## 11) Observability & audit
* **Metrics:** scan latency, layer cache hit %, artifact bytes, DSSE/Rekor latency, policy evaluation time, queue depth, admission decisions (Zastava).
* **Tracing:** perstage spans; correlation IDs across Scanner→Signer→Attestor.
* **Audit logs:** every signing records `license_id`, `image_digest`, `policy_digest`, and Rekor UUID.
* **Scheduler metrics:** `scheduler.impacted_images_total`, `scheduler.jobs_enqueued_total`, `scheduler.selection_ms`, endtoend p95 (event → verdict).
* **Notify metrics:** `notify.sent_total{channel}`, `notify.dropped_total{reason}`, `notify.digest_coalesced_total`, `notify.latency_ms`.
* **Tracing:** perstage spans; correlation IDs across Scanner→Signer→Attestor and Concelier/Excititor→Scheduler→Scanner→Notify.
* **Audit logs:** every signing records `license_id`, `image_digest`, `policy_digest`, and Rekor UUID; Scheduler records who scheduled what; Notify records where, when, and why messages were sent or deduped.
* **Compliance:** MinIO **Object Lock** for immutable artifacts; reproducible outputs via policy digest + SBOM digest in predicate.
---
@@ -373,11 +418,13 @@ services:
* M2: Buildx generator certified flows; crossregistry trust policies.
* M3: PatchPresence plugin (signaturebased backport detection), optin.
* M3: Zastava Admission control GA with policy presets and dryrun→enforce stages.
* M3: **Scheduler GA** with exportdelta impact routing and capacityaware pacing.
* M3: **Notify GA** with digests, Slack/Teams/Email/Webhooks; **M4:** PagerDuty/Opsgenie connectors.
* Continuous: Policy UX (waiver TTLs, vendor rules), Excititor connectors expansion.
---
## 13) Canonical sequences (verification & signing)
## 13) Canonical sequences (verification, reevaluation & notify)
**Sign & log (OpTok + PoE, image verify, DSSE, Rekor).**
@@ -409,22 +456,62 @@ sequenceDiagram
end
```
**Verification (third party).**
**Eventdriven reevaluation & notify.**
```plantuml
@startuml
actor Verifier
participant "stellaops verify" as Tool
database "Fulcio/KMS root" as Root
participant "Rekor v2" as R2
Verifier -> Tool: bundle (URL/file)
Tool -> Tool: Verify DSSE signature
Tool -> Root: Verify cert chain to StellaOps root
Tool -> R2: Verify inclusion proof / query by UUID
Tool -> Verifier: OK + claims (license_id, policy_digest, version)
@enduml
```mermaid
sequenceDiagram
participant CONC as Concelier
participant EXC as Excititor
participant SCH as Scheduler
participant SC as Scanner.WebService
participant NO as Notify
CONC->>SCH: export.delta {changedProductKeys, exportId}
EXC ->>SCH: export.delta {changedProductKeys, exportId}
SCH->>SCH: Impact select via BOM-Index bitmaps
SCH->>SC: Enqueue analysis-only reports (batches)
SC-->>SCH: verdict stream (PASS/FAIL, deltas)
SCH->>NO: rescan.delta {imageDigest, newCriticals, links}
NO-->>Slack/Teams/Email/Webhook: deliver (throttle/digest rules applied)
```
---
**End of `high_level_architecture.md` (Consolidated).**
## 14) Minimal data shapes (Scheduler & Notify)
**Scheduler schedule (YAML via UI/CLI)**
```yaml
name: nightly-eu
when: "0 2 * * * Europe/Sofia"
mode: analysis-only # or content-refresh
selection:
scope: all-images # or tenant/ns/repo label selectors
onlyIf: { lastReportOlderThanDays: 7 }
notify:
onNewFindings: true
minSeverity: high
limits:
maxJobs: 5000
ratePerSecond: 50
```
**Notify rule (YAML)**
```yaml
name: high-critical-alerts
match:
eventKinds: ["report.ready","rescan.delta","zastava.admission"]
minSeverity: high
namespaces: ["prod-*"]
vex: { includeAcceptedJustifications: false }
actions:
- channel: slack
target: "#sec-alerts"
template: "concise"
throttle: "5m"
- channel: email
target: "soc@acme.org"
digest: "hourly"
enabled: true
```