FUll implementation plan (first draft)

This commit is contained in:
2025-10-19 00:28:48 +03:00
parent 6524626230
commit c4980d9625
125 changed files with 5438 additions and 166 deletions

View File

@@ -1,8 +1,3 @@
Below is the **revised, consolidated** `high_level_architecture.md`.
It **absorbs** all content from `components.md` so you have a single, authoritative file. No separate components doc is required.
---
# HighLevel Architecture — **StellaOps** (Consolidated • 2025Q4)
> **Purpose.** A complete, implementationready map of StellaOps: product vision, all runtime components, trust boundaries, tokens/licensing, control/data flows, storage, APIs, security, scale, DevOps, and verification logic.
@@ -30,28 +25,32 @@ It **absorbs** all content from `components.md` so you have a single, authoritat
### 1.1 Runtime inventory (firstparty)
| Service / Tool | Container image | Core role | Scale pattern |
| ------------------------------- | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------- |
| **Scanner.WebService** | `stellaops/scanner-web` | Control plane for scans; catalog; SBOM composition (inventory & usage); diff; exports. | Stateless; N replicas behind LB. |
| **Scanner.Worker** | `stellaops/scanner-worker` | Runs analyzers (OS, Lang: Java/Node/Python/Go/.NET/Rust, Native ELF/PE/MachO, EntryTrace); emits perlayer SBOMs and composes image SBOMs. | Horizontal; queuedriven; sharded by layer digest. |
| **Scanner.Sbomer.BuildXPlugin** | `stellaops/sbom-indexer` | BuildKit **generator** for buildtime SBOMs as OCI **referrers**. | CIside; ephemeral. |
| **Scanner.Sbomer.DockerImage** | `stellaops/scanner-cli` | CLIorchestrated scanner container for postbuild scans. | Local/CI; ephemeral. |
| **Concelier.WebService** | `stellaops/concelier-web` | Vulnerability ingest/normalize/merge/export (JSON + Trivy DB). | HA via Mongo locks. |
| **Excititor.WebService** | `stellaops/excititor-web` | VEX ingest/normalize/consensus; conflict retention; exports. | HA via Mongo locks. |
| **Policy Engine** | (in `scanner-web`) | YAML DSL evaluator (waivers, vendor preferences, KEV/EPSS, license, usagegating); produces **policy digest**. | Inprocess; cache per digest. |
| **Signer** | `stellaops/signer` | **Hard gate:** validates entitlement + release integrity; mints signing cert (Fulcio keyless) or uses KMS; signs DSSE. | Stateless; HPA by QPS. |
| **Attestor** | `stellaops/attestor` | Posts DSSE bundles to **Rekor v2**; verification endpoints. | Stateless; HPA by QPS. |
| **Authority** | `stellaops/authority` | Onprem OIDC issuing **shortlived OpToks** with DPoP/mTLS sender constraint. | HA behind LB. |
| **Zastava** (Runtime) | `stellaops/zastava` | Runtime inspector/enforcer (observer + optional Admission Webhook). | DaemonSet + Webhook. |
| **Web UI** | `stellaops/ui` | Angular app for scans, diffs, policy, VEX, runtime, reports. | Stateless. |
| **StellaOps.Cli** | `stellaops/cli` | CLI for init/scan/export/diff/policy/report/verify; Buildx helper. | Local/CI. |
| Service / Tool | Container image | Core role | Scale pattern |
| ------------------------------- | ---------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------- |
| **Scanner.WebService** | `stellaops/scanner-web` | Control plane for scans; catalog; SBOM composition (inventory & usage); diff; exports; **analysisonly report runs** for Scheduler. | Stateless; N replicas behind LB. |
| **Scanner.Worker** | `stellaops/scanner-worker` | Runs analyzers (OS, Lang: Java/Node/Python/Go/.NET/Rust, Native ELF/PE/MachO, EntryTrace); emits perlayer SBOMs and composes image SBOMs. | Horizontal; queuedriven; sharded by layer digest. |
| **Scanner.Sbomer.BuildXPlugin** | `stellaops/sbom-indexer` | BuildKit **generator** for buildtime SBOMs as OCI **referrers**. | CIside; ephemeral. |
| **Scanner.Sbomer.DockerImage** | `stellaops/scanner-cli` | CLIorchestrated scanner container for postbuild scans. | Local/CI; ephemeral. |
| **Concelier.WebService** | `stellaops/concelier-web` | Vulnerability ingest/normalize/merge/export (JSON + Trivy DB). | HA via Mongo locks. |
| **Excititor.WebService** | `stellaops/excititor-web` | VEX ingest/normalize/consensus; conflict retention; exports. | HA via Mongo locks. |
| **Policy Engine** | (in `scanner-web`) | YAML DSL evaluator (waivers, vendor preferences, KEV/EPSS, license, usagegating); produces **policy digest**. | Inprocess; cache per digest. |
| **Scheduler.WebService** | `stellaops/scheduler-web` | Schedules **reevaluation** runs; consumes Concelier/Excititor deltas; selects **impacted images** via BOMIndex; orchestrates analysisonly reports. | Stateless API. |
| **Scheduler.Worker** | `stellaops/scheduler-worker` | Executes selection and enqueues batches toward Scanner; enforces rate/limits and windows; maintains impact cursors. | Horizontal; queuedriven. |
| **Notify.WebService** | `stellaops/notify-web` | Rules engine for outbound notifications; manages channels, templates, throttle/digest logic. | Stateless API. |
| **Notify.Worker** | `stellaops/notify-worker` | Delivers to Slack/Teams/Email/Webhooks; idempotent retries; digests. | Horizontal; perchannel rate limits. |
| **Signer** | `stellaops/signer` | **Hard gate:** validates entitlement + release integrity; mints signing cert (Fulcio keyless) or uses KMS; signs DSSE. | Stateless; HPA by QPS. |
| **Attestor** | `stellaops/attestor` | Posts DSSE bundles to **Rekor v2**; verification endpoints. | Stateless; HPA by QPS. |
| **Authority** | `stellaops/authority` | Onprem OIDC issuing **shortlived OpToks** with DPoP/mTLS sender constraint. | HA behind LB. |
| **Zastava** (Runtime) | `stellaops/zastava` | Runtime inspector/enforcer (observer + optional Admission Webhook). | DaemonSet + Webhook. |
| **Web UI** | `stellaops/ui` | Angular app for scans, diffs, policy, VEX, **Scheduler**, **Notify**, runtime, reports. | Stateless. |
| **StellaOps.Cli** | `stellaops/cli` | CLI for init/scan/export/diff/policy/report/verify; Buildx helper; **schedule** and **notify** verbs. | Local/CI. |
### 1.2 Thirdparty (selfhosted)
* **Fulcio** (Sigstore CA) — issues shortlived signing certs (keyless).
* **Rekor v2** (tilebacked transparency log).
* **MinIO** — S3compatible object store with lifecycle & Object Lock.
* **MongoDB** — catalog, advisories, VEX.
* **MongoDB** — catalog, advisories, VEX, scheduler, notify.
* **Queue** — Redis Streams / NATS / RabbitMQ (pluggable).
* **OCI Registry** — must support **Referrers API** (discover SBOMs/signatures).
@@ -71,8 +70,12 @@ flowchart LR
Auth[Authority (OIDC)\nOpTok (DPoP/mTLS)]
SW[Scanner.WebService]
WK[Scanner.Worker xN]
FEED[Concelier]
VEX[Excititor]
CONC[Concelier]
EXC[Excititor]
SCHW[Scheduler.Web]
SCH[Scheduler.Worker xN]
NOTW[Notify.Web]
NOT[Notify.Worker xN]
POL[Policy Engine (in Scanner.Web)]
SGN[Signer\n(entitlement + signing)]
ATT[Attestor\n(Rekor v2 submit/verify)]
@@ -93,11 +96,19 @@ flowchart LR
QUE --> WK
WK --> MIN
SW --> MGO
FEED --> MGO
VEX --> MGO
CONC --> MGO
EXC --> MGO
UI --> SW
Z --> SW
%% New event-driven loop
CONC -- export.delta --> SCHW
EXC -- export.delta --> SCHW
SCHW --> SCH
SCH --> SW
SW -- report.ready --> NOTW
Z -- admission/observe --> NOTW
SGN <--> Auth
SGN --> FUL
SGN -->|mTLS| ATT
@@ -106,7 +117,7 @@ flowchart LR
SGN <-->|verify referrers| REG
```
**Trust boundaries.** Only **Signer** can sign; only **Attestor** can write to **Rekor v2**. Scanner/UI never sign.
**Trust boundaries.** Only **Signer** can sign; only **Attestor** can write to **Rekor v2**. Scanner/UI/Scheduler/Notify never sign.
---
@@ -116,7 +127,7 @@ flowchart LR
* **License Token (LT)** — longlived JWT from **Licensing Service**; used **once** to enroll the installation; never used in hot path.
* **ProofofEntitlement (PoE)** — bound to the installation key (mTLS client cert **or** DPoPbound JWT with `cnf`); mediumlived; renewable; revocable.
* **Operational token (OpTok)** — 25min OIDC token from **Authority**, **senderconstrained** (DPoP or mTLS). Used to authenticate to **Signer**/**Scanner.WebService**.
* **Operational token (OpTok)** — 25min OIDC token from **Authority**, **senderconstrained** (DPoP or mTLS). Used to authenticate to **Signer**/**Scanner.WebService**/**Scheduler.Web**/**Notify.Web**.
**Signer enforces both:** PoE proves entitlement; OpTok proves “who is calling now”. It also **independently verifies** the **scanner image digest** is **StellaOpssigned** via **Referrers + cosign** before signing anything.
@@ -173,6 +184,11 @@ LS --> IA: PoE (mTLS client cert or JWT with cnf=K_inst), CRL/OCSP/introspect
* Buildx **generator** runs analyzers during `docker buildx build --attest=type=sbom,generator=stellaops/sbom-indexer`, attaches SBOMs as **OCI referrers**.
* Scanner.WebService can trust these (policyconfigurable) and **skip** rescan; DSSE + Rekor v2 can be done either at build time or postpush via Signer/Attestor.
### 3.5 Events / integrations
* **Out:** `report.ready` (summary + verdict + Rekor UUID) → internal bus for **Notify** & UI.
* **Expose:** imagelevel **BOMIndex** metadata for **Scheduler** impact selection.
---
## 4) Backend evaluation (decider)
@@ -227,6 +243,8 @@ s3://stellaops/
* `artifacts` (type/format/sha/size/rekor/ttl/immutable/refCount/createdAt)
* `images`, `layers`, `links`, `lifecycleRules`
* **Scheduler:** `schedules`, `runs`, `locks`, `impact_cursors`
* **Notify:** `rules`, `deliveries`, `channels`, `templates`
**Retention**
@@ -239,13 +257,13 @@ s3://stellaops/
### 7.1 Scanner.WebService
```
POST /api/scans { imageRef|digest, force? } → { scanId }
GET /api/scans/{id} → { status, digests, artifacts[] }
GET /api/sboms/{imageDigest} ?format=cdx-json|cdx-pb|spdx-json&view=inventory|usage
POST /api/scans { imageRef|digest, force? } → { scanId }
GET /api/scans/{id} → { status, digests, artifacts[] }
GET /api/sboms/{imageDigest} ?format=cdx-json|cdx-pb|spdx-json&view=inventory|usage
GET /api/diff?old=<digest>&new=<digest> → { added[], removed[], changed[], byLayer[] }
POST /api/exports { imageDigest, format, view } → { artifactId, rekorUrl }
POST /api/reports { imageDigest, policyRevision? } → { reportId, rekorUrl }
GET /api/catalog/artifacts/{id} → { size, ttl, immutable, rekor, refs }
POST /api/exports { imageDigest, format, view } → { artifactId, rekorUrl }
POST /api/reports { imageDigest, policyRevision?, vexSnapshot? } → { reportId, verdict, rekorUrl }
GET /api/catalog/artifacts/{id} → { size, ttl, immutable, rekor, refs }
GET /healthz | /readyz | /metrics
```
@@ -276,6 +294,25 @@ POST /license/introspect { poe } → { active, claims, exp }
POST /attest/endorse { bundle } → endorsement bundle (optional)
```
### 7.6 Scheduler
```
POST /api/v1/scheduler/schedules {yaml|json} → { scheduleId }
GET /api/v1/scheduler/schedules → [ { id, nextRun, status, stats } ]
POST /api/v1/scheduler/run { id|selector } → { runId }
GET /api/v1/scheduler/runs/{id} → { status, counts, links }
GET /api/v1/scheduler/cursor → { lastConcelierExportId, lastExcititorExportId }
```
### 7.7 Notify
```
POST /api/v1/notify/test { channel, target } → { delivered }
POST /api/v1/notify/rules {yaml|json} → { ruleId }
GET /api/v1/notify/rules → [ { id, match, actions, enabled } ]
GET /api/v1/notify/deliveries → [ { id, eventId, channel, status, attempts } ]
```
---
## 8) Security & verifiability
@@ -283,8 +320,9 @@ POST /attest/endorse { bundle } → endorsement bundle (optio
* **Senderconstrained tokens.** All operational calls use **DPoP** (RFC9449) or **mTLSbound** tokens (RFC8705).
* **Entitlement.** **PoE** is mandatory; revocation honored online.
* **Release integrity.** **Signer** independently verifies **scanner image digest** via **Referrers + cosign** before signing.
* **Separation of duties.** Scanner/UI cannot sign; only **Signer** can sign; only **Attestor** can write to **Rekor v2**.
* **Separation of duties.** Scanner/UI/Scheduler/Notify cannot sign; only **Signer** can sign; only **Attestor** can write to **Rekor v2**.
* **Verifiers.** Anyone can verify: DSSE signature → certificate chain to **StellaOps Fulcio/KMS root****Rekor v2** inclusion.
* **RBAC.** Roles: `scanner.admin|read`, `scheduler.admin|read`, `notify.admin|read`, `zastava.admin|read`.
* **Community vs Authorized.** Free/community runs throttled with no official attestations; authorized runs full speed and produce **StellaOpsverified** bundles.
**DSSE predicate (SBOM/report)**
@@ -321,6 +359,8 @@ Binary header + purl table + roaring bitmaps; optional `usedByEntrypoint` flags
* Buildtime path P95 ≤35s on warmed bases.
* Postbuild delta scan P95 ≤10s for 200MB images.
* Policy + VEX evaluation ≤500ms for 5k components using BOMIndex.
* **Event → notification** p95 ≤ **3060s** under nominal load.
* **Export delta → reevaluation verdict** p95 ≤ **5min** for 10k impacted images.
* **Quotas:** license plan enforces QPS/concurrency/size; **Signer** throttles and can deny DSSE.
---
@@ -337,32 +377,37 @@ Binary header + purl table + roaring bitmaps; optional `usedByEntrypoint` flags
```yaml
services:
authority: { image: stellaops/authority }
fulcio: { image: sigstore/fulcio }
rekor: { image: sigstore/rekor-v2 }
minio: { image: minio/minio, command: server /data --console-address ":9001" }
mongo: { image: mongo:7 }
signer: { image: stellaops/signer, depends_on: [authority, fulcio] }
attestor: { image: stellaops/attestor, depends_on: [rekor, signer] }
scanner-web:{ image: stellaops/scanner-web, depends_on: [mongo, minio, signer, attestor] }
scanner-worker:
image: stellaops/scanner-worker
deploy: { replicas: 4 }
depends_on: [scanner-web]
concelier: { image: stellaops/concelier-web, depends_on: [mongo] }
excititor: { image: stellaops/excititor-web, depends_on: [mongo] }
ui: { image: stellaops/ui, depends_on: [scanner-web, concelier, excititor] }
authority: { image: stellaops/authority }
fulcio: { image: sigstore/fulcio }
rekor: { image: sigstore/rekor-v2 }
minio: { image: minio/minio, command: server /data --console-address ":9001" }
mongo: { image: mongo:7 }
signer: { image: stellaops/signer, depends_on: [authority, fulcio] }
attestor: { image: stellaops/attestor, depends_on: [rekor, signer] }
scanner-web: { image: stellaops/scanner-web, depends_on: [mongo, minio, signer, attestor] }
scanner-worker: { image: stellaops/scanner-worker, deploy: { replicas: 4 }, depends_on: [scanner-web] }
concelier: { image: stellaops/concelier-web, depends_on: [mongo] }
excititor: { image: stellaops/excititor-web, depends_on: [mongo] }
scheduler-web: { image: stellaops/scheduler-web, depends_on: [mongo] }
scheduler-worker:{ image: stellaops/scheduler-worker, deploy: { replicas: 2 }, depends_on: [scheduler-web] }
notify-web: { image: stellaops/notify-web, depends_on: [mongo] }
notify-worker: { image: stellaops/notify-worker, deploy: { replicas: 2 }, depends_on: [notify-web] }
ui: { image: stellaops/ui, depends_on: [scanner-web, concelier, excititor, scheduler-web, notify-web] }
```
* **Backups:** Mongo dumps; MinIO versioned buckets & replication; Rekor v2 DB snapshots; JWKS/Fulcio/KMS key rotation.
* **Ops runbooks:** Scheduler catchup after Concelier/Excititor recovery; connector key rotation (Slack/Teams/SMTP).
* **SLOs & alerts:** lag between Concelier/Excititor export and first rescan verdict; delivery failure rates by channel.
---
## 11) Observability & audit
* **Metrics:** scan latency, layer cache hit %, artifact bytes, DSSE/Rekor latency, policy evaluation time, queue depth, admission decisions (Zastava).
* **Tracing:** perstage spans; correlation IDs across Scanner→Signer→Attestor.
* **Audit logs:** every signing records `license_id`, `image_digest`, `policy_digest`, and Rekor UUID.
* **Scheduler metrics:** `scheduler.impacted_images_total`, `scheduler.jobs_enqueued_total`, `scheduler.selection_ms`, endtoend p95 (event → verdict).
* **Notify metrics:** `notify.sent_total{channel}`, `notify.dropped_total{reason}`, `notify.digest_coalesced_total`, `notify.latency_ms`.
* **Tracing:** perstage spans; correlation IDs across Scanner→Signer→Attestor and Concelier/Excititor→Scheduler→Scanner→Notify.
* **Audit logs:** every signing records `license_id`, `image_digest`, `policy_digest`, and Rekor UUID; Scheduler records who scheduled what; Notify records where, when, and why messages were sent or deduped.
* **Compliance:** MinIO **Object Lock** for immutable artifacts; reproducible outputs via policy digest + SBOM digest in predicate.
---
@@ -373,11 +418,13 @@ services:
* M2: Buildx generator certified flows; crossregistry trust policies.
* M3: PatchPresence plugin (signaturebased backport detection), optin.
* M3: Zastava Admission control GA with policy presets and dryrun→enforce stages.
* M3: **Scheduler GA** with exportdelta impact routing and capacityaware pacing.
* M3: **Notify GA** with digests, Slack/Teams/Email/Webhooks; **M4:** PagerDuty/Opsgenie connectors.
* Continuous: Policy UX (waiver TTLs, vendor rules), Excititor connectors expansion.
---
## 13) Canonical sequences (verification & signing)
## 13) Canonical sequences (verification, reevaluation & notify)
**Sign & log (OpTok + PoE, image verify, DSSE, Rekor).**
@@ -409,22 +456,62 @@ sequenceDiagram
end
```
**Verification (third party).**
**Eventdriven reevaluation & notify.**
```plantuml
@startuml
actor Verifier
participant "stellaops verify" as Tool
database "Fulcio/KMS root" as Root
participant "Rekor v2" as R2
Verifier -> Tool: bundle (URL/file)
Tool -> Tool: Verify DSSE signature
Tool -> Root: Verify cert chain to StellaOps root
Tool -> R2: Verify inclusion proof / query by UUID
Tool -> Verifier: OK + claims (license_id, policy_digest, version)
@enduml
```mermaid
sequenceDiagram
participant CONC as Concelier
participant EXC as Excititor
participant SCH as Scheduler
participant SC as Scanner.WebService
participant NO as Notify
CONC->>SCH: export.delta {changedProductKeys, exportId}
EXC ->>SCH: export.delta {changedProductKeys, exportId}
SCH->>SCH: Impact select via BOM-Index bitmaps
SCH->>SC: Enqueue analysis-only reports (batches)
SC-->>SCH: verdict stream (PASS/FAIL, deltas)
SCH->>NO: rescan.delta {imageDigest, newCriticals, links}
NO-->>Slack/Teams/Email/Webhook: deliver (throttle/digest rules applied)
```
---
**End of `high_level_architecture.md` (Consolidated).**
## 14) Minimal data shapes (Scheduler & Notify)
**Scheduler schedule (YAML via UI/CLI)**
```yaml
name: nightly-eu
when: "0 2 * * * Europe/Sofia"
mode: analysis-only # or content-refresh
selection:
scope: all-images # or tenant/ns/repo label selectors
onlyIf: { lastReportOlderThanDays: 7 }
notify:
onNewFindings: true
minSeverity: high
limits:
maxJobs: 5000
ratePerSecond: 50
```
**Notify rule (YAML)**
```yaml
name: high-critical-alerts
match:
eventKinds: ["report.ready","rescan.delta","zastava.admission"]
minSeverity: high
namespaces: ["prod-*"]
vex: { includeAcceptedJustifications: false }
actions:
- channel: slack
target: "#sec-alerts"
template: "concise"
throttle: "5m"
- channel: email
target: "soc@acme.org"
digest: "hourly"
enabled: true
```

View File

@@ -37,6 +37,8 @@ src/
**Language/runtime**: .NET 10 **Native AOT** for speed/startup; Linux builds use **musl** static when possible.
**Plug-in verbs.** Non-core verbs (Excititor, runtime helpers, future integrations) ship as restart-time plug-ins under `plugins/cli/**` with manifest descriptors. The launcher loads plug-ins on startup; hot reloading is intentionally unsupported.
**OS targets**: linuxx64/arm64, windowsx64/arm64, macOSx64/arm64.
---
@@ -386,4 +388,3 @@ script:
* macOS: 1315 (x64, arm64).
* Windows: 10/11, Server 2019/2022 (x64, arm64).
* Docker engines: Docker Desktop, containerdbased runners.

456
docs/ARCHITECTURE_NOTIFY.md Normal file
View File

@@ -0,0 +1,456 @@
> **Scope.** Implementationready architecture for **Notify**: a rulesdriven, tenantaware notification service that consumes platform events (scan completed, report ready, rescan deltas, attestation logged, admission decisions, etc.), evaluates operatordefined routing rules, renders **channelspecific messages** (Slack/Teams/Email/Webhook), and delivers them **reliably** with idempotency, throttling, and digests. It is UImanaged, auditable, and safe by default (no secrets leakage, no spam storms).
---
## 0) Mission & boundaries
**Mission.** Convert **facts** from StellaOps into **actionable, noisecontrolled** signals where teams already live (chat/email/webhooks), with **explainable** reasons and deep links to the UI.
**Boundaries.**
* Notify **does not make policy decisions** and **does not rescan**; it **consumes** events from Scanner/Scheduler/Vexer/Feedser/Attestor/Zastava and routes them.
* Attachments are **links** (UI/attestation pages); Notify **does not** attach SBOMs or large blobs to messages.
* Secrets for channels (Slack tokens, SMTP creds) are **referenced**, not stored raw in Mongo.
---
## 1) Runtime shape & projects
```
src/
├─ StellaOps.Notify.WebService/ # REST: rules/channels CRUD, test send, deliveries browse
├─ StellaOps.Notify.Worker/ # consumers + evaluators + renderers + delivery workers
├─ StellaOps.Notify.Connectors.* / # channel plug-ins: Slack, Teams, Email, Webhook (v1)
│ └─ *.Tests/
├─ StellaOps.Notify.Engine/ # rules engine, templates, idempotency, digests, throttles
├─ StellaOps.Notify.Models/ # DTOs (Rule, Channel, Event, Delivery, Template)
├─ StellaOps.Notify.Storage.Mongo/ # rules, channels, deliveries, digests, locks
├─ StellaOps.Notify.Queue/ # bus client (Redis Streams/NATS JetStream)
└─ StellaOps.Notify.Tests.* # unit/integration/e2e
```
**Deployables**:
* **Notify.WebService** (stateless API)
* **Notify.Worker** (horizontal scale)
**Dependencies**: Authority (OpToks; DPoP/mTLS), MongoDB, Redis/NATS (bus), HTTP egress to Slack/Teams/Webhooks, SMTP relay for Email.
---
## 2) Responsibilities
1. **Ingest** platform events from internal bus with strong ordering per key (e.g., image digest).
2. **Evaluate rules** (tenantscoped) with matchers: severity changes, namespaces, repos, labels, KEV flags, provider provenance (VEX), component keys, admission decisions, etc.
3. **Control noise**: **throttle**, **coalesce** (digest windows), and **dedupe** via idempotency keys.
4. **Render** channelspecific messages using safe templates; include **evidence** and **links**.
5. **Deliver** with retries/backoff; record outcome; expose delivery history to UI.
6. **Test** paths (send test to channel targets) without touching live rules.
7. **Audit**: log who configured what, when, and why a message was sent.
---
## 3) Event model (inputs)
Notify subscribes to the **internal event bus** (produced by services, escaped JSON; gzip allowed with caps):
* `scanner.scan.completed` — new SBOM(s) composed; artifacts ready
* `scanner.report.ready` — analysis verdict (policy+vex) available; carries deltas summary
* `scheduler.rescan.delta` — new findings after Feedser/Vexer deltas (already summarized)
* `attestor.logged` — Rekor UUID returned (sbom/report/vex export)
* `zastava.admission` — admit/deny with reasons, namespace, image digests
* `feedser.export.completed` — new export ready (rarely notified directly; usually drives Scheduler)
* `vexer.export.completed` — new consensus snapshot (ditto)
**Canonical envelope (bus → Notify.Engine):**
```json
{
"eventId": "uuid",
"kind": "scanner.report.ready",
"tenant": "tenant-01",
"ts": "2025-10-18T05:41:22Z",
"actor": "scanner-webservice",
"scope": { "namespace":"payments", "repo":"ghcr.io/acme/api", "digest":"sha256:..." },
"payload": { /* kind-specific fields, see below */ }
}
```
**Examples (payload cores):**
* `scanner.report.ready`:
```json
{ "verdict":"fail|warn|pass",
"delta": { "newCritical":1, "newHigh":2, "kev":["CVE-2025-..."] },
"topFindings":[{"purl":"pkg:rpm/openssl","vulnId":"CVE-2025-...","severity":"critical"}],
"links":{"ui":"https://ui/...","rekor":"https://rekor/..."} }
```
* `zastava.admission`:
```json
{ "decision":"deny|allow", "reasons":["unsigned image","missing SBOM"],
"images":[{"digest":"sha256:...","signed":false,"hasSbom":false}] }
```
---
## 4) Rules engine — semantics
**Rule shape (simplified):**
```yaml
name: "high-critical-alerts-prod"
enabled: true
match:
eventKinds: ["scanner.report.ready","scheduler.rescan.delta","zastava.admission"]
namespaces: ["prod-*"]
repos: ["ghcr.io/acme/*"]
minSeverity: "high" # min of new findings (delta context)
kev: true # require KEV-tagged or allow any if false
verdict: ["fail","deny"] # filter for report/admission
vex:
includeRejectedJustifications: false # notify only on accepted 'affected'
actions:
- channel: "slack:sec-alerts" # reference to Channel object
template: "concise"
throttle: "5m"
- channel: "email:soc"
digest: "hourly"
template: "detailed"
```
**Evaluation order**
1. **Tenant check** → discard if rule tenant ≠ event tenant.
2. **Kind filter** → discard early.
3. **Scope match** (namespace/repo/labels).
4. **Delta/severity gates** (if event carries `delta`).
5. **VEX gate** (drop if events finding is not affected under policy consensus unless rule says otherwise).
6. **Throttling/dedup** (idempotency key) — skip if suppressed.
7. **Actions** → enqueue perchannel job(s).
**Idempotency key**: `hash(ruleId | actionId | event.kind | scope.digest | delta.hash | day-bucket)`; ensures “same alert” doesnt fire more than once within throttle window.
**Digest windows**: maintain per action a **coalescer**:
* Window: `5m|15m|1h|1d` (configurable); coalesces events by tenant + namespace/repo or by digest group.
* Digest messages summarize top N items and counts, with safe truncation.
---
## 5) Channels & connectors (plugins)
Channel config is **twopart**: a **Channel** record (name, type, options) and a Secret **reference** (Vault/K8s Secret). Connectors are **restart-time plug-ins** discovered on service start (same manifest convention as Concelier/Excititor) and live under `plugins/notify/<channel>/`.
**Builtin v1:**
* **Slack**: Bot token (xoxb…), `chat.postMessage` + `blocks`; rate limit aware (HTTP 429).
* **Microsoft Teams**: Incoming Webhook (or Graph card later); adaptive card payloads.
* **Email (SMTP)**: TLS (STARTTLS or implicit), From/To/CC/BCC; HTML+text alt; DKIM optional.
* **Generic Webhook**: POST JSON with HMAC signature (Ed25519 or SHA256) in headers.
**Connector contract:** (implemented by plug-in assemblies)
```csharp
public interface INotifyConnector {
string Type { get; } // "slack" | "teams" | "email" | "webhook" | ...
Task<DeliveryResult> SendAsync(DeliveryContext ctx, CancellationToken ct);
Task<HealthResult> HealthAsync(ChannelConfig cfg, CancellationToken ct);
}
```
**DeliveryContext** includes **rendered content** and **raw event** for audit.
**Secrets**: `ChannelConfig.secretRef` points to Authoritymanaged secret handle or K8s Secret path; workers load at send-time; plug-in manifests (`notify-plugin.json`) declare capabilities and version.
---
## 6) Templates & rendering
**Template engine**: strongly typed, safe Handlebarsstyle; no arbitrary code. Partial templates per channel. Deterministic outputs (prop order, no locale drift unless requested).
**Variables** (examples):
* `event.kind`, `event.ts`, `scope.namespace`, `scope.repo`, `scope.digest`
* `payload.verdict`, `payload.delta.newCritical`, `payload.links.ui`, `payload.links.rekor`
* `topFindings[]` with `purl`, `vulnId`, `severity`
* `policy.name`, `policy.revision` (if available)
**Helpers**:
* `severity_icon(sev)`, `link(text,url)`, `pluralize(n, "finding")`, `truncate(text, n)`, `code(text)`.
**Channel mapping**:
* Slack: title + blocks, limited to 50 blocks/3000 chars per section; long lists → link to UI.
* Teams: Adaptive Card schema 1.5; fallback text for older channels.
* Email: HTML + text; inline table of top N findings, rest behind UI link.
* Webhook: JSON with `event`, `ruleId`, `actionId`, `summary`, `links`, and raw `payload` subset.
**i18n**: template set per locale (English default; Bulgarian builtin).
---
## 7) Data model (Mongo)
**Database**: `notify`
* `rules`
```
{ _id, tenantId, name, enabled, match, actions, createdBy, updatedBy, createdAt, updatedAt }
```
* `channels`
```
{ _id, tenantId, name:"slack:sec-alerts", type:"slack",
config:{ webhookUrl?:"", channel:"#sec-alerts", workspace?: "...", secretRef:"ref://..." },
createdAt, updatedAt }
```
* `deliveries`
```
{ _id, tenantId, ruleId, actionId, eventId, kind, scope, status:"sent|failed|throttled|digested|dropped",
attempts:[{ts, status, code, reason}],
rendered:{ title, body, target }, // redacted for PII; body hash stored
sentAt, lastError? }
```
* `digests`
```
{ _id, tenantId, actionKey, window:"hourly", openedAt, items:[{eventId, scope, delta}], status:"open|flushed" }
```
* `throttles`
```
{ key:"idem:<hash>", ttlAt } // short-lived, also cached in Redis
```
**Indexes**: rules by `{tenantId, enabled}`, deliveries by `{tenantId, sentAt desc}`, digests by `{tenantId, actionKey}`.
---
## 8) External APIs (WebService)
Base path: `/api/v1/notify` (Authority OpToks; scopes: `notify.admin` for write, `notify.read` for view).
* **Channels**
* `POST /channels` | `GET /channels` | `GET /channels/{id}` | `PATCH /channels/{id}` | `DELETE /channels/{id}`
* `POST /channels/{id}/test` → send sample message (no rule evaluation)
* `GET /channels/{id}/health` → connector selfcheck
* **Rules**
* `POST /rules` | `GET /rules` | `GET /rules/{id}` | `PATCH /rules/{id}` | `DELETE /rules/{id}`
* `POST /rules/{id}/test` → dryrun rule against a **sample event** (no delivery unless `--send`)
* **Deliveries**
* `GET /deliveries?tenant=...&since=...` → list
* `GET /deliveries/{id}` → detail (redacted body + metadata)
* `POST /deliveries/{id}/retry` → force retry (admin)
* **Admin**
* `GET /stats` (per tenant counts, last hour/day)
* `GET /healthz|readyz` (liveness)
**Ingestion**: workers do **not** expose public ingestion; they **subscribe** to the internal bus. (Optional `/events/test` for integration testing, adminonly.)
---
## 9) Delivery pipeline (worker)
```
[Event bus] → [Ingestor] → [RuleMatcher] → [Throttle/Dedupe] → [DigestCoalescer] → [Renderer] → [Connector] → [Result]
└────────→ [DeliveryStore]
```
* **Ingestor**: N consumers with perkey ordering (key = tenant|digest|namespace).
* **RuleMatcher**: loads active rules snapshot for tenant into memory; vectorized predicate check.
* **Throttle/Dedupe**: consult Redis + Mongo `throttles`; if hit → record `status=throttled`.
* **DigestCoalescer**: append to open digest window or flush when timer expires.
* **Renderer**: select template (channel+locale), inject variables, enforce length limits, compute `bodyHash`.
* **Connector**: send; handle providerspecific rate limits and backoffs; `maxAttempts` with exponential jitter; overflow → DLQ (deadletter topic) + UI surfacing.
**Idempotency**: per action **idempotency key** stored in Redis (TTL = `throttle window` or `digest window`). Connectors also respect **provider** idempotency where available (e.g., Slack `client_msg_id`).
---
## 10) Reliability & rate controls
* **Pertenant** RPM caps (default 600/min) + **perchannel** concurrency (Slack 14, Teams 12, Email 832 based on relay).
* **Backoff** map: Slack 429 → respect `RetryAfter`; SMTP 4xx → retry; 5xx → retry with jitter; permanent rejects → drop with status recorded.
* **DLQ**: NATS/Redis stream `notify.dlq` with `{event, rule, action, error}` for operator inspection; UI shows DLQ items.
---
## 11) Security & privacy
* **AuthZ**: all APIs require **Authority** OpToks; actions scoped by tenant.
* **Secrets**: `secretRef` only; Notify fetches justintime from Authority Secret proxy or K8s Secret (mounted). No plaintext secrets in Mongo.
* **Egress TLS**: validate SSL; pin domains per channel config; optional CA bundle override for onprem SMTP.
* **Webhook signing**: HMAC or Ed25519 signatures in `X-StellaOps-Signature` + replaywindow timestamp; include canonical body hash in header.
* **Redaction**: deliveries store **hashes** of bodies, not full payloads for chat/email to minimize PII retention (configurable).
* **Quiet hours**: per tenant (e.g., 22:0006:00) route highsev only; defer others to digests.
* **Loop prevention**: Webhook target allowlist + event origin tags; do not ingest own webhooks.
---
## 12) Observability (Prometheus + OTEL)
* `notify.events_consumed_total{kind}`
* `notify.rules_matched_total{ruleId}`
* `notify.throttled_total{reason}`
* `notify.digest_coalesced_total{window}`
* `notify.sent_total{channel}` / `notify.failed_total{channel,code}`
* `notify.delivery_latency_seconds{channel}` (endtoend)
* **Tracing**: spans `ingest`, `match`, `render`, `send`; correlation id = `eventId`.
**SLO targets**
* Event→delivery p95 **≤ 3060s** under nominal load.
* Failure rate p95 **< 0.5%** per hour (excluding provider outages).
* Duplicate rate **≈ 0** (idempotency working).
---
## 13) Configuration (YAML)
```yaml
notify:
authority:
issuer: "https://authority.internal"
require: "dpop" # or "mtls"
bus:
kind: "redis" # or "nats"
streams:
- "scanner.events"
- "scheduler.events"
- "attestor.events"
- "zastava.events"
mongo:
uri: "mongodb://mongo/notify"
limits:
perTenantRpm: 600
perChannel:
slack: { concurrency: 2 }
teams: { concurrency: 1 }
email: { concurrency: 8 }
webhook: { concurrency: 8 }
digests:
defaultWindow: "1h"
maxItems: 100
quietHours:
enabled: true
window: "22:00-06:00"
minSeverity: "critical"
webhooks:
sign:
method: "ed25519" # or "hmac-sha256"
keyRef: "ref://notify/webhook-sign-key"
```
---
## 14) UI touchpoints
* **Notifications → Channels**: add Slack/Teams/Email/Webhook; run **health**; rotate secrets.
* **Notifications → Rules**: create/edit YAML rules with linting; test with sample events; see match rate.
* **Notifications → Deliveries**: timeline with filters (status, channel, rule); inspect last error; retry.
* **Digest preview**: shows current window contents and when it will flush.
* **Quiet hours**: configure per tenant; show overrides.
* **DLQ**: browse deadletters; requeue after fix.
---
## 15) Failure modes & responses
| Condition | Behavior |
| ----------------------------------- | ------------------------------------------------------------------------------------- |
| Slack 429 / Teams 429 | Respect `RetryAfter`, backoff with jitter, reduce concurrency |
| SMTP transient 4xx | Retry up to `maxAttempts`; escalate to DLQ on exhaust |
| Invalid channel secret | Mark channel unhealthy; suppress sends; surface in UI |
| Rule explosion (matches everything) | Safety valve: pertenant RPM caps; autopause rule after X drops; UI alert |
| Bus outage | Buffer to local queue (bounded); resume consuming when healthy |
| Mongo slowness | Fall back to Redis throttles; batch write deliveries; shed lowpriority notifications |
---
## 16) Testing matrix
* **Unit**: matchers, throttle math, digest coalescing, idempotency keys, template rendering edge cases.
* **Connectors**: providerlevel rate limits, payload size truncation, error mapping.
* **Integration**: synthetic event storm (10k/min), ensure p95 latency & duplicate rate.
* **Security**: DPoP/mTLS on APIs; secretRef resolution; webhook signing & replay windows.
* **i18n**: localized templates render deterministically.
* **Chaos**: Slack/Teams API flaps; SMTP greylisting; Redis hiccups; ensure graceful degradation.
---
## 17) Sequences (representative)
**A) New criticals after Feedser delta (Slack immediate + Email hourly digest)**
```mermaid
sequenceDiagram
autonumber
participant SCH as Scheduler
participant NO as Notify.Worker
participant SL as Slack
participant SMTP as Email
SCH->>NO: bus event scheduler.rescan.delta { newCritical:1, digest:sha256:... }
NO->>NO: match rules (Slack immediate; Email hourly digest)
NO->>SL: chat.postMessage (concise)
SL-->>NO: 200 OK
NO->>NO: append to digest window (email:soc)
Note over NO: At window close → render digest email
NO->>SMTP: send email (detailed digest)
SMTP-->>NO: 250 OK
```
**B) Admission deny (Teams card + Webhook)**
```mermaid
sequenceDiagram
autonumber
participant ZA as Zastava
participant NO as Notify.Worker
participant TE as Teams
participant WH as Webhook
ZA->>NO: bus event zastava.admission { decision: "deny", reasons: [...] }
NO->>TE: POST adaptive card
TE-->>NO: 200 OK
NO->>WH: POST JSON (signed)
WH-->>NO: 2xx
```
---
## 18) Implementation notes
* **Language**: .NET 10; minimal API; `System.Text.Json` with canonical writer for body hashing; Channels for pipelines.
* **Bus**: Redis Streams (**XGROUP** consumers) or NATS JetStream for atleastonce with ack; pertenant consumer groups to localize backpressure.
* **Templates**: compile and cache per rule+channel+locale; version with rule `updatedAt` to invalidate.
* **Rules**: store raw YAML + parsed AST; validate with schema + static checks (e.g., nonsensical combos).
* **Secrets**: pluggable secret resolver (Authority Secret proxy, K8s, Vault).
* **Rate limiting**: `System.Threading.RateLimiting` + perconnector adapters.
---
## 19) Roadmap (postv1)
* **PagerDuty/Opsgenie** connectors; **Jira** ticket creation.
* **User inbox** (inapp notifications) + mobile push via webhook relay.
* **Anomaly suppression**: autopause noisy rules with hints (learned thresholds).
* **Graph rules**: “only notify if *not_affected → affected* transition at consensus layer”.
* **Label enrichment**: pluggable taggers (business criticality, data classification) to refine matchers.

View File

@@ -40,6 +40,8 @@ src/
└─ StellaOps.Scanner.Sbomer.DockerImage/ # CLIdriven scanner container
```
Analyzer assemblies and buildx generators are packaged as **restart-time plug-ins** under `plugins/scanner/**` with manifests; services must restart to activate new plug-ins.
**Runtime formfactor:** two deployables
* **Scanner.WebService** (stateless REST)
@@ -410,4 +412,3 @@ vector<string> purls
map<purlIndex, roaring_bitmap> components
optional map<purlIndex, roaring_bitmap> usedByEntrypoint
```

View File

@@ -0,0 +1,424 @@
# component_architecture_scheduler.md — **StellaOps Scheduler** (2025Q4)
> **Scope.** Implementationready architecture for **Scheduler**: a service that (1) **reevaluates** alreadycataloged images when intel changes (Feedser/Vexer/policy), (2) orchestrates **nightly** and **adhoc** runs, (3) targets only the **impacted** images using the BOMIndex, and (4) emits **reportready** events that downstream **Notify** fans out. Default mode is **analysisonly** (no image pull); optional **contentrefresh** can be enabled per schedule.
---
## 0) Mission & boundaries
**Mission.** Keep scan results **current** without rescanning the world. When new advisories or VEX claims land, **pinpoint** affected images and ask the backend to recompute **verdicts** against the **existing SBOMs**. Surface only **meaningful deltas** to humans and ticket queues.
**Boundaries.**
* Scheduler **does not** compute SBOMs and **does not** sign. It calls Scanner/WebServices **/reports (analysisonly)** endpoint and lets the backend (Policy + Vexer + Feedser) decide PASS/FAIL.
* Scheduler **may** ask Scanner to **contentrefresh** selected targets (e.g., mutable tags) but the default is **no** image pull.
* Notifications are **not** sent directly; Scheduler emits events consumed by **Notify**.
---
## 1) Runtime shape & projects
```
src/
├─ StellaOps.Scheduler.WebService/ # REST (schedules CRUD, runs, admin)
├─ StellaOps.Scheduler.Worker/ # planners + runners (N replicas)
├─ StellaOps.Scheduler.ImpactIndex/ # purl→images inverted index (roaring bitmaps)
├─ StellaOps.Scheduler.Models/ # DTOs (Schedule, Run, ImpactSet, Deltas)
├─ StellaOps.Scheduler.Storage.Mongo/ # schedules, runs, cursors, locks
├─ StellaOps.Scheduler.Queue/ # Redis Streams / NATS abstraction
├─ StellaOps.Scheduler.Tests.* # unit/integration/e2e
```
**Deployables**:
* **Scheduler.WebService** (stateless)
* **Scheduler.Worker** (scaleout; planners + executors)
**Dependencies**: Authority (OpTok + DPoP/mTLS), Scanner.WebService, Feedser, Vexer, MongoDB, Redis/NATS, (optional) Notify.
---
## 2) Core responsibilities
1. **Timebased** runs: cron windows per tenant/timezone (e.g., “02:00 Europe/Sofia”).
2. **Eventdriven** runs: react to **Feedser export** and **Vexer export** deltas (changed product keys / advisories / claims).
3. **Impact targeting**: map changes to **image sets** using a **global inverted index** built from Scanners perimage **BOMIndex** sidecars.
4. **Run planning**: shard, pace, and ratelimit jobs to avoid thundering herds.
5. **Execution**: call Scanner **/reports (analysisonly)** or **/scans (contentrefresh)**; aggregate **delta** results.
6. **Events**: publish `rescan.delta` and `report.ready` summaries for **Notify** & **UI**.
7. **Control plane**: CRUD schedules, **pause/resume**, dryrun previews, audit.
---
## 3) Data model (Mongo)
**Database**: `scheduler`
* `schedules`
```
{ _id, tenantId, name, enabled, whenCron, timezone,
mode: "analysis-only" | "content-refresh",
selection: { scope: "all-images" | "by-namespace" | "by-repo" | "by-digest" | "by-labels",
includeTags?: ["prod-*"], digests?: [sha256...], resolvesTags?: bool },
onlyIf: { lastReportOlderThanDays?: int, policyRevision?: string },
notify: { onNewFindings: bool, minSeverity: "low|medium|high|critical", includeKEV: bool },
limits: { maxJobs?: int, ratePerSecond?: int, parallelism?: int },
createdAt, updatedAt, createdBy, updatedBy }
```
* `runs`
```
{ _id, scheduleId?, tenantId, trigger: "cron|feedser|vexer|manual",
reason?: { feedserExportId?, vexerExportId?, cursor? },
state: "planning|queued|running|completed|error|cancelled",
stats: { candidates: int, deduped: int, queued: int, completed: int, deltas: int, newCriticals: int },
startedAt, finishedAt, error? }
```
* `impact_cursors`
```
{ _id: tenantId, feedserLastExportId, vexerLastExportId, updatedAt }
```
* `locks` (singleton schedulers, run leases)
* `audit` (CRUD actions, run outcomes)
**Indexes**:
* `schedules` on `{tenantId, enabled}`, `{whenCron}`.
* `runs` on `{tenantId, startedAt desc}`, `{state}`.
* TTL optional for completed runs (e.g., 180 days).
---
## 4) ImpactIndex (global inverted index)
Goal: translate **change keys** → **image sets** in **milliseconds**.
**Source**: Scanner produces perimage **BOMIndex** sidecars (purls, and `usedByEntrypoint` bitmaps). Scheduler ingests/refreshes them to build a **global** index.
**Representation**:
* Assign **image IDs** (dense ints) to catalog images.
* Keep **Roaring Bitmaps**:
* `Contains[purl] → bitmap(imageIds)`
* `UsedBy[purl] → bitmap(imageIds)` (subset of Contains)
* Optionally keep **Owner maps**: `{imageId → {tenantId, namespaces[], repos[]}}` for selection filters.
* Persist in RocksDB/LMDB or Redismodules; cache hot shards in memory; snapshot to Mongo for cold start.
**Update paths**:
* On new/updated image SBOM: **merge** perimage set into global maps.
* On image remove/expiry: **clear** id from bitmaps.
**API (internal)**:
```csharp
IImpactIndex {
ImpactSet ResolveByPurls(IEnumerable<string> purls, bool usageOnly, Selector sel);
ImpactSet ResolveByVulns(IEnumerable<string> vulnIds, bool usageOnly, Selector sel); // optional (vuln->purl precomputed by Feedser)
ImpactSet ResolveAll(Selector sel); // for nightly
}
```
**Selector filters**: tenant, namespaces, repos, labels, digest allowlists, `includeTags` patterns.
---
## 5) External interfaces (REST)
Base path: `/api/v1/scheduler` (Authority OpToks; scopes: `scheduler.read`, `scheduler.admin`).
### 5.1 Schedules CRUD
* `POST /schedules` → create
* `GET /schedules` → list (filter by tenant)
* `GET /schedules/{id}` → details + next run
* `PATCH /schedules/{id}` → pause/resume/update
* `DELETE /schedules/{id}` → delete (soft delete, optional)
### 5.2 Run control & introspection
* `POST /run` — adhoc run
```json
{ "mode": "analysis-only|content-refresh", "selection": {...}, "reason": "manual" }
```
* `GET /runs` — list with paging
* `GET /runs/{id}` — status, stats, links to deltas
* `POST /runs/{id}/cancel` — besteffort cancel
### 5.3 Previews (dryrun)
* `POST /preview/impact` — returns **candidate count** and a small sample of impacted digests for given change keys or selection.
### 5.4 Event webhooks (optional push from Feedser/Vexer)
* `POST /events/feedser-export`
```json
{ "exportId":"...", "changedProductKeys":["pkg:rpm/openssl", ...], "kev": ["CVE-..."], "window": { "from":"...","to":"..." } }
```
* `POST /events/vexer-export`
```json
{ "exportId":"...", "changedClaims":[ { "productKey":"pkg:deb/...", "vulnId":"CVE-...", "status":"not_affected→affected"} ], ... }
```
**Security**: webhook requires **mTLS** or an **HMAC** `X-Scheduler-Signature` (Ed25519 / SHA256) plus Authority token.
---
## 6) Planner → Runner pipeline
### 6.1 Planning algorithm (eventdriven)
```
On Export Event (Feedser/Vexer):
keys = Normalize(change payload) # productKeys or vulnIds→productKeys
usageOnly = schedule/policy hint? # default true
sel = Selector for tenant/scope from schedules subscribed to events
impacted = ImpactIndex.ResolveByPurls(keys, usageOnly, sel)
impacted = ApplyOwnerFilters(impacted, sel) # namespaces/repos/labels
impacted = DeduplicateByDigest(impacted)
impacted = EnforceLimits(impacted, limits.maxJobs)
shards = Shard(impacted, byHashPrefix, n=limits.parallelism)
For each shard:
Enqueue RunSegment (runId, shard, rate=limits.ratePerSecond)
```
**Fairness & pacing**
* Use **leaky bucket** per tenant and per registry host.
* Prioritize **KEVtagged** and **critical** first if oversubscribed.
### 6.2 Nightly planning
```
At cron tick:
sel = resolve selection
candidates = ImpactIndex.ResolveAll(sel)
if lastReportOlderThanDays present → filter by report age (via Scanner catalog)
shard & enqueue as above
```
### 6.3 Execution (Runner)
* Pop **RunSegment** job → for each image digest:
* **analysisonly**: `POST scanner/reports { imageDigest, policyRevision? }`
* **contentrefresh**: resolve tag→digest if needed; `POST scanner/scans { imageRef, attest? false }` then `POST /reports`
* Collect **delta**: `newFindings`, `newCriticals`/`highs`, `links` (UI deep link, Rekor if present).
* Persist perimage outcome in `runs.{id}.stats` (incremental counters).
* Emit `scheduler.rescan.delta` events to **Notify** only when **delta > 0** and matches severity rule.
---
## 7) Event model (outbound)
**Topic**: `rescan.delta` (internal bus → Notify; UI subscribes via backend).
```json
{
"tenant": "tenant-01",
"runId": "324af…",
"imageDigest": "sha256:…",
"newCriticals": 1,
"newHigh": 2,
"kevHits": ["CVE-2025-..."],
"topFindings": [
{ "purl":"pkg:rpm/openssl@3.0.12-...","vulnId":"CVE-2025-...","severity":"critical","link":"https://ui/scans/..." }
],
"reportUrl": "https://ui/.../scans/sha256:.../report",
"attestation": { "uuid":"rekor-uuid", "verified": true },
"ts": "2025-10-18T03:12:45Z"
}
```
**Also**: `report.ready` for “nochange” summaries (digest + zero delta), which Notify can ignore by rule.
---
## 8) Security posture
* **AuthN/Z**: Authority OpToks with `aud=scheduler`; DPoP (preferred) or mTLS.
* **Multitenant**: every schedule, run, and event carries `tenantId`; ImpactIndex filters by tenantvisible images.
* **Webhook** callers (Feedser/Vexer) present **mTLS** or **HMAC** and Authority token.
* **Input hardening**: size caps on changed key lists; reject >100k keys per event; compress (zstd/gzip) allowed with limits.
* **No secrets** in logs; redact tokens and signatures.
---
## 9) Observability & SLOs
**Metrics (Prometheus)**
* `scheduler.events_total{source, result}`
* `scheduler.impact_resolve_seconds{quantile}`
* `scheduler.images_selected_total{mode}`
* `scheduler.jobs_enqueued_total{mode}`
* `scheduler.run_latency_seconds{quantile}` // event → first verdict
* `scheduler.delta_images_total{severity}`
* `scheduler.rate_limited_total{reason}`
**Targets**
* Resolve 10k changed keys → impacted set in **<300ms** (hot cache).
* Event → first rescan verdict in **≤60s** (p95).
* Nightly coverage 50k images in **≤10min** with 10 workers (analysisonly).
**Tracing** (OTEL): spans `plan`, `resolve`, `enqueue`, `report_call`, `persist`, `emit`.
---
## 10) Configuration (YAML)
```yaml
scheduler:
authority:
issuer: "https://authority.internal"
require: "dpop" # or "mtls"
queue:
kind: "redis" # or "nats"
url: "redis://redis:6379/4"
mongo:
uri: "mongodb://mongo/scheduler"
impactIndex:
storage: "rocksdb" # "rocksdb" | "redis" | "memory"
warmOnStart: true
usageOnlyDefault: true
limits:
defaultRatePerSecond: 50
defaultParallelism: 8
maxJobsPerRun: 50000
integrates:
scannerUrl: "https://scanner-web.internal"
feedserWebhook: true
vexerWebhook: true
notifications:
emitBus: "internal" # deliver to Notify via internal bus
```
---
## 11) UI touchpoints
* **Schedules** page: CRUD, enable/pause, next run, last run stats, mode (analysis/content), selector preview.
* **Runs** page: timeline; heatmap of deltas; drilldown to affected images.
* **Dryrun preview** modal: “This Feedser export touches ~3,214 images; projected deltas: ~420 (34 KEV).”
---
## 12) Failure modes & degradations
| Condition | Behavior |
| ------------------------------------ | ---------------------------------------------------------------------------------------- |
| ImpactIndex cold / incomplete | Fall back to **All** selection for nightly; for events, cap to KEV+critical until warmed |
| Feedser/Vexer webhook storm | Coalesce by exportId; debounce 3060s; keep last |
| Scanner under load (429) | Backoff with jitter; respect pertenant/leaky bucket |
| Oversubscription (too many impacted) | Prioritize KEV/critical first; spillover to next window; UI banner shows backlog |
| Notify down | Buffer outbound events in queue (TTL 24h) |
| Mongo slow | Cut batch sizes; samplelog; alert ops; dont drop runs unless critical |
---
## 13) Testing matrix
* **ImpactIndex**: correctness (purl→image sets), performance, persistence after restart, memory pressure with 1M purls.
* **Planner**: dedupe, shard, fairness, limit enforcement, KEV prioritization.
* **Runner**: parallel report calls, error backoff, partial failures, idempotency.
* **Endtoend**: Feedser export → deltas visible in UI in ≤60s.
* **Security**: webhook auth (mTLS/HMAC), DPoP nonce dance, tenant isolation.
* **Chaos**: drop scanner availability; simulate registry throttles (contentrefresh mode).
* **Nightly**: cron tick correctness across timezones and DST.
---
## 14) Implementation notes
* **Language**: .NET 10 minimal API; Channelsbased pipeline; `System.Threading.RateLimiting`.
* **Bitmaps**: Roaring via `RoaringBitmap` bindings; memorymap large shards if RocksDB used.
* **Cron**: Quartzstyle parser with timezone support; clock skew tolerated ±60s.
* **Dryrun**: use ImpactIndex only; never call scanner.
* **Idempotency**: run segments carry deterministic keys; retries safe.
* **Backpressure**: pertenant buckets; perhost registry budgets respected when contentrefresh enabled.
---
## 15) Sequences (representative)
**A) Eventdriven rescan (Feedser delta)**
```mermaid
sequenceDiagram
autonumber
participant FE as Feedser
participant SCH as Scheduler.Worker
participant IDX as ImpactIndex
participant SC as Scanner.WebService
participant NO as Notify
FE->>SCH: POST /events/feedser-export {exportId, changedProductKeys}
SCH->>IDX: ResolveByPurls(keys, usageOnly=true, sel)
IDX-->>SCH: bitmap(imageIds) → digests list
SCH->>SC: POST /reports {imageDigest} (batch/sequenced)
SC-->>SCH: report deltas (new criticals/highs)
alt delta>0
SCH->>NO: rescan.delta {digest, newCriticals, links}
end
```
**B) Nightly rescan**
```mermaid
sequenceDiagram
autonumber
participant CRON as Cron
participant SCH as Scheduler.Worker
participant IDX as ImpactIndex
participant SC as Scanner.WebService
CRON->>SCH: tick (02:00 Europe/Sofia)
SCH->>IDX: ResolveAll(selector)
IDX-->>SCH: candidates
SCH->>SC: POST /reports {digest} (paced)
SC-->>SCH: results
SCH-->>SCH: aggregate, store run stats
```
**C) Contentrefresh (tag followers)**
```mermaid
sequenceDiagram
autonumber
participant SCH as Scheduler
participant SC as Scanner
SCH->>SC: resolve tag→digest (if changed)
alt digest changed
SCH->>SC: POST /scans {imageRef} # new SBOM
SC-->>SCH: scan complete (artifacts)
SCH->>SC: POST /reports {imageDigest}
else unchanged
SCH->>SC: POST /reports {imageDigest} # analysis-only
end
```
---
## 16) Roadmap
* **Vulncentric impact**: prejoin vuln→purl→images to rank by **KEV** and **exploitedinthewild** signals.
* **Policy diff preview**: when a staged policy changes, show projected breakage set before promotion.
* **Crosscluster federation**: one Scheduler instance driving many Scanner clusters (tenant isolation).
* **Windows containers**: integrate Zastava runtime hints for Usage view tightening.
---
**End — component_architecture_scheduler.md**

View File

@@ -31,31 +31,33 @@ Everything here is opensource and versioned— when you check out a git ta
- **03[Vision & Roadmap](03_VISION.md)**
- **04[Feature Matrix](04_FEATURE_MATRIX.md)**
### Reference & concepts
- **05[System Requirements Specification](05_SYSTEM_REQUIREMENTS_SPEC.md)**
- **07[HighLevel Architecture](07_HIGH_LEVEL_ARCHITECTURE.md)**
- **08Module Architecture Dossiers**
- [Scanner](ARCHITECTURE_SCANNER.md)
- [Concelier](ARCHITECTURE_CONCELIER.md)
- [Excititor](ARCHITECTURE_EXCITITOR.md)
- [Signer](ARCHITECTURE_SIGNER.md)
- [Attestor](ARCHITECTURE_ATTESTOR.md)
- [Authority](ARCHITECTURE_AUTHORITY.md)
- [CLI](ARCHITECTURE_CLI.md)
- [WebUI](ARCHITECTURE_UI.md)
- [Zastava Runtime](ARCHITECTURE_ZASTAVA.md)
- [Release & Operations](ARCHITECTURE_DEVOPS.md)
- **09[API&CLI Reference](09_API_CLI_REFERENCE.md)**
- **10[Plugin SDK Guide](10_PLUGIN_SDK_GUIDE.md)**
- **10[Concelier CLI Quickstart](10_CONCELIER_CLI_QUICKSTART.md)**
- **30[Excititor Connector Packaging Guide](dev/30_EXCITITOR_CONNECTOR_GUIDE.md)**
- **30Developer Templates**
- [Excititor Connector Skeleton](dev/templates/excititor-connector/)
- **11[Authority Service](11_AUTHORITY.md)**
- **11[Data Schemas](11_DATA_SCHEMAS.md)**
- **12[Performance Workbook](12_PERFORMANCE_WORKBOOK.md)**
- **13[ReleaseEngineering Playbook](13_RELEASE_ENGINEERING_PLAYBOOK.md)**
- **30[Fixture Maintenance](dev/fixtures.md)**
### Reference & concepts
- **05[System Requirements Specification](05_SYSTEM_REQUIREMENTS_SPEC.md)**
- **07[HighLevel Architecture](07_HIGH_LEVEL_ARCHITECTURE.md)**
- **08Module Architecture Dossiers**
- [Scanner](ARCHITECTURE_SCANNER.md)
- [Concelier](ARCHITECTURE_CONCELIER.md)
- [Excititor](ARCHITECTURE_EXCITITOR.md)
- [Signer](ARCHITECTURE_SIGNER.md)
- [Attestor](ARCHITECTURE_ATTESTOR.md)
- [Authority](ARCHITECTURE_AUTHORITY.md)
- [Notify](ARCHITECTURE_NOTIFY.md)
- [Scheduler](ARCHITECTURE_SCHEDULER.md)
- [CLI](ARCHITECTURE_CLI.md)
- [WebUI](ARCHITECTURE_UI.md)
- [Zastava Runtime](ARCHITECTURE_ZASTAVA.md)
- [Release & Operations](ARCHITECTURE_DEVOPS.md)
- **09[API&CLI Reference](09_API_CLI_REFERENCE.md)**
- **10[Plugin SDK Guide](10_PLUGIN_SDK_GUIDE.md)**
- **10[Concelier CLI Quickstart](10_CONCELIER_CLI_QUICKSTART.md)**
- **30[Excititor Connector Packaging Guide](dev/30_EXCITITOR_CONNECTOR_GUIDE.md)**
- **30Developer Templates**
- [Excititor Connector Skeleton](dev/templates/excititor-connector/)
- **11[Authority Service](11_AUTHORITY.md)**
- **11[Data Schemas](11_DATA_SCHEMAS.md)**
- **12[Performance Workbook](12_PERFORMANCE_WORKBOOK.md)**
- **13[ReleaseEngineering Playbook](13_RELEASE_ENGINEERING_PLAYBOOK.md)**
- **30[Fixture Maintenance](dev/fixtures.md)**
### User & operator guides
- **14[Glossary](14_GLOSSARY_OF_TERMS.md)**
@@ -64,18 +66,18 @@ Everything here is opensource and versioned— when you check out a git ta
- **18[Coding Standards](18_CODING_STANDARDS.md)**
- **19[TestSuite Overview](19_TEST_SUITE_OVERVIEW.md)**
- **21[Install Guide](21_INSTALL_GUIDE.md)**
- **22[CI/CD Recipes Library](ci/20_CI_RECIPES.md)**
- **23[FAQ](23_FAQ_MATRIX.md)**
- **24[Offline Update Kit Admin Guide](24_OFFLINE_KIT.md)**
- **25[Concelier Apple Connector Operations](ops/concelier-apple-operations.md)**
- **26[Authority Key Rotation Playbook](ops/authority-key-rotation.md)**
- **27[Concelier CCCS Connector Operations](ops/concelier-cccs-operations.md)**
- **28[Concelier CISA ICS Connector Operations](ops/concelier-icscisa-operations.md)**
- **29[Concelier CERT-Bund Connector Operations](ops/concelier-certbund-operations.md)**
- **30[Concelier MSRC Connector AAD Onboarding](ops/concelier-msrc-operations.md)**
### Legal & licence
- **31[Legal & Quota FAQ](29_LEGAL_FAQ_QUOTA.md)**
- **22[CI/CD Recipes Library](ci/20_CI_RECIPES.md)**
- **23[FAQ](23_FAQ_MATRIX.md)**
- **24[Offline Update Kit Admin Guide](24_OFFLINE_KIT.md)**
- **25[Concelier Apple Connector Operations](ops/concelier-apple-operations.md)**
- **26[Authority Key Rotation Playbook](ops/authority-key-rotation.md)**
- **27[Concelier CCCS Connector Operations](ops/concelier-cccs-operations.md)**
- **28[Concelier CISA ICS Connector Operations](ops/concelier-icscisa-operations.md)**
- **29[Concelier CERT-Bund Connector Operations](ops/concelier-certbund-operations.md)**
- **30[Concelier MSRC Connector AAD Onboarding](ops/concelier-msrc-operations.md)**
### Legal & licence
- **31[Legal & Quota FAQ](29_LEGAL_FAQ_QUOTA.md)**
</details>

View File

@@ -9,6 +9,9 @@
| DOC5.Concelier-Runbook | DONE (2025-10-12) | Docs Guild | DOC3.Concelier-Authority | Produce dedicated Concelier authority audit runbook covering log fields, monitoring recommendations, and troubleshooting steps. | ✅ Runbook published; ✅ linked from DOC3/DOC5; ✅ alerting guidance included. |
| FEEDDOCS-DOCS-05-001 | DONE (2025-10-11) | Docs Guild | FEEDMERGE-ENGINE-04-001, FEEDMERGE-ENGINE-04-002 | Publish Concelier conflict resolution runbook covering precedence workflow, merge-event auditing, and Sprint 3 metrics. | ✅ `docs/ops/concelier-conflict-resolution.md` committed; ✅ metrics/log tables align with latest merge code; ✅ Ops alert guidance handed to Concelier team. |
| FEEDDOCS-DOCS-05-002 | DONE (2025-10-16) | Docs Guild, Concelier Ops | FEEDDOCS-DOCS-05-001 | Ops sign-off captured: conflict runbook circulated, alert thresholds tuned, and rollout decisions documented in change log. | ✅ Ops review recorded; ✅ alert thresholds finalised using `docs/ops/concelier-authority-audit-runbook.md`; ✅ change-log entry linked from runbook once GHSA/NVD/OSV regression fixtures land. |
| DOCS-ADR-09-001 | TODO | Docs Guild, DevEx | — | Establish ADR process (`docs/adr/0000-template.md`) and document usage guidelines. | Template published; README snippet linking ADR process; announcement posted. |
| DOCS-EVENTS-09-002 | TODO | Docs Guild, Platform Events | SCANNER-EVENTS-15-201 | Publish event schema catalog (`docs/events/`) for `scanner.report.ready@1`, `scheduler.rescan.delta@1`, `attestor.logged@1`. | Schemas validated; docs/events/README summarises usage; Notify/Scheduler teams acknowledge. |
| DOCS-RUNTIME-17-004 | TODO | Docs Guild, Runtime Guild | SCANNER-EMIT-17-701, ZASTAVA-OBS-17-005, DEVOPS-REL-17-002 | Document build-id workflows: SBOM exposure, runtime event payloads, debug-store layout, and operator guidance for symbol retrieval. | Architecture + operator docs updated with build-id sections, examples show `readelf` output + debuginfod usage, references linked from Offline Kit/Release guides. |
> Update statuses (TODO/DOING/REVIEW/DONE/BLOCKED) as progress changes. Keep guides in sync with configuration samples under `etc/`.

18
docs/adr/0000-template.md Normal file
View File

@@ -0,0 +1,18 @@
# ADR-0000: Title
## Status
Proposed
## Context
- What decision needs to be made?
- What are the forces (requirements, constraints, stakeholders)?
## Decision
- Summary of the chosen option.
## Consequences
- Positive/negative consequences.
- Follow-up actions or tasks.
## References
- Links to related ADRs, issues, documents.

9
docs/events/README.md Normal file
View File

@@ -0,0 +1,9 @@
# Event Envelope Schemas
Versioned JSON Schemas for platform events consumed by Scheduler, Notify, and UI.
- `scanner.report.ready@1.json`
- `scheduler.rescan.delta@1.json`
- `attestor.logged@1.json`
Producers must bump the version suffix when introducing breaking changes; consumers validate incoming payloads against these schemas.

View File

@@ -0,0 +1,38 @@
{
"$id": "https://stella-ops.org/schemas/events/attestor.logged@1.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["eventId", "kind", "tenant", "ts", "payload"],
"properties": {
"eventId": {"type": "string", "format": "uuid"},
"kind": {"const": "attestor.logged"},
"tenant": {"type": "string"},
"ts": {"type": "string", "format": "date-time"},
"payload": {
"type": "object",
"required": ["artifactSha256", "rekor", "subject"],
"properties": {
"artifactSha256": {"type": "string"},
"rekor": {
"type": "object",
"required": ["uuid", "url"],
"properties": {
"uuid": {"type": "string"},
"url": {"type": "string", "format": "uri"},
"index": {"type": "integer", "minimum": 0}
}
},
"subject": {
"type": "object",
"required": ["type", "name"],
"properties": {
"type": {"enum": ["sbom", "report", "vex-export"]},
"name": {"type": "string"}
}
}
},
"additionalProperties": true
}
},
"additionalProperties": false
}

View File

@@ -0,0 +1,46 @@
{
"$id": "https://stella-ops.org/schemas/events/scanner.report.ready@1.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["eventId", "kind", "tenant", "ts", "scope", "payload"],
"properties": {
"eventId": {"type": "string", "format": "uuid"},
"kind": {"const": "scanner.report.ready"},
"tenant": {"type": "string"},
"ts": {"type": "string", "format": "date-time"},
"scope": {
"type": "object",
"required": ["repo", "digest"],
"properties": {
"namespace": {"type": "string"},
"repo": {"type": "string"},
"digest": {"type": "string"}
}
},
"payload": {
"type": "object",
"required": ["verdict", "delta", "links"],
"properties": {
"verdict": {"enum": ["pass", "warn", "fail"]},
"delta": {
"type": "object",
"properties": {
"newCritical": {"type": "integer", "minimum": 0},
"newHigh": {"type": "integer", "minimum": 0},
"kev": {"type": "array", "items": {"type": "string"}}
}
},
"links": {
"type": "object",
"properties": {
"ui": {"type": "string", "format": "uri"},
"rekor": {"type": "string", "format": "uri"}
},
"additionalProperties": false
}
},
"additionalProperties": true
}
},
"additionalProperties": false
}

View File

@@ -0,0 +1,33 @@
{
"$id": "https://stella-ops.org/schemas/events/scheduler.rescan.delta@1.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["eventId", "kind", "tenant", "ts", "payload"],
"properties": {
"eventId": {"type": "string", "format": "uuid"},
"kind": {"const": "scheduler.rescan.delta"},
"tenant": {"type": "string"},
"ts": {"type": "string", "format": "date-time"},
"payload": {
"type": "object",
"required": ["scheduleId", "impactedDigests", "summary"],
"properties": {
"scheduleId": {"type": "string"},
"impactedDigests": {
"type": "array",
"items": {"type": "string"}
},
"summary": {
"type": "object",
"properties": {
"newCritical": {"type": "integer", "minimum": 0},
"newHigh": {"type": "integer", "minimum": 0},
"total": {"type": "integer", "minimum": 0}
}
}
},
"additionalProperties": true
}
},
"additionalProperties": false
}