Rewrite architecture docs and add Vexer connector template

This commit is contained in:
2025-10-17 19:34:43 +03:00
parent 29a7d51e41
commit fbd1826ef3
25 changed files with 4885 additions and 777 deletions

View File

@@ -0,0 +1,413 @@
# component_architecture_scanner.md — **StellaOps Scanner** (2025Q4)
> **Scope.** Implementationready architecture for the **Scanner** subsystem: WebService, Workers, analyzers, SBOM assembly (inventory & usage), perlayer caching, threeway diffs, artifact catalog (MinIO+Mongo), attestation handoff, and scale/security posture. This document is the contract between the scanning plane and everything else (Policy, Vexer, Feedser, UI, CLI).
---
## 0) Mission & boundaries
**Mission.** Produce **deterministic**, **explainable** SBOMs and diffs for container images and filesystems, quickly and repeatedly, without guessing. Emit two views: **Inventory** (everything present) and **Usage** (entrypoint closure + actually linked libs). Attach attestations through **Signer→Attestor→Rekor v2**.
**Boundaries.**
* Scanner **does not** produce PASS/FAIL. The backend (Policy + Vexer + Feedser) decides presentation and verdicts.
* Scanner **does not** keep thirdparty SBOM warehouses. It may **bind** to existing attestations for exact hashes.
* Core analyzers are **deterministic** (no fuzzy identity). Optional heuristic plugins (e.g., patchpresence) run under explicit flags and never contaminate the core SBOM.
---
## 1) Solution & project layout
```
src/
├─ StellaOps.Scanner.WebService/ # REST control plane, catalog, diff, exports
├─ StellaOps.Scanner.Worker/ # queue consumer; executes analyzers
├─ StellaOps.Scanner.Models/ # DTOs, evidence, graph nodes, CDX/SPDX adapters
├─ StellaOps.Scanner.Storage/ # Mongo repositories; MinIO object client; ILM/GC
├─ StellaOps.Scanner.Queue/ # queue abstraction (Redis/NATS/RabbitMQ)
├─ StellaOps.Scanner.Cache/ # layer cache; file CAS; bloom/bitmap indexes
├─ StellaOps.Scanner.EntryTrace/ # ENTRYPOINT/CMD → terminal program resolver (shell AST)
├─ StellaOps.Scanner.Analyzers.OS.[Apk|Dpkg|Rpm]/
├─ StellaOps.Scanner.Analyzers.Lang.[Java|Node|Python|Go|DotNet|Rust]/
├─ StellaOps.Scanner.Analyzers.Native.[ELF|PE|MachO]/ # PE/Mach-O planned (M2)
├─ StellaOps.Scanner.Emit.CDX/ # CycloneDX (JSON + Protobuf)
├─ StellaOps.Scanner.Emit.SPDX/ # SPDX 3.0.1 JSON
├─ StellaOps.Scanner.Diff/ # image→layer→component threeway diff
├─ StellaOps.Scanner.Index/ # BOMIndex sidecar (purls + roaring bitmaps)
├─ StellaOps.Scanner.Tests.* # unit/integration/e2e fixtures
└─ tools/
├─ StellaOps.Scanner.Sbomer.BuildXPlugin/ # BuildKit generator (image referrer SBOMs)
└─ StellaOps.Scanner.Sbomer.DockerImage/ # CLIdriven scanner container
```
**Runtime formfactor:** two deployables
* **Scanner.WebService** (stateless REST)
* **Scanner.Worker** (N replicas; queuedriven)
---
## 2) External dependencies
* **OCI registry** with **Referrers API** (discover attached SBOMs/signatures).
* **MinIO** (S3compatible) for SBOM artifacts; **Object Lock** for immutable classes; **ILM** for TTL.
* **MongoDB** for catalog, job state, diffs, ILM rules.
* **Queue** (Redis Streams/NATS/RabbitMQ).
* **Authority** (onprem OIDC) for **OpToks** (DPoP/mTLS).
* **Signer** + **Attestor** (+ **Fulcio/KMS** + **Rekor v2**) for DSSE + transparency.
---
## 3) Contracts & data model
### 3.1 Evidencefirst component model
**Nodes**
* `Image`, `Layer`, `File`
* `Component` (`purl?`, `name`, `version?`, `type`, `id` — may be `bin:{sha256}`)
* `Executable` (ELF/PE/MachO), `Library` (native or managed), `EntryScript` (shell/launcher)
**Edges** (all carry **Evidence**)
* `contains(Image|Layer → File)`
* `installs(PackageDB → Component)` (OS database row)
* `declares(InstalledMetadata → Component)` (distinfo, pom.properties, deps.json…)
* `links_to(Executable → Library)` (ELF `DT_NEEDED`, PE imports)
* `calls(EntryScript → Program)` (file:line from shell AST)
* `attests(Rekor → Component|Image)` (SBOM/predicate binding)
* `bound_from_attestation(Component_attested → Component_observed)` (hash equality proof)
**Evidence**
```
{ source: enum, locator: (path|offset|line), sha256?, method: enum, timestamp }
```
No confidences. Either a fact is proven with listed mechanisms, or it is not claimed.
### 3.2 Catalog schema (Mongo)
* `artifacts`
```
{ _id, type: layer-bom|image-bom|diff|index,
format: cdx-json|cdx-pb|spdx-json,
bytesSha256, size, rekor: { uuid,index,url }?,
ttlClass, immutable, refCount, createdAt }
```
* `images { imageDigest, repo, tag?, arch, createdAt, lastSeen }`
* `layers { layerDigest, mediaType, size, createdAt, lastSeen }`
* `links { fromType, fromDigest, artifactId }` // image/layer -> artifact
* `jobs { _id, kind, args, state, startedAt, heartbeatAt, endedAt, error }`
* `lifecycleRules { ruleId, scope, ttlDays, retainIfReferenced, immutable }`
### 3.3 Object store layout (MinIO)
```
layers/<sha256>/sbom.cdx.json.zst
layers/<sha256>/sbom.spdx.json.zst
images/<imgDigest>/inventory.cdx.pb # CycloneDX Protobuf
images/<imgDigest>/usage.cdx.pb
indexes/<imgDigest>/bom-index.bin # purls + roaring bitmaps
diffs/<old>_<new>/diff.json.zst
attest/<artifactSha256>.dsse.json # DSSE bundle (cert chain + Rekor proof)
```
---
## 4) REST API (Scanner.WebService)
All under `/api/v1/scanner`. Auth: **OpTok** (DPoP/mTLS); RBAC scopes.
```
POST /scans { imageRef|digest, force?:bool } → { scanId }
GET /scans/{id} → { status, imageDigest, artifacts[], rekor? }
GET /sboms/{imageDigest} ?format=cdx-json|cdx-pb|spdx-json&view=inventory|usage → bytes
GET /diff?old=<digest>&new=<digest>&view=inventory|usage → diff.json
POST /exports { imageDigest, format, view, attest?:bool } → { artifactId, rekor? }
POST /reports { imageDigest, policyRevision? } → { reportId, rekor? } # delegates to backend policy+vex
GET /catalog/artifacts/{id} → { meta }
GET /healthz | /readyz | /metrics
```
---
## 5) Execution flow (Worker)
### 5.1 Acquire & verify
1. **Resolve image** (prefer `repo@sha256:`).
2. **(Optional) verify image signature** per policy (cosign).
3. **Pull blobs**, compute layer digests; record metadata.
### 5.2 Layer union FS
* Apply whiteouts; materialize final filesystem; map **file → first introducing layer**.
* Windows layers (MSI/SxS/GAC) planned in **M2**.
### 5.3 Evidence harvest (parallel analyzers; deterministic only)
**A) OS packages**
* **apk**: `/lib/apk/db/installed`
* **dpkg**: `/var/lib/dpkg/status`, `/var/lib/dpkg/info/*.list`
* **rpm**: `/var/lib/rpm/Packages` (via librpm or parser)
* Record `name`, `version` (epoch/revision), `arch`, source package where present, and **declared file lists**.
**B) Language ecosystems (installed state only)**
* **Java**: `META-INF/maven/*/pom.properties`, MANIFEST → `pkg:maven/...`
* **Node**: `node_modules/**/package.json` → `pkg:npm/...`
* **Python**: `*.dist-info/{METADATA,RECORD}` → `pkg:pypi/...`
* **Go**: Go **buildinfo** in binaries → `pkg:golang/...`
* **.NET**: `*.deps.json` + assembly metadata → `pkg:nuget/...`
* **Rust**: crates only when **explicitly present** (embedded metadata or cargo/registry traces); otherwise binaries reported as `bin:{sha256}`.
> **Rule:** We only report components proven **on disk** with authoritative metadata. Lockfiles are evidence only.
**C) Native link graph**
* **ELF**: parse `PT_INTERP`, `DT_NEEDED`, RPATH/RUNPATH, **GNU symbol versions**; map **SONAMEs** to file paths; link executables → libs.
* **PE/MachO** (planned M2): import table, delayimports; version resources; code signatures.
* Map libs back to **OS packages** if possible (via file lists); else emit `bin:{sha256}` components.
**D) EntryTrace (ENTRYPOINT/CMD → terminal program)**
* Read image config; parse shell (POSIX/Bash subset) with AST: `source`/`.` includes; `case/if`; `exec`/`command`; `runparts`.
* Resolve commands via **PATH** within the **built rootfs**; follow language launchers (Java/Node/Python) to identify the terminal program (ELF/JAR/venv script).
* Record **file:line** and choices for each hop; output chain graph.
* Unresolvable dynamic constructs are recorded as **unknown** edges with reasons (e.g., `$FOO` unresolved).
**E) Attestation & SBOM bind (optional)**
* For each **file hash** or **binary hash**, query local cache of **Rekor v2** indices; if an SBOM attestation is found for **exact hash**, bind it to the component (origin=`attested`).
* For the **image** digest, likewise bind SBOM attestations (buildtime referrers).
### 5.4 Component normalization (exact only)
* Create `Component` nodes only with deterministic identities: purl, or **`bin:{sha256}`** for unlabeled binaries.
* Record **origin** (OS DB, installed metadata, linker, attestation).
### 5.5 SBOM assembly & emit
* **Perlayer SBOM fragments**: components introduced by the layer (+ relationships).
* **Image SBOMs**: merge fragments; refer back to them via **CycloneDX BOMLink** (or SPDX ExternalRef).
* Emit both **Inventory** & **Usage** views.
* Serialize **CycloneDX JSON** and **CycloneDX Protobuf**; optionally **SPDX 3.0.1 JSON**.
* Build **BOMIndex** sidecar: purl table + roaring bitmap; flag `usedByEntrypoint` components for fast backend joins.
### 5.6 DSSE attestation (via Signer/Attestor)
* WebService constructs **predicate** with `image_digest`, `stellaops_version`, `license_id`, `policy_digest?` (when emitting **final reports**), timestamps.
* Calls **Signer** (requires **OpTok + PoE**); Signer verifies **entitlement + scanner image integrity** and returns **DSSE bundle**.
* **Attestor** logs to **Rekor v2**; returns `{uuid,index,proof}` → stored in `artifacts.rekor`.
---
## 6) Threeway diff (image → layer → component)
### 6.1 Keys & classification
* Component key: **purl** when present; else `bin:{sha256}`.
* Diff classes: `added`, `removed`, `version_changed` (`upgraded|downgraded`), `metadata_changed` (e.g., origin from attestation vs observed).
* Layer attribution: for each change, resolve the **introducing/removing layer**.
### 6.2 Algorithm (outline)
```
A = components(imageOld, key)
B = components(imageNew, key)
added = B \ A
removed = A \ B
changed = { k in A∩B : version(A[k]) != version(B[k]) || origin changed }
for each item in added/removed/changed:
layer = attribute_to_layer(item, imageOld|imageNew)
usageFlag = usedByEntrypoint(item, imageNew)
emit diff.json (grouped by layer with badges)
```
Diffs are stored as artifacts and feed **UI** and **CLI**.
---
## 7) Buildtime SBOMs (fast CI path)
**Scanner.Sbomer.BuildXPlugin** can act as a BuildKit **generator**:
* During `docker buildx build --attest=type=sbom,generator=stellaops/sbom-indexer`, run analyzers on the build context/output; attach SBOMs as OCI **referrers** to the built image.
* Optionally request **Signer/Attestor** to produce **StellaOpsverified** attestation immediately; else, Scanner.WebService can verify and reattest postpush.
* Scanner.WebService trusts buildtime SBOMs per policy, enabling **norescan** for unchanged bases.
---
## 8) Configuration (YAML)
```yaml
scanner:
queue:
kind: redis
url: "redis://queue:6379/0"
mongo:
uri: "mongodb://mongo/scanner"
s3:
endpoint: "http://minio:9000"
bucket: "stellaops"
objectLock: "governance" # or 'compliance'
analyzers:
os: { apk: true, dpkg: true, rpm: true }
lang: { java: true, node: true, python: true, go: true, dotnet: true, rust: true }
native: { elf: true, pe: false, macho: false } # PE/Mach-O in M2
entryTrace: { enabled: true, shellMaxDepth: 64, followRunParts: true }
emit:
cdx: { json: true, protobuf: true }
spdx: { json: true }
compress: "zstd"
rekor:
url: "https://rekor-v2.internal"
signer:
url: "https://signer.internal"
limits:
maxParallel: 8
perRegistryConcurrency: 2
policyHints:
verifyImageSignature: false
trustBuildTimeSboms: true
```
---
## 9) Scale & performance
* **Parallelism**: peranalyzer concurrency; bounded directory walkers; file CAS dedupe by sha256.
* **Distributed locks** per **layer digest** to prevent duplicate work across Workers.
* **Registry throttles**: perhost concurrency budgets; exponential backoff on 429/5xx.
* **Targets**:
* **Buildtime**: P95 ≤35s on warmed bases (CI generator).
* **Postbuild delta**: P95 ≤10s for 200MB images with cache hit.
* **Emit**: CycloneDX Protobuf ≤150ms for 5k components; JSON ≤500ms.
* **Diff**: ≤200ms for 5k vs 5k components.
---
## 10) Security posture
* **AuthN**: Authorityissued short OpToks (DPoP/mTLS).
* **AuthZ**: scopes (`scanner.scan`, `scanner.export`, `scanner.catalog.read`).
* **mTLS** to **Signer**/**Attestor**; only **Signer** can sign.
* **No network fetches** during analysis (except registry pulls and optional Rekor index reads).
* **Sandboxing**: nonroot containers; readonly FS; seccomp profiles; disable execution of scanned content.
* **Release integrity**: all firstparty images are **cosignsigned**; Workers/WebService selfverify at startup.
---
## 11) Observability & audit
* **Metrics**:
* `scanner.jobs_inflight`, `scanner.scan_latency_seconds`
* `scanner.layer_cache_hits_total`, `scanner.file_cas_hits_total`
* `scanner.artifact_bytes_total{format}`
* `scanner.attestation_latency_seconds`, `scanner.rekor_failures_total`
* **Tracing**: spans for acquire→union→analyzers→compose→emit→sign→log.
* **Audit logs**: DSSE requests log `license_id`, `image_digest`, `artifactSha256`, `policy_digest?`, Rekor UUID on success.
---
## 12) Testing matrix
* **Determinism:** given same image + analyzers → byteidentical **CDX Protobuf**; JSON normalized.
* **OS packages:** groundtruth images per distro; compare to package DB.
* **Lang ecosystems:** sample images per ecosystem (Java/Node/Python/Go/.NET/Rust) with installed metadata; negative tests w/ lockfileonly.
* **Native & EntryTrace:** ELF graph correctness; shell AST cases (includes, runparts, exec, case/if).
* **Diff:** layer attribution against synthetic twoimage sequences.
* **Performance:** cold vs warm cache; large `node_modules` and `sitepackages`.
* **Security:** ensure no code execution from image; fuzz parser inputs; path traversal resistance on layer extract.
---
## 13) Failure modes & degradations
* **Missing OS DB** (files exist, DB removed): record **files**; do **not** fabricate package components; emit `bin:{sha256}` where unavoidable; flag in evidence.
* **Unreadable metadata** (corrupt distinfo): record file evidence; skip component creation; annotate.
* **Dynamic shell constructs**: mark unresolved edges with reasons (env var unknown) and continue; **Usage** view may be partial.
* **Registry rate limits**: honor backoff; queue job retries with jitter.
* **Signer refusal** (license/plan/version): scan completes; artifact produced; **no attestation**; WebService marks result as **unverified**.
---
## 14) Optional plugins (off by default)
* **Patchpresence detector** (signaturebased backport checks). Reads curated functionlevel signatures from advisories; inspects binaries for patched code snippets to lower falsepositives for backported fixes. Runs as a sidecar analyzer that **annotates** components; never overrides core identities.
* **Runtime probes** (with Zastava): when allowed, compare **/proc/<pid>/maps** (DSOs actually loaded) with static **Usage** view for precision.
---
## 15) DevOps & operations
* **HA**: WebService horizontal scale; Workers autoscale by queue depth & CPU; distributed locks on layers.
* **Retention**: ILM rules per artifact class (`short`, `default`, `compliance`); **Object Lock** for compliance artifacts (reports, signed SBOMs).
* **Upgrades**: bump **cache schema** when analyzer outputs change; WebService triggers refresh of dependent artifacts.
* **Backups**: Mongo (daily dumps); MinIO (versioned buckets, replication); Rekor v2 DB snapshots.
---
## 16) CLI & UI touch points
* **CLI**: `stellaops scan <ref>`, `stellaops diff --old --new`, `stellaops export`, `stellaops verify attestation <bundle|url>`.
* **UI**: Scan detail shows **Inventory/Usage** toggles, **Diff by Layer**, **Attestation badge** (verified/unverified), Rekor link, and **EntryTrace** chain with file:line breadcrumbs.
---
## 17) Roadmap (Scanner)
* **M2**: Windows containers (MSI/SxS/GAC analyzers), PE/MachO native analyzer, deeper Rust metadata.
* **M2**: Buildx generator GA (certified external registries), crossregistry trust policies.
* **M3**: Patchpresence plugin GA (optin), crossimage corpus clustering (evidenceonly; not identity).
* **M3**: Advanced EntryTrace (POSIX shell features breadth, busybox detection).
---
### Appendix A — EntryTrace resolution (pseudo)
```csharp
ResolveEntrypoint(ImageConfig cfg, RootFs fs):
cmd = Normalize(cfg.ENTRYPOINT, cfg.CMD)
stack = [ Script(cmd, path=FindOnPath(cmd[0], fs)) ]
visited = set()
while stack not empty and depth < MAX:
cur = stack.pop()
if cur in visited: continue
visited.add(cur)
if IsShellScript(cur.path):
ast = ParseShell(cur.path)
foreach directive in ast:
if directive is Source include:
p = ResolveInclude(include.path, cur.env, fs)
stack.push(Script(p))
if directive is Exec call:
p = ResolveExec(call.argv[0], cur.env, fs)
stack.push(Program(p, argv=call.argv))
if directive is Interpreter (python -m / node / java -jar):
term = ResolveInterpreterTarget(call, fs)
stack.push(Program(term))
else:
return Terminal(cur.path)
return Unknown(reason)
```
### Appendix B — BOMIndex sidecar
```
struct Header { magic, version, imageDigest, createdAt }
vector<string> purls
map<purlIndex, roaring_bitmap> components
optional map<purlIndex, roaring_bitmap> usedByEntrypoint
```