18 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	component_architecture_scanner.md — Stella Ops Scanner (2025Q4)
Scope. Implementation‑ready architecture for the Scanner subsystem: WebService, Workers, analyzers, SBOM assembly (inventory & usage), per‑layer caching, three‑way diffs, artifact catalog (MinIO+Mongo), attestation hand‑off, and scale/security posture. This document is the contract between the scanning plane and everything else (Policy, Vexer, Feedser, UI, CLI).
0) Mission & boundaries
Mission. Produce deterministic, explainable SBOMs and diffs for container images and filesystems, quickly and repeatedly, without guessing. Emit two views: Inventory (everything present) and Usage (entrypoint closure + actually linked libs). Attach attestations through Signer→Attestor→Rekor v2.
Boundaries.
- Scanner does not produce PASS/FAIL. The backend (Policy + Vexer + Feedser) decides presentation and verdicts.
- Scanner does not keep third‑party SBOM warehouses. It may bind to existing attestations for exact hashes.
- Core analyzers are deterministic (no fuzzy identity). Optional heuristic plug‑ins (e.g., patch‑presence) run under explicit flags and never contaminate the core SBOM.
1) Solution & project layout
src/
 ├─ StellaOps.Scanner.WebService/            # REST control plane, catalog, diff, exports
 ├─ StellaOps.Scanner.Worker/                # queue consumer; executes analyzers
 ├─ StellaOps.Scanner.Models/                # DTOs, evidence, graph nodes, CDX/SPDX adapters
 ├─ StellaOps.Scanner.Storage/               # Mongo repositories; MinIO object client; ILM/GC
 ├─ StellaOps.Scanner.Queue/                 # queue abstraction (Redis/NATS/RabbitMQ)
 ├─ StellaOps.Scanner.Cache/                 # layer cache; file CAS; bloom/bitmap indexes
 ├─ StellaOps.Scanner.EntryTrace/            # ENTRYPOINT/CMD → terminal program resolver (shell AST)
 ├─ StellaOps.Scanner.Analyzers.OS.[Apk|Dpkg|Rpm]/
 ├─ StellaOps.Scanner.Analyzers.Lang.[Java|Node|Python|Go|DotNet|Rust]/
 ├─ StellaOps.Scanner.Analyzers.Native.[ELF|PE|MachO]/   # PE/Mach-O planned (M2)
 ├─ StellaOps.Scanner.Emit.CDX/              # CycloneDX (JSON + Protobuf)
 ├─ StellaOps.Scanner.Emit.SPDX/             # SPDX 3.0.1 JSON
 ├─ StellaOps.Scanner.Diff/                  # image→layer→component three‑way diff
 ├─ StellaOps.Scanner.Index/                 # BOM‑Index sidecar (purls + roaring bitmaps)
 ├─ StellaOps.Scanner.Tests.*                # unit/integration/e2e fixtures
 └─ tools/
     ├─ StellaOps.Scanner.Sbomer.BuildXPlugin/   # BuildKit generator (image referrer SBOMs)
     └─ StellaOps.Scanner.Sbomer.DockerImage/    # CLI‑driven scanner container
Runtime form‑factor: two deployables
- Scanner.WebService (stateless REST)
- Scanner.Worker (N replicas; queue‑driven)
2) External dependencies
- OCI registry with Referrers API (discover attached SBOMs/signatures).
- MinIO (S3‑compatible) for SBOM artifacts; Object Lock for immutable classes; ILM for TTL.
- MongoDB for catalog, job state, diffs, ILM rules.
- Queue (Redis Streams/NATS/RabbitMQ).
- Authority (on‑prem OIDC) for OpToks (DPoP/mTLS).
- Signer + Attestor (+ Fulcio/KMS + Rekor v2) for DSSE + transparency.
3) Contracts & data model
3.1 Evidence‑first component model
Nodes
- Image,- Layer,- File
- Component(- purl?,- name,- version?,- type,- id— may be- bin:{sha256})
- Executable(ELF/PE/Mach‑O),- Library(native or managed),- EntryScript(shell/launcher)
Edges (all carry Evidence)
- contains(Image|Layer → File)
- installs(PackageDB → Component)(OS database row)
- declares(InstalledMetadata → Component)(dist‑info, pom.properties, deps.json…)
- links_to(Executable → Library)(ELF- DT_NEEDED, PE imports)
- calls(EntryScript → Program)(file:line from shell AST)
- attests(Rekor → Component|Image)(SBOM/predicate binding)
- bound_from_attestation(Component_attested → Component_observed)(hash equality proof)
Evidence
{ source: enum, locator: (path|offset|line), sha256?, method: enum, timestamp }
No confidences. Either a fact is proven with listed mechanisms, or it is not claimed.
3.2 Catalog schema (Mongo)
- 
artifacts{ _id, type: layer-bom|image-bom|diff|index, format: cdx-json|cdx-pb|spdx-json, bytesSha256, size, rekor: { uuid,index,url }?, ttlClass, immutable, refCount, createdAt }
- 
images { imageDigest, repo, tag?, arch, createdAt, lastSeen }
- 
layers { layerDigest, mediaType, size, createdAt, lastSeen }
- 
links { fromType, fromDigest, artifactId }// image/layer -> artifact
- 
jobs { _id, kind, args, state, startedAt, heartbeatAt, endedAt, error }
- 
lifecycleRules { ruleId, scope, ttlDays, retainIfReferenced, immutable }
3.3 Object store layout (MinIO)
layers/<sha256>/sbom.cdx.json.zst
layers/<sha256>/sbom.spdx.json.zst
images/<imgDigest>/inventory.cdx.pb            # CycloneDX Protobuf
images/<imgDigest>/usage.cdx.pb
indexes/<imgDigest>/bom-index.bin              # purls + roaring bitmaps
diffs/<old>_<new>/diff.json.zst
attest/<artifactSha256>.dsse.json              # DSSE bundle (cert chain + Rekor proof)
4) REST API (Scanner.WebService)
All under /api/v1/scanner. Auth: OpTok (DPoP/mTLS); RBAC scopes.
POST /scans                        { imageRef|digest, force?:bool } → { scanId }
GET  /scans/{id}                   → { status, imageDigest, artifacts[], rekor? }
GET  /sboms/{imageDigest}          ?format=cdx-json|cdx-pb|spdx-json&view=inventory|usage → bytes
GET  /diff?old=<digest>&new=<digest>&view=inventory|usage → diff.json
POST /exports                      { imageDigest, format, view, attest?:bool } → { artifactId, rekor? }
POST /reports                      { imageDigest, policyRevision? } → { reportId, rekor? }   # delegates to backend policy+vex
GET  /catalog/artifacts/{id}       → { meta }
GET  /healthz | /readyz | /metrics
5) Execution flow (Worker)
5.1 Acquire & verify
- Resolve image (prefer repo@sha256:…).
- (Optional) verify image signature per policy (cosign).
- Pull blobs, compute layer digests; record metadata.
5.2 Layer union FS
- Apply whiteouts; materialize final filesystem; map file → first introducing layer.
- Windows layers (MSI/SxS/GAC) planned in M2.
5.3 Evidence harvest (parallel analyzers; deterministic only)
A) OS packages
- apk: /lib/apk/db/installed
- dpkg: /var/lib/dpkg/status,/var/lib/dpkg/info/*.list
- rpm: /var/lib/rpm/Packages(via librpm or parser)
- Record name,version(epoch/revision),arch, source package where present, and declared file lists.
B) Language ecosystems (installed state only)
- Java: META-INF/maven/*/pom.properties, MANIFEST →pkg:maven/...
- Node: node_modules/**/package.json→pkg:npm/...
- Python: *.dist-info/{METADATA,RECORD}→pkg:pypi/...
- Go: Go buildinfo in binaries → pkg:golang/...
- .NET: *.deps.json+ assembly metadata →pkg:nuget/...
- Rust: crates only when explicitly present (embedded metadata or cargo/registry traces); otherwise binaries reported as bin:{sha256}.
Rule: We only report components proven on disk with authoritative metadata. Lockfiles are evidence only.
C) Native link graph
- ELF: parse PT_INTERP,DT_NEEDED, RPATH/RUNPATH, GNU symbol versions; map SONAMEs to file paths; link executables → libs.
- PE/Mach‑O (planned M2): import table, delay‑imports; version resources; code signatures.
- Map libs back to OS packages if possible (via file lists); else emit bin:{sha256}components.
D) EntryTrace (ENTRYPOINT/CMD → terminal program)
- Read image config; parse shell (POSIX/Bash subset) with AST: source/.includes;case/if;exec/command;run‑parts.
- Resolve commands via PATH within the built rootfs; follow language launchers (Java/Node/Python) to identify the terminal program (ELF/JAR/venv script).
- Record file:line and choices for each hop; output chain graph.
- Unresolvable dynamic constructs are recorded as unknown edges with reasons (e.g., $FOOunresolved).
E) Attestation & SBOM bind (optional)
- For each file hash or binary hash, query local cache of Rekor v2 indices; if an SBOM attestation is found for exact hash, bind it to the component (origin=attested).
- For the image digest, likewise bind SBOM attestations (build‑time referrers).
5.4 Component normalization (exact only)
- Create Componentnodes only with deterministic identities: purl, orbin:{sha256}for unlabeled binaries.
- Record origin (OS DB, installed metadata, linker, attestation).
5.5 SBOM assembly & emit
- Per‑layer SBOM fragments: components introduced by the layer (+ relationships).
- Image SBOMs: merge fragments; refer back to them via CycloneDX BOM‑Link (or SPDX ExternalRef).
- Emit both Inventory & Usage views.
- Serialize CycloneDX JSON and CycloneDX Protobuf; optionally SPDX 3.0.1 JSON.
- Build BOM‑Index sidecar: purl table + roaring bitmap; flag usedByEntrypointcomponents for fast backend joins.
5.6 DSSE attestation (via Signer/Attestor)
- WebService constructs predicate with image_digest,stellaops_version,license_id,policy_digest?(when emitting final reports), timestamps.
- Calls Signer (requires OpTok + PoE); Signer verifies entitlement + scanner image integrity and returns DSSE bundle.
- Attestor logs to Rekor v2; returns {uuid,index,proof}→ stored inartifacts.rekor.
6) Three‑way diff (image → layer → component)
6.1 Keys & classification
- Component key: purl when present; else bin:{sha256}.
- Diff classes: added,removed,version_changed(upgraded|downgraded),metadata_changed(e.g., origin from attestation vs observed).
- Layer attribution: for each change, resolve the introducing/removing layer.
6.2 Algorithm (outline)
A = components(imageOld, key)
B = components(imageNew, key)
added   = B \ A
removed = A \ B
changed = { k in A∩B : version(A[k]) != version(B[k]) || origin changed }
for each item in added/removed/changed:
   layer = attribute_to_layer(item, imageOld|imageNew)
   usageFlag = usedByEntrypoint(item, imageNew)
emit diff.json (grouped by layer with badges)
Diffs are stored as artifacts and feed UI and CLI.
7) Build‑time SBOMs (fast CI path)
Scanner.Sbomer.BuildXPlugin can act as a BuildKit generator:
- During docker buildx build --attest=type=sbom,generator=stellaops/sbom-indexer, run analyzers on the build context/output; attach SBOMs as OCI referrers to the built image.
- Optionally request Signer/Attestor to produce Stella Ops‑verified attestation immediately; else, Scanner.WebService can verify and re‑attest post‑push.
- Scanner.WebService trusts build‑time SBOMs per policy, enabling no‑rescan for unchanged bases.
8) Configuration (YAML)
scanner:
  queue:
    kind: redis
    url: "redis://queue:6379/0"
  mongo:
    uri: "mongodb://mongo/scanner"
  s3:
    endpoint: "http://minio:9000"
    bucket: "stellaops"
    objectLock: "governance"   # or 'compliance'
  analyzers:
    os: { apk: true, dpkg: true, rpm: true }
    lang: { java: true, node: true, python: true, go: true, dotnet: true, rust: true }
    native: { elf: true, pe: false, macho: false }    # PE/Mach-O in M2
    entryTrace: { enabled: true, shellMaxDepth: 64, followRunParts: true }
  emit:
    cdx: { json: true, protobuf: true }
    spdx: { json: true }
    compress: "zstd"
  rekor:
    url: "https://rekor-v2.internal"
  signer:
    url: "https://signer.internal"
  limits:
    maxParallel: 8
    perRegistryConcurrency: 2
  policyHints:
    verifyImageSignature: false
    trustBuildTimeSboms: true
9) Scale & performance
- 
Parallelism: per‑analyzer concurrency; bounded directory walkers; file CAS dedupe by sha256. 
- 
Distributed locks per layer digest to prevent duplicate work across Workers. 
- 
Registry throttles: per‑host concurrency budgets; exponential backoff on 429/5xx. 
- 
Targets: - Build‑time: P95 ≤ 3–5 s on warmed bases (CI generator).
- Post‑build delta: P95 ≤ 10 s for 200 MB images with cache hit.
- Emit: CycloneDX Protobuf ≤ 150 ms for 5k components; JSON ≤ 500 ms.
- Diff: ≤ 200 ms for 5k vs 5k components.
 
10) Security posture
- AuthN: Authority‑issued short OpToks (DPoP/mTLS).
- AuthZ: scopes (scanner.scan,scanner.export,scanner.catalog.read).
- mTLS to Signer/Attestor; only Signer can sign.
- No network fetches during analysis (except registry pulls and optional Rekor index reads).
- Sandboxing: non‑root containers; read‑only FS; seccomp profiles; disable execution of scanned content.
- Release integrity: all first‑party images are cosign‑signed; Workers/WebService self‑verify at startup.
11) Observability & audit
- 
Metrics: - scanner.jobs_inflight,- scanner.scan_latency_seconds
- scanner.layer_cache_hits_total,- scanner.file_cas_hits_total
- scanner.artifact_bytes_total{format}
- scanner.attestation_latency_seconds,- scanner.rekor_failures_total
 
- 
Tracing: spans for acquire→union→analyzers→compose→emit→sign→log. 
- 
Audit logs: DSSE requests log license_id,image_digest,artifactSha256,policy_digest?, Rekor UUID on success.
12) Testing matrix
- Determinism: given same image + analyzers → byte‑identical CDX Protobuf; JSON normalized.
- OS packages: ground‑truth images per distro; compare to package DB.
- Lang ecosystems: sample images per ecosystem (Java/Node/Python/Go/.NET/Rust) with installed metadata; negative tests w/ lockfile‑only.
- Native & EntryTrace: ELF graph correctness; shell AST cases (includes, run‑parts, exec, case/if).
- Diff: layer attribution against synthetic two‑image sequences.
- Performance: cold vs warm cache; large node_modulesandsite‑packages.
- Security: ensure no code execution from image; fuzz parser inputs; path traversal resistance on layer extract.
13) Failure modes & degradations
- Missing OS DB (files exist, DB removed): record files; do not fabricate package components; emit bin:{sha256}where unavoidable; flag in evidence.
- Unreadable metadata (corrupt dist‑info): record file evidence; skip component creation; annotate.
- Dynamic shell constructs: mark unresolved edges with reasons (env var unknown) and continue; Usage view may be partial.
- Registry rate limits: honor backoff; queue job retries with jitter.
- Signer refusal (license/plan/version): scan completes; artifact produced; no attestation; WebService marks result as unverified.
14) Optional plug‑ins (off by default)
- Patch‑presence detector (signature‑based backport checks). Reads curated function‑level signatures from advisories; inspects binaries for patched code snippets to lower false‑positives for backported fixes. Runs as a sidecar analyzer that annotates components; never overrides core identities.
- Runtime probes (with Zastava): when allowed, compare /proc//maps (DSOs actually loaded) with static Usage view for precision.
15) DevOps & operations
- HA: WebService horizontal scale; Workers autoscale by queue depth & CPU; distributed locks on layers.
- Retention: ILM rules per artifact class (short,default,compliance); Object Lock for compliance artifacts (reports, signed SBOMs).
- Upgrades: bump cache schema when analyzer outputs change; WebService triggers refresh of dependent artifacts.
- Backups: Mongo (daily dumps); MinIO (versioned buckets, replication); Rekor v2 DB snapshots.
16) CLI & UI touch points
- CLI: stellaops scan <ref>,stellaops diff --old --new,stellaops export,stellaops verify attestation <bundle|url>.
- UI: Scan detail shows Inventory/Usage toggles, Diff by Layer, Attestation badge (verified/unverified), Rekor link, and EntryTrace chain with file:line breadcrumbs.
17) Roadmap (Scanner)
- M2: Windows containers (MSI/SxS/GAC analyzers), PE/Mach‑O native analyzer, deeper Rust metadata.
- M2: Buildx generator GA (certified external registries), cross‑registry trust policies.
- M3: Patch‑presence plug‑in GA (opt‑in), cross‑image corpus clustering (evidence‑only; not identity).
- M3: Advanced EntryTrace (POSIX shell features breadth, busybox detection).
Appendix A — EntryTrace resolution (pseudo)
ResolveEntrypoint(ImageConfig cfg, RootFs fs):
  cmd = Normalize(cfg.ENTRYPOINT, cfg.CMD)
  stack = [ Script(cmd, path=FindOnPath(cmd[0], fs)) ]
  visited = set()
  while stack not empty and depth < MAX:
    cur = stack.pop()
    if cur in visited: continue
    visited.add(cur)
    if IsShellScript(cur.path):
       ast = ParseShell(cur.path)
       foreach directive in ast:
         if directive is Source include:
            p = ResolveInclude(include.path, cur.env, fs)
            stack.push(Script(p))
         if directive is Exec call:
            p = ResolveExec(call.argv[0], cur.env, fs)
            stack.push(Program(p, argv=call.argv))
         if directive is Interpreter (python -m / node / java -jar):
            term = ResolveInterpreterTarget(call, fs)
            stack.push(Program(term))
    else:
       return Terminal(cur.path)
  return Unknown(reason)
Appendix B — BOM‑Index sidecar
struct Header { magic, version, imageDigest, createdAt }
vector<string> purls
map<purlIndex, roaring_bitmap> components
optional map<purlIndex, roaring_bitmap> usedByEntrypoint