Files
git.stella-ops.org/docs/ARCHITECTURE_SCANNER.md
master 96d52884e8
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Add Policy DSL Validator, Schema Exporter, and Simulation Smoke tools
- Implemented PolicyDslValidator with command-line options for strict mode and JSON output.
- Created PolicySchemaExporter to generate JSON schemas for policy-related models.
- Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes.
- Added project files and necessary dependencies for each tool.
- Ensured proper error handling and usage instructions across tools.
2025-10-27 08:00:11 +02:00

24 KiB
Raw Blame History

component_architecture_scanner.md — StellaOps Scanner (2025Q4)

Scope. Implementationready architecture for the Scanner subsystem: WebService, Workers, analyzers, SBOM assembly (inventory & usage), perlayer caching, threeway diffs, artifact catalog (RustFS default + Mongo, S3-compatible fallback), attestation handoff, and scale/security posture. This document is the contract between the scanning plane and everything else (Policy, Excititor, Concelier, UI, CLI).


0) Mission & boundaries

Mission. Produce deterministic, explainable SBOMs and diffs for container images and filesystems, quickly and repeatedly, without guessing. Emit two views: Inventory (everything present) and Usage (entrypoint closure + actually linked libs). Attach attestations through Signer→Attestor→Rekor v2.

Boundaries.

  • Scanner does not produce PASS/FAIL. The backend (Policy + Excititor + Concelier) decides presentation and verdicts.
  • Scanner does not keep thirdparty SBOM warehouses. It may bind to existing attestations for exact hashes.
  • Core analyzers are deterministic (no fuzzy identity). Optional heuristic plugins (e.g., patchpresence) run under explicit flags and never contaminate the core SBOM.

1) Solution & project layout

src/
 ├─ StellaOps.Scanner.WebService/            # REST control plane, catalog, diff, exports
 ├─ StellaOps.Scanner.Worker/                # queue consumer; executes analyzers
 ├─ StellaOps.Scanner.Models/                # DTOs, evidence, graph nodes, CDX/SPDX adapters
 ├─ StellaOps.Scanner.Storage/               # Mongo repositories; RustFS object client (default) + S3 fallback; ILM/GC
 ├─ StellaOps.Scanner.Queue/                 # queue abstraction (Redis/NATS/RabbitMQ)
 ├─ StellaOps.Scanner.Cache/                 # layer cache; file CAS; bloom/bitmap indexes
 ├─ StellaOps.Scanner.EntryTrace/            # ENTRYPOINT/CMD → terminal program resolver (shell AST)
 ├─ StellaOps.Scanner.Analyzers.OS.[Apk|Dpkg|Rpm]/
 ├─ StellaOps.Scanner.Analyzers.Lang.[Java|Node|Python|Go|DotNet|Rust]/
 ├─ StellaOps.Scanner.Analyzers.Native.[ELF|PE|MachO]/   # PE/Mach-O planned (M2)
 ├─ StellaOps.Scanner.Emit.CDX/              # CycloneDX (JSON + Protobuf)
 ├─ StellaOps.Scanner.Emit.SPDX/             # SPDX 3.0.1 JSON
 ├─ StellaOps.Scanner.Diff/                  # image→layer→component threeway diff
 ├─ StellaOps.Scanner.Index/                 # BOMIndex sidecar (purls + roaring bitmaps)
 ├─ StellaOps.Scanner.Tests.*                # unit/integration/e2e fixtures
 └─ tools/
     ├─ StellaOps.Scanner.Sbomer.BuildXPlugin/   # BuildKit generator (image referrer SBOMs)
     └─ StellaOps.Scanner.Sbomer.DockerImage/    # CLIdriven scanner container

Analyzer assemblies and buildx generators are packaged as restart-time plug-ins under plugins/scanner/** with manifests; services must restart to activate new plug-ins.

1.1 Queue backbone (Redis / NATS)

StellaOps.Scanner.Queue exposes a transport-agnostic contract (IScanQueue/IScanQueueLease) used by the WebService producer and Worker consumers. Sprint 9 introduces two first-party transports:

  • Redis Streams (default). Uses consumer groups, deterministic idempotency keys (scanner:jobs:idemp:*), and supports lease claim (XCLAIM), renewal, exponential-backoff retries, and a scanner:jobs:dead stream for exhausted attempts.
  • NATS JetStream. Provisions the SCANNER_JOBS work-queue stream + durable consumer scanner-workers, publishes with MsgId for dedupe, applies backoff via NAK delays, and routes dead-lettered jobs to SCANNER_JOBS_DEAD.

Metrics are emitted via Meter counters (scanner_queue_enqueued_total, scanner_queue_retry_total, scanner_queue_deadletter_total), and ScannerQueueHealthCheck pings the active backend (Redis PING, NATS PING). Configuration is bound from scanner.queue:

scanner:
  queue:
    kind: redis # or nats
    redis:
      connectionString: "redis://queue:6379/0"
      streamName: "scanner:jobs"
    nats:
      url: "nats://queue:4222"
      stream: "SCANNER_JOBS"
      subject: "scanner.jobs"
      durableConsumer: "scanner-workers"
      deadLetterSubject: "scanner.jobs.dead"
    maxDeliveryAttempts: 5
    retryInitialBackoff: 00:00:05
    retryMaxBackoff: 00:02:00

The DI extension (AddScannerQueue) wires the selected transport, so future additions (e.g., RabbitMQ) only implement the same contract and register.

Runtime formfactor: two deployables

  • Scanner.WebService (stateless REST)
  • Scanner.Worker (N replicas; queuedriven)

2) External dependencies

  • OCI registry with Referrers API (discover attached SBOMs/signatures).
  • RustFS (default, offline-first) for SBOM artifacts; optional S3/MinIO compatibility retained for migration; Object Lock semantics emulated via retention headers; ILM for TTL.
  • MongoDB for catalog, job state, diffs, ILM rules.
  • Queue (Redis Streams/NATS/RabbitMQ).
  • Authority (onprem OIDC) for OpToks (DPoP/mTLS).
  • Signer + Attestor (+ Fulcio/KMS + Rekor v2) for DSSE + transparency.

3) Contracts & data model

3.1 Evidencefirst component model

Nodes

  • Image, Layer, File
  • Component (purl?, name, version?, type, id — may be bin:{sha256})
  • Executable (ELF/PE/MachO), Library (native or managed), EntryScript (shell/launcher)

Edges (all carry Evidence)

  • contains(Image|Layer → File)
  • installs(PackageDB → Component) (OS database row)
  • declares(InstalledMetadata → Component) (distinfo, pom.properties, deps.json…)
  • links_to(Executable → Library) (ELF DT_NEEDED, PE imports)
  • calls(EntryScript → Program) (file:line from shell AST)
  • attests(Rekor → Component|Image) (SBOM/predicate binding)
  • bound_from_attestation(Component_attested → Component_observed) (hash equality proof)

Evidence

{ source: enum, locator: (path|offset|line), sha256?, method: enum, timestamp }

No confidences. Either a fact is proven with listed mechanisms, or it is not claimed.

3.2 Catalog schema (Mongo)

  • artifacts

    { _id, type: layer-bom|image-bom|diff|index,
      format: cdx-json|cdx-pb|spdx-json,
      bytesSha256, size, rekor: { uuid,index,url }?,
      ttlClass, immutable, refCount, createdAt }
    
  • images { imageDigest, repo, tag?, arch, createdAt, lastSeen }

  • layers { layerDigest, mediaType, size, createdAt, lastSeen }

  • links { fromType, fromDigest, artifactId } // image/layer -> artifact

  • jobs { _id, kind, args, state, startedAt, heartbeatAt, endedAt, error }

  • lifecycleRules { ruleId, scope, ttlDays, retainIfReferenced, immutable }

3.3 Object store layout (RustFS)

layers/<sha256>/sbom.cdx.json.zst
layers/<sha256>/sbom.spdx.json.zst
images/<imgDigest>/inventory.cdx.pb            # CycloneDX Protobuf
images/<imgDigest>/usage.cdx.pb
indexes/<imgDigest>/bom-index.bin              # purls + roaring bitmaps
diffs/<old>_<new>/diff.json.zst
attest/<artifactSha256>.dsse.json              # DSSE bundle (cert chain + Rekor proof)

RustFS exposes a deterministic HTTP API (PUT|GET|DELETE /api/v1/buckets/{bucket}/objects/{key}). Scanner clients tag immutable uploads with X-RustFS-Immutable: true and, when retention applies, X-RustFS-Retain-Seconds: <ttlSeconds>. Additional headers can be injected via scanner.artifactStore.headers to support custom auth or proxy requirements. Legacy MinIO/S3 deployments remain supported by setting scanner.artifactStore.driver = "s3" during phased migrations.


4) REST API (Scanner.WebService)

All under /api/v1/scanner. Auth: OpTok (DPoP/mTLS); RBAC scopes.

POST /scans                        { imageRef|digest, force?:bool } → { scanId }
GET  /scans/{id}                   → { status, imageDigest, artifacts[], rekor? }
GET  /sboms/{imageDigest}          ?format=cdx-json|cdx-pb|spdx-json&view=inventory|usage → bytes
GET  /diff?old=<digest>&new=<digest>&view=inventory|usage → diff.json
POST /exports                      { imageDigest, format, view, attest?:bool } → { artifactId, rekor? }
POST /reports                      { imageDigest, policyRevision? } → { reportId, rekor? }   # delegates to backend policy+vex
GET  /catalog/artifacts/{id}       → { meta }
GET  /healthz | /readyz | /metrics

Report events

When scanner.events.enabled = true, the WebService serialises the signed report (canonical JSON + DSSE envelope) with NotifyCanonicalJsonSerializer and publishes two Redis Stream entries (scanner.report.ready, scanner.scan.completed) to the configured stream (default stella.events). The stream fields carry the whole envelope plus lightweight headers (kind, tenant, ts) so Notify and UI timelines can consume the event bus without recomputing signatures. Publish timeouts and bounded stream length are controlled via scanner:events:publishTimeoutSeconds and scanner:events:maxStreamLength. If the queue driver is already Redis and no explicit events DSN is provided, the host reuses the queue connection and auto-enables event emission so deployments get live envelopes without extra wiring. Compose/Helm bundles expose the same knobs via the SCANNER__EVENTS__* environment variables for quick tuning.


5) Execution flow (Worker)

5.1 Acquire & verify

  1. Resolve image (prefer repo@sha256:…).
  2. (Optional) verify image signature per policy (cosign).
  3. Pull blobs, compute layer digests; record metadata.

5.2 Layer union FS

  • Apply whiteouts; materialize final filesystem; map file → first introducing layer.
  • Windows layers (MSI/SxS/GAC) planned in M2.

5.3 Evidence harvest (parallel analyzers; deterministic only)

A) OS packages

  • apk: /lib/apk/db/installed
  • dpkg: /var/lib/dpkg/status, /var/lib/dpkg/info/*.list
  • rpm: /var/lib/rpm/Packages (via librpm or parser)
  • Record name, version (epoch/revision), arch, source package where present, and declared file lists.

Data flow note: Each OS analyzer now writes its canonical output into the shared ScanAnalysisStore under analysis.os.packages (raw results), analysis.os.fragments (per-analyzer layer fragments), and contributes to analysis.layers.fragments (the aggregated view consumed by emit/diff pipelines). Helpers in ScanAnalysisCompositionBuilder convert these fragments into SBOM composition requests and component graphs so the diff/emit stages no longer reach back into individual analyzer implementations.

B) Language ecosystems (installed state only)

  • Java: META-INF/maven/*/pom.properties, MANIFEST → pkg:maven/...
  • Node: node_modules/**/package.jsonpkg:npm/...
  • Python: *.dist-info/{METADATA,RECORD}pkg:pypi/...
  • Go: Go buildinfo in binaries → pkg:golang/...
  • .NET: *.deps.json + assembly metadata → pkg:nuget/...
  • Rust: crates only when explicitly present (embedded metadata or cargo/registry traces); otherwise binaries reported as bin:{sha256}.

Rule: We only report components proven on disk with authoritative metadata. Lockfiles are evidence only.

C) Native link graph

  • ELF: parse PT_INTERP, DT_NEEDED, RPATH/RUNPATH, GNU symbol versions; map SONAMEs to file paths; link executables → libs.
  • PE/MachO (planned M2): import table, delayimports; version resources; code signatures.
  • Map libs back to OS packages if possible (via file lists); else emit bin:{sha256} components.
  • The exported metadata (stellaops.os.* properties, license list, source package) feeds policy scoring and export pipelines directly Policy evaluates quiet rules against package provenance while Exporters forward the enriched fields into downstream JSON/Trivy payloads.

D) EntryTrace (ENTRYPOINT/CMD → terminal program)

  • Read image config; parse shell (POSIX/Bash subset) with AST: source/. includes; case/if; exec/command; runparts.
  • Resolve commands via PATH within the built rootfs; follow language launchers (Java/Node/Python) to identify the terminal program (ELF/JAR/venv script).
  • Record file:line and choices for each hop; output chain graph.
  • Unresolvable dynamic constructs are recorded as unknown edges with reasons (e.g., $FOO unresolved).

E) Attestation & SBOM bind (optional)

  • For each file hash or binary hash, query local cache of Rekor v2 indices; if an SBOM attestation is found for exact hash, bind it to the component (origin=attested).
  • For the image digest, likewise bind SBOM attestations (buildtime referrers).

5.4 Component normalization (exact only)

  • Create Component nodes only with deterministic identities: purl, or bin:{sha256} for unlabeled binaries.
  • Record origin (OS DB, installed metadata, linker, attestation).

5.5 SBOM assembly & emit

  • Per-layer SBOM fragments: components introduced by the layer (+ relationships).
  • Image SBOMs: merge fragments; refer back to them via CycloneDX BOMLink (or SPDX ExternalRef).
  • Emit both Inventory & Usage views.
  • When the native analyzer reports an ELF buildId, attach it to component metadata and surface it as stellaops:buildId in CycloneDX properties (and diff metadata). This keeps SBOM/diff output in lockstep with runtime events and the debug-store manifest.
  • Serialize CycloneDX JSON and CycloneDX Protobuf; optionally SPDX 3.0.1 JSON.
  • Build BOMIndex sidecar: purl table + roaring bitmap; flag usedByEntrypoint components for fast backend joins.

The emitted buildId metadata is preserved in component hashes, diff payloads, and /policy/runtime responses so operators can pivot from SBOM entries → runtime events → debug/.build-id/<aa>/<rest>.debug within the Offline Kit or release bundle.

5.6 DSSE attestation (via Signer/Attestor)

  • WebService constructs predicate with image_digest, stellaops_version, license_id, policy_digest? (when emitting final reports), timestamps.
  • Calls Signer (requires OpTok + PoE); Signer verifies entitlement + scanner image integrity and returns DSSE bundle.
  • Attestor logs to Rekor v2; returns {uuid,index,proof} → stored in artifacts.rekor.

6) Threeway diff (image → layer → component)

6.1 Keys & classification

  • Component key: purl when present; else bin:{sha256}.
  • Diff classes: added, removed, version_changed (upgraded|downgraded), metadata_changed (e.g., origin from attestation vs observed).
  • Layer attribution: for each change, resolve the introducing/removing layer.

6.2 Algorithm (outline)

A = components(imageOld, key)
B = components(imageNew, key)

added   = B \ A
removed = A \ B
changed = { k in A∩B : version(A[k]) != version(B[k]) || origin changed }

for each item in added/removed/changed:
   layer = attribute_to_layer(item, imageOld|imageNew)
   usageFlag = usedByEntrypoint(item, imageNew)
emit diff.json (grouped by layer with badges)

Diffs are stored as artifacts and feed UI and CLI.


7) Buildtime SBOMs (fast CI path)

Scanner.Sbomer.BuildXPlugin can act as a BuildKit generator:

  • During docker buildx build --attest=type=sbom,generator=stellaops/sbom-indexer, run analyzers on the build context/output; attach SBOMs as OCI referrers to the built image.
  • Optionally request Signer/Attestor to produce StellaOpsverified attestation immediately; else, Scanner.WebService can verify and reattest postpush.
  • Scanner.WebService trusts buildtime SBOMs per policy, enabling norescan for unchanged bases.

8) Configuration (YAML)

scanner:
  queue:
    kind: redis
    url: "redis://queue:6379/0"
  mongo:
    uri: "mongodb://mongo/scanner"
  s3:
    endpoint: "http://minio:9000"
    bucket: "stellaops"
    objectLock: "governance"   # or 'compliance'
  analyzers:
    os: { apk: true, dpkg: true, rpm: true }
    lang: { java: true, node: true, python: true, go: true, dotnet: true, rust: true }
    native: { elf: true, pe: false, macho: false }    # PE/Mach-O in M2
    entryTrace: { enabled: true, shellMaxDepth: 64, followRunParts: true }
  emit:
    cdx: { json: true, protobuf: true }
    spdx: { json: true }
    compress: "zstd"
  rekor:
    url: "https://rekor-v2.internal"
  signer:
    url: "https://signer.internal"
  limits:
    maxParallel: 8
    perRegistryConcurrency: 2
  policyHints:
    verifyImageSignature: false
    trustBuildTimeSboms: true

9) Scale & performance

  • Parallelism: peranalyzer concurrency; bounded directory walkers; file CAS dedupe by sha256.

  • Distributed locks per layer digest to prevent duplicate work across Workers.

  • Registry throttles: perhost concurrency budgets; exponential backoff on 429/5xx.

  • Targets:

    • Buildtime: P95 ≤35s on warmed bases (CI generator).
    • Postbuild delta: P95 ≤10s for 200MB images with cache hit.
    • Emit: CycloneDX Protobuf ≤150ms for 5k components; JSON ≤500ms.
    • Diff: ≤200ms for 5k vs 5k components.

10) Security posture

  • AuthN: Authorityissued short OpToks (DPoP/mTLS).
  • AuthZ: scopes (scanner.scan, scanner.export, scanner.catalog.read).
  • mTLS to Signer/Attestor; only Signer can sign.
  • No network fetches during analysis (except registry pulls and optional Rekor index reads).
  • Sandboxing: nonroot containers; readonly FS; seccomp profiles; disable execution of scanned content.
  • Release integrity: all firstparty images are cosignsigned; Workers/WebService selfverify at startup.

11) Observability & audit

  • Metrics:

    • scanner.jobs_inflight, scanner.scan_latency_seconds
    • scanner.layer_cache_hits_total, scanner.file_cas_hits_total
    • scanner.artifact_bytes_total{format}
    • scanner.attestation_latency_seconds, scanner.rekor_failures_total
    • scanner_analyzer_golang_heuristic_total{indicator,version_hint} — increments whenever the Go analyzer falls back to heuristics (build-id or runtime markers). Grafana panel: sum by (indicator) (rate(scanner_analyzer_golang_heuristic_total[5m])); alert when the rate is ≥1 for 15minutes to highlight unexpected stripped binaries.
  • Tracing: spans for acquire→union→analyzers→compose→emit→sign→log.

  • Audit logs: DSSE requests log license_id, image_digest, artifactSha256, policy_digest?, Rekor UUID on success.


12) Testing matrix

  • Determinism: given same image + analyzers → byteidentical CDX Protobuf; JSON normalized.
  • OS packages: groundtruth images per distro; compare to package DB.
  • Lang ecosystems: sample images per ecosystem (Java/Node/Python/Go/.NET/Rust) with installed metadata; negative tests w/ lockfileonly.
  • Native & EntryTrace: ELF graph correctness; shell AST cases (includes, runparts, exec, case/if).
  • Diff: layer attribution against synthetic twoimage sequences.
  • Performance: cold vs warm cache; large node_modules and sitepackages.
  • Security: ensure no code execution from image; fuzz parser inputs; path traversal resistance on layer extract.

13) Failure modes & degradations

  • Missing OS DB (files exist, DB removed): record files; do not fabricate package components; emit bin:{sha256} where unavoidable; flag in evidence.
  • Unreadable metadata (corrupt distinfo): record file evidence; skip component creation; annotate.
  • Dynamic shell constructs: mark unresolved edges with reasons (env var unknown) and continue; Usage view may be partial.
  • Registry rate limits: honor backoff; queue job retries with jitter.
  • Signer refusal (license/plan/version): scan completes; artifact produced; no attestation; WebService marks result as unverified.

14) Optional plugins (off by default)

  • Patchpresence detector (signaturebased backport checks). Reads curated functionlevel signatures from advisories; inspects binaries for patched code snippets to lower falsepositives for backported fixes. Runs as a sidecar analyzer that annotates components; never overrides core identities.
  • Runtime probes (with Zastava): when allowed, compare /proc//maps (DSOs actually loaded) with static Usage view for precision.

15) DevOps & operations

  • HA: WebService horizontal scale; Workers autoscale by queue depth & CPU; distributed locks on layers.
  • Retention: ILM rules per artifact class (short, default, compliance); Object Lock for compliance artifacts (reports, signed SBOMs).
  • Upgrades: bump cache schema when analyzer outputs change; WebService triggers refresh of dependent artifacts.
  • Backups: Mongo (daily dumps); RustFS snapshots (filesystem-level rsync/ZFS) or S3 versioning when legacy driver enabled; Rekor v2 DB snapshots.

16) CLI & UI touch points

  • CLI: stellaops scan <ref>, stellaops diff --old --new, stellaops export, stellaops verify attestation <bundle|url>.
  • UI: Scan detail shows Inventory/Usage toggles, Diff by Layer, Attestation badge (verified/unverified), Rekor link, and EntryTrace chain with file:line breadcrumbs.

17) Roadmap (Scanner)

  • M2: Windows containers (MSI/SxS/GAC analyzers), PE/MachO native analyzer, deeper Rust metadata.
  • M2: Buildx generator GA (certified external registries), crossregistry trust policies.
  • M3: Patchpresence plugin GA (optin), crossimage corpus clustering (evidenceonly; not identity).
  • M3: Advanced EntryTrace (POSIX shell features breadth, busybox detection).

Appendix A — EntryTrace resolution (pseudo)

ResolveEntrypoint(ImageConfig cfg, RootFs fs):
  cmd = Normalize(cfg.ENTRYPOINT, cfg.CMD)
  stack = [ Script(cmd, path=FindOnPath(cmd[0], fs)) ]
  visited = set()

  while stack not empty and depth < MAX:
    cur = stack.pop()
    if cur in visited: continue
    visited.add(cur)

    if IsShellScript(cur.path):
       ast = ParseShell(cur.path)
       foreach directive in ast:
         if directive is Source include:
            p = ResolveInclude(include.path, cur.env, fs)
            stack.push(Script(p))
         if directive is Exec call:
            p = ResolveExec(call.argv[0], cur.env, fs)
            stack.push(Program(p, argv=call.argv))
         if directive is Interpreter (python -m / node / java -jar):
            term = ResolveInterpreterTarget(call, fs)
            stack.push(Program(term))
    else:
       return Terminal(cur.path)

  return Unknown(reason)

Appendix A.1 — EntryTrace Explainability

EntryTrace emits structured diagnostics and metrics so operators can quickly understand why resolution succeeded or degraded:

Reason Description Typical Mitigation
CommandNotFound A command referenced in the script cannot be located in the layered root filesystem or PATH. Ensure binaries exist in the image or extend PATH hints.
MissingFile source/./run-parts targets are missing. Bundle the script or guard the include.
DynamicEnvironmentReference Path depends on $VARS that are unknown at scan time. Provide defaults via scan metadata or accept partial usage.
RecursionLimitReached Nested includes exceeded the analyzer depth limit (default 64). Flatten indirection or increase the limit in options.
RunPartsEmpty run-parts directory contained no executable entries. Remove empty directories or ignore if intentional.
JarNotFound / ModuleNotFound Java/Python targets missing, preventing interpreter tracing. Ship the jar/module with the image or adjust the launcher.

Diagnostics drive two metrics published by EntryTraceMetrics:

  • entrytrace_resolutions_total{outcome} — resolution attempts segmented by outcome (resolved, partiallyresolved, unresolved).
  • entrytrace_unresolved_total{reason} — diagnostic counts keyed by reason.

Structured logs include entrytrace.path, entrytrace.command, entrytrace.reason, and entrytrace.depth, all correlated with scan/job IDs. Timestamps are normalized to UTC (microsecond precision) to keep DSSE attestations and UI traces explainable.

Appendix B — BOMIndex sidecar

struct Header { magic, version, imageDigest, createdAt }
vector<string> purls
map<purlIndex, roaring_bitmap> components
optional map<purlIndex, roaring_bitmap> usedByEntrypoint