Refactor code structure for improved readability and maintainability
Some checks failed
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled

This commit is contained in:
StellaOps Bot
2025-12-06 21:48:12 +02:00
parent f6c22854a4
commit dd0067ea0b
105 changed files with 12662 additions and 427 deletions

View File

@@ -1,6 +1,6 @@
# component_architecture_scanner.md — **StellaOps Scanner** (2025Q4)
> Aligned with Epic6 Vulnerability Explorer and Epic10 Export Center.
# component_architecture_scanner.md — **StellaOps Scanner** (2025Q4)
> Aligned with Epic6 Vulnerability Explorer and Epic10 Export Center.
> **Scope.** Implementationready architecture for the **Scanner** subsystem: WebService, Workers, analyzers, SBOM assembly (inventory & usage), perlayer caching, threeway diffs, artifact catalog (RustFS default + Mongo, S3-compatible fallback), attestation handoff, and scale/security posture. This document is the contract between the scanning plane and everything else (Policy, Excititor, Concelier, UI, CLI).
@@ -30,31 +30,31 @@ src/
├─ StellaOps.Scanner.Cache/ # layer cache; file CAS; bloom/bitmap indexes
├─ StellaOps.Scanner.EntryTrace/ # ENTRYPOINT/CMD → terminal program resolver (shell AST)
├─ StellaOps.Scanner.Analyzers.OS.[Apk|Dpkg|Rpm]/
├─ StellaOps.Scanner.Analyzers.Lang.[Java|Node|Python|Go|DotNet|Rust]/
├─ StellaOps.Scanner.Analyzers.Native.[ELF|PE|MachO]/ # PE/Mach-O planned (M2)
├─ StellaOps.Scanner.Symbols.Native/ # NEW native symbol reader/demangler (Sprint 401)
├─ StellaOps.Scanner.CallGraph.Native/ # NEW function/call-edge builder + CAS emitter
├─ StellaOps.Scanner.Analyzers.Lang.[Java|Node|Bun|Python|Go|DotNet|Rust|Ruby|Php]/
├─ StellaOps.Scanner.Analyzers.Native.[ELF|PE|MachO]/ # PE/Mach-O planned (M2)
├─ StellaOps.Scanner.Symbols.Native/ # NEW native symbol reader/demangler (Sprint 401)
├─ StellaOps.Scanner.CallGraph.Native/ # NEW function/call-edge builder + CAS emitter
├─ StellaOps.Scanner.Emit.CDX/ # CycloneDX (JSON + Protobuf)
├─ StellaOps.Scanner.Emit.SPDX/ # SPDX 3.0.1 JSON
├─ StellaOps.Scanner.Diff/ # image→layer→component threeway diff
├─ StellaOps.Scanner.Index/ # BOMIndex sidecar (purls + roaring bitmaps)
├─ StellaOps.Scanner.Tests.* # unit/integration/e2e fixtures
└─ Tools/
├─ StellaOps.Scanner.Sbomer.BuildXPlugin/ # BuildKit generator (image referrer SBOMs)
└─ StellaOps.Scanner.Sbomer.DockerImage/ # CLIdriven scanner container
└─ Tools/
├─ StellaOps.Scanner.Sbomer.BuildXPlugin/ # BuildKit generator (image referrer SBOMs)
└─ StellaOps.Scanner.Sbomer.DockerImage/ # CLIdriven scanner container
```
Analyzer assemblies and buildx generators are packaged as **restart-time plug-ins** under `plugins/scanner/**` with manifests; services must restart to activate new plug-ins.
### 1.2 Native reachability upgrades (Nov 2026)
- **Stripped-binary pipeline**: native analyzers must recover functions even without symbols (prolog patterns, xrefs, PLT/GOT, vtables). Emit a tool-agnostic neutral JSON (NJIF) with functions, CFG/CG, and evidence tags. Keep heuristics deterministic and record toolchain hashes in the scan manifest.
- **Synthetic roots**: treat `.preinit_array`, `.init_array`, legacy `.ctors`, and `_init` as graph entrypoints; add roots for constructors in each `DT_NEEDED` dependency. Tag edges from these roots with `phase=load` for explainers.
- **Build-id capture**: read `.note.gnu.build-id` for every ELF, store hex build-id alongside soname/path, propagate into `SymbolID`/`code_id`, and expose it to SBOM + runtime joiners. If missing, fall back to file hash and mark source accordingly.
- **PURL-resolved edges**: annotate call edges with the callee purl and `symbol_digest` so graphs merge with SBOM components. See `docs/reachability/purl-resolved-edges.md` for schema rules and acceptance tests.
- **Unknowns emission**: when symbol → purl mapping or edge targets remain unresolved, emit structured Unknowns to Signals (see `docs/signals/unknowns-registry.md`) instead of dropping evidence.
- **Hybrid attestation**: emit **graph-level DSSE** for every `richgraph-v1` (mandatory) and optional **edge-bundle DSSE** (≤512 edges) for runtime/init-root/contested edges or third-party provenance. Publish graph DSSE digests to Rekor by default; edge-bundle Rekor publish is policy-driven. CAS layout: `cas://reachability/graphs/{blake3}` for graph body, `.../{blake3}.dsse` for envelope, and `cas://reachability/edges/{graph_hash}/{bundle_id}[.dsse]` for bundles. Deterministic ordering before hashing/signing is required.
- **Deterministic call-graph manifest**: capture analyzer versions, feed hashes, toolchain digests, and flags in a manifest stored alongside `richgraph-v1`; replaying with the same manifest MUST yield identical node/edge sets and hashes (see `docs/reachability/lead.md`).
Analyzer assemblies and buildx generators are packaged as **restart-time plug-ins** under `plugins/scanner/**` with manifests; services must restart to activate new plug-ins.
### 1.2 Native reachability upgrades (Nov 2026)
- **Stripped-binary pipeline**: native analyzers must recover functions even without symbols (prolog patterns, xrefs, PLT/GOT, vtables). Emit a tool-agnostic neutral JSON (NJIF) with functions, CFG/CG, and evidence tags. Keep heuristics deterministic and record toolchain hashes in the scan manifest.
- **Synthetic roots**: treat `.preinit_array`, `.init_array`, legacy `.ctors`, and `_init` as graph entrypoints; add roots for constructors in each `DT_NEEDED` dependency. Tag edges from these roots with `phase=load` for explainers.
- **Build-id capture**: read `.note.gnu.build-id` for every ELF, store hex build-id alongside soname/path, propagate into `SymbolID`/`code_id`, and expose it to SBOM + runtime joiners. If missing, fall back to file hash and mark source accordingly.
- **PURL-resolved edges**: annotate call edges with the callee purl and `symbol_digest` so graphs merge with SBOM components. See `docs/reachability/purl-resolved-edges.md` for schema rules and acceptance tests.
- **Unknowns emission**: when symbol → purl mapping or edge targets remain unresolved, emit structured Unknowns to Signals (see `docs/signals/unknowns-registry.md`) instead of dropping evidence.
- **Hybrid attestation**: emit **graph-level DSSE** for every `richgraph-v1` (mandatory) and optional **edge-bundle DSSE** (≤512 edges) for runtime/init-root/contested edges or third-party provenance. Publish graph DSSE digests to Rekor by default; edge-bundle Rekor publish is policy-driven. CAS layout: `cas://reachability/graphs/{blake3}` for graph body, `.../{blake3}.dsse` for envelope, and `cas://reachability/edges/{graph_hash}/{bundle_id}[.dsse]` for bundles. Deterministic ordering before hashing/signing is required.
- **Deterministic call-graph manifest**: capture analyzer versions, feed hashes, toolchain digests, and flags in a manifest stored alongside `richgraph-v1`; replaying with the same manifest MUST yield identical node/edge sets and hashes (see `docs/reachability/lead.md`).
### 1.1 Queue backbone (Redis / NATS)
@@ -144,9 +144,10 @@ No confidences. Either a fact is proven with listed mechanisms, or it is not cla
* `images { imageDigest, repo, tag?, arch, createdAt, lastSeen }`
* `layers { layerDigest, mediaType, size, createdAt, lastSeen }`
* `links { fromType, fromDigest, artifactId }` // image/layer -> artifact
* `jobs { _id, kind, args, state, startedAt, heartbeatAt, endedAt, error }`
* `lifecycleRules { ruleId, scope, ttlDays, retainIfReferenced, immutable }`
* `ruby.packages { _id: scanId, imageDigest, generatedAtUtc, packages[] }` // decoded `RubyPackageInventory` documents for CLI/Policy reuse
* `jobs { _id, kind, args, state, startedAt, heartbeatAt, endedAt, error }`
* `lifecycleRules { ruleId, scope, ttlDays, retainIfReferenced, immutable }`
* `ruby.packages { _id: scanId, imageDigest, generatedAtUtc, packages[] }` // decoded `RubyPackageInventory` documents for CLI/Policy reuse
* `bun.packages { _id: scanId, imageDigest, generatedAtUtc, packages[] }` // decoded `BunPackageInventory` documents for CLI/Policy reuse
### 3.3 Object store layout (RustFS)
@@ -175,10 +176,11 @@ All under `/api/v1/scanner`. Auth: **OpTok** (DPoP/mTLS); RBAC scopes.
```
POST /scans { imageRef|digest, force?:bool } → { scanId }
GET /scans/{id} → { status, imageDigest, artifacts[], rekor? }
GET /sboms/{imageDigest} ?format=cdx-json|cdx-pb|spdx-json&view=inventory|usage → bytes
GET /scans/{id}/ruby-packages → { scanId, imageDigest, generatedAt, packages[] }
GET /diff?old=<digest>&new=<digest>&view=inventory|usage → diff.json
GET /scans/{id} → { status, imageDigest, artifacts[], rekor? }
GET /sboms/{imageDigest} ?format=cdx-json|cdx-pb|spdx-json&view=inventory|usage → bytes
GET /scans/{id}/ruby-packages → { scanId, imageDigest, generatedAt, packages[] }
GET /scans/{id}/bun-packages → { scanId, imageDigest, generatedAt, packages[] }
GET /diff?old=<digest>&new=<digest>&view=inventory|usage → diff.json
POST /exports { imageDigest, format, view, attest?:bool } → { artifactId, rekor? }
POST /reports { imageDigest, policyRevision? } → { reportId, rekor? } # delegates to backend policy+vex
GET /catalog/artifacts/{id} → { meta }
@@ -223,6 +225,7 @@ When `scanner.events.enabled = true`, the WebService serialises the signed repor
* **Java**: `META-INF/maven/*/pom.properties`, MANIFEST → `pkg:maven/...`
* **Node**: `node_modules/**/package.json` → `pkg:npm/...`
* **Bun**: `bun.lock` (JSONC text) + `node_modules/**/package.json` + `node_modules/.bun/**/package.json` (isolated linker) → `pkg:npm/...`; `bun.lockb` (binary) emits remediation guidance
* **Python**: `*.dist-info/{METADATA,RECORD}` → `pkg:pypi/...`
* **Go**: Go **buildinfo** in binaries → `pkg:golang/...`
* **.NET**: `*.deps.json` + assembly metadata → `pkg:nuget/...`
@@ -230,18 +233,18 @@ When `scanner.events.enabled = true`, the WebService serialises the signed repor
> **Rule:** We only report components proven **on disk** with authoritative metadata. Lockfiles are evidence only.
**C) Native link graph**
* **ELF**: parse `PT_INTERP`, `DT_NEEDED`, RPATH/RUNPATH, **GNU symbol versions**; map **SONAMEs** to file paths; link executables → libs.
* **PE/MachO** (planned M2): import table, delayimports; version resources; code signatures.
* Map libs back to **OS packages** if possible (via file lists); else emit `bin:{sha256}` components.
* The exported metadata (`stellaops.os.*` properties, license list, source package) feeds policy scoring and export pipelines
directly Policy evaluates quiet rules against package provenance while Exporters forward the enriched fields into
downstream JSON/Trivy payloads.
* **Reachability lattice**: analyzers + runtime probes emit `Evidence`/`Mitigation` records (see `docs/reachability/lattice.md`). The lattice engine joins static path evidence, runtime hits (EventPipe/JFR), taint flows, environment gates, and mitigations into `ReachDecision` documents that feed VEX gating and event graph storage.
* Sprint401 introduces `StellaOps.Scanner.Symbols.Native` (DWARF/PDB reader + demangler) and `StellaOps.Scanner.CallGraph.Native`
(function boundary detector + call-edge builder). These libraries feed `FuncNode`/`CallEdge` CAS bundles and enrich reachability
graphs with `{code_id, confidence, evidence}` so Signals/Policy/UI can cite function-level justifications.
**C) Native link graph**
* **ELF**: parse `PT_INTERP`, `DT_NEEDED`, RPATH/RUNPATH, **GNU symbol versions**; map **SONAMEs** to file paths; link executables → libs.
* **PE/MachO** (planned M2): import table, delayimports; version resources; code signatures.
* Map libs back to **OS packages** if possible (via file lists); else emit `bin:{sha256}` components.
* The exported metadata (`stellaops.os.*` properties, license list, source package) feeds policy scoring and export pipelines
directly Policy evaluates quiet rules against package provenance while Exporters forward the enriched fields into
downstream JSON/Trivy payloads.
* **Reachability lattice**: analyzers + runtime probes emit `Evidence`/`Mitigation` records (see `docs/reachability/lattice.md`). The lattice engine joins static path evidence, runtime hits (EventPipe/JFR), taint flows, environment gates, and mitigations into `ReachDecision` documents that feed VEX gating and event graph storage.
* Sprint401 introduces `StellaOps.Scanner.Symbols.Native` (DWARF/PDB reader + demangler) and `StellaOps.Scanner.CallGraph.Native`
(function boundary detector + call-edge builder). These libraries feed `FuncNode`/`CallEdge` CAS bundles and enrich reachability
graphs with `{code_id, confidence, evidence}` so Signals/Policy/UI can cite function-level justifications.
**D) EntryTrace (ENTRYPOINT/CMD → terminal program)**
@@ -273,10 +276,10 @@ The emitted `buildId` metadata is preserved in component hashes, diff payloads,
### 5.6 DSSE attestation (via Signer/Attestor)
* WebService constructs **predicate** with `image_digest`, `stellaops_version`, `license_id`, `policy_digest?` (when emitting **final reports**), timestamps.
* Calls **Signer** (requires **OpTok + PoE**); Signer verifies **entitlement + scanner image integrity** and returns **DSSE bundle**.
* **Attestor** logs to **Rekor v2**; returns `{uuid,index,proof}` → stored in `artifacts.rekor`.
* Operator enablement runbooks (toggles, env-var map, rollout guidance) live in [`operations/dsse-rekor-operator-guide.md`](operations/dsse-rekor-operator-guide.md) per SCANNER-ENG-0015.
* WebService constructs **predicate** with `image_digest`, `stellaops_version`, `license_id`, `policy_digest?` (when emitting **final reports**), timestamps.
* Calls **Signer** (requires **OpTok + PoE**); Signer verifies **entitlement + scanner image integrity** and returns **DSSE bundle**.
* **Attestor** logs to **Rekor v2**; returns `{uuid,index,proof}` → stored in `artifacts.rekor`.
* Operator enablement runbooks (toggles, env-var map, rollout guidance) live in [`operations/dsse-rekor-operator-guide.md`](operations/dsse-rekor-operator-guide.md) per SCANNER-ENG-0015.
---
@@ -333,7 +336,7 @@ scanner:
objectLock: "governance" # or 'compliance'
analyzers:
os: { apk: true, dpkg: true, rpm: true }
lang: { java: true, node: true, python: true, go: true, dotnet: true, rust: true }
lang: { java: true, node: true, bun: true, python: true, go: true, dotnet: true, rust: true, ruby: true, php: true }
native: { elf: true, pe: false, macho: false } # PE/Mach-O in M2
entryTrace: { enabled: true, shellMaxDepth: 64, followRunParts: true }
emit:
@@ -478,17 +481,17 @@ ResolveEntrypoint(ImageConfig cfg, RootFs fs):
return Unknown(reason)
```
### Appendix A.1 — EntryTrace Explainability
### Appendix A.0 — Replay / Record mode
- WebService ships a **RecordModeService** that assembles replay manifests (schema v1) with policy/feed/tool pins and reachability references, then writes deterministic input/output bundles to the configured object store (RustFS default, S3/Minio fallback) under `replay/<head>/<digest>.tar.zst`.
- Bundles contain canonical manifest JSON plus inputs (policy/feed/tool/analyzer digests) and outputs (SBOM, findings, optional VEX/logs); CAS URIs follow `cas://replay/...` and are attached to scan snapshots as `ReplayArtifacts`.
- Reachability graphs/traces are folded into the manifest via `ReachabilityReplayWriter`; manifests and bundles hash with stable ordering for replay verification (`docs/replay/DETERMINISTIC_REPLAY.md`).
- Worker sealed-mode intake reads `replay.bundle.uri` + `replay.bundle.sha256` (plus determinism feed/policy pins) from job metadata, persists bundle refs in analysis and surface manifest, and validates hashes before use.
- Deterministic execution switches (`docs/modules/scanner/deterministic-execution.md`) must be enabled when generating replay bundles to keep hashes stable.
EntryTrace emits structured diagnostics and metrics so operators can quickly understand why resolution succeeded or degraded:
### Appendix A.1 — EntryTrace Explainability
### Appendix A.0 — Replay / Record mode
- WebService ships a **RecordModeService** that assembles replay manifests (schema v1) with policy/feed/tool pins and reachability references, then writes deterministic input/output bundles to the configured object store (RustFS default, S3/Minio fallback) under `replay/<head>/<digest>.tar.zst`.
- Bundles contain canonical manifest JSON plus inputs (policy/feed/tool/analyzer digests) and outputs (SBOM, findings, optional VEX/logs); CAS URIs follow `cas://replay/...` and are attached to scan snapshots as `ReplayArtifacts`.
- Reachability graphs/traces are folded into the manifest via `ReachabilityReplayWriter`; manifests and bundles hash with stable ordering for replay verification (`docs/replay/DETERMINISTIC_REPLAY.md`).
- Worker sealed-mode intake reads `replay.bundle.uri` + `replay.bundle.sha256` (plus determinism feed/policy pins) from job metadata, persists bundle refs in analysis and surface manifest, and validates hashes before use.
- Deterministic execution switches (`docs/modules/scanner/deterministic-execution.md`) must be enabled when generating replay bundles to keep hashes stable.
EntryTrace emits structured diagnostics and metrics so operators can quickly understand why resolution succeeded or degraded:
| Reason | Description | Typical Mitigation |
|--------|-------------|--------------------|

View File

@@ -0,0 +1,146 @@
# Bun Analyzer Developer Gotchas
This document covers common pitfalls and considerations when working with the Bun analyzer.
## 1. Isolated Installs Are Symlink-Heavy
Bun's isolated linker (`bun install --linker isolated`) creates a flat store under `node_modules/.bun/` with symlinks for package resolution. This differs from the default hoisted layout.
**Implications:**
- The analyzer must traverse `node_modules/.bun/**/package.json` in addition to `node_modules/**/package.json`
- Symlink safety guards are critical to prevent infinite loops and out-of-root traversal
- Both logical and real paths are recorded in evidence for traceability
- Performance guards (`MaxSymlinkDepth=10`, `MaxFilesPerRoot=50000`) are enforced
**Testing:**
- Use the `IsolatedLinkerInstallIsParsedAsync` test fixture to verify `.bun/` traversal
- Use the `SymlinkSafetyIsEnforcedAsync` test fixture for symlink corner cases
## 2. `node_modules/.bun/` Scanning Requirement
Unlike Node.js, Bun may store packages entirely under `node_modules/.bun/` with only symlinks in the top-level `node_modules/`. If your scanner configuration excludes `.bun/` directories, you will miss dependencies.
**Checklist:**
- Ensure glob patterns include `.bun/` subdirectories
- Do not filter out hidden directories in container scans
- Verify evidence shows packages from both `node_modules/` and `node_modules/.bun/`
## 3. `bun.lockb` Migration Path
The binary lockfile (`bun.lockb`) format is undocumented and unstable. The analyzer treats it as **unsupported** and emits a remediation finding.
**Migration command:**
```bash
bun install --save-text-lockfile
```
This generates `bun.lock` (JSONC text format) which the analyzer can parse.
**WebService response:** When only `bun.lockb` is present:
- The scan completes but reports unsupported status
- Remediation guidance is included in findings
- No package inventory is generated
## 4. JSONC Lockfile Format
`bun.lock` uses JSONC (JSON with Comments) format supporting:
- Single-line comments (`// ...`)
- Multi-line comments (`/* ... */`)
- Trailing commas in arrays and objects
**Parser considerations:**
- The `BunLockParser` tolerates these JSONC features
- Standard JSON parsers will fail on `bun.lock` files
- Format may evolve with Bun releases; parser is intentionally tolerant
## 5. Multi-Stage Build Implications
In multi-stage Docker builds, the final image may contain only production artifacts without the lockfile or `node_modules/.bun/` directory.
**Scanning strategies:**
1. **Image scanning (recommended for production):** Scans the final image filesystem. Set `include_dev: false` to filter dev dependencies
2. **Repository scanning:** Scans `bun.lock` from source. Includes all dependencies by default (`include_dev: true`)
**Best practice:** Scan both the repository (for complete visibility) and production images (for runtime accuracy).
## 6. npm Ecosystem Reuse
Bun packages are npm packages. The analyzer:
- Emits `pkg:npm/<name>@<version>` PURLs (same as Node analyzer)
- Uses `ecosystem = npm` for vulnerability lookups
- Adds `package_manager = bun` metadata for differentiation
This means:
- Vulnerability intelligence is shared with Node analyzer
- VEX statements for npm packages apply to Bun
- No separate Bun-specific advisory database is needed
## 7. Source Detection in Lockfile
`bun.lock` entries include source information that determines package type:
| Source Pattern | Type | Example |
|---------------|------|---------|
| No source / default registry | `registry` | `lodash@4.17.21` |
| `git+https://...` or `git://...` | `git` | VCS dependency |
| `file:` or `link:` | `tarball` | Local package |
| `workspace:` | `workspace` | Monorepo member |
The analyzer records source type in evidence for provenance tracking.
## 8. Workspace/Monorepo Handling
Bun workspaces use a single `bun.lock` at the root with multiple `package.json` files in subdirectories.
**Analyzer behavior:**
- Discovers the root by presence of `bun.lock` + `package.json`
- Traverses all `node_modules/` directories under the root
- Deduplicates packages by `(name, version)` while accumulating occurrence paths
- Records workspace member paths in metadata
**Testing:** Use the `WorkspacesAreParsedAsync` test fixture.
## 9. Dev/Prod Dependency Filtering
The `include_dev` configuration option controls whether dev dependencies are included:
| Context | Default `include_dev` | Rationale |
|---------|----------------------|-----------|
| Repository scan (lockfile-only) | `true` | Full visibility for developers |
| Image scan (installed packages) | `true` | Packages are present regardless of intent |
**Override:** Set `include_dev: false` in scan configuration to exclude dev dependencies from results.
## 10. Evidence Model
Each Bun package includes evidence with:
- `source`: Where the package was found (`node_modules`, `bun.lock`, `node_modules/.bun`)
- `locator`: File path to the evidence
- `resolved`: The resolved URL from lockfile (if available)
- `integrity`: SHA hash from lockfile (if available)
- `sha256`: File hash for installed packages
Evidence enables:
- Tracing packages to their origin
- Validating integrity
- Explaining presence in SBOM
## CLI Reference
### Inspect local workspace
```bash
stellaops-cli bun inspect --root /path/to/project
```
### Resolve packages from scan
```bash
stellaops-cli bun resolve --scan-id <id>
stellaops-cli bun resolve --digest sha256:<hash>
stellaops-cli bun resolve --ref myregistry.io/myapp:latest
```
### Output formats
```bash
stellaops-cli bun inspect --format json > packages.json
stellaops-cli bun inspect --format table
```