feat: Update analyzer fixtures and metadata for improved license handling and provenance tracking
- Added license expressions and provenance fields to expected JSON outputs for .NET and Rust analyzers. - Introduced new .nuspec files for StellaOps.Runtime.SelfContained and StellaOps.Toolkit packages, including license information. - Created LICENSE.txt files for both toolkit packages with clear licensing terms. - Updated expected JSON for signed and simple analyzers to include license information and provenance. - Enhanced the SPRINTS_LANG_IMPLEMENTATION_PLAN.md with detailed progress and future sprint outlines, ensuring clarity on deliverables and acceptance metrics.
This commit is contained in:
		| @@ -1,43 +1,43 @@ | ||||
| # StellaOps Scanner — Language Analyzer Implementation Plan (2025Q4) | ||||
|  | ||||
| > **Goal.** Deliver best-in-class language analyzers that outperform competitors on fidelity, determinism, and offline readiness while integrating tightly with Scanner Worker orchestration and SBOM composition. | ||||
|  | ||||
| All sprints below assume prerequisites from SP10-G2 (core scaffolding + Java analyzer) are complete. Each sprint is sized for a focused guild (≈1–1.5 weeks) and produces definitive gates for downstream teams (Emit, Policy, Scheduler). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## Sprint LA1 — Node Analyzer & Workspace Intelligence (Tasks 10-302, 10-307, 10-308, 10-309 subset) *(DOING — 2025-10-19)* | ||||
| - **Scope:** Resolve hoisted `node_modules`, PNPM structures, Yarn Berry Plug'n'Play, symlinked workspaces, and detect security-sensitive scripts. | ||||
| - **Deliverables:** | ||||
|   - `StellaOps.Scanner.Analyzers.Lang.Node` plug-in with manifest + DI registration. | ||||
|   - Deterministic walker supporting >100 k modules with streaming JSON parsing. | ||||
|   - Workspace graph persisted as analyzer metadata (`package.json` provenance + symlink target proofs). | ||||
| - **Acceptance Metrics:** | ||||
|   - 10 k module fixture scans <1.8 s on 4 vCPU (p95). | ||||
|   - Memory ceiling <220 MB (tracked via deterministic benchmark harness). | ||||
|   - All symlink targets canonicalized; path traversal guarded. | ||||
| - **Gate Artifacts:** | ||||
|   - `Fixtures/lang/node/**` golden outputs. | ||||
|   - Analyzer benchmark CSV + flamegraph (commit under `bench/Scanner.Analyzers`). | ||||
|   - Worker integration sample enabling Node analyzer via manifest. | ||||
| # StellaOps Scanner — Language Analyzer Implementation Plan (2025Q4) | ||||
|  | ||||
| > **Goal.** Deliver best-in-class language analyzers that outperform competitors on fidelity, determinism, and offline readiness while integrating tightly with Scanner Worker orchestration and SBOM composition. | ||||
|  | ||||
| All sprints below assume prerequisites from SP10-G2 (core scaffolding + Java analyzer) are complete. Each sprint is sized for a focused guild (≈1–1.5 weeks) and produces definitive gates for downstream teams (Emit, Policy, Scheduler). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## Sprint LA1 — Node Analyzer & Workspace Intelligence (Tasks 10-302, 10-307, 10-308, 10-309 subset) *(DOING — 2025-10-19)* | ||||
| - **Scope:** Resolve hoisted `node_modules`, PNPM structures, Yarn Berry Plug'n'Play, symlinked workspaces, and detect security-sensitive scripts. | ||||
| - **Deliverables:** | ||||
|   - `StellaOps.Scanner.Analyzers.Lang.Node` plug-in with manifest + DI registration. | ||||
|   - Deterministic walker supporting >100 k modules with streaming JSON parsing. | ||||
|   - Workspace graph persisted as analyzer metadata (`package.json` provenance + symlink target proofs). | ||||
| - **Acceptance Metrics:** | ||||
|   - 10 k module fixture scans <1.8 s on 4 vCPU (p95). | ||||
|   - Memory ceiling <220 MB (tracked via deterministic benchmark harness). | ||||
|   - All symlink targets canonicalized; path traversal guarded. | ||||
| - **Gate Artifacts:** | ||||
|   - `Fixtures/lang/node/**` golden outputs. | ||||
|   - Analyzer benchmark CSV + flamegraph (commit under `bench/Scanner.Analyzers`). | ||||
|   - Worker integration sample enabling Node analyzer via manifest. | ||||
| - **Progress (2025-10-21):** Module walker with package-lock/yarn/pnpm resolution, workspace attribution, integrity metadata, and deterministic fixture harness committed; Node tasks 10-302A/B remain green. Shared component mapper + canonical result harness landed, closing tasks 10-307/308. Script metadata & telemetry (10-302C) emit policy hints, hashed evidence, and feed `scanner_analyzer_node_scripts_total` into Worker OpenTelemetry pipeline. Restart-time packaging closed (10-309): manifest added, Worker language catalog loads the Node analyzer, integration tests cover dispatch + layer fragments, and Offline Kit docs call out bundled language plug-ins. | ||||
|  | ||||
| ## Sprint LA2 — Python Analyzer & Entry Point Attribution (Tasks 10-303, 10-307, 10-308, 10-309 subset) | ||||
| - **Scope:** Parse `*.dist-info`, `RECORD` hashes, entry points, and pip-installed editable packages; integrate usage hints from EntryTrace. | ||||
| - **Deliverables:** | ||||
|   - `StellaOps.Scanner.Analyzers.Lang.Python` plug-in. | ||||
|   - RECORD hash validation with optional Zip64 support for `.whl` caches. | ||||
|   - Entry-point mapping into `UsageFlags` for Emit stage. | ||||
| - **Acceptance Metrics:** | ||||
|   - Hash verification throughput ≥75 MB/s sustained with streaming reader. | ||||
|   - False-positive rate for editable installs <1 % on curated fixtures. | ||||
|   - Determinism check across CPython 3.8–3.12 generated metadata. | ||||
|  | ||||
| ## Sprint LA2 — Python Analyzer & Entry Point Attribution (Tasks 10-303, 10-307, 10-308, 10-309 subset) | ||||
| - **Scope:** Parse `*.dist-info`, `RECORD` hashes, entry points, and pip-installed editable packages; integrate usage hints from EntryTrace. | ||||
| - **Deliverables:** | ||||
|   - `StellaOps.Scanner.Analyzers.Lang.Python` plug-in. | ||||
|   - RECORD hash validation with optional Zip64 support for `.whl` caches. | ||||
|   - Entry-point mapping into `UsageFlags` for Emit stage. | ||||
| - **Acceptance Metrics:** | ||||
|   - Hash verification throughput ≥75 MB/s sustained with streaming reader. | ||||
|   - False-positive rate for editable installs <1 % on curated fixtures. | ||||
|   - Determinism check across CPython 3.8–3.12 generated metadata. | ||||
| - **Gate Artifacts:** | ||||
|   - Golden fixtures for `site-packages`, virtualenv, and layered pip caches. | ||||
|   - Usage hint propagation tests (EntryTrace → analyzer → SBOM). | ||||
|   - Metrics counters (`scanner_analyzer_python_components_total`) documented. | ||||
| - **Progress (2025-10-21):** Python analyzer landed; Tasks 10-303A/B/C are DONE with dist-info parsing, RECORD verification, editable install detection, and deterministic `simple-venv` fixture + benchmark hooks recorded. | ||||
|  | ||||
|  | ||||
| ## Sprint LA3 — Go Analyzer & Build Info Synthesis (Tasks 10-304, 10-307, 10-308, 10-309 subset) | ||||
| - **Scope:** Extract Go build metadata from `.note.go.buildid`, embedded module info, and fallback to `bin:{sha256}`; surface VCS provenance. | ||||
| - **Deliverables:** | ||||
| @@ -51,67 +51,67 @@ All sprints below assume prerequisites from SP10-G2 (core scaffolding + Java ana | ||||
| - **Gate Artifacts:** | ||||
|   - Benchmarks vs competitor open-source tool (Trivy or Syft) demonstrating faster metadata extraction. | ||||
|   - Documentation snippet explaining VCS metadata fields for Policy team. | ||||
| - **Progress (2025-10-22):** Build-info decoder shipped with DWARF-string fallback for `vcs.*` markers, plus cached metadata keyed by binary length/timestamp. Added Go test fixtures covering build-info and DWARF-only binaries with deterministic goldens; analyzer now emits `go.dwarf` evidence alongside `go.buildinfo` metadata to feed downstream provenance rules. Completed stripped-binary heuristics with deterministic `golang::bin::sha256` components and a new `stripped` fixture to guard quiet-provenance behaviour. | ||||
|  | ||||
| ## Sprint LA4 — .NET Analyzer & RID Variants (Tasks 10-305, 10-307, 10-308, 10-309 subset) | ||||
| - **Scope:** Parse `*.deps.json`, `runtimeconfig.json`, assembly metadata, and RID-specific assets; correlate with native dependencies. | ||||
| - **Deliverables:** | ||||
|   - `StellaOps.Scanner.Analyzers.Lang.DotNet` plug-in. | ||||
|   - Strong-name + Authenticode optional verification when offline cert bundle provided. | ||||
|   - RID-aware component grouping with fallback to `bin:{sha256}` for self-contained apps. | ||||
| - **Acceptance Metrics:** | ||||
|   - Multi-target app fixture processed <1.2 s; memory <250 MB. | ||||
|   - RID variant collapse reduces component explosion by ≥40 % vs naive listing. | ||||
|   - All security metadata (signing Publisher, timestamp) surfaced deterministically. | ||||
| - **Progress (2025-10-22):** Build-info decoder shipped with DWARF-string fallback for `vcs.*` markers, plus cached metadata keyed by binary length/timestamp. Added Go test fixtures covering build-info and DWARF-only binaries with deterministic goldens; analyzer now emits `go.dwarf` evidence alongside `go.buildinfo` metadata to feed downstream provenance rules. Completed stripped-binary heuristics with deterministic `golang::bin::sha256` components and a new `stripped` fixture to guard quiet-provenance behaviour. Heuristic fallbacks now emit `scanner_analyzer_golang_heuristic_total{indicator,version_hint}` counters, and shared buffer pooling (`ArrayPool<byte>`) keeps concurrent scans allocation-lite. Bench harness (`bench/Scanner.Analyzers/config.json`) gained a dedicated Go scenario with baseline mean 4.02 ms; comparison against Syft v1.29.1 on the same fixture shows a 22 % speed advantage (see `bench/Scanner.Analyzers/lang/go/syft-comparison-20251021.csv`). | ||||
|  | ||||
| ## Sprint LA4 — .NET Analyzer & RID Variants (Tasks 10-305, 10-307, 10-308, 10-309 subset) | ||||
| - **Scope:** Parse `*.deps.json`, `runtimeconfig.json`, assembly metadata, and RID-specific assets; correlate with native dependencies. | ||||
| - **Deliverables:** | ||||
|   - `StellaOps.Scanner.Analyzers.Lang.DotNet` plug-in. | ||||
|   - Strong-name + Authenticode optional verification when offline cert bundle provided. | ||||
|   - RID-aware component grouping with fallback to `bin:{sha256}` for self-contained apps. | ||||
| - **Acceptance Metrics:** | ||||
|   - Multi-target app fixture processed <1.2 s; memory <250 MB. | ||||
|   - RID variant collapse reduces component explosion by ≥40 % vs naive listing. | ||||
|   - All security metadata (signing Publisher, timestamp) surfaced deterministically. | ||||
| - **Gate Artifacts:** | ||||
|   - Signed .NET sample apps (framework-dependent & self-contained) under `samples/scanner/lang/dotnet/`. | ||||
|   - Tests verifying dual runtimeconfig merge logic. | ||||
|   - Guidance for Policy on license propagation from NuGet metadata. | ||||
| - **Progress (2025-10-22):** Completed task 10-305A with a deterministic deps/runtimeconfig ingest pipeline producing `pkg:nuget` components across RID targets. Added dotnet fixture + golden output to the shared harness, wired analyzer plugin availability, and surfaced RID metadata in component records for downstream emit/diff work. | ||||
|  | ||||
| ## Sprint LA5 — Rust Analyzer & Binary Fingerprinting (Tasks 10-306, 10-307, 10-308, 10-309 subset) | ||||
| - **Scope:** Detect crates via metadata in `.fingerprint`, Cargo.lock fragments, or embedded `rustc` markers; robust fallback to binary hash classification. | ||||
| - **Deliverables:** | ||||
|   - `StellaOps.Scanner.Analyzers.Lang.Rust` plug-in. | ||||
|   - Symbol table heuristics capable of attributing stripped binaries by leveraging `.comment` and section names without violating determinism. | ||||
|   - Quiet-provenance flags to differentiate heuristics from hard evidence. | ||||
| - **Acceptance Metrics:** | ||||
|   - Accurate crate attribution ≥85 % on curated Cargo workspace fixtures. | ||||
|   - Heuristic fallback clearly labeled; no false “certain” claims. | ||||
|   - Analyzer completes <1 s on 500 binary corpus. | ||||
| - **Gate Artifacts:** | ||||
|   - Fixtures covering cargo workspaces, binaries with embedded metadata stripped. | ||||
|   - ADR documenting heuristic boundaries + risk mitigations. | ||||
|  | ||||
| ## Sprint LA6 — Shared Evidence Enhancements & Worker Integration (Tasks 10-307, 10-308, 10-309 finalization) | ||||
| - **Scope:** Finalize shared helpers, deterministic harness expansion, Worker/Emit wiring, and macro benchmarks. | ||||
| - **Deliverables:** | ||||
|   - Consolidated `LanguageComponentWriter` extensions for license, vulnerability hints, and usage propagation. | ||||
|   - Worker dispatcher loading plug-ins via manifest registry + health checks. | ||||
|   - Combined analyzer benchmark suite executed in CI with regression thresholds. | ||||
| - **Acceptance Metrics:** | ||||
|   - Worker executes mixed analyzer suite (Java+Node+Python+Go+.NET+Rust) within SLA: warm scan <6 s, cold <25 s. | ||||
|   - CI determinism guard catches output drift (>0 diff tolerance) across all fixtures. | ||||
|   - Telemetry coverage: each analyzer emits timing + component counters. | ||||
| - **Gate Artifacts:** | ||||
|   - `SPRINTS_LANG_IMPLEMENTATION_PLAN.md` progress log updated (this file). | ||||
|   - `bench/Scanner.Analyzers/lang-matrix.csv` recorded + referenced in docs. | ||||
|   - Ops notes for packaging plug-ins into Offline Kit. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## Cross-Sprint Considerations | ||||
| - **Security:** All analyzers must enforce path canonicalization, guard against zip-slip, and expose provenance classifications (`observed`, `heuristic`, `attested`). | ||||
| - **Offline-first:** No network calls; rely on cached metadata and optional offline bundles (license texts, signature roots). | ||||
| - **Determinism:** Normalise timestamps to `0001-01-01T00:00:00Z` when persisting synthetic data; sort collections by stable keys. | ||||
| - **Benchmarking:** Extend `bench/Scanner.Analyzers` to compare against open-source scanners (Syft/Trivy) and document performance wins. | ||||
| - **Hand-offs:** Emit guild requires consistent component schemas; Policy needs license + provenance metadata; Scheduler depends on usage flags for ImpactIndex. | ||||
|  | ||||
| ## Tracking & Reporting | ||||
| - Update `TASKS.md` per sprint (TODO → DOING → DONE) with date stamps. | ||||
| - Log sprint summaries in `docs/updates/` once each sprint lands. | ||||
| - Use module-specific CI pipeline to run analyzer suites nightly (determinism + perf). | ||||
|  | ||||
| --- | ||||
|  | ||||
| **Next Action:** Start Sprint LA1 (Node Analyzer) — move tasks 10-302, 10-307, 10-308, 10-309 → DOING and spin up fixtures + benchmarks. | ||||
| - **Progress (2025-10-22):** Completed task 10-305A with a deterministic deps/runtimeconfig ingest pipeline producing `pkg:nuget` components across RID targets. Added dotnet fixture + golden output to the shared harness, wired analyzer plugin availability, and surfaced RID metadata in component records for downstream emit/diff work. License provenance and quiet flagging now ride through the shared helpers (task 10-307D), including nuspec license expression/file ingestion, manifest provenance tagging, and concurrency-safe file metadata caching with new parallel tests. | ||||
|  | ||||
| ## Sprint LA5 — Rust Analyzer & Binary Fingerprinting (Tasks 10-306, 10-307, 10-308, 10-309 subset) | ||||
| - **Scope:** Detect crates via metadata in `.fingerprint`, Cargo.lock fragments, or embedded `rustc` markers; robust fallback to binary hash classification. | ||||
| - **Deliverables:** | ||||
|   - `StellaOps.Scanner.Analyzers.Lang.Rust` plug-in. | ||||
|   - Symbol table heuristics capable of attributing stripped binaries by leveraging `.comment` and section names without violating determinism. | ||||
|   - Quiet-provenance flags to differentiate heuristics from hard evidence. | ||||
| - **Acceptance Metrics:** | ||||
|   - Accurate crate attribution ≥85 % on curated Cargo workspace fixtures. | ||||
|   - Heuristic fallback clearly labeled; no false “certain” claims. | ||||
|   - Analyzer completes <1 s on 500 binary corpus. | ||||
| - **Gate Artifacts:** | ||||
|   - Fixtures covering cargo workspaces, binaries with embedded metadata stripped. | ||||
|   - ADR documenting heuristic boundaries + risk mitigations. | ||||
|  | ||||
| ## Sprint LA6 — Shared Evidence Enhancements & Worker Integration (Tasks 10-307, 10-308, 10-309 finalization) | ||||
| - **Scope:** Finalize shared helpers, deterministic harness expansion, Worker/Emit wiring, and macro benchmarks. | ||||
| - **Deliverables:** | ||||
|   - Consolidated `LanguageComponentWriter` extensions for license, vulnerability hints, and usage propagation. | ||||
|   - Worker dispatcher loading plug-ins via manifest registry + health checks. | ||||
|   - Combined analyzer benchmark suite executed in CI with regression thresholds. | ||||
| - **Acceptance Metrics:** | ||||
|   - Worker executes mixed analyzer suite (Java+Node+Python+Go+.NET+Rust) within SLA: warm scan <6 s, cold <25 s. | ||||
|   - CI determinism guard catches output drift (>0 diff tolerance) across all fixtures. | ||||
|   - Telemetry coverage: each analyzer emits timing + component counters. | ||||
| - **Gate Artifacts:** | ||||
|   - `SPRINTS_LANG_IMPLEMENTATION_PLAN.md` progress log updated (this file). | ||||
|   - `bench/Scanner.Analyzers/lang-matrix.csv` recorded + referenced in docs. | ||||
|   - Ops notes for packaging plug-ins into Offline Kit. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## Cross-Sprint Considerations | ||||
| - **Security:** All analyzers must enforce path canonicalization, guard against zip-slip, and expose provenance classifications (`observed`, `heuristic`, `attested`). | ||||
| - **Offline-first:** No network calls; rely on cached metadata and optional offline bundles (license texts, signature roots). | ||||
| - **Determinism:** Normalise timestamps to `0001-01-01T00:00:00Z` when persisting synthetic data; sort collections by stable keys. | ||||
| - **Benchmarking:** Extend `bench/Scanner.Analyzers` to compare against open-source scanners (Syft/Trivy) and document performance wins. | ||||
| - **Hand-offs:** Emit guild requires consistent component schemas; Policy needs license + provenance metadata; Scheduler depends on usage flags for ImpactIndex. | ||||
|  | ||||
| ## Tracking & Reporting | ||||
| - Update `TASKS.md` per sprint (TODO → DOING → DONE) with date stamps. | ||||
| - Log sprint summaries in `docs/updates/` once each sprint lands. | ||||
| - Use module-specific CI pipeline to run analyzer suites nightly (determinism + perf). | ||||
|  | ||||
| --- | ||||
|  | ||||
| **Next Action:** Start Sprint LA1 (Node Analyzer) — move tasks 10-302, 10-307, 10-308, 10-309 → DOING and spin up fixtures + benchmarks. | ||||
|   | ||||
		Reference in New Issue
	
	Block a user