Files
git.stella-ops.org/src/StellaOps.Scanner.Analyzers.Lang/SPRINTS_LANG_IMPLEMENTATION_PLAN.md
master 09d21d977c feat: Update analyzer fixtures and metadata for improved license handling and provenance tracking
- Added license expressions and provenance fields to expected JSON outputs for .NET and Rust analyzers.
- Introduced new .nuspec files for StellaOps.Runtime.SelfContained and StellaOps.Toolkit packages, including license information.
- Created LICENSE.txt files for both toolkit packages with clear licensing terms.
- Updated expected JSON for signed and simple analyzers to include license information and provenance.
- Enhanced the SPRINTS_LANG_IMPLEMENTATION_PLAN.md with detailed progress and future sprint outlines, ensuring clarity on deliverables and acceptance metrics.
2025-10-23 07:57:16 +03:00

9.8 KiB
Raw Blame History

StellaOps Scanner — Language Analyzer Implementation Plan (2025Q4)

Goal. Deliver best-in-class language analyzers that outperform competitors on fidelity, determinism, and offline readiness while integrating tightly with Scanner Worker orchestration and SBOM composition.

All sprints below assume prerequisites from SP10-G2 (core scaffolding + Java analyzer) are complete. Each sprint is sized for a focused guild (≈11.5weeks) and produces definitive gates for downstream teams (Emit, Policy, Scheduler).


Sprint LA1 — Node Analyzer & Workspace Intelligence (Tasks 10-302, 10-307, 10-308, 10-309 subset) (DOING — 2025-10-19)

  • Scope: Resolve hoisted node_modules, PNPM structures, Yarn Berry Plug'n'Play, symlinked workspaces, and detect security-sensitive scripts.
  • Deliverables:
    • StellaOps.Scanner.Analyzers.Lang.Node plug-in with manifest + DI registration.
    • Deterministic walker supporting >100k modules with streaming JSON parsing.
    • Workspace graph persisted as analyzer metadata (package.json provenance + symlink target proofs).
  • Acceptance Metrics:
    • 10k module fixture scans <1.8s on 4vCPU (p95).
    • Memory ceiling <220MB (tracked via deterministic benchmark harness).
    • All symlink targets canonicalized; path traversal guarded.
  • Gate Artifacts:
    • Fixtures/lang/node/** golden outputs.
    • Analyzer benchmark CSV + flamegraph (commit under bench/Scanner.Analyzers).
    • Worker integration sample enabling Node analyzer via manifest.
  • Progress (2025-10-21): Module walker with package-lock/yarn/pnpm resolution, workspace attribution, integrity metadata, and deterministic fixture harness committed; Node tasks 10-302A/B remain green. Shared component mapper + canonical result harness landed, closing tasks 10-307/308. Script metadata & telemetry (10-302C) emit policy hints, hashed evidence, and feed scanner_analyzer_node_scripts_total into Worker OpenTelemetry pipeline. Restart-time packaging closed (10-309): manifest added, Worker language catalog loads the Node analyzer, integration tests cover dispatch + layer fragments, and Offline Kit docs call out bundled language plug-ins.

Sprint LA2 — Python Analyzer & Entry Point Attribution (Tasks 10-303, 10-307, 10-308, 10-309 subset)

  • Scope: Parse *.dist-info, RECORD hashes, entry points, and pip-installed editable packages; integrate usage hints from EntryTrace.
  • Deliverables:
    • StellaOps.Scanner.Analyzers.Lang.Python plug-in.
    • RECORD hash validation with optional Zip64 support for .whl caches.
    • Entry-point mapping into UsageFlags for Emit stage.
  • Acceptance Metrics:
    • Hash verification throughput ≥75MB/s sustained with streaming reader.
    • False-positive rate for editable installs <1% on curated fixtures.
    • Determinism check across CPython 3.83.12 generated metadata.
  • Gate Artifacts:
    • Golden fixtures for site-packages, virtualenv, and layered pip caches.
    • Usage hint propagation tests (EntryTrace → analyzer → SBOM).
    • Metrics counters (scanner_analyzer_python_components_total) documented.
  • Progress (2025-10-21): Python analyzer landed; Tasks 10-303A/B/C are DONE with dist-info parsing, RECORD verification, editable install detection, and deterministic simple-venv fixture + benchmark hooks recorded.

Sprint LA3 — Go Analyzer & Build Info Synthesis (Tasks 10-304, 10-307, 10-308, 10-309 subset)

  • Scope: Extract Go build metadata from .note.go.buildid, embedded module info, and fallback to bin:{sha256}; surface VCS provenance.
  • Deliverables:
    • StellaOps.Scanner.Analyzers.Lang.Go plug-in.
    • DWARF-lite parser to enrich component origin (commit hash + dirty flag) when available.
    • Shared hash cache to dedupe repeated binaries across layers.
  • Acceptance Metrics:
    • Analyzer latency ≤400µs per binary (hot cache) / ≤2ms (cold).
    • Provenance coverage ≥95% on representative Go fixture suite.
    • Zero allocations in happy path beyond pooled buffers (validated via BenchmarkDotNet).
  • Gate Artifacts:
    • Benchmarks vs competitor open-source tool (Trivy or Syft) demonstrating faster metadata extraction.
    • Documentation snippet explaining VCS metadata fields for Policy team.
  • Progress (2025-10-22): Build-info decoder shipped with DWARF-string fallback for vcs.* markers, plus cached metadata keyed by binary length/timestamp. Added Go test fixtures covering build-info and DWARF-only binaries with deterministic goldens; analyzer now emits go.dwarf evidence alongside go.buildinfo metadata to feed downstream provenance rules. Completed stripped-binary heuristics with deterministic golang::bin::sha256 components and a new stripped fixture to guard quiet-provenance behaviour. Heuristic fallbacks now emit scanner_analyzer_golang_heuristic_total{indicator,version_hint} counters, and shared buffer pooling (ArrayPool<byte>) keeps concurrent scans allocation-lite. Bench harness (bench/Scanner.Analyzers/config.json) gained a dedicated Go scenario with baseline mean 4.02ms; comparison against Syft v1.29.1 on the same fixture shows a 22% speed advantage (see bench/Scanner.Analyzers/lang/go/syft-comparison-20251021.csv).

Sprint LA4 — .NET Analyzer & RID Variants (Tasks 10-305, 10-307, 10-308, 10-309 subset)

  • Scope: Parse *.deps.json, runtimeconfig.json, assembly metadata, and RID-specific assets; correlate with native dependencies.
  • Deliverables:
    • StellaOps.Scanner.Analyzers.Lang.DotNet plug-in.
    • Strong-name + Authenticode optional verification when offline cert bundle provided.
    • RID-aware component grouping with fallback to bin:{sha256} for self-contained apps.
  • Acceptance Metrics:
    • Multi-target app fixture processed <1.2s; memory <250MB.
    • RID variant collapse reduces component explosion by ≥40% vs naive listing.
    • All security metadata (signing Publisher, timestamp) surfaced deterministically.
  • Gate Artifacts:
    • Signed .NET sample apps (framework-dependent & self-contained) under samples/scanner/lang/dotnet/.
    • Tests verifying dual runtimeconfig merge logic.
    • Guidance for Policy on license propagation from NuGet metadata.
  • Progress (2025-10-22): Completed task 10-305A with a deterministic deps/runtimeconfig ingest pipeline producing pkg:nuget components across RID targets. Added dotnet fixture + golden output to the shared harness, wired analyzer plugin availability, and surfaced RID metadata in component records for downstream emit/diff work. License provenance and quiet flagging now ride through the shared helpers (task 10-307D), including nuspec license expression/file ingestion, manifest provenance tagging, and concurrency-safe file metadata caching with new parallel tests.

Sprint LA5 — Rust Analyzer & Binary Fingerprinting (Tasks 10-306, 10-307, 10-308, 10-309 subset)

  • Scope: Detect crates via metadata in .fingerprint, Cargo.lock fragments, or embedded rustc markers; robust fallback to binary hash classification.
  • Deliverables:
    • StellaOps.Scanner.Analyzers.Lang.Rust plug-in.
    • Symbol table heuristics capable of attributing stripped binaries by leveraging .comment and section names without violating determinism.
    • Quiet-provenance flags to differentiate heuristics from hard evidence.
  • Acceptance Metrics:
    • Accurate crate attribution ≥85% on curated Cargo workspace fixtures.
    • Heuristic fallback clearly labeled; no false “certain” claims.
    • Analyzer completes <1s on 500 binary corpus.
  • Gate Artifacts:
    • Fixtures covering cargo workspaces, binaries with embedded metadata stripped.
    • ADR documenting heuristic boundaries + risk mitigations.

Sprint LA6 — Shared Evidence Enhancements & Worker Integration (Tasks 10-307, 10-308, 10-309 finalization)

  • Scope: Finalize shared helpers, deterministic harness expansion, Worker/Emit wiring, and macro benchmarks.
  • Deliverables:
    • Consolidated LanguageComponentWriter extensions for license, vulnerability hints, and usage propagation.
    • Worker dispatcher loading plug-ins via manifest registry + health checks.
    • Combined analyzer benchmark suite executed in CI with regression thresholds.
  • Acceptance Metrics:
    • Worker executes mixed analyzer suite (Java+Node+Python+Go+.NET+Rust) within SLA: warm scan <6s, cold <25s.
    • CI determinism guard catches output drift (>0 diff tolerance) across all fixtures.
    • Telemetry coverage: each analyzer emits timing + component counters.
  • Gate Artifacts:
    • SPRINTS_LANG_IMPLEMENTATION_PLAN.md progress log updated (this file).
    • bench/Scanner.Analyzers/lang-matrix.csv recorded + referenced in docs.
    • Ops notes for packaging plug-ins into Offline Kit.

Cross-Sprint Considerations

  • Security: All analyzers must enforce path canonicalization, guard against zip-slip, and expose provenance classifications (observed, heuristic, attested).
  • Offline-first: No network calls; rely on cached metadata and optional offline bundles (license texts, signature roots).
  • Determinism: Normalise timestamps to 0001-01-01T00:00:00Z when persisting synthetic data; sort collections by stable keys.
  • Benchmarking: Extend bench/Scanner.Analyzers to compare against open-source scanners (Syft/Trivy) and document performance wins.
  • Hand-offs: Emit guild requires consistent component schemas; Policy needs license + provenance metadata; Scheduler depends on usage flags for ImpactIndex.

Tracking & Reporting

  • Update TASKS.md per sprint (TODO → DOING → DONE) with date stamps.
  • Log sprint summaries in docs/updates/ once each sprint lands.
  • Use module-specific CI pipeline to run analyzer suites nightly (determinism + perf).

Next Action: Start Sprint LA1 (Node Analyzer) — move tasks 10-302, 10-307, 10-308, 10-309 → DOING and spin up fixtures + benchmarks.