Files
git.stella-ops.org/docs/implplan/SPRINT_137_scanner_gap_design.md
master 69c59defdc
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
feat: Implement Runtime Facts ingestion service and NDJSON reader
- Added RuntimeFactsNdjsonReader for reading NDJSON formatted runtime facts.
- Introduced IRuntimeFactsIngestionService interface and its implementation.
- Enhanced Program.cs to register new services and endpoints for runtime facts.
- Updated CallgraphIngestionService to include CAS URI in stored artifacts.
- Created RuntimeFactsValidationException for validation errors during ingestion.
- Added tests for RuntimeFactsIngestionService and RuntimeFactsNdjsonReader.
- Implemented SignalsSealedModeMonitor for compliance checks in sealed mode.
- Updated project dependencies for testing utilities.
2025-11-10 07:56:15 +02:00

18 KiB
Raw Blame History

Sprint 137 - Scanner & Surface

Phase focus: Scanner.VIII — Analyzer gap design & readiness.

  • Depends on: Sprint 136 · Scanner.VII (Surface env/fs/secrets) to ensure shared primitives exist.
  • Feeds: Sprint 138 (Ruby parity) and Sprint 139 (language-specific analyzers) by locking designs + policy hooks.
Task ID State Summary Owner / Source Depends On
SCANNER-ENG-0002 DONE (2025-11-09) Design the Node.js lockfile collector + CLI validator per docs/benchmarks/scanner/scanning-gaps-stella-misses-from-competitors.md, capturing Surface + policy requirements before implementation. Scanner Guild, CLI Guild (docs/modules/scanner)
SCANNER-ENG-0003 DONE (2025-11-09) Design Python lockfile + editable-install parity checks with policy predicates and CLI workflow coverage as outlined in the gap analysis. Python Analyzer Guild, CLI Guild (docs/modules/scanner)
SCANNER-ENG-0004 DONE (2025-11-09) Design Java lockfile ingestion/validation (Gradle/SBT collectors, CLI verb, policy hooks) to close comparison gaps. Java Analyzer Guild, CLI Guild (docs/modules/scanner)
SCANNER-ENG-0005 DONE (2025-11-09) Enhance Go stripped-binary fallback inference design, including inferred module metadata + policy integration, per the gap analysis. Go Analyzer Guild (docs/modules/scanner)
SCANNER-ENG-0006 DONE (2025-11-09) Expand Rust fingerprint coverage design (enriched fingerprint catalogue + policy controls) per the comparison matrix. Rust Analyzer Guild (docs/modules/scanner)
SCANNER-ENG-0007 DONE (2025-11-09) Design the deterministic secret leak detection pipeline covering rule packaging, Policy Engine integration, and CLI workflow. Scanner Guild, Policy Guild (docs/modules/scanner)

2025-11-09: The gap designs below capture analyzer, Surface, CLI, and policy contracts for SCANNER-ENG-0002…0007; tasks were taken DOING → DONE after this review.

Implementation progress (2025-11-09)

  • Gradle/Maven lock ingestion is now wired into JavaLanguageAnalyzer: JavaLockFileCollector sorts lock metadata deterministically, merges it with archive findings (lockConfiguration, lockRepository, lockResolved), and emits declared-only components (with declaredOnly=true, lockSource, lockLocator) whenever jars are missing. CLI/Surface telemetry tags were updated to carry per-language declared/missing counters.
  • stella java lock-validate shares the HandleLanguageLockValidateAsync helper with Node/Python, has table/JSON output parity, and is documented alongside the scanner README + CLI guide (including the new metric stellaops.cli.java.lock_validate.count). Tests now cover the Ruby/Node/Java lock workflows end-to-end via CommandHandlersTests.

Design outcomes

SCANNER-ENG-0002 — Node.js lockfile collector + CLI validator

Scope & goals

  • Provide deterministic ingestion of pnpm-lock.yaml, package-lock.json, and yarn.lock so declared dependencies are preserved even when node_modules is absent.
  • Offer a CLI validator that runs without scheduling a scan, reusing the same collector and Surface safety rails.

Design decisions

  • Add NodeLockfileCollector under StellaOps.Scanner.Analyzers.Lang.Node. The collector normalises manifests into a shared model (package name, version, resolved, integrity, registry, workspace path) and emits DeclaredOnly = true components stored beside installed fragments (LayerComponentFragment.DeclaredSources).
  • Reuse LanguageAnalyzerContext merge rules so installed packages supersede declared-only entries while retaining discrepancies for policy.
  • Gate execution through Surface.Validation (scanner.lockfiles.node.* knobs) that enforce max lockfile size, workspace limits, and registry allowlists; violations fail fast with deterministic error IDs.
  • Private registries referenced in lockfiles must use secret:// handles. Surface.Secrets resolves these handles before validation and the resolved metadata (never the secret) is attached to the collector context for auditing.
  • EntryTrace usage hints annotate runtime packages; when a package is used at runtime but missing from the lockfile, the merge step tags it with UsageWithoutDeclaration.

CLI, policy, docs

  • Add stella node lock-validate [path] --format {auto|pnpm|npm|yarn} that runs locally, reuses Surface controls, and returns canonical JSON + table summaries. The CLI inherits --surface-config so air-gapped configs stay consistent.
  • Scanner/WebService gains --node-lockfiles / SCANNER__NODE__LOCKFILES__ENABLED toggles to control ingestion during full scans.
  • Policy Engine receives predicates: node.lock.declaredMissing, node.lock.registryDisallowed, node.lock.declarationOnly. Templates show how to fail on disallowed registries while only warning on declared-only findings that never reach runtime.
  • Update docs/modules/scanner/architecture.md and policy DSL appendices with the new evidence flags and CLI workflow.

Testing, telemetry, rollout

  • Golden fixtures for pnpm v8, npm v9, and yarn berry lockfiles live under tests/Scanner.Analyzers.Node/__fixtures__/lockfiles. Deterministic snapshots are asserted in both analyzer and CLI tests.
  • Add integration coverage in tests/Scanner.Cli.Node verifying exit codes and explain output for mismatched packages/registries.
  • Emit counters (scanner.node.lock.declared, scanner.node.lock.mismatch, scanner.node.lock.registry_blocked) plus structured logs keyed by lockfile digest.
  • Offline Kit ships the parser tables and CLI binary help under offline/scanner/node-lockfiles/README.md.

Implementation status (2025-11-09)

  • Lockfile declarations now emit DeclaredOnly components in StellaOps.Scanner.Analyzers.Lang.Node with lock source/locator metadata and deterministic evidence for policy use.
  • CLI verb stella node lock-validate inspects lockfiles locally, rendering declared-only/missing-lock summaries and emitting stellaops.cli.node.lock_validate.count telemetry.
  • Node analyzer determinism fixtures updated with declared-only coverage; CLI unit suite exercises the new handler.
  • Python analyzer ingests requirements*.txt, Pipfile.lock, and poetry.lock, tagging installed distributions with lockSource metadata and creating declared-only components. stella python lock-validate mirrors the workflow for offline validation and records stellaops.cli.python.lock_validate.count.

SCANNER-ENG-0003 — Python lockfile + editable-install parity

Scope & goals

  • Parse Python lockfiles (poetry.lock, Pipfile.lock, hashed requirements*.txt) to capture declared graphs pre-install.
  • Detect editable installs and local path references so policy can assert parity between lockfiles and runtime contents.

Design decisions

  • Introduce PythonLockfileCollector in StellaOps.Scanner.Analyzers.Lang.Python, capable of reading Poetry, Pipenv, pip-tools, and raw requirements syntax (including environment markers, extras, hashes, VCS refs).
  • Extend the collector with an EditableResolver that inspects lockfile entries (path =, editable = true, -e ./pkg) and consults Surface.FS to normalise the referenced directory, capturing EditablePath, SourceDigest, and VcsRef metadata.
  • Merge results with installed *.dist-info data using LanguageAnalyzerContext. Installed evidence overrides declared-only components; editable packages missing from the artifact layer are tagged EditableMissing.
  • Surface.Validation adds knobs scanner.lockfiles.python.maxBytes, scanner.lockfiles.python.allowedIndexes, and ensures hashes are present when policy mandates repeatable environments. Private index credentials are provided via Surface.Secrets and never persisted.

CLI, policy, docs

  • New CLI verb stella python lock-validate mirrors the Node workflow, validates editable references resolve within the checked-out tree, and emits parity diagnostics.
  • Scanner runs accept --python-lockfiles to toggle ingestion per tenant.
  • Policy predicates: python.lock.declaredMissing, python.lock.editableUnpinned, python.lock.indexDisallowed. Editable packages missing from the filesystem can be set to fail builds or raise waivers.
  • Document the workflow in docs/modules/scanner/architecture.md and the policy cookbook, including guidance on handling build-system backends.

Testing, telemetry, rollout

  • Fixtures covering Poetry 1.6, Pipenv 2024.x, requirements.txt with markers, and mixed editable/VCS entries live beside the analyzer tests.
  • CLI golden output asserts deterministic ordering and masking of secrets in URLs.
  • Metrics: scanner.python.lock.declared, scanner.python.lock.editable, scanner.python.lock.failures.
  • Offline Kit bundles include parser definitions and sample policies to keep air-gapped tenants aligned.

SCANNER-ENG-0004 — Java/Gradle/SBT lockfile ingestion & validation

Scope & goals

  • Capture Gradle, Maven, and SBT dependency locks before artifacts are built, along with repository provenance and configuration scopes.
  • Provide CLI validation and policy predicates enforcing repository allowlists and declared/runtime parity.

Design decisions

  • Add collectors: GradleLockfileCollector (reads gradle.lockfile and gradle/dependency-locks/*.lock), MavenLockfileCollector (parses pom.xml/pom.lock + dependencyManagement overrides), and SbtLockfileCollector (reads Ivy resolution outputs or dependencies.lock).
  • Each collector emits normalized records keyed by groupId:artifactId:version plus config scope (compileClasspath, runtimeClasspath, etc.), repository URI, checksum, and optional classifier. Records are stored as DeclaredOnly fragments associated with their workspace path.
  • Surface.Validation enforces file-size limits, repository allowlists (scanner.lockfiles.java.allowedRepos), and optional checksum requirements. Private Maven credentials flow through Surface.Secrets.
  • JavaLanguageAnalyzer merges declared entries with installed archives. Runtime usage from EntryTrace is attached so policies can prioritize gaps that reach runtime.

CLI, policy, docs

  • CLI verb stella java lock-validate supports Gradle/Maven/SBT modes, prints mismatched dependencies, and checks repository policy.
  • Scanner flags --java-lockfiles or env SCANNER__JAVA__LOCKFILES__ENABLED gate ingestion. Lockfile artifacts are uploaded to Surface.FS for evidence replay.
  • Policy predicates: java.lock.declaredMissing, java.lock.repoDisallowed, java.lock.unpinned (no checksum). Explain traces cite repository + config scope for each discrepancy.
  • Docs: update scanner module dossier and policy template library with repository governance examples.

Testing, telemetry, rollout

  • Fixtures derived from sample Gradle multi-projects, Maven BOM hierarchies, and SBT builds validate parser coverage and CLI messaging.
  • Metrics scanner.java.lock.declared, scanner.java.lock.missing, scanner.java.lock.repo_blocked feed the observability dashboards.
  • Offline kits include parser grammars and CLI docs so air-gapped tenants can enforce repo policies without SaaS dependencies.

SCANNER-ENG-0005 — Go stripped-binary fallback inference

Scope & goals

  • Enrich the stripped-binary fallback so Go modules remain explainable even without embedded buildinfo, and give Policy Engine knobs to treat inferred evidence differently.

Design decisions

  • Extend GoBinaryScanner with an inference pipeline that, when build info is absent, parses ELF/Mach-O symbol tables and DWARF data using the existing ElfSharp bindings. Symbols feed into a new GoSymbolInferenceEngine that matches against a signed GoFingerprintCatalog under StellaOps.Scanner.Analyzers.Lang.Go.Fingerprints.
  • Inferred results carry Confidence (01), matched symbol counts, and reasons (BuildInfoMissing, SymbolMatches, PkgPathFallback). Records are emitted as InferredModule metadata alongside hashed fallback components.
  • Update fragment schemas so DSSE-composed BOMs include both the hashed fallback and the inference summary, enabling deterministic replay.
  • Surface.Validation exposes scanner.analyzers.go.fallback.enabled, scanner.analyzers.go.fallback.maxSymbolBytes, ensuring workloads can opt out or constrain processing time.

Policy, CLI, docs

  • Policy predicates go.module.inferenceConfidence and go.module.hashOnly let tenants fail when only hashed provenance exists or warn when inference confidence < threshold.
  • CLI flag --go-fallback-detail (and corresponding API query) prints hashed vs inferred modules, confidence, and remediation hints (e.g., rebuild with -buildvcs).
  • Documentation updates cover inference details, how confidence feeds lattice weights, and how to author waivers.

Testing, telemetry, rollout

  • Add stripped binary fixtures (Linux, macOS) plus intentionally obfuscated samples. Tests assert deterministic inference and hashing.
  • Metrics scanner.go.inference.count, scanner.go.inference.confidence_bucket ensure observability; logs include imageDigest, binaryPath, confidence.
  • Offline Kit bundles the fingerprint catalog and inference changelog so air-gapped tenants can audit provenance.

SCANNER-ENG-0006 — Rust fingerprint coverage expansion

Scope & goals

  • Improve Rust evidence for stripped binaries by expanding fingerprint sources, symbol parsing, and policy controls over heuristic findings.

Design decisions

  • Build a new RustFingerprintCatalog signed and versioned, fed by Cargo crate metadata, community hash contributions, and curated fingerprints from StellaOps scans. Catalog lives under StellaOps.Scanner.Analyzers.Lang.Rust.Fingerprints with deterministic ordering.
  • Extend RustAnalyzerCollector with symbol parsing (DWARF, ELF build IDs) via SymbolGraphResolver. Resolver correlates crate sections, monomorphized symbol prefixes, and #[panic_handler] markers to infer crate names and versions.
  • Emit inference metadata (fingerprintId, confidence, symbolEvidence[]) alongside hashed fallbacks. Authoritative Cargo.lock data (when present) still wins in merges.
  • Surface.Validation adds toggles for fingerprint freshness and maximum catalog size per tenant. Offline bundles deliver catalog updates signed via DSSE.

Policy, CLI, docs

  • Policy predicates: rust.fingerprint.confidence, rust.fingerprint.catalogAgeDays. Templates show how to warn when only heuristic data exists, or fail if catalog updates are stale.
  • CLI flag --rust-fingerprint-detail prints authoritative vs inferred crates, symbol samples, and guidance.
  • Documentation (scanner module + policy guide) explains how inference is stored, how catalog publishing works, and how to tune policy weights.

Testing, telemetry, rollout

  • Add fixtures for stripped Rust binaries across editions (20182024) and with/without LTO. Determinism tests compare catalog revisions and inference outputs.
  • Metrics scanner.rust.fingerprint.authoritative, scanner.rust.fingerprint.inferred, scanner.rust.fingerprint.catalog_version feed dashboards and alerts.
  • Offline kit updates include catalog packages, verification instructions, and waiver templates tied to predicate names.

SCANNER-ENG-0007 — Deterministic secret leak detection pipeline

Scope & goals

  • Provide first-party secret leak detection that matches competitor capabilities while preserving deterministic, offline-friendly execution and explainability.

Design decisions

  • Introduce StellaOps.Scanner.Analyzers.Secrets, a restart-time plug-in that consumes rule bundles (ruleset.tgz) signed with DSSE and versioned (semantic version + hash). Bundles live under plugins/scanner/secrets/rules/<version>.
  • Rule bundles contain deterministic regex/entropy definitions, context windows, and masking directives. A rule index is generated at build time to guarantee deterministic ordering.
  • Analyzer executes after Surface validation of each file/layer. Files pass through a streaming matcher that outputs SecretLeakEvidence (rule id, severity, confidence, file path, byte ranges, masking applied). Findings persist in ScanAnalysisStore and align with DSSE exports.
  • Surface.Validation introduces scanner.secrets.rules.bundle, scanner.secrets.maxFileBytes, and scanner.secrets.targetGlobs. Surface.Secrets supplies allowlist tokens (e.g., approved test keys) without exposing plaintext to analyzers.
  • Events/attestations: findings optionally published via the existing Redis events, and Export Center bundles include masked evidence plus rule metadata.

CLI, policy, docs

  • Add stella secrets scan [path|image] plus --secrets flag on stella scan to run the analyzer inline. CLI output redacts payloads, shows rule IDs, severity, and remediation hints.
  • Policy Engine ingests secret.leak evidence, including ruleId, confidence, masking.applied, enabling predicates like secret.leak.highConfidence, secret.leak.ruleDisabled. Templates cover severities, approvals, and ticket automation.
  • Documentation updates: scanner module dossier (new analyzer), policy cookbook (rule management), and Offline Kit guide (bundling rule updates).

Testing, telemetry, rollout

  • Rule-pack regression tests ensure deterministic matching and masking; analyzer unit tests cover regex + entropy combos, while integration tests run across sample repositories and OCI layers.
  • Metrics: scanner.secrets.ruleset.version, scanner.secrets.findings.total, scanner.secrets.findings.high_confidence. Logs include rule ID, masked hash, and file digests for auditing.
  • Offline Kit delivers the signed ruleset catalog, upgrade guide, and policy defaults so fully air-gapped tenants can keep pace without internet access.