Files
git.stella-ops.org/docs/modules/scanner/language-analyzers-contract.md
StellaOps Bot 6e45066e37
Some checks failed
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Signals Reachability Scoring & Events / reachability-smoke (push) Has been cancelled
Signals Reachability Scoring & Events / sign-and-upload (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Scanner Analyzers / Discover Analyzers (push) Has been cancelled
Scanner Analyzers / Build Analyzers (push) Has been cancelled
Scanner Analyzers / Test Language Analyzers (push) Has been cancelled
Scanner Analyzers / Validate Test Fixtures (push) Has been cancelled
Scanner Analyzers / Verify Deterministic Output (push) Has been cancelled
up
2025-12-13 09:37:15 +02:00

6.3 KiB

Scanner Language Analyzer Contracts (Identity / Evidence / Container Layout)

This document freezes the cross-analyzer contracts that are shared by the language analyzers (Java, .NET, Python, Node, Bun). These rules exist to prevent false matches, keep outputs deterministic, and protect against host-path leakage.

1) Identity Safety Contract (PURL vs Explicit Key)

1.1 Goals

  • No fake versions: never encode version ranges, tags, local paths, or git URLs as a versioned PURL.
  • No collisions: explicit-key identities must not collide with concrete PURLs and must be deterministic across OS path separators.
  • Proof-first: emit concrete PURLs only when the analyzer has concrete, replayable evidence for the version.

1.2 When to emit a concrete PURL

Emit a concrete (versioned) PURL only when both are true:

  1. The analyzer can determine a concrete version (ecosystem-specific) for the component.
  2. The version is backed by replayable evidence (e.g., installed artifact metadata or lockfile-resolved entry).

Typical sources that qualify:

  • Installed inventory (e.g., node_modules/**/package.json, Python *.dist-info/METADATA, .NET deps.json entries).
  • Lockfile-resolved inventory (e.g., bun.lock entry with name@version and integrity/resolved URL).

1.3 When to emit an explicit-key component (required)

Emit an explicit-key component when the dependency is declared-only or otherwise non-concrete:

  • Version ranges / operators (^, ~, >=, <, *, x, latest, etc.).
  • Workspace/link/file dependencies (workspace:*, link:, file:, local path refs, editable installs).
  • Git dependencies (git URL / commit / ref) when a concrete semantic version is not provable from local evidence.
  • Unknown / missing version.

Rule: If the analyzer cannot prove a concrete version from local evidence, it must not emit a versioned PURL for that dependency.

1.4 Explicit-key format (canonical)

For declared-only / non-concrete identities, analyzers must emit:

  • componentKey: explicit::<analyzerId>::<ecosystem>::<name>::sha256:<digest>
  • purl: null
  • version: null

Where <digest> is sha256 of the canonical UTF-8 string:

<ecosystem>\n<normalizedName>\n<normalizedSpec>\n<originLocator>

Canonicalization rules:

  • <normalizedName> uses ecosystem naming rules (e.g., npm scoped names keep @scope/name).
  • <normalizedSpec> is the original declared specifier (range/tag/url/path), trimmed; for unknown, use "".
  • <originLocator> is project-relative with / separators (e.g., package.json#dependencies, requirements.txt, Directory.Packages.props#PackageVersion:Foo).
  • No absolute paths, drive letters, or host roots appear in any input to the digest.

1.5 Required metadata for explicit-key components

Explicit-key components must include (at minimum) these metadata keys:

  • declaredOnly=true
  • declared.source=<file> (e.g., package.json, Directory.Packages.props)
  • declared.locator=<originLocator> (same string used in digest)
  • declared.versionSpec=<normalizedSpec> (original specifier or empty)
  • declared.scope=<prod|dev|peer|optional|unknown> when applicable
  • declared.sourceType=<range|tag|git|tarball|file|link|workspace|path|editable|unknown>

2) Evidence Locator Contract

2.1 General rules

  • Evidence locators are external-facing and must be stable and parseable.
  • Every locator is project-relative with / separators (never absolute).
  • Evidence content/hashing must be bounded; when bounds are exceeded, emit deterministic skipped markers in metadata instead of silently omitting.

2.2 Locator formats (canonical)

File evidence

  • locator: <relativePath> (e.g., packages/app/package.json)
  • source: a stable discriminator (e.g., package.json, pom.xml, METADATA)

Lockfile entry evidence

  • locator: <lockfileRelativePath>:<selector>
  • Examples:
    • Node package-lock: package-lock.json:packages/app/node_modules/foo
    • Bun lock: bun.lock:packages[foo@1.2.3]
    • Maven/Gradle lock: gradle.lockfile:com.example:foo:1.2.3

Nested artifact evidence

  • locator: <outer>!<inner>!<path>
  • Example: demo-jni.jar!META-INF/native-image/demo/jni-config.json

Derived evidence

  • locator: a stable synthetic name (e.g., phase22.ndjson)
  • source: a stable synthetic source (e.g., node.observation)

2.3 Hashing rules (baseline)

  • Hash only bounded inputs (default: 1 MiB per evidence value/file; analyzers may choose a tighter cap).
  • Hash algorithm: sha256 over UTF-8 bytes for textual evidence, raw bytes for file evidence.
  • If hashing is skipped due to bounds or errors, emit deterministic metadata markers (e.g., hashSkipped=true, hashSkipped.reason=sizeCap).

3) Container Layout Discovery Contract

3.1 Layer root candidates

Language analyzers that support container-root discovery must treat these as candidate roots under the analysis root:

  • layers/* (direct children)
  • .layers/* (direct children; must not be skipped)
  • layer* (direct children of the analysis root, e.g., layer1/, layer2/)

Each candidate root is scanned independently for projects.

3.2 Bounds and traversal safety (required)

  • Deterministic traversal (sorted directory enumeration).
  • Depth caps per candidate root; hard cap on total discovered project roots.
  • Must never recurse into node_modules/ (Node/Bun) or equivalent heavy dirs.
  • Hidden directories may be skipped except .layers which is treated as a top-level candidate root.
  • No symlink escape: if symlinks are followed, resolved targets must remain within the candidate root prefix and cycles must be prevented.

3.3 Overlay/whiteout semantics

  • If an analyzer implements overlay semantics (notably Python container adapters), whiteouts and precedence rules must be explicit, deterministic, and fixture-tested.
  • If an analyzer does not implement overlay semantics, it must still keep discovery bounded and must not silently drop projects; emit deterministic "skipped" markers when bounds prevent full traversal.

Compliance

Sprints docs/implplan/SPRINT_0403_0001_0001_scanner_java_detection_gaps.md through docs/implplan/SPRINT_0407_0001_0001_scanner_bun_detection_gaps.md (and the program sprint docs/implplan/SPRINT_0408_0001_0001_scanner_language_detection_gaps_program.md) carry the per-analyzer implementation and test evidence required to enforce this contract.