# Scanner Language Analyzer Contracts (Identity / Evidence / Container Layout) This document freezes the cross-analyzer contracts that are shared by the language analyzers (Java, .NET, Python, Node, Bun). These rules exist to prevent false matches, keep outputs deterministic, and protect against host-path leakage. ## 1) Identity Safety Contract (PURL vs Explicit Key) ### 1.1 Goals - **No fake versions**: never encode version ranges, tags, local paths, or git URLs as a versioned PURL. - **No collisions**: explicit-key identities must not collide with concrete PURLs and must be deterministic across OS path separators. - **Proof-first**: emit concrete PURLs only when the analyzer has concrete, replayable evidence for the version. ### 1.2 When to emit a concrete PURL Emit a concrete (versioned) PURL only when **both** are true: 1) The analyzer can determine a **concrete version** (ecosystem-specific) for the component. 2) The version is backed by **replayable evidence** (e.g., installed artifact metadata or lockfile-resolved entry). Typical sources that qualify: - **Installed inventory** (e.g., `node_modules/**/package.json`, Python `*.dist-info/METADATA`, .NET `deps.json` entries). - **Lockfile-resolved inventory** (e.g., `bun.lock` entry with `name@version` and integrity/resolved URL). ### 1.3 When to emit an explicit-key component (required) Emit an explicit-key component when the dependency is **declared-only** or otherwise **non-concrete**: - Version ranges / operators (`^`, `~`, `>=`, `<`, `*`, `x`, `latest`, etc.). - Workspace/link/file dependencies (`workspace:*`, `link:`, `file:`, local path refs, editable installs). - Git dependencies (git URL / commit / ref) when a concrete semantic version is not provable from local evidence. - Unknown / missing version. **Rule:** If the analyzer cannot prove a concrete version from local evidence, it must not emit a versioned PURL for that dependency. ### 1.4 Explicit-key format (canonical) For declared-only / non-concrete identities, analyzers must emit: - `componentKey`: `explicit::::::::sha256:` - `purl`: `null` - `version`: `null` Where `` is `sha256` of the canonical UTF-8 string: ``` \n\n\n ``` Canonicalization rules: - `` uses ecosystem naming rules (e.g., npm scoped names keep `@scope/name`). - `` is the **original declared specifier** (range/tag/url/path), trimmed; for unknown, use `""`. - `` is project-relative with `/` separators (e.g., `package.json#dependencies`, `requirements.txt`, `Directory.Packages.props#PackageVersion:Foo`). - No absolute paths, drive letters, or host roots appear in any input to the digest. ### 1.5 Required metadata for explicit-key components Explicit-key components must include (at minimum) these metadata keys: - `declaredOnly=true` - `declared.source=` (e.g., `package.json`, `Directory.Packages.props`) - `declared.locator=` (same string used in digest) - `declared.versionSpec=` (original specifier or empty) - `declared.scope=` when applicable - `declared.sourceType=` ## 2) Evidence Locator Contract ### 2.1 General rules - Evidence locators are **external-facing** and must be stable and parseable. - Every locator is **project-relative** with `/` separators (never absolute). - Evidence content/hashing must be bounded; when bounds are exceeded, emit deterministic `skipped` markers in metadata instead of silently omitting. ### 2.2 Locator formats (canonical) **File evidence** - `locator`: `` (e.g., `packages/app/package.json`) - `source`: a stable discriminator (e.g., `package.json`, `pom.xml`, `METADATA`) **Lockfile entry evidence** - `locator`: `:` - Examples: - Node package-lock: `package-lock.json:packages/app/node_modules/foo` - Bun lock: `bun.lock:packages[foo@1.2.3]` - Maven/Gradle lock: `gradle.lockfile:com.example:foo:1.2.3` **Nested artifact evidence** - `locator`: `!!` - Example: `demo-jni.jar!META-INF/native-image/demo/jni-config.json` **Derived evidence** - `locator`: a stable synthetic name (e.g., `phase22.ndjson`) - `source`: a stable synthetic source (e.g., `node.observation`) ### 2.3 Hashing rules (baseline) - Hash only bounded inputs (default: 1 MiB per evidence value/file; analyzers may choose a tighter cap). - Hash algorithm: `sha256` over UTF-8 bytes for textual evidence, raw bytes for file evidence. - If hashing is skipped due to bounds or errors, emit deterministic metadata markers (e.g., `hashSkipped=true`, `hashSkipped.reason=sizeCap`). ## 3) Container Layout Discovery Contract ### 3.1 Layer root candidates Language analyzers that support container-root discovery must treat these as **candidate roots** under the analysis root: - `layers/*` (direct children) - `.layers/*` (direct children; **must not be skipped**) - `layer*` (direct children of the analysis root, e.g., `layer1/`, `layer2/`) Each candidate root is scanned independently for projects. ### 3.2 Bounds and traversal safety (required) - Deterministic traversal (sorted directory enumeration). - Depth caps per candidate root; hard cap on total discovered project roots. - Must never recurse into `node_modules/` (Node/Bun) or equivalent heavy dirs. - Hidden directories may be skipped **except** `.layers` which is treated as a top-level candidate root. - No symlink escape: if symlinks are followed, resolved targets must remain within the candidate root prefix and cycles must be prevented. ### 3.3 Overlay/whiteout semantics - If an analyzer implements overlay semantics (notably Python container adapters), whiteouts and precedence rules must be explicit, deterministic, and fixture-tested. - If an analyzer does **not** implement overlay semantics, it must still keep discovery bounded and must not silently drop projects; emit deterministic "skipped" markers when bounds prevent full traversal. ## Compliance Sprints `docs/implplan/SPRINT_0403_0001_0001_scanner_java_detection_gaps.md` through `docs/implplan/SPRINT_0407_0001_0001_scanner_bun_detection_gaps.md` (and the program sprint `docs/implplan/SPRINT_0408_0001_0001_scanner_language_detection_gaps_program.md`) carry the per-analyzer implementation and test evidence required to enforce this contract.