6.3 KiB
Scanner Language Analyzer Contracts (Identity / Evidence / Container Layout)
This document freezes the cross-analyzer contracts that are shared by the language analyzers (Java, .NET, Python, Node, Bun). These rules exist to prevent false matches, keep outputs deterministic, and protect against host-path leakage.
1) Identity Safety Contract (PURL vs Explicit Key)
1.1 Goals
- No fake versions: never encode version ranges, tags, local paths, or git URLs as a versioned PURL.
- No collisions: explicit-key identities must not collide with concrete PURLs and must be deterministic across OS path separators.
- Proof-first: emit concrete PURLs only when the analyzer has concrete, replayable evidence for the version.
1.2 When to emit a concrete PURL
Emit a concrete (versioned) PURL only when both are true:
- The analyzer can determine a concrete version (ecosystem-specific) for the component.
- The version is backed by replayable evidence (e.g., installed artifact metadata or lockfile-resolved entry).
Typical sources that qualify:
- Installed inventory (e.g.,
node_modules/**/package.json, Python*.dist-info/METADATA, .NETdeps.jsonentries). - Lockfile-resolved inventory (e.g.,
bun.lockentry withname@versionand integrity/resolved URL).
1.3 When to emit an explicit-key component (required)
Emit an explicit-key component when the dependency is declared-only or otherwise non-concrete:
- Version ranges / operators (
^,~,>=,<,*,x,latest, etc.). - Workspace/link/file dependencies (
workspace:*,link:,file:, local path refs, editable installs). - Git dependencies (git URL / commit / ref) when a concrete semantic version is not provable from local evidence.
- Unknown / missing version.
Rule: If the analyzer cannot prove a concrete version from local evidence, it must not emit a versioned PURL for that dependency.
1.4 Explicit-key format (canonical)
For declared-only / non-concrete identities, analyzers must emit:
componentKey:explicit::<analyzerId>::<ecosystem>::<name>::sha256:<digest>purl:nullversion:null
Where <digest> is sha256 of the canonical UTF-8 string:
<ecosystem>\n<normalizedName>\n<normalizedSpec>\n<originLocator>
Canonicalization rules:
<normalizedName>uses ecosystem naming rules (e.g., npm scoped names keep@scope/name).<normalizedSpec>is the original declared specifier (range/tag/url/path), trimmed; for unknown, use"".<originLocator>is project-relative with/separators (e.g.,package.json#dependencies,requirements.txt,Directory.Packages.props#PackageVersion:Foo).- No absolute paths, drive letters, or host roots appear in any input to the digest.
1.5 Required metadata for explicit-key components
Explicit-key components must include (at minimum) these metadata keys:
declaredOnly=truedeclared.source=<file>(e.g.,package.json,Directory.Packages.props)declared.locator=<originLocator>(same string used in digest)declared.versionSpec=<normalizedSpec>(original specifier or empty)declared.scope=<prod|dev|peer|optional|unknown>when applicabledeclared.sourceType=<range|tag|git|tarball|file|link|workspace|path|editable|unknown>
2) Evidence Locator Contract
2.1 General rules
- Evidence locators are external-facing and must be stable and parseable.
- Every locator is project-relative with
/separators (never absolute). - Evidence content/hashing must be bounded; when bounds are exceeded, emit deterministic
skippedmarkers in metadata instead of silently omitting.
2.2 Locator formats (canonical)
File evidence
locator:<relativePath>(e.g.,packages/app/package.json)source: a stable discriminator (e.g.,package.json,pom.xml,METADATA)
Lockfile entry evidence
locator:<lockfileRelativePath>:<selector>- Examples:
- Node package-lock:
package-lock.json:packages/app/node_modules/foo - Bun lock:
bun.lock:packages[foo@1.2.3] - Maven/Gradle lock:
gradle.lockfile:com.example:foo:1.2.3
- Node package-lock:
Nested artifact evidence
locator:<outer>!<inner>!<path>- Example:
demo-jni.jar!META-INF/native-image/demo/jni-config.json
Derived evidence
locator: a stable synthetic name (e.g.,phase22.ndjson)source: a stable synthetic source (e.g.,node.observation)
2.3 Hashing rules (baseline)
- Hash only bounded inputs (default: 1 MiB per evidence value/file; analyzers may choose a tighter cap).
- Hash algorithm:
sha256over UTF-8 bytes for textual evidence, raw bytes for file evidence. - If hashing is skipped due to bounds or errors, emit deterministic metadata markers (e.g.,
hashSkipped=true,hashSkipped.reason=sizeCap).
3) Container Layout Discovery Contract
3.1 Layer root candidates
Language analyzers that support container-root discovery must treat these as candidate roots under the analysis root:
layers/*(direct children).layers/*(direct children; must not be skipped)layer*(direct children of the analysis root, e.g.,layer1/,layer2/)
Each candidate root is scanned independently for projects.
3.2 Bounds and traversal safety (required)
- Deterministic traversal (sorted directory enumeration).
- Depth caps per candidate root; hard cap on total discovered project roots.
- Must never recurse into
node_modules/(Node/Bun) or equivalent heavy dirs. - Hidden directories may be skipped except
.layerswhich is treated as a top-level candidate root. - No symlink escape: if symlinks are followed, resolved targets must remain within the candidate root prefix and cycles must be prevented.
3.3 Overlay/whiteout semantics
- If an analyzer implements overlay semantics (notably Python container adapters), whiteouts and precedence rules must be explicit, deterministic, and fixture-tested.
- If an analyzer does not implement overlay semantics, it must still keep discovery bounded and must not silently drop projects; emit deterministic "skipped" markers when bounds prevent full traversal.
Compliance
Sprints docs/implplan/SPRINT_0403_0001_0001_scanner_java_detection_gaps.md through docs/implplan/SPRINT_0407_0001_0001_scanner_bun_detection_gaps.md (and the program sprint docs/implplan/SPRINT_0408_0001_0001_scanner_language_detection_gaps_program.md) carry the per-analyzer implementation and test evidence required to enforce this contract.