Files
git.stella-ops.org/docs/modules/scanner/language-analyzers-contract.md
StellaOps Bot 6e45066e37
Some checks failed
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Signals Reachability Scoring & Events / reachability-smoke (push) Has been cancelled
Signals Reachability Scoring & Events / sign-and-upload (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Scanner Analyzers / Discover Analyzers (push) Has been cancelled
Scanner Analyzers / Build Analyzers (push) Has been cancelled
Scanner Analyzers / Test Language Analyzers (push) Has been cancelled
Scanner Analyzers / Validate Test Fixtures (push) Has been cancelled
Scanner Analyzers / Verify Deterministic Output (push) Has been cancelled
up
2025-12-13 09:37:15 +02:00

111 lines
6.3 KiB
Markdown

# Scanner Language Analyzer Contracts (Identity / Evidence / Container Layout)
This document freezes the cross-analyzer contracts that are shared by the language analyzers (Java, .NET, Python, Node, Bun). These rules exist to prevent false matches, keep outputs deterministic, and protect against host-path leakage.
## 1) Identity Safety Contract (PURL vs Explicit Key)
### 1.1 Goals
- **No fake versions**: never encode version ranges, tags, local paths, or git URLs as a versioned PURL.
- **No collisions**: explicit-key identities must not collide with concrete PURLs and must be deterministic across OS path separators.
- **Proof-first**: emit concrete PURLs only when the analyzer has concrete, replayable evidence for the version.
### 1.2 When to emit a concrete PURL
Emit a concrete (versioned) PURL only when **both** are true:
1) The analyzer can determine a **concrete version** (ecosystem-specific) for the component.
2) The version is backed by **replayable evidence** (e.g., installed artifact metadata or lockfile-resolved entry).
Typical sources that qualify:
- **Installed inventory** (e.g., `node_modules/**/package.json`, Python `*.dist-info/METADATA`, .NET `deps.json` entries).
- **Lockfile-resolved inventory** (e.g., `bun.lock` entry with `name@version` and integrity/resolved URL).
### 1.3 When to emit an explicit-key component (required)
Emit an explicit-key component when the dependency is **declared-only** or otherwise **non-concrete**:
- Version ranges / operators (`^`, `~`, `>=`, `<`, `*`, `x`, `latest`, etc.).
- Workspace/link/file dependencies (`workspace:*`, `link:`, `file:`, local path refs, editable installs).
- Git dependencies (git URL / commit / ref) when a concrete semantic version is not provable from local evidence.
- Unknown / missing version.
**Rule:** If the analyzer cannot prove a concrete version from local evidence, it must not emit a versioned PURL for that dependency.
### 1.4 Explicit-key format (canonical)
For declared-only / non-concrete identities, analyzers must emit:
- `componentKey`: `explicit::<analyzerId>::<ecosystem>::<name>::sha256:<digest>`
- `purl`: `null`
- `version`: `null`
Where `<digest>` is `sha256` of the canonical UTF-8 string:
```
<ecosystem>\n<normalizedName>\n<normalizedSpec>\n<originLocator>
```
Canonicalization rules:
- `<normalizedName>` uses ecosystem naming rules (e.g., npm scoped names keep `@scope/name`).
- `<normalizedSpec>` is the **original declared specifier** (range/tag/url/path), trimmed; for unknown, use `""`.
- `<originLocator>` is project-relative with `/` separators (e.g., `package.json#dependencies`, `requirements.txt`, `Directory.Packages.props#PackageVersion:Foo`).
- No absolute paths, drive letters, or host roots appear in any input to the digest.
### 1.5 Required metadata for explicit-key components
Explicit-key components must include (at minimum) these metadata keys:
- `declaredOnly=true`
- `declared.source=<file>` (e.g., `package.json`, `Directory.Packages.props`)
- `declared.locator=<originLocator>` (same string used in digest)
- `declared.versionSpec=<normalizedSpec>` (original specifier or empty)
- `declared.scope=<prod|dev|peer|optional|unknown>` when applicable
- `declared.sourceType=<range|tag|git|tarball|file|link|workspace|path|editable|unknown>`
## 2) Evidence Locator Contract
### 2.1 General rules
- Evidence locators are **external-facing** and must be stable and parseable.
- Every locator is **project-relative** with `/` separators (never absolute).
- Evidence content/hashing must be bounded; when bounds are exceeded, emit deterministic `skipped` markers in metadata instead of silently omitting.
### 2.2 Locator formats (canonical)
**File evidence**
- `locator`: `<relativePath>` (e.g., `packages/app/package.json`)
- `source`: a stable discriminator (e.g., `package.json`, `pom.xml`, `METADATA`)
**Lockfile entry evidence**
- `locator`: `<lockfileRelativePath>:<selector>`
- Examples:
- Node package-lock: `package-lock.json:packages/app/node_modules/foo`
- Bun lock: `bun.lock:packages[foo@1.2.3]`
- Maven/Gradle lock: `gradle.lockfile:com.example:foo:1.2.3`
**Nested artifact evidence**
- `locator`: `<outer>!<inner>!<path>`
- Example: `demo-jni.jar!META-INF/native-image/demo/jni-config.json`
**Derived evidence**
- `locator`: a stable synthetic name (e.g., `phase22.ndjson`)
- `source`: a stable synthetic source (e.g., `node.observation`)
### 2.3 Hashing rules (baseline)
- Hash only bounded inputs (default: 1 MiB per evidence value/file; analyzers may choose a tighter cap).
- Hash algorithm: `sha256` over UTF-8 bytes for textual evidence, raw bytes for file evidence.
- If hashing is skipped due to bounds or errors, emit deterministic metadata markers (e.g., `hashSkipped=true`, `hashSkipped.reason=sizeCap`).
## 3) Container Layout Discovery Contract
### 3.1 Layer root candidates
Language analyzers that support container-root discovery must treat these as **candidate roots** under the analysis root:
- `layers/*` (direct children)
- `.layers/*` (direct children; **must not be skipped**)
- `layer*` (direct children of the analysis root, e.g., `layer1/`, `layer2/`)
Each candidate root is scanned independently for projects.
### 3.2 Bounds and traversal safety (required)
- Deterministic traversal (sorted directory enumeration).
- Depth caps per candidate root; hard cap on total discovered project roots.
- Must never recurse into `node_modules/` (Node/Bun) or equivalent heavy dirs.
- Hidden directories may be skipped **except** `.layers` which is treated as a top-level candidate root.
- No symlink escape: if symlinks are followed, resolved targets must remain within the candidate root prefix and cycles must be prevented.
### 3.3 Overlay/whiteout semantics
- If an analyzer implements overlay semantics (notably Python container adapters), whiteouts and precedence rules must be explicit, deterministic, and fixture-tested.
- If an analyzer does **not** implement overlay semantics, it must still keep discovery bounded and must not silently drop projects; emit deterministic "skipped" markers when bounds prevent full traversal.
## Compliance
Sprints `docs/implplan/SPRINT_0403_0001_0001_scanner_java_detection_gaps.md` through `docs/implplan/SPRINT_0407_0001_0001_scanner_bun_detection_gaps.md` (and the program sprint `docs/implplan/SPRINT_0408_0001_0001_scanner_language_detection_gaps_program.md`) carry the per-analyzer implementation and test evidence required to enforce this contract.