# Java Analyzer (Scanner) ## What it does - Inventories Maven coordinates from JVM archives (JAR/WAR/EAR/fat JAR) without executing build tools. - Prefers installed artifact metadata (`META-INF/maven/**/pom.properties`), with a `pom.xml` fallback when properties are missing. - Enriches output with bounded embedded-library scan metadata and JNI usage hints. ## Inputs and precedence 1. **Installed archive inventory**: parse Maven coordinates from `META-INF/maven/**/pom.properties` in each discovered archive. 2. **`pom.xml` fallback**: when no `pom.properties` in the archive, parse `META-INF/maven/**/pom.xml` and emit a Maven PURL only when `groupId`, `artifactId`, and `version` are concrete (no placeholders like `${...}`). 3. **Lock augmentation (current)**: when a lock entry matches an installed artifact, merge lock metadata onto the component; unmatched lock entries still emit declared-only components. 4. **Multi-module lock precedence (pending)**: deterministic precedence rules are tracked in `SCAN-JAVA-403-003` (blocked). 5. **Runtime images (pending)**: runtime component identity is tracked in `SCAN-JAVA-403-004` (blocked). ## Embedded archives (fat JAR / WAR / EAR layouts) The analyzer scans embedded library jars without extracting them to disk: - `BOOT-INF/lib/*.jar` - `WEB-INF/lib/*.jar` - `APP-INF/lib/*.jar` - `lib/*.jar` ### Locator format Evidence locators are nested deterministically using `!` separators: - `outer.jar!BOOT-INF/lib/inner.jar!META-INF/maven/.../pom.properties` ### Bounds and skip markers Embedded scanning is bounded and deterministic: - Max embedded jars per archive: `256` - Max embedded jar bytes: `25 MiB` When embedded scanning is skipped or truncated, the outer component metadata includes deterministic markers: - `embeddedScan.candidateJars`, `embeddedScan.scannedJars`, `embeddedScan.emittedComponents` - `embeddedScanSkipped=true`, `embeddedScan.skippedJars`, `embeddedScanSkipReasons=<...>` (when applicable) Embedded components include: - `embedded=true` - `embedded.containerJarPath=` - `embedded.entryPath=` ## Evidence and hashing - Evidence locators are project-relative, use `/` separators, and use `!` for nested artifact paths. - `sha256` for `pom.properties` and `pom.xml` evidence is computed over the raw entry bytes. ## `pom.xml` with incomplete coordinates When `pom.xml` is present but coordinates are incomplete (missing values or `${...}` placeholders), the analyzer emits an explicit-key component: - `purl=null`, `version=null` - `metadata.unresolvedCoordinates=true` - `componentKey` follows the cross-analyzer explicit-key scheme via `LanguageExplicitKey.Create("java", "maven", ...)` ## JNI metadata (bytecode-based) JNI hints are derived from parsed bytecode (native method flags and load call sites), not raw ASCII scanning. When bytecode analysis finds JNI edges (`jni.edgeCount > 0`), components are annotated with bounded, deterministic metadata: - `jni.edgeCount`, `jni.nativeMethodCount`, `jni.loadCallCount`, optional `jni.warningCount` - `jni.reasons` (distinct reason codes) - `jni.targetLibraries` (top-N stable sample; currently 12) ## Known limitations - Shaded jars that strip Maven metadata remain best-effort; embedded libs without Maven metadata do not emit components. - Gradle multi-module lock precedence and runtime image component identity remain blocked until explicit decisions land. ## References - Sprint: `docs/implplan/SPRINT_0403_0001_0001_scanner_java_detection_gaps.md` - Cross-analyzer contract: `docs/modules/scanner/language-analyzers-contract.md` - Implementation: `src/Scanner/__Libraries/StellaOps.Scanner.Analyzers.Lang.Java/JavaLanguageAnalyzer.cs`