# Python Analyzer (Scanner) ## What it does - Inventories Python distributions without executing `python`/`pip` (static inspection only). - Prefers installed distribution metadata (`*.dist-info/`) and validates `RECORD` when present (bounded, streaming IO). - Emits deterministic component metadata (`pkg.kind`, `pkg.confidence`, `pkg.location`) and evidence locators for replay/audit. ## Inputs and precedence 1. **Installed inventory (preferred)**: detect site-packages roots and parse `*.dist-info/` / `*.egg-info/` metadata for concrete `pkg:pypi/@` components. 2. **Archive inventory**: mount wheels (`*.whl`) and zipapps (`*.pyz`, `*.pyzw`) into the Python VFS and enrich any in-archive `*.dist-info/` metadata (including `RECORD` verification). 3. **Lock augmentation (current)**: parse root-level `requirements*.txt` pinned entries (`==`/`===`), `Pipfile.lock` `default` section, and `poetry.lock`; when a lock entry matches an installed component, merge lock metadata. 4. **Declared-only (current)**: lock entries not present in installed inventory still emit components: - concrete versions emit a versioned `pkg:pypi/...@` PURL - non-concrete declarations (e.g., editable paths) emit explicit-key components (see Identity Rules) ## Project discovery (including container roots) The analyzer is layout-aware and bounded: - Virtualenv layout roots are detected via `pyvenv.cfg` or `venv/`-style directories. - Site-packages roots include `lib/python*/site-packages` and `lib/python*/dist-packages`. - Container unpack layouts are supported as additional candidate roots: - `layers/*` (direct children) - `.layers/*` (direct children) - `layer*` (direct children of the analysis root) ## Virtual filesystem (VFS) and determinism - Inputs are normalized deterministically (dedupe + stable ordering); later/higher-confidence inputs override earlier ones in the VFS overlay. - Archive virtual roots are stable and collision-safe: - `archives/wheel/` - `archives/zipapp/` - `archives/sdist/` - collisions use a deterministic `~N` suffix - Evidence locators are always analysis-root relative and use `/` separators. ## Identity rules (PURL vs explicit key) Concrete versions emit a PURL: - `purl = pkg:pypi/@` Non-concrete declarations emit an explicit key: - `componentKey = explicit::::pypi::::sha256:` - `purl = null`, `version = null` - generated via `LanguageExplicitKey.Create(...)` and aligned with `docs/modules/scanner/language-analyzers-contract.md` Editable declarations (from requirements `--editable` / `-e`) normalize the specifier: - project-relative paths stay relative (`editable-src`) - absolute/host paths are redacted and never appear in the digest input ## Evidence and metadata Installed and archive distributions emit evidence for (when present): - `METADATA`, `RECORD`, `WHEEL`, `INSTALLER`, `entry_points.txt`, `direct_url.json` `RECORD` verification emits deterministic counters: - `record.totalEntries`, `record.hashedEntries`, `record.missingFiles`, `record.hashMismatches`, `record.ioErrors` - plus `record.unsupportedAlgorithms` when algorithms outside the supported set are present Declared-only/lock-only components include: - `declaredOnly=true` - `lockSource`, `lockLocator`, optional `lockResolved`, `lockIndex`, `lockExtras`, `lockEditablePath` ## Container overlay semantics (pending contract) When scanning raw OCI layer trees, correct overlay/whiteout handling is contract-driven. Until that contract lands, treat per-layer inventory as best-effort and do not rely on it as a merged-rootfs truth source. ## Vendored/bundled packages (pending contract) Vendored directory signals are detected but representation (separate components vs parent-only metadata) is contract-driven to avoid false vulnerability joins. ## References - Sprint: `docs/implplan/SPRINT_0405_0001_0001_scanner_python_detection_gaps.md` - Cross-analyzer contract: `docs/modules/scanner/language-analyzers-contract.md` - Implementation: `src/Scanner/__Libraries/StellaOps.Scanner.Analyzers.Lang.Python/PythonLanguageAnalyzer.cs`