Some checks failed
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Signals Reachability Scoring & Events / reachability-smoke (push) Has been cancelled
Signals Reachability Scoring & Events / sign-and-upload (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Scanner Analyzers / Discover Analyzers (push) Has been cancelled
Scanner Analyzers / Build Analyzers (push) Has been cancelled
Scanner Analyzers / Test Language Analyzers (push) Has been cancelled
Scanner Analyzers / Validate Test Fixtures (push) Has been cancelled
Scanner Analyzers / Verify Deterministic Output (push) Has been cancelled
70 lines
4.1 KiB
Markdown
70 lines
4.1 KiB
Markdown
# Python Analyzer (Scanner)
|
|
|
|
## What it does
|
|
- Inventories Python distributions without executing `python`/`pip` (static inspection only).
|
|
- Prefers installed distribution metadata (`*.dist-info/`) and validates `RECORD` when present (bounded, streaming IO).
|
|
- Emits deterministic component metadata (`pkg.kind`, `pkg.confidence`, `pkg.location`) and evidence locators for replay/audit.
|
|
|
|
## Inputs and precedence
|
|
1. **Installed inventory (preferred)**: detect site-packages roots and parse `*.dist-info/` / `*.egg-info/` metadata for concrete `pkg:pypi/<name>@<version>` components.
|
|
2. **Archive inventory**: mount wheels (`*.whl`) and zipapps (`*.pyz`, `*.pyzw`) into the Python VFS and enrich any in-archive `*.dist-info/` metadata (including `RECORD` verification).
|
|
3. **Lock augmentation (current)**: parse root-level `requirements*.txt` pinned entries (`==`/`===`), `Pipfile.lock` `default` section, and `poetry.lock`; when a lock entry matches an installed component, merge lock metadata.
|
|
4. **Declared-only (current)**: lock entries not present in installed inventory still emit components:
|
|
- concrete versions emit a versioned `pkg:pypi/...@<version>` PURL
|
|
- non-concrete declarations (e.g., editable paths) emit explicit-key components (see Identity Rules)
|
|
|
|
## Project discovery (including container roots)
|
|
The analyzer is layout-aware and bounded:
|
|
- Virtualenv layout roots are detected via `pyvenv.cfg` or `venv/`-style directories.
|
|
- Site-packages roots include `lib/python*/site-packages` and `lib/python*/dist-packages`.
|
|
- Container unpack layouts are supported as additional candidate roots:
|
|
- `layers/*` (direct children)
|
|
- `.layers/*` (direct children)
|
|
- `layer*` (direct children of the analysis root)
|
|
|
|
## Virtual filesystem (VFS) and determinism
|
|
- Inputs are normalized deterministically (dedupe + stable ordering); later/higher-confidence inputs override earlier ones in the VFS overlay.
|
|
- Archive virtual roots are stable and collision-safe:
|
|
- `archives/wheel/<file>`
|
|
- `archives/zipapp/<file>`
|
|
- `archives/sdist/<file>`
|
|
- collisions use a deterministic `~N` suffix
|
|
- Evidence locators are always analysis-root relative and use `/` separators.
|
|
|
|
## Identity rules (PURL vs explicit key)
|
|
Concrete versions emit a PURL:
|
|
- `purl = pkg:pypi/<normalizedName>@<version>`
|
|
|
|
Non-concrete declarations emit an explicit key:
|
|
- `componentKey = explicit::<analyzerId>::pypi::<name>::sha256:<digest>`
|
|
- `purl = null`, `version = null`
|
|
- generated via `LanguageExplicitKey.Create(...)` and aligned with `docs/modules/scanner/language-analyzers-contract.md`
|
|
|
|
Editable declarations (from requirements `--editable` / `-e`) normalize the specifier:
|
|
- project-relative paths stay relative (`editable-src`)
|
|
- absolute/host paths are redacted and never appear in the digest input
|
|
|
|
## Evidence and metadata
|
|
Installed and archive distributions emit evidence for (when present):
|
|
- `METADATA`, `RECORD`, `WHEEL`, `INSTALLER`, `entry_points.txt`, `direct_url.json`
|
|
|
|
`RECORD` verification emits deterministic counters:
|
|
- `record.totalEntries`, `record.hashedEntries`, `record.missingFiles`, `record.hashMismatches`, `record.ioErrors`
|
|
- plus `record.unsupportedAlgorithms` when algorithms outside the supported set are present
|
|
|
|
Declared-only/lock-only components include:
|
|
- `declaredOnly=true`
|
|
- `lockSource`, `lockLocator`, optional `lockResolved`, `lockIndex`, `lockExtras`, `lockEditablePath`
|
|
|
|
## Container overlay semantics (pending contract)
|
|
When scanning raw OCI layer trees, correct overlay/whiteout handling is contract-driven. Until that contract lands, treat per-layer inventory as best-effort and do not rely on it as a merged-rootfs truth source.
|
|
|
|
## Vendored/bundled packages (pending contract)
|
|
Vendored directory signals are detected but representation (separate components vs parent-only metadata) is contract-driven to avoid false vulnerability joins.
|
|
|
|
## References
|
|
- Sprint: `docs/implplan/SPRINT_0405_0001_0001_scanner_python_detection_gaps.md`
|
|
- Cross-analyzer contract: `docs/modules/scanner/language-analyzers-contract.md`
|
|
- Implementation: `src/Scanner/__Libraries/StellaOps.Scanner.Analyzers.Lang.Python/PythonLanguageAnalyzer.cs`
|
|
|