Files
git.stella-ops.org/docs/modules/scanner/analyzers-python.md
StellaOps Bot 6e45066e37
Some checks failed
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Signals Reachability Scoring & Events / reachability-smoke (push) Has been cancelled
Signals Reachability Scoring & Events / sign-and-upload (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Scanner Analyzers / Discover Analyzers (push) Has been cancelled
Scanner Analyzers / Build Analyzers (push) Has been cancelled
Scanner Analyzers / Test Language Analyzers (push) Has been cancelled
Scanner Analyzers / Validate Test Fixtures (push) Has been cancelled
Scanner Analyzers / Verify Deterministic Output (push) Has been cancelled
up
2025-12-13 09:37:15 +02:00

4.1 KiB

Python Analyzer (Scanner)

What it does

  • Inventories Python distributions without executing python/pip (static inspection only).
  • Prefers installed distribution metadata (*.dist-info/) and validates RECORD when present (bounded, streaming IO).
  • Emits deterministic component metadata (pkg.kind, pkg.confidence, pkg.location) and evidence locators for replay/audit.

Inputs and precedence

  1. Installed inventory (preferred): detect site-packages roots and parse *.dist-info/ / *.egg-info/ metadata for concrete pkg:pypi/<name>@<version> components.
  2. Archive inventory: mount wheels (*.whl) and zipapps (*.pyz, *.pyzw) into the Python VFS and enrich any in-archive *.dist-info/ metadata (including RECORD verification).
  3. Lock augmentation (current): parse root-level requirements*.txt pinned entries (==/===), Pipfile.lock default section, and poetry.lock; when a lock entry matches an installed component, merge lock metadata.
  4. Declared-only (current): lock entries not present in installed inventory still emit components:
    • concrete versions emit a versioned pkg:pypi/...@<version> PURL
    • non-concrete declarations (e.g., editable paths) emit explicit-key components (see Identity Rules)

Project discovery (including container roots)

The analyzer is layout-aware and bounded:

  • Virtualenv layout roots are detected via pyvenv.cfg or venv/-style directories.
  • Site-packages roots include lib/python*/site-packages and lib/python*/dist-packages.
  • Container unpack layouts are supported as additional candidate roots:
    • layers/* (direct children)
    • .layers/* (direct children)
    • layer* (direct children of the analysis root)

Virtual filesystem (VFS) and determinism

  • Inputs are normalized deterministically (dedupe + stable ordering); later/higher-confidence inputs override earlier ones in the VFS overlay.
  • Archive virtual roots are stable and collision-safe:
    • archives/wheel/<file>
    • archives/zipapp/<file>
    • archives/sdist/<file>
    • collisions use a deterministic ~N suffix
  • Evidence locators are always analysis-root relative and use / separators.

Identity rules (PURL vs explicit key)

Concrete versions emit a PURL:

  • purl = pkg:pypi/<normalizedName>@<version>

Non-concrete declarations emit an explicit key:

  • componentKey = explicit::<analyzerId>::pypi::<name>::sha256:<digest>
  • purl = null, version = null
  • generated via LanguageExplicitKey.Create(...) and aligned with docs/modules/scanner/language-analyzers-contract.md

Editable declarations (from requirements --editable / -e) normalize the specifier:

  • project-relative paths stay relative (editable-src)
  • absolute/host paths are redacted and never appear in the digest input

Evidence and metadata

Installed and archive distributions emit evidence for (when present):

  • METADATA, RECORD, WHEEL, INSTALLER, entry_points.txt, direct_url.json

RECORD verification emits deterministic counters:

  • record.totalEntries, record.hashedEntries, record.missingFiles, record.hashMismatches, record.ioErrors
  • plus record.unsupportedAlgorithms when algorithms outside the supported set are present

Declared-only/lock-only components include:

  • declaredOnly=true
  • lockSource, lockLocator, optional lockResolved, lockIndex, lockExtras, lockEditablePath

Container overlay semantics (pending contract)

When scanning raw OCI layer trees, correct overlay/whiteout handling is contract-driven. Until that contract lands, treat per-layer inventory as best-effort and do not rely on it as a merged-rootfs truth source.

Vendored/bundled packages (pending contract)

Vendored directory signals are detected but representation (separate components vs parent-only metadata) is contract-driven to avoid false vulnerability joins.

References

  • Sprint: docs/implplan/SPRINT_0405_0001_0001_scanner_python_detection_gaps.md
  • Cross-analyzer contract: docs/modules/scanner/language-analyzers-contract.md
  • Implementation: src/Scanner/__Libraries/StellaOps.Scanner.Analyzers.Lang.Python/PythonLanguageAnalyzer.cs