Files
git.stella-ops.org/docs/modules/scanner/design/native-reachability-plan.md
StellaOps Bot 999e26a48e up
2025-12-13 02:22:15 +02:00

8.8 KiB

Native Reachability Graph Plan (Scanner · Signals Alignment)

Goals

  • Extract native reachability graphs from ELF binaries across layers (stripped and unstripped), emitting:
    • Build IDs (.note.gnu.build-id) and code IDs per file.
    • Symbol digests (purl+symbol) and edges (callgraph) with deterministic ordering.
    • Synthetic roots for _init, .init_array, .preinit_array, entry points.
    • DSSE graph bundle per layer for Signals ingestion.
  • Offline-friendly, deterministic outputs (stable ordering, UTF-8, UTC).

Inputs

  • Layered filesystem with ELF binaries and shared objects.
  • Layer metadata: digests from scanner.rootfs.layers and scanner.layer.archives (when provided).
  • Optional runtime proc snapshot for reconciliation (if available via Signals pipeline).

Approach

  • Discovery: Walk layer directories; identify ELF binaries (e_ident, machine, class). Record per-layer path.
  • Identifiers: Capture build-id (hash of .note.gnu.build-id), fallback to SHA-256 of .text when absent; store code-id (PE/ELF-friendly string).
  • Symbols: Parse .symtab/.dynsym; compute stable symbol digests (e.g., SHA-256 over symbol bytes + name); include size/address for ordering.
  • Edges: Build callgraph from relocation/import tables and (when available) .eh_frame/.plt linkage; emit Unknown edges when target unresolved.
  • Synthetic Roots: Insert edges from synthetic root nodes (per binary) to _start, _init, .init_array entries.
  • Layer Bundles: Emit DSSE bundle per layer with edges, symbols, identifiers, and provenance (layer digest, path, sha256).
  • Determinism: Sort by layer digest, path, symbol name; normalize paths to POSIX separators; timestamps fixed to generation time in UTC ISO-8601.

Deliverables

  • Library: StellaOps.Scanner.Analyzers.Native (new) with ELF reader and graph builder.
  • Tests: fixtures under src/Scanner/__Tests/StellaOps.Scanner.Analyzers.Native.Tests using stripped/unstripped ELF samples (no network).
  • DSSE bundle schema: shared constants/types reused by Signals ingestion.
  • Sprint doc links: referenced from SPRINT_0146_0001_0001_scanner_analyzer_gap_close.md.

Task Backlog (initial)

  1. Skeleton project StellaOps.Scanner.Analyzers.Native + plugin registration for scanner worker.
  2. ELF reader: header detection, build-id extraction, code-id calculation, section loader with deterministic sorting.
  3. Symbol digests: compute sha256(name + addr + size + binding); emit per-symbol evidence and purl+symbol IDs.
  4. Callgraph builder: edges from PLT/relocs/imports; Unknown targets captured; synthetic roots for init arrays.
  5. Layer attribution: carry layer digest/source through evidence; emit DSSE bundle per layer with signatures stubbed for now.
  6. Tests/fixtures: stripped+unstripped ELF, shared objects, missing build-id, init array edges; golden JSON/NDJSON bundles.
  7. Signals alignment: finalize DSSE graph schema and bundle naming; hook into reachability ingestion contract.

Open Questions

  • Final DSSE payload shape (Signals team) — currently assumed graph.bundle with edges, symbols, metadata.
  • Whether to include debugline info for coverage (could add optional module later).

8. Native Schema Alignment with richgraph-v1 (Sprint 0401)

Native callgraph output must conform to richgraph-v1 (see docs/contracts/richgraph-v1.md). This section defines the native-specific mappings.

8.1 NativeFunction Node Schema

Maps ELF/PE/Mach-O symbols to richgraph-v1 nodes:

{
  "id": "sym:binary:...",
  "symbol_id": "sym:binary:base64url(sha256(tuple))",
  "lang": "binary",
  "kind": "function",
  "display": "ssl3_read_bytes",
  "code_id": "code:binary:base64url(...)",
  "code_block_hash": "sha256:deadbeef...",
  "symbol": {
    "mangled": "_Z15ssl3_read_bytesP6ssl_stPviijPi",
    "demangled": "ssl3_read_bytes(ssl_st*, void*, int, int, int, int*)",
    "source": "DWARF",
    "confidence": 0.98
  },
  "purl": "pkg:deb/ubuntu/openssl@3.0.2?arch=amd64",
  "build_id": "gnu-build-id:a1b2c3d4e5f6...",
  "symbol_digest": "sha256:...",
  "evidence": ["dynsym", "dwarf"],
  "attributes": {
    "section": ".text",
    "address": "0x401000",
    "size": 256,
    "binding": "global",
    "visibility": "default",
    "elf_type": "STT_FUNC"
  }
}

8.2 SymbolID Construction for Native

Canonical tuple (NUL-separated, per richgraph-v1 §SymbolID):

binary:
  {file_hash}\0{section}\0{addr}\0{name}\0{linkage}\0{code_block_hash?}

Examples:
  sym:binary:base64url(sha256("sha256:abc...\0.text\00x401000\0ssl3_read_bytes\0global\0"))
  sym:binary:base64url(sha256("sha256:abc...\0.text\00x401000\0\0local\0sha256:deadbeef"))  # stripped

8.3 NativeCallEdge Schema

Maps PLT/GOT/relocation-based calls to richgraph-v1 edges:

{
  "from": "sym:binary:...",
  "to": "sym:binary:...",
  "kind": "call",
  "purl": "pkg:deb/ubuntu/openssl@3.0.2?arch=amd64",
  "symbol_digest": "sha256:...",
  "confidence": 0.85,
  "evidence": ["plt", "got", "reloc"],
  "candidates": [],
  "attributes": {
    "reloc_type": "R_X86_64_PLT32",
    "got_offset": "0x602020",
    "plt_index": 42
  }
}

8.4 Edge Kind Mapping

Native Call Type richgraph-v1 kind Confidence Evidence
Direct call (resolved) call 1.0 ["disasm"]
PLT call (resolved) call 0.95 ["plt", "got"]
PLT call (unresolved) indirect 0.5 ["plt"] + candidates[]
GOT indirect indirect 0.6 ["got", "reloc"]
Function pointer indirect 0.3 ["disasm", "heuristic"]
Init array entry init 1.0 ["init_array"]
TLS constructor init 1.0 ["tls_init"]

8.5 Native Root Nodes

Synthetic roots for native entry points:

{
  "roots": [
    {
      "id": "sym:binary:..._start",
      "phase": "load",
      "source": "e_entry"
    },
    {
      "id": "sym:binary:...main",
      "phase": "runtime",
      "source": "symbol"
    },
    {
      "id": "init:binary:0x401000",
      "phase": "init",
      "source": "DT_INIT_ARRAY[0]"
    },
    {
      "id": "init:binary:0x401020",
      "phase": "init",
      "source": ".ctors[0]"
    }
  ]
}

8.6 Build ID and Code ID Handling

Source build_id format code_id fallback
ELF .note.gnu.build-id gnu-build-id:{hex} N/A
PE Debug Directory pdb-guid:{guid}:{age} N/A
Mach-O LC_UUID macho-uuid:{uuid} N/A
Missing build-id None sha256:{file_hash}

When build-id is missing:

  1. Set build_id to null
  2. Set code_id using file hash: code:binary:base64url(sha256("{file_hash}\0{section}\0{addr}\0{size}"))
  3. Add "build_id_source": "FileHash" to attributes
  4. Emit U1 uncertainty state with entropy based on % of symbols missing build-id

8.7 Stripped Binary Handling

For stripped binaries without symbol names:

  1. Synthetic name: sub_{address} (e.g., sub_401000)
  2. Code block hash: SHA-256 of function bytes (sha256:{hex})
  3. Confidence: 0.4 (heuristic function boundary detection)
  4. Evidence: ["heuristic", "cfg"]

Example node:

{
  "id": "sym:binary:...",
  "symbol_id": "sym:binary:...",
  "lang": "binary",
  "kind": "function",
  "display": "sub_401000",
  "code_id": "code:binary:...",
  "code_block_hash": "sha256:deadbeef...",
  "symbol": {
    "mangled": null,
    "demangled": null,
    "source": "NONE",
    "confidence": 0.4
  },
  "evidence": ["heuristic", "cfg"]
}

8.8 Unknown Edge Targets

When call target cannot be resolved:

  1. Create synthetic target node with "kind": "unknown"
  2. Add to candidates[] on edge if multiple possibilities
  3. Emit edge with low confidence (0.3)
  4. Register in Unknowns registry
{
  "from": "sym:binary:...caller",
  "to": "unknown:binary:plt_42",
  "kind": "indirect",
  "confidence": 0.3,
  "candidates": [
    "pkg:deb/ubuntu/libssl@3.0.2",
    "pkg:deb/ubuntu/libcrypto@3.0.2"
  ],
  "evidence": ["plt", "unresolved"]
}

8.9 DSSE Bundle for Native Graphs

Per-layer DSSE bundle structure:

{
  "payloadType": "application/vnd.stellaops.graph+json",
  "payload": "<base64(canonical_graph_json)>",
  "signatures": [
    {
      "keyid": "stellaops:scanner:native:v1",
      "sig": "<base64(signature)>"
    }
  ]
}

Subject path: cas://reachability/graphs/{blake3}

8.10 Implementation Checklist

  • NativeFunctionNode maps to richgraph-v1 node schema
  • NativeCallEdge maps to richgraph-v1 edge schema
  • SymbolID uses sym:binary: prefix with canonical tuple
  • CodeID uses code:binary: prefix for stripped symbols
  • Graph hash uses BLAKE3-256 (blake3:{hex})
  • Symbol digest uses SHA-256 (sha256:{hex})
  • Init array roots use phase: "init"
  • Missing build-id triggers U1 uncertainty
  • DSSE envelope per layer with stellaops:scanner:native:v1 key