# Native Reachability Graph Plan (Scanner · Signals Alignment) ## Goals - Extract native reachability graphs from ELF binaries across layers (stripped and unstripped), emitting: - Build IDs (`.note.gnu.build-id`) and code IDs per file. - Symbol digests (purl+symbol) and edges (callgraph) with deterministic ordering. - Synthetic roots for `_init`, `.init_array`, `.preinit_array`, entry points. - DSSE graph bundle per layer for Signals ingestion. - Offline-friendly, deterministic outputs (stable ordering, UTF-8, UTC). ## Inputs - Layered filesystem with ELF binaries and shared objects. - Layer metadata: digests from `scanner.rootfs.layers` and `scanner.layer.archives` (when provided). - Optional runtime proc snapshot for reconciliation (if available via Signals pipeline). ## Approach - **Discovery**: Walk layer directories; identify ELF binaries (`e_ident`, machine, class). Record per-layer path. - **Identifiers**: Capture build-id (hash of `.note.gnu.build-id`), fallback to SHA-256 of `.text` when absent; store code-id (PE/ELF-friendly string). - **Symbols**: Parse `.symtab`/`.dynsym`; compute stable symbol digests (e.g., SHA-256 over symbol bytes + name); include size/address for ordering. - **Edges**: Build callgraph from relocation/import tables and (when available) `.eh_frame`/`.plt` linkage; emit Unknown edges when target unresolved. - **Synthetic Roots**: Insert edges from synthetic root nodes (per binary) to `_start`, `_init`, `.init_array` entries. - **Layer Bundles**: Emit DSSE bundle per layer with edges, symbols, identifiers, and provenance (layer digest, path, sha256). - **Determinism**: Sort by layer digest, path, symbol name; normalize paths to POSIX separators; timestamps fixed to generation time in UTC ISO-8601. ## Deliverables - Library: `StellaOps.Scanner.Analyzers.Native` (new) with ELF reader and graph builder. - Tests: fixtures under `src/Scanner/__Tests/StellaOps.Scanner.Analyzers.Native.Tests` using stripped/unstripped ELF samples (no network). - DSSE bundle schema: shared constants/types reused by Signals ingestion. - Sprint doc links: referenced from `SPRINT_0146_0001_0001_scanner_analyzer_gap_close.md`. ## Task Backlog (initial) 1) Skeleton project `StellaOps.Scanner.Analyzers.Native` + plugin registration for scanner worker. 2) ELF reader: header detection, build-id extraction, code-id calculation, section loader with deterministic sorting. 3) Symbol digests: compute `sha256(name + addr + size + binding)`; emit per-symbol evidence and purl+symbol IDs. 4) Callgraph builder: edges from PLT/relocs/imports; Unknown targets captured; synthetic roots for init arrays. 5) Layer attribution: carry layer digest/source through evidence; emit DSSE bundle per layer with signatures stubbed for now. 6) Tests/fixtures: stripped+unstripped ELF, shared objects, missing build-id, init array edges; golden JSON/NDJSON bundles. 7) Signals alignment: finalize DSSE graph schema and bundle naming; hook into reachability ingestion contract. ## Open Questions - Final DSSE payload shape (Signals team) — currently assumed `graph.bundle` with edges, symbols, metadata. - Whether to include debugline info for coverage (could add optional module later). --- ## 8. Native Schema Alignment with richgraph-v1 (Sprint 0401) Native callgraph output must conform to `richgraph-v1` (see `docs/contracts/richgraph-v1.md`). This section defines the native-specific mappings. ### 8.1 NativeFunction Node Schema Maps ELF/PE/Mach-O symbols to richgraph-v1 nodes: ```json { "id": "sym:binary:...", "symbol_id": "sym:binary:base64url(sha256(tuple))", "lang": "binary", "kind": "function", "display": "ssl3_read_bytes", "code_id": "code:binary:base64url(...)", "code_block_hash": "sha256:deadbeef...", "symbol": { "mangled": "_Z15ssl3_read_bytesP6ssl_stPviijPi", "demangled": "ssl3_read_bytes(ssl_st*, void*, int, int, int, int*)", "source": "DWARF", "confidence": 0.98 }, "purl": "pkg:deb/ubuntu/openssl@3.0.2?arch=amd64", "build_id": "gnu-build-id:a1b2c3d4e5f6...", "symbol_digest": "sha256:...", "evidence": ["dynsym", "dwarf"], "attributes": { "section": ".text", "address": "0x401000", "size": 256, "binding": "global", "visibility": "default", "elf_type": "STT_FUNC" } } ``` ### 8.2 SymbolID Construction for Native Canonical tuple (NUL-separated, per `richgraph-v1` §SymbolID): ``` binary: {file_hash}\0{section}\0{addr}\0{name}\0{linkage}\0{code_block_hash?} Examples: sym:binary:base64url(sha256("sha256:abc...\0.text\00x401000\0ssl3_read_bytes\0global\0")) sym:binary:base64url(sha256("sha256:abc...\0.text\00x401000\0\0local\0sha256:deadbeef")) # stripped ``` ### 8.3 NativeCallEdge Schema Maps PLT/GOT/relocation-based calls to richgraph-v1 edges: ```json { "from": "sym:binary:...", "to": "sym:binary:...", "kind": "call", "purl": "pkg:deb/ubuntu/openssl@3.0.2?arch=amd64", "symbol_digest": "sha256:...", "confidence": 0.85, "evidence": ["plt", "got", "reloc"], "candidates": [], "attributes": { "reloc_type": "R_X86_64_PLT32", "got_offset": "0x602020", "plt_index": 42 } } ``` ### 8.4 Edge Kind Mapping | Native Call Type | richgraph-v1 `kind` | Confidence | Evidence | |------------------|---------------------|------------|----------| | Direct call (resolved) | `call` | 1.0 | `["disasm"]` | | PLT call (resolved) | `call` | 0.95 | `["plt", "got"]` | | PLT call (unresolved) | `indirect` | 0.5 | `["plt"]` + `candidates[]` | | GOT indirect | `indirect` | 0.6 | `["got", "reloc"]` | | Function pointer | `indirect` | 0.3 | `["disasm", "heuristic"]` | | Init array entry | `init` | 1.0 | `["init_array"]` | | TLS constructor | `init` | 1.0 | `["tls_init"]` | ### 8.5 Native Root Nodes Synthetic roots for native entry points: ```json { "roots": [ { "id": "sym:binary:..._start", "phase": "load", "source": "e_entry" }, { "id": "sym:binary:...main", "phase": "runtime", "source": "symbol" }, { "id": "init:binary:0x401000", "phase": "init", "source": "DT_INIT_ARRAY[0]" }, { "id": "init:binary:0x401020", "phase": "init", "source": ".ctors[0]" } ] } ``` ### 8.6 Build ID and Code ID Handling | Source | build_id format | code_id fallback | |--------|-----------------|------------------| | ELF `.note.gnu.build-id` | `gnu-build-id:{hex}` | N/A | | PE Debug Directory | `pdb-guid:{guid}:{age}` | N/A | | Mach-O `LC_UUID` | `macho-uuid:{uuid}` | N/A | | Missing build-id | None | `sha256:{file_hash}` | When build-id is missing: 1. Set `build_id` to null 2. Set `code_id` using file hash: `code:binary:base64url(sha256("{file_hash}\0{section}\0{addr}\0{size}"))` 3. Add `"build_id_source": "FileHash"` to attributes 4. Emit `U1` uncertainty state with entropy based on % of symbols missing build-id ### 8.7 Stripped Binary Handling For stripped binaries without symbol names: 1. **Synthetic name:** `sub_{address}` (e.g., `sub_401000`) 2. **Code block hash:** SHA-256 of function bytes (`sha256:{hex}`) 3. **Confidence:** 0.4 (heuristic function boundary detection) 4. **Evidence:** `["heuristic", "cfg"]` Example node: ```json { "id": "sym:binary:...", "symbol_id": "sym:binary:...", "lang": "binary", "kind": "function", "display": "sub_401000", "code_id": "code:binary:...", "code_block_hash": "sha256:deadbeef...", "symbol": { "mangled": null, "demangled": null, "source": "NONE", "confidence": 0.4 }, "evidence": ["heuristic", "cfg"] } ``` ### 8.8 Unknown Edge Targets When call target cannot be resolved: 1. Create synthetic target node with `"kind": "unknown"` 2. Add to `candidates[]` on edge if multiple possibilities 3. Emit edge with low confidence (0.3) 4. Register in Unknowns registry ```json { "from": "sym:binary:...caller", "to": "unknown:binary:plt_42", "kind": "indirect", "confidence": 0.3, "candidates": [ "pkg:deb/ubuntu/libssl@3.0.2", "pkg:deb/ubuntu/libcrypto@3.0.2" ], "evidence": ["plt", "unresolved"] } ``` ### 8.9 DSSE Bundle for Native Graphs Per-layer DSSE bundle structure: ```json { "payloadType": "application/vnd.stellaops.graph+json", "payload": "", "signatures": [ { "keyid": "stellaops:scanner:native:v1", "sig": "" } ] } ``` Subject path: `cas://reachability/graphs/{blake3}` ### 8.10 Implementation Checklist - [ ] `NativeFunctionNode` maps to `richgraph-v1` node schema - [ ] `NativeCallEdge` maps to `richgraph-v1` edge schema - [ ] SymbolID uses `sym:binary:` prefix with canonical tuple - [ ] CodeID uses `code:binary:` prefix for stripped symbols - [ ] Graph hash uses BLAKE3-256 (`blake3:{hex}`) - [ ] Symbol digest uses SHA-256 (`sha256:{hex}`) - [ ] Init array roots use `phase: "init"` - [ ] Missing build-id triggers U1 uncertainty - [ ] DSSE envelope per layer with `stellaops:scanner:native:v1` key