8.8 KiB
8.8 KiB
Native Reachability Graph Plan (Scanner · Signals Alignment)
Goals
- Extract native reachability graphs from ELF binaries across layers (stripped and unstripped), emitting:
- Build IDs (
.note.gnu.build-id) and code IDs per file. - Symbol digests (purl+symbol) and edges (callgraph) with deterministic ordering.
- Synthetic roots for
_init,.init_array,.preinit_array, entry points. - DSSE graph bundle per layer for Signals ingestion.
- Build IDs (
- Offline-friendly, deterministic outputs (stable ordering, UTF-8, UTC).
Inputs
- Layered filesystem with ELF binaries and shared objects.
- Layer metadata: digests from
scanner.rootfs.layersandscanner.layer.archives(when provided). - Optional runtime proc snapshot for reconciliation (if available via Signals pipeline).
Approach
- Discovery: Walk layer directories; identify ELF binaries (
e_ident, machine, class). Record per-layer path. - Identifiers: Capture build-id (hash of
.note.gnu.build-id), fallback to SHA-256 of.textwhen absent; store code-id (PE/ELF-friendly string). - Symbols: Parse
.symtab/.dynsym; compute stable symbol digests (e.g., SHA-256 over symbol bytes + name); include size/address for ordering. - Edges: Build callgraph from relocation/import tables and (when available)
.eh_frame/.pltlinkage; emit Unknown edges when target unresolved. - Synthetic Roots: Insert edges from synthetic root nodes (per binary) to
_start,_init,.init_arrayentries. - Layer Bundles: Emit DSSE bundle per layer with edges, symbols, identifiers, and provenance (layer digest, path, sha256).
- Determinism: Sort by layer digest, path, symbol name; normalize paths to POSIX separators; timestamps fixed to generation time in UTC ISO-8601.
Deliverables
- Library:
StellaOps.Scanner.Analyzers.Native(new) with ELF reader and graph builder. - Tests: fixtures under
src/Scanner/__Tests/StellaOps.Scanner.Analyzers.Native.Testsusing stripped/unstripped ELF samples (no network). - DSSE bundle schema: shared constants/types reused by Signals ingestion.
- Sprint doc links: referenced from
SPRINT_0146_0001_0001_scanner_analyzer_gap_close.md.
Task Backlog (initial)
- Skeleton project
StellaOps.Scanner.Analyzers.Native+ plugin registration for scanner worker. - ELF reader: header detection, build-id extraction, code-id calculation, section loader with deterministic sorting.
- Symbol digests: compute
sha256(name + addr + size + binding); emit per-symbol evidence and purl+symbol IDs. - Callgraph builder: edges from PLT/relocs/imports; Unknown targets captured; synthetic roots for init arrays.
- Layer attribution: carry layer digest/source through evidence; emit DSSE bundle per layer with signatures stubbed for now.
- Tests/fixtures: stripped+unstripped ELF, shared objects, missing build-id, init array edges; golden JSON/NDJSON bundles.
- Signals alignment: finalize DSSE graph schema and bundle naming; hook into reachability ingestion contract.
Open Questions
- Final DSSE payload shape (Signals team) — currently assumed
graph.bundlewith edges, symbols, metadata. - Whether to include debugline info for coverage (could add optional module later).
8. Native Schema Alignment with richgraph-v1 (Sprint 0401)
Native callgraph output must conform to richgraph-v1 (see docs/contracts/richgraph-v1.md). This section defines the native-specific mappings.
8.1 NativeFunction Node Schema
Maps ELF/PE/Mach-O symbols to richgraph-v1 nodes:
{
"id": "sym:binary:...",
"symbol_id": "sym:binary:base64url(sha256(tuple))",
"lang": "binary",
"kind": "function",
"display": "ssl3_read_bytes",
"code_id": "code:binary:base64url(...)",
"code_block_hash": "sha256:deadbeef...",
"symbol": {
"mangled": "_Z15ssl3_read_bytesP6ssl_stPviijPi",
"demangled": "ssl3_read_bytes(ssl_st*, void*, int, int, int, int*)",
"source": "DWARF",
"confidence": 0.98
},
"purl": "pkg:deb/ubuntu/openssl@3.0.2?arch=amd64",
"build_id": "gnu-build-id:a1b2c3d4e5f6...",
"symbol_digest": "sha256:...",
"evidence": ["dynsym", "dwarf"],
"attributes": {
"section": ".text",
"address": "0x401000",
"size": 256,
"binding": "global",
"visibility": "default",
"elf_type": "STT_FUNC"
}
}
8.2 SymbolID Construction for Native
Canonical tuple (NUL-separated, per richgraph-v1 §SymbolID):
binary:
{file_hash}\0{section}\0{addr}\0{name}\0{linkage}\0{code_block_hash?}
Examples:
sym:binary:base64url(sha256("sha256:abc...\0.text\00x401000\0ssl3_read_bytes\0global\0"))
sym:binary:base64url(sha256("sha256:abc...\0.text\00x401000\0\0local\0sha256:deadbeef")) # stripped
8.3 NativeCallEdge Schema
Maps PLT/GOT/relocation-based calls to richgraph-v1 edges:
{
"from": "sym:binary:...",
"to": "sym:binary:...",
"kind": "call",
"purl": "pkg:deb/ubuntu/openssl@3.0.2?arch=amd64",
"symbol_digest": "sha256:...",
"confidence": 0.85,
"evidence": ["plt", "got", "reloc"],
"candidates": [],
"attributes": {
"reloc_type": "R_X86_64_PLT32",
"got_offset": "0x602020",
"plt_index": 42
}
}
8.4 Edge Kind Mapping
| Native Call Type | richgraph-v1 kind |
Confidence | Evidence |
|---|---|---|---|
| Direct call (resolved) | call |
1.0 | ["disasm"] |
| PLT call (resolved) | call |
0.95 | ["plt", "got"] |
| PLT call (unresolved) | indirect |
0.5 | ["plt"] + candidates[] |
| GOT indirect | indirect |
0.6 | ["got", "reloc"] |
| Function pointer | indirect |
0.3 | ["disasm", "heuristic"] |
| Init array entry | init |
1.0 | ["init_array"] |
| TLS constructor | init |
1.0 | ["tls_init"] |
8.5 Native Root Nodes
Synthetic roots for native entry points:
{
"roots": [
{
"id": "sym:binary:..._start",
"phase": "load",
"source": "e_entry"
},
{
"id": "sym:binary:...main",
"phase": "runtime",
"source": "symbol"
},
{
"id": "init:binary:0x401000",
"phase": "init",
"source": "DT_INIT_ARRAY[0]"
},
{
"id": "init:binary:0x401020",
"phase": "init",
"source": ".ctors[0]"
}
]
}
8.6 Build ID and Code ID Handling
| Source | build_id format | code_id fallback |
|---|---|---|
ELF .note.gnu.build-id |
gnu-build-id:{hex} |
N/A |
| PE Debug Directory | pdb-guid:{guid}:{age} |
N/A |
Mach-O LC_UUID |
macho-uuid:{uuid} |
N/A |
| Missing build-id | None | sha256:{file_hash} |
When build-id is missing:
- Set
build_idto null - Set
code_idusing file hash:code:binary:base64url(sha256("{file_hash}\0{section}\0{addr}\0{size}")) - Add
"build_id_source": "FileHash"to attributes - Emit
U1uncertainty state with entropy based on % of symbols missing build-id
8.7 Stripped Binary Handling
For stripped binaries without symbol names:
- Synthetic name:
sub_{address}(e.g.,sub_401000) - Code block hash: SHA-256 of function bytes (
sha256:{hex}) - Confidence: 0.4 (heuristic function boundary detection)
- Evidence:
["heuristic", "cfg"]
Example node:
{
"id": "sym:binary:...",
"symbol_id": "sym:binary:...",
"lang": "binary",
"kind": "function",
"display": "sub_401000",
"code_id": "code:binary:...",
"code_block_hash": "sha256:deadbeef...",
"symbol": {
"mangled": null,
"demangled": null,
"source": "NONE",
"confidence": 0.4
},
"evidence": ["heuristic", "cfg"]
}
8.8 Unknown Edge Targets
When call target cannot be resolved:
- Create synthetic target node with
"kind": "unknown" - Add to
candidates[]on edge if multiple possibilities - Emit edge with low confidence (0.3)
- Register in Unknowns registry
{
"from": "sym:binary:...caller",
"to": "unknown:binary:plt_42",
"kind": "indirect",
"confidence": 0.3,
"candidates": [
"pkg:deb/ubuntu/libssl@3.0.2",
"pkg:deb/ubuntu/libcrypto@3.0.2"
],
"evidence": ["plt", "unresolved"]
}
8.9 DSSE Bundle for Native Graphs
Per-layer DSSE bundle structure:
{
"payloadType": "application/vnd.stellaops.graph+json",
"payload": "<base64(canonical_graph_json)>",
"signatures": [
{
"keyid": "stellaops:scanner:native:v1",
"sig": "<base64(signature)>"
}
]
}
Subject path: cas://reachability/graphs/{blake3}
8.10 Implementation Checklist
NativeFunctionNodemaps torichgraph-v1node schemaNativeCallEdgemaps torichgraph-v1edge schema- SymbolID uses
sym:binary:prefix with canonical tuple - CodeID uses
code:binary:prefix for stripped symbols - Graph hash uses BLAKE3-256 (
blake3:{hex}) - Symbol digest uses SHA-256 (
sha256:{hex}) - Init array roots use
phase: "init" - Missing build-id triggers U1 uncertainty
- DSSE envelope per layer with
stellaops:scanner:native:v1key