259 lines
8.8 KiB
Markdown
259 lines
8.8 KiB
Markdown
# Native Reachability Graph Plan (Scanner · Signals Alignment)
|
|
|
|
## Goals
|
|
- Extract native reachability graphs from ELF binaries across layers (stripped and unstripped), emitting:
|
|
- Build IDs (`.note.gnu.build-id`) and code IDs per file.
|
|
- Symbol digests (purl+symbol) and edges (callgraph) with deterministic ordering.
|
|
- Synthetic roots for `_init`, `.init_array`, `.preinit_array`, entry points.
|
|
- DSSE graph bundle per layer for Signals ingestion.
|
|
- Offline-friendly, deterministic outputs (stable ordering, UTF-8, UTC).
|
|
|
|
## Inputs
|
|
- Layered filesystem with ELF binaries and shared objects.
|
|
- Layer metadata: digests from `scanner.rootfs.layers` and `scanner.layer.archives` (when provided).
|
|
- Optional runtime proc snapshot for reconciliation (if available via Signals pipeline).
|
|
|
|
## Approach
|
|
- **Discovery**: Walk layer directories; identify ELF binaries (`e_ident`, machine, class). Record per-layer path.
|
|
- **Identifiers**: Capture build-id (hash of `.note.gnu.build-id`), fallback to SHA-256 of `.text` when absent; store code-id (PE/ELF-friendly string).
|
|
- **Symbols**: Parse `.symtab`/`.dynsym`; compute stable symbol digests (e.g., SHA-256 over symbol bytes + name); include size/address for ordering.
|
|
- **Edges**: Build callgraph from relocation/import tables and (when available) `.eh_frame`/`.plt` linkage; emit Unknown edges when target unresolved.
|
|
- **Synthetic Roots**: Insert edges from synthetic root nodes (per binary) to `_start`, `_init`, `.init_array` entries.
|
|
- **Layer Bundles**: Emit DSSE bundle per layer with edges, symbols, identifiers, and provenance (layer digest, path, sha256).
|
|
- **Determinism**: Sort by layer digest, path, symbol name; normalize paths to POSIX separators; timestamps fixed to generation time in UTC ISO-8601.
|
|
|
|
## Deliverables
|
|
- Library: `StellaOps.Scanner.Analyzers.Native` (new) with ELF reader and graph builder.
|
|
- Tests: fixtures under `src/Scanner/__Tests/StellaOps.Scanner.Analyzers.Native.Tests` using stripped/unstripped ELF samples (no network).
|
|
- DSSE bundle schema: shared constants/types reused by Signals ingestion.
|
|
- Sprint doc links: referenced from `SPRINT_0146_0001_0001_scanner_analyzer_gap_close.md`.
|
|
|
|
## Task Backlog (initial)
|
|
1) Skeleton project `StellaOps.Scanner.Analyzers.Native` + plugin registration for scanner worker.
|
|
2) ELF reader: header detection, build-id extraction, code-id calculation, section loader with deterministic sorting.
|
|
3) Symbol digests: compute `sha256(name + addr + size + binding)`; emit per-symbol evidence and purl+symbol IDs.
|
|
4) Callgraph builder: edges from PLT/relocs/imports; Unknown targets captured; synthetic roots for init arrays.
|
|
5) Layer attribution: carry layer digest/source through evidence; emit DSSE bundle per layer with signatures stubbed for now.
|
|
6) Tests/fixtures: stripped+unstripped ELF, shared objects, missing build-id, init array edges; golden JSON/NDJSON bundles.
|
|
7) Signals alignment: finalize DSSE graph schema and bundle naming; hook into reachability ingestion contract.
|
|
|
|
## Open Questions
|
|
- Final DSSE payload shape (Signals team) — currently assumed `graph.bundle` with edges, symbols, metadata.
|
|
- Whether to include debugline info for coverage (could add optional module later).
|
|
|
|
---
|
|
|
|
## 8. Native Schema Alignment with richgraph-v1 (Sprint 0401)
|
|
|
|
Native callgraph output must conform to `richgraph-v1` (see `docs/contracts/richgraph-v1.md`). This section defines the native-specific mappings.
|
|
|
|
### 8.1 NativeFunction Node Schema
|
|
|
|
Maps ELF/PE/Mach-O symbols to richgraph-v1 nodes:
|
|
|
|
```json
|
|
{
|
|
"id": "sym:binary:...",
|
|
"symbol_id": "sym:binary:base64url(sha256(tuple))",
|
|
"lang": "binary",
|
|
"kind": "function",
|
|
"display": "ssl3_read_bytes",
|
|
"code_id": "code:binary:base64url(...)",
|
|
"code_block_hash": "sha256:deadbeef...",
|
|
"symbol": {
|
|
"mangled": "_Z15ssl3_read_bytesP6ssl_stPviijPi",
|
|
"demangled": "ssl3_read_bytes(ssl_st*, void*, int, int, int, int*)",
|
|
"source": "DWARF",
|
|
"confidence": 0.98
|
|
},
|
|
"purl": "pkg:deb/ubuntu/openssl@3.0.2?arch=amd64",
|
|
"build_id": "gnu-build-id:a1b2c3d4e5f6...",
|
|
"symbol_digest": "sha256:...",
|
|
"evidence": ["dynsym", "dwarf"],
|
|
"attributes": {
|
|
"section": ".text",
|
|
"address": "0x401000",
|
|
"size": 256,
|
|
"binding": "global",
|
|
"visibility": "default",
|
|
"elf_type": "STT_FUNC"
|
|
}
|
|
}
|
|
```
|
|
|
|
### 8.2 SymbolID Construction for Native
|
|
|
|
Canonical tuple (NUL-separated, per `richgraph-v1` §SymbolID):
|
|
|
|
```
|
|
binary:
|
|
{file_hash}\0{section}\0{addr}\0{name}\0{linkage}\0{code_block_hash?}
|
|
|
|
Examples:
|
|
sym:binary:base64url(sha256("sha256:abc...\0.text\00x401000\0ssl3_read_bytes\0global\0"))
|
|
sym:binary:base64url(sha256("sha256:abc...\0.text\00x401000\0\0local\0sha256:deadbeef")) # stripped
|
|
```
|
|
|
|
### 8.3 NativeCallEdge Schema
|
|
|
|
Maps PLT/GOT/relocation-based calls to richgraph-v1 edges:
|
|
|
|
```json
|
|
{
|
|
"from": "sym:binary:...",
|
|
"to": "sym:binary:...",
|
|
"kind": "call",
|
|
"purl": "pkg:deb/ubuntu/openssl@3.0.2?arch=amd64",
|
|
"symbol_digest": "sha256:...",
|
|
"confidence": 0.85,
|
|
"evidence": ["plt", "got", "reloc"],
|
|
"candidates": [],
|
|
"attributes": {
|
|
"reloc_type": "R_X86_64_PLT32",
|
|
"got_offset": "0x602020",
|
|
"plt_index": 42
|
|
}
|
|
}
|
|
```
|
|
|
|
### 8.4 Edge Kind Mapping
|
|
|
|
| Native Call Type | richgraph-v1 `kind` | Confidence | Evidence |
|
|
|------------------|---------------------|------------|----------|
|
|
| Direct call (resolved) | `call` | 1.0 | `["disasm"]` |
|
|
| PLT call (resolved) | `call` | 0.95 | `["plt", "got"]` |
|
|
| PLT call (unresolved) | `indirect` | 0.5 | `["plt"]` + `candidates[]` |
|
|
| GOT indirect | `indirect` | 0.6 | `["got", "reloc"]` |
|
|
| Function pointer | `indirect` | 0.3 | `["disasm", "heuristic"]` |
|
|
| Init array entry | `init` | 1.0 | `["init_array"]` |
|
|
| TLS constructor | `init` | 1.0 | `["tls_init"]` |
|
|
|
|
### 8.5 Native Root Nodes
|
|
|
|
Synthetic roots for native entry points:
|
|
|
|
```json
|
|
{
|
|
"roots": [
|
|
{
|
|
"id": "sym:binary:..._start",
|
|
"phase": "load",
|
|
"source": "e_entry"
|
|
},
|
|
{
|
|
"id": "sym:binary:...main",
|
|
"phase": "runtime",
|
|
"source": "symbol"
|
|
},
|
|
{
|
|
"id": "init:binary:0x401000",
|
|
"phase": "init",
|
|
"source": "DT_INIT_ARRAY[0]"
|
|
},
|
|
{
|
|
"id": "init:binary:0x401020",
|
|
"phase": "init",
|
|
"source": ".ctors[0]"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### 8.6 Build ID and Code ID Handling
|
|
|
|
| Source | build_id format | code_id fallback |
|
|
|--------|-----------------|------------------|
|
|
| ELF `.note.gnu.build-id` | `gnu-build-id:{hex}` | N/A |
|
|
| PE Debug Directory | `pdb-guid:{guid}:{age}` | N/A |
|
|
| Mach-O `LC_UUID` | `macho-uuid:{uuid}` | N/A |
|
|
| Missing build-id | None | `sha256:{file_hash}` |
|
|
|
|
When build-id is missing:
|
|
1. Set `build_id` to null
|
|
2. Set `code_id` using file hash: `code:binary:base64url(sha256("{file_hash}\0{section}\0{addr}\0{size}"))`
|
|
3. Add `"build_id_source": "FileHash"` to attributes
|
|
4. Emit `U1` uncertainty state with entropy based on % of symbols missing build-id
|
|
|
|
### 8.7 Stripped Binary Handling
|
|
|
|
For stripped binaries without symbol names:
|
|
|
|
1. **Synthetic name:** `sub_{address}` (e.g., `sub_401000`)
|
|
2. **Code block hash:** SHA-256 of function bytes (`sha256:{hex}`)
|
|
3. **Confidence:** 0.4 (heuristic function boundary detection)
|
|
4. **Evidence:** `["heuristic", "cfg"]`
|
|
|
|
Example node:
|
|
```json
|
|
{
|
|
"id": "sym:binary:...",
|
|
"symbol_id": "sym:binary:...",
|
|
"lang": "binary",
|
|
"kind": "function",
|
|
"display": "sub_401000",
|
|
"code_id": "code:binary:...",
|
|
"code_block_hash": "sha256:deadbeef...",
|
|
"symbol": {
|
|
"mangled": null,
|
|
"demangled": null,
|
|
"source": "NONE",
|
|
"confidence": 0.4
|
|
},
|
|
"evidence": ["heuristic", "cfg"]
|
|
}
|
|
```
|
|
|
|
### 8.8 Unknown Edge Targets
|
|
|
|
When call target cannot be resolved:
|
|
|
|
1. Create synthetic target node with `"kind": "unknown"`
|
|
2. Add to `candidates[]` on edge if multiple possibilities
|
|
3. Emit edge with low confidence (0.3)
|
|
4. Register in Unknowns registry
|
|
|
|
```json
|
|
{
|
|
"from": "sym:binary:...caller",
|
|
"to": "unknown:binary:plt_42",
|
|
"kind": "indirect",
|
|
"confidence": 0.3,
|
|
"candidates": [
|
|
"pkg:deb/ubuntu/libssl@3.0.2",
|
|
"pkg:deb/ubuntu/libcrypto@3.0.2"
|
|
],
|
|
"evidence": ["plt", "unresolved"]
|
|
}
|
|
```
|
|
|
|
### 8.9 DSSE Bundle for Native Graphs
|
|
|
|
Per-layer DSSE bundle structure:
|
|
|
|
```json
|
|
{
|
|
"payloadType": "application/vnd.stellaops.graph+json",
|
|
"payload": "<base64(canonical_graph_json)>",
|
|
"signatures": [
|
|
{
|
|
"keyid": "stellaops:scanner:native:v1",
|
|
"sig": "<base64(signature)>"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Subject path: `cas://reachability/graphs/{blake3}`
|
|
|
|
### 8.10 Implementation Checklist
|
|
|
|
- [ ] `NativeFunctionNode` maps to `richgraph-v1` node schema
|
|
- [ ] `NativeCallEdge` maps to `richgraph-v1` edge schema
|
|
- [ ] SymbolID uses `sym:binary:` prefix with canonical tuple
|
|
- [ ] CodeID uses `code:binary:` prefix for stripped symbols
|
|
- [ ] Graph hash uses BLAKE3-256 (`blake3:{hex}`)
|
|
- [ ] Symbol digest uses SHA-256 (`sha256:{hex}`)
|
|
- [ ] Init array roots use `phase: "init"`
|
|
- [ ] Missing build-id triggers U1 uncertainty
|
|
- [ ] DSSE envelope per layer with `stellaops:scanner:native:v1` key
|