diff --git a/AGENTS.md b/AGENTS.md index d9bbed491..4b49d9751 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -162,16 +162,19 @@ You will be explicitly told which role you are acting in. Your behavior must cha Your goals: -1. Review new advisory files against: - - * Archived advisories: `docs/product-advisories/archive/*.md`. - * Implementation plans: `docs/implplan/SPRINT_*.md`. - * Historical tasks: `docs/implplan/archived/all-tasks.md`. -2. Identify new topics or features that require implementation. -3. For genuinely new items (not already implemented or planned): - - * Check the relevant module docs: `docs/modules//*arch*.md` for compatibility or contradictions. - * If contradictions arise, you must surface and discuss them with the requester (in prose) and propose alignments. +1. Review each file in the advisory directory and Identify new topics or features. +2. Then determine whether the topic is relevant by: + 2. 1. Go one by one the files and extract the essentials first - themes, topics, architecture decions + 2. 2. Then read each of the archive/*.md files and seek if these are already had been advised. If it exists or it is close - then ignore the topic from the new advisory. Else keep it. + 2. 3. Check the relevant module docs: `docs/modules//*arch*.md` for compatibility or contradictions. + 2. 4. Implementation plans: `docs/implplan/SPRINT_*.md`. + 2. 5. Historical tasks: `docs/implplan/archived/all-tasks.md`. + 2. 4. For all of the new topics - then go in SPRINT*.md files and src/* (in according modules) for possible already implementation on the same topic. If same or close - ignore it. Otherwise keep it. + 2. 5. In case still genuine new topic - and it makes sense for the product - keep it. +3. When done for all files and all new genuine topics - present a report. Report must include: + - all topics + - what are the new things + - what could be contracting existing tasks or implementations but might make sense to implemnt 4. Once scope is agreed, hand over to your **project manager** role (4.2) to define implementation sprints and tasks. 5. **Advisory and design decision sync**: diff --git a/docs/product-advisories/17-Nov-2026 - 1 copy.md b/docs/product-advisories/17-Nov-2026 - SBOM-Provenance-Spine.md similarity index 100% rename from docs/product-advisories/17-Nov-2026 - 1 copy.md rename to docs/product-advisories/17-Nov-2026 - SBOM-Provenance-Spine.md diff --git a/docs/product-advisories/17-Nov-2026 - 1.md b/docs/product-advisories/17-Nov-2026 - Stripped-ELF-Reachability.md similarity index 100% rename from docs/product-advisories/17-Nov-2026 - 1.md rename to docs/product-advisories/17-Nov-2026 - Stripped-ELF-Reachability.md diff --git a/docs/product-advisories/18-Nov-2026 - 1 copy 4.md b/docs/product-advisories/18-Nov-2026 - Binary-Reachability-Engine.md similarity index 100% rename from docs/product-advisories/18-Nov-2026 - 1 copy 4.md rename to docs/product-advisories/18-Nov-2026 - Binary-Reachability-Engine.md diff --git a/docs/product-advisories/18-Nov-2026 - 1 copy 2.md b/docs/product-advisories/18-Nov-2026 - CSharp-Binary-Analyzer.md similarity index 100% rename from docs/product-advisories/18-Nov-2026 - 1 copy 2.md rename to docs/product-advisories/18-Nov-2026 - CSharp-Binary-Analyzer.md diff --git a/docs/product-advisories/18-Nov-2026 - 1 copy.md b/docs/product-advisories/18-Nov-2026 - Patch-Oracles.md similarity index 100% rename from docs/product-advisories/18-Nov-2026 - 1 copy.md rename to docs/product-advisories/18-Nov-2026 - Patch-Oracles.md diff --git a/docs/product-advisories/18-Nov-2026 - 1.md b/docs/product-advisories/18-Nov-2026 - SBOM-Provenance-Spine.md similarity index 100% rename from docs/product-advisories/18-Nov-2026 - 1.md rename to docs/product-advisories/18-Nov-2026 - SBOM-Provenance-Spine.md diff --git a/docs/product-advisories/18-Nov-2026 - 1 copy 5.md b/docs/product-advisories/18-Nov-2026 - Unknowns-Registry.md similarity index 100% rename from docs/product-advisories/18-Nov-2026 - 1 copy 5.md rename to docs/product-advisories/18-Nov-2026 - Unknowns-Registry.md diff --git a/docs/product-advisories/20-Nov-2026 - Branch · Attach ELF Build‑IDs for Stable PURL Mapping.md b/docs/product-advisories/20-Nov-2026 - Branch · Attach ELF Build‑IDs for Stable PURL Mapping.md new file mode 100644 index 000000000..34c59abdf --- /dev/null +++ b/docs/product-advisories/20-Nov-2026 - Branch · Attach ELF Build‑IDs for Stable PURL Mapping.md @@ -0,0 +1,1246 @@ +Here’s a quick, practical win for your SBOM/runtime join story: **record the ELF build‑id alongside soname and path when mapping modules to purls.** + +Why it matters: + +* **build‑id** (from `.note.gnu.build-id`) is a **content hash** that uniquely identifies an ELF image—even if filenames/paths change. +* Distros and **debuginfod** index debug symbols **by build‑id**, so you can reliably join runtime traces → binaries → SBOM entries → debug artifacts. +* It hardens reachability and VEX joins (no “same soname, different bits” ambiguity). + +### What to capture per ELF + +* `soname` (if shared object) +* `full path` at runtime +* `purl` (package URL from your resolver) +* **`build_id`** (hex, no colons) +* `arch`, `file type` (ET_DYN/ET_EXEC), and `build-id source` (NT_GNU_BUILD_ID) + +### How to read it (portable snippets) + +**CLI** + +```bash +# show build-id quickly +readelf -n /path/to/bin | awk '/Build ID:/ {print $3}' +# or: +objdump -s --section .note.gnu.build-id /path/to/bin +``` + +**C (runtime collector)** + +```c +#include +#include +static int note_cb(struct dl_phdr_info *info, size_t size, void *data) { + for (int i=0; iphnum; i++) { + const ElfW(Phdr) *ph = &info->phdr[i]; + if (ph->p_type == PT_NOTE) { + // scan notes for NT_GNU_BUILD_ID (type=3, name="GNU") + // extract desc bytes → hex string build_id + } + } + return 0; +} +// call dl_iterate_phdr(note_cb, NULL); +``` + +**Go (scanner)** + +```go +f, _ := elf.Open(path) +for _, n := range f.Notes { + if n.Type == elf.NT_GNU_BUILD_ID && n.Name == "GNU" { + buildID := fmt.Sprintf("%x", n.Desc) + // record buildID + } +} +``` + +### Suggested Stella Ops schema (add field, no versioning break) + +```json +{ + "module": { + "path": "/usr/lib/x86_64-linux-gnu/libssl.so.3", + "soname": "libssl.so.3", + "purl": "pkg:deb/ubuntu/openssl@3.0.2-0ubuntu1.10?arch=amd64", + "elf": { + "build_id": "a1b2c3d4e5f6...", + "type": "ET_DYN", + "arch": "x86_64", + "notes": { "source": "NT_GNU_BUILD_ID" } + } + } +} +``` + +### Join strategy + +1. **Runtime → build‑id:** collect from process maps (or dl_iterate_phdr) and file scan fallback. +2. **SBOM → candidate binaries:** map by purl/filename, then **confirm by build‑id** where available. +3. **Debug/Source:** query debuginfod or distro debug repos by build‑id to fetch symbols for precise call‑graph and reachability. +4. **VEX/Policies:** treat build‑id as the primary key for binary‑level assertions; purl stays as the package‑level key. + +### Edge handling + +* **Stripped binaries:** build‑id still present in the note; if missing, fall back to **full‑file hash** and flag `build_id_absent=true`. +* **Containers:** compute build‑id inside image layers and cache in your “Proof‑of‑Integrity Graph.” +* **Kernel/Modules:** same idea—`/sys/module/*/notes/.note.gnu.build-id`. + +### Quick acceptance tests + +* Scan a container image (Debian/Ubuntu/RHEL) and verify >90% of ELF objects yield a build‑id. +* Cross‑check one binary: path changes across containers, **build‑id stays identical**. +* Fetch symbols via debuginfod using that build‑id and run a tiny call‑graph demo to prove determinism. + +If you want, I can draft the exact .NET 10 collector for Linux (P/Invoke `dl_iterate_phdr`) and a CycloneDX extension block to store `build_id`. +Here’s a concrete “implementation spec” for a C# dev to build an **ELF metadata / build-id collector** (“elf builder”). I’ll treat this as a small reusable .NET library plus some process-level helpers. + +--- + +## 1. Goal & Scope + +**Goal:** From C# on Linux, be able to: + +1. Given an ELF file path, extract: + + * `build-id` (from `.note.gnu.build-id`, i.e. NT_GNU_BUILD_ID) + * `soname` (for shared objects) + * ELF type (ET_EXEC / ET_DYN / etc.) + * machine architecture + * file path + * optional fallback: full-file hash if build-id is missing + +2. Given a running process (usually self), enumerate loaded ELF modules and attach the above metadata per module. + +The output will power your SBOM/runtime join (path + soname + build-id → purl). + +--- + +## 2. Public API Spec + +### 2.1 Core model + +```csharp +public enum ElfFileType +{ + Unknown = 0, + Relocatable = 1, // ET_REL + Executable = 2, // ET_EXEC + SharedObject = 3, // ET_DYN + Core = 4 // ET_CORE +} + +public sealed class ElfMetadata +{ + public required string Path { get; init; } + public string? Soname { get; init; } + public string? BuildId { get; init; } // Hex, lowercase, no colons + public string BuildIdSource { get; init; } = ""; // "NT_GNU_BUILD_ID" | "FileHash" | "" + public ElfFileType FileType { get; init; } + + public string Machine { get; init; } = ""; // e.g. "x86_64", "aarch64" + public bool Is64Bit { get; init; } + public bool IsLittleEndian { get; init; } + + public string? FileHashSha256 { get; init; } // only if BuildId == null +} +``` + +### 2.2 File-level API + +```csharp +public static class ElfReader +{ + /// + /// Parse the ELF file at the given path and extract metadata. + /// Throws if file is not ELF or cannot be read. + /// + public static ElfMetadata ReadMetadata(string path); +} +``` + +**Behavior:** + +* Validates ELF magic. +* Supports both 32-bit and 64-bit ELF. +* Supports little and big endian (but you can initially only test little-endian). +* Uses program headers (PT_NOTE) and note parsing to extract build-id. +* Uses section headers + .dynamic to extract `DT_SONAME`. +* Sets `BuildIdSource = "NT_GNU_BUILD_ID"` if build-id present. +* If no build-id, computes `FileHashSha256` and sets `BuildIdSource = "FileHash"`. + +### 2.3 Process-level API (Linux) + +```csharp +public static class ElfProcessScanner +{ + /// + /// Enumerate ELF modules for the current process (default) or a given pid. + /// Only returns unique paths that are actual ELF files. + /// + public static IReadOnlyList GetProcessModules(int? pid = null); +} +``` + +**Default implementation:** + +* Only supports Linux. +* Reads `/proc//maps`. +* Filters entries that map regular files (path not `[vdso]`, `[heap]`, etc.). +* De-duplicates by canonical path (e.g. `realpath` behavior). +* For each unique path: + + * Check first 4 bytes for ELF magic. + * Call `ElfReader.ReadMetadata(path)`. + +--- + +## 3. ELF Parsing: Binary Layout & Rules + +You do **not** need unsafe code; a `BinaryReader` is enough. + +### 3.1 ELF header + +First 16 bytes: `e_ident[]`. + +Key fields: + +* `e_ident[0..3]` = `0x7F, 'E', 'L', 'F'` (magic) +* `e_ident[4]` = `EI_CLASS`: + + * 1 = 32-bit (`ELFCLASS32`) + * 2 = 64-bit (`ELFCLASS64`) +* `e_ident[5]` = `EI_DATA`: + + * 1 = little-endian (`ELFDATA2LSB`) + * 2 = big-endian (`ELFDATA2MSB`) + +Then the “native” header fields, which differ slightly between 32 & 64 bit. + +Define two internal structs (don’t use `[StructLayout]`; just read fields manually): + +```csharp +internal sealed class ElfHeaderCommon +{ + public byte[] Ident = new byte[16]; + public ushort Type; // e_type + public ushort Machine; // e_machine + public uint Version; // e_version + public ulong Entry; // e_entry (32/64 sized) + public ulong Phoff; // e_phoff + public ulong Shoff; // e_shoff + public uint Flags; // e_flags + public ushort Ehsize; // e_ehsize + public ushort Phentsize; // e_phentsize + public ushort Phnum; // e_phnum + public ushort Shentsize; // e_shentsize + public ushort Shnum; // e_shnum + public ushort Shstrndx; // e_shstrndx +} +``` + +**Algorithm to read header:** + +1. `ReadBytes(16)` → `Ident`. Validate magic & EI_CLASS/EI_DATA. + +2. Decide `is64` (from EI_CLASS) and `littleEndian` (from EI_DATA). + +3. Use helper methods: + + ```csharp + static ushort ReadUInt16(BinaryReader br, bool little) { ... } + static uint ReadUInt32(BinaryReader br, bool little) { ... } + static ulong ReadUInt64(BinaryReader br, bool little) { ... } + ``` + + Where these helpers swap bytes if file is big-endian and host is little-endian. + +4. For 32-bit ELF: fields `Entry`, `Phoff`, `Shoff` are 4-byte values that you zero-extend to 64-bit. + +5. For 64-bit ELF: fields are 8-byte values. + +### 3.2 Program headers (for build-id) + +Each program header: + +* 32-bit: + + ```text + uint32 p_type; + uint32 p_offset; + uint32 p_vaddr; + uint32 p_paddr; + uint32 p_filesz; + uint32 p_memsz; + uint32 p_flags; + uint32 p_align; + ``` + +* 64-bit: + + ```text + uint32 p_type; + uint32 p_flags; + uint64 p_offset; + uint64 p_vaddr; + uint64 p_paddr; + uint64 p_filesz; + uint64 p_memsz; + uint64 p_align; + ``` + +You only really need: + +* `p_type` (look for `PT_NOTE` = 4) +* `p_offset` +* `p_filesz` + +**Reading algorithm:** + +```csharp +internal sealed class ProgramHeader +{ + public uint Type; + public ulong Offset; + public ulong FileSize; +} +``` + +* Seek to `header.Phoff`. +* For `i = 0..Phnum-1`: + + * For 32-bit: + + * `Type = ReadUInt32()` + * Skip `p_offset` into `Offset = ReadUInt32()` + * Skip the rest. + * For 64-bit: + + * `Type = ReadUInt32()` + * `flags = ReadUInt32()` (ignored) + * `Offset = ReadUInt64()` + * `FileSize = ReadUInt64()` + * Skip rest. +* Store those with `Type == 4` (PT_NOTE). + +### 3.3 Note segments & NT_GNU_BUILD_ID + +Each **note** has: + +```text +uint32 namesz; +uint32 descsz; +uint32 type; +char name[namesz]; // padded to 4-byte boundary +byte desc[descsz]; // padded to 4-byte boundary +``` + +We care about: + +* `type == 3` (NT_GNU_BUILD_ID) +* `name == "GNU"` (null-terminated; usually `"GNU\0"`) + +**Algorithm:** + +For each `PT_NOTE` program header: + +1. Seek to `ph.Offset`, set `remaining = ph.FileSize`. +2. While `remaining >= 12`: + + * `namesz = ReadUInt32()` + * `descsz = ReadUInt32()` + * `type = ReadUInt32()` + * `remaining -= 12`. + * Read `nameBytes = ReadBytes(namesz)`; `remaining -= namesz`. + + * Skip padding: `pad = (4 - (namesz % 4)) & 3`; `Seek(pad)`, `remaining -= pad`. + * Read `desc = ReadBytes(descsz)`; `remaining -= descsz`. + + * Skip padding: `pad = (4 - (descsz % 4)) & 3`; `Seek(pad)`, `remaining -= pad`. + * If `type == 3` and `Encoding.ASCII.GetString(nameBytes).TrimEnd('\0') == "GNU"`: + + * Convert `desc` to hex: + + ```csharp + string buildId = BitConverter.ToString(desc).Replace("-", "").ToLowerInvariant(); + ``` + + * Return immediately. + +If no note matches, return null, and you can later fall back to `FileHashSha256`. + +### 3.4 Section headers & SONAME + +You need `DT_SONAME` from the dynamic section. Steps: + +1. Read **section headers** from `Shoff` (ELF header). + + Minimal section header model: + + ```csharp + internal sealed class SectionHeader + { + public uint Name; // index into shstrtab + public uint Type; // SHT_* + public ulong Offset; + public ulong Size; + public uint Link; // for some types + } + ``` + + For each section: + + * Read `Name`, `Type`, `Flags` (ignored), `Addr` (ignored), `Offset`, `Size`, `Link`, etc. + * Keep these in an array. + +2. Find the **section header string table** (`shstrtab`): + + * Use `header.Shstrndx` to locate its section header. + * Read that section’s bytes into `shStrTab`. + * Define helper to get section name: + + ```csharp + static string ReadNullTerminatedString(byte[] table, uint offset) + { + int i = (int)offset; + int start = i; + while (i < table.Length && table[i] != 0) i++; + return Encoding.ASCII.GetString(table, start, i - start); + } + ``` + +3. Use `shStrTab` to find: + + * `.dynamic` section (`Type == 6` i.e. `SHT_DYNAMIC`). + * The string table it references (`SectionHeader.Link` → index of the dynamic string table, often `.dynstr`). + +4. Parse the **dynamic section**: + + * `Elf64_Dyn` is array of entries: + + ```text + int64 d_tag; + uint64 d_val; + ``` + + (For 32-bit, both are 4 bytes; you can cast to 64-bit.) + + * For each entry: + + * Read `d_tag` (signed, but you can treat as 64-bit). + * Read `d_val`. + * If `d_tag == 14` (`DT_SONAME`), then `d_val` is an offset into the dynstr string table. + +5. Read `SONAME`: + + * Use dynstr bytes + `d_val` as index, decode null-terminated ASCII → `Soname`. + +If there is no `.dynamic` section or no `DT_SONAME`, set `Soname = null`. + +### 3.5 Mapping `e_machine` to architecture string + +`e_machine` is a numeric code. Map the most common ones: + +```csharp +static string MapMachine(ushort eMachine) => eMachine switch +{ + 3 => "x86", // EM_386 + 62 => "x86_64", // EM_X86_64 + 40 => "arm", // EM_ARM + 183 => "aarch64", // EM_AARCH64 + 8 => "mips", // EM_MIPS + _ => $"unknown({eMachine})" +}; +``` + +### 3.6 Mapping `e_type` to `ElfFileType` + +```csharp +static ElfFileType MapFileType(ushort eType) => eType switch +{ + 1 => ElfFileType.Relocatable, // ET_REL + 2 => ElfFileType.Executable, // ET_EXEC + 3 => ElfFileType.SharedObject,// ET_DYN + 4 => ElfFileType.Core, // ET_CORE + _ => ElfFileType.Unknown +}; +``` + +### 3.7 Fallback: SHA-256 hash + +If build-id is missing: + +```csharp +static string ComputeFileSha256(string path) +{ + using var sha = System.Security.Cryptography.SHA256.Create(); + using var fs = File.OpenRead(path); + var hash = sha.ComputeHash(fs); + return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant(); +} +``` + +Set: + +* `BuildId = null` +* `BuildIdSource = "FileHash"` +* `FileHashSha256 = computedHash` + +--- + +## 4. Implementation Skeleton (ElfReader) + +Here’s a compact skeleton tying it together: + +```csharp +public static class ElfReader +{ + public static ElfMetadata ReadMetadata(string path) + { + using var fs = File.OpenRead(path); + using var br = new BinaryReader(fs); + + // 1. Read e_ident + byte[] ident = br.ReadBytes(16); + if (ident.Length < 16 || + ident[0] != 0x7F || ident[1] != (byte)'E' || + ident[2] != (byte)'L' || ident[3] != (byte)'F') + { + throw new InvalidDataException("Not an ELF file."); + } + + bool is64 = ident[4] == 2; // EI_CLASS + bool little = ident[5] == 1; // EI_DATA + + // 2. Read header + var header = ReadElfHeader(br, ident, is64, little); + + // 3. Read program headers + var phdrs = ReadProgramHeaders(br, header, is64, little); + + // 4. Extract build-id from PT_NOTE + string? buildId = TryReadBuildIdFromNotes(br, phdrs, little, is64); + + // 5. Read SONAME from .dynamic + string? soname = TryReadSoname(br, header, is64, little); + + // 6. Map machine & type + string machine = MapMachine(header.Machine); + ElfFileType fileType = MapFileType(header.Type); + + // 7. Hash fallback + string? fileHash = null; + string source; + if (buildId is null) + { + fileHash = ComputeFileSha256(path); + source = "FileHash"; + } + else + { + source = "NT_GNU_BUILD_ID"; + } + + return new ElfMetadata + { + Path = path, + Soname = soname, + BuildId = buildId, + BuildIdSource = source, + FileType = fileType, + Machine = machine, + Is64Bit = is64, + IsLittleEndian = little, + FileHashSha256 = fileHash + }; + } + + // ... implement ReadElfHeader, ReadProgramHeaders, + // TryReadBuildIdFromNotes, TryReadSoname, MapMachine, + // MapFileType, ComputeFileSha256, + endian helpers ... +} +``` + +I didn’t expand *every* helper to keep this readable, but all helpers follow exactly the rules in section 3. + +--- + +## 5. Process Scanner Spec (Linux) + +### 5.1 Reading `/proc//maps` + +Each line looks roughly like: + +```text +7f2d9c214000-7f2d9c234000 r--p 00000000 08:01 1234567 /usr/lib/x86_64-linux-gnu/libssl.so.3 +``` + +Last field is the file path, if any. + +**Algorithm:** + +```csharp +public static class ElfProcessScanner +{ + public static IReadOnlyList GetProcessModules(int? pid = null) + { + int actualPid = pid ?? Environment.ProcessId; + string mapsPath = $"/proc/{actualPid}/maps"; + + if (!File.Exists(mapsPath)) + throw new PlatformNotSupportedException("Only supported on Linux with /proc."); + + var paths = new HashSet(StringComparer.Ordinal); + foreach (var line in File.ReadLines(mapsPath)) + { + int idx = line.IndexOf('/'); + if (idx < 0) + continue; + + string p = line.Substring(idx).Trim(); + if (p.StartsWith("[")) + continue; // skip [heap], [vdso], etc. + + if (!File.Exists(p)) + continue; + + // De-duplicate + if (!paths.Add(p)) + continue; + } + + var result = new List(); + foreach (var p in paths) + { + if (!IsElfFile(p)) + continue; + + try + { + var meta = ElfReader.ReadMetadata(p); + result.Add(meta); + } + catch + { + // swallow or log; not all mapped files are valid ELF + } + } + + return result; + } + + private static bool IsElfFile(string path) + { + try + { + using var fs = File.OpenRead(path); + Span magic = stackalloc byte[4]; + if (fs.Read(magic) != 4) return false; + return magic[0] == 0x7F && magic[1] == (byte)'E' && + magic[2] == (byte)'L' && magic[3] == (byte)'F'; + } + catch { return false; } + } +} +``` + +This is simple and robust. If you later want **even more accurate** results (e.g., also non-file-backed shared objects), you can add a P/Invoke path that uses `dl_iterate_phdr`, but `/proc//maps` gets you the SBOM-relevant modules. + +--- + +## 6. JSON / SBOM Integration (Optional but Recommended) + +When you serialize `ElfMetadata` into your runtime evidence / graph, I’d recommend a nested ELF block, e.g.: + +```json +{ + "path": "/usr/lib/x86_64-linux-gnu/libssl.so.3", + "soname": "libssl.so.3", + "purl": "pkg:deb/ubuntu/openssl@3.0.2-0ubuntu1.10?arch=amd64", + "elf": { + "build_id": "a1b2c3d4e5f6...", + "build_id_source": "NT_GNU_BUILD_ID", + "file_type": "SharedObject", + "machine": "x86_64", + "is_64bit": true, + "is_little_endian": true, + "file_hash_sha256": null + } +} +``` + +You can keep `purl` on the higher-level module object; `build_id` becomes the primary key for binary-accurate joins. + +--- + +## 7. Testing Checklist + +For a C# dev implementing this, I’d suggest these tests: + +1. **Basic ELF detection** + + * Provide a non-ELF file → `ReadMetadata` throws. + * Provide `/bin/ls` (or similar) → parses successfully. + +2. **32-bit vs 64-bit** + + * If you have a 32-bit ELF on the system, verify `Is64Bit` and basic fields. + +3. **Build-id extraction** + + * Compare `ReadMetadata(path).BuildId` against `readelf -n path | grep 'Build ID:'`. + * Ensure they match. + +4. **SONAME extraction** + + * For a shared library, ensure `Soname` matches `readelf -d path | grep SONAME`. + +5. **Hash fallback** + + * On an ELF without build-id (you can strip or use a test file), verify `BuildId == null` and `FileHashSha256 != null`. + +6. **Process scanner** + + * Call `GetProcessModules()` in your test app; ensure it returns at least: + + * your own executable + * core system libs (`libc.so.6`, `ld-linux*`, etc.) + +--- + +If you’d like, next step I can do is expand any of the helper methods (e.g. `ReadElfHeader` or `TryReadSoname`) into complete C# code, but this should be enough as a “full spec & directions” for a C# dev to implement the ELF build-id collector. +You’re already in “pretty good” territory with the current spec. To get to “best in class,” I’d tighten it in a few dimensions: **API design, correctness, performance, observability, testing, and extensibility**. + +I’ll walk through concrete ways to upgrade what we already have, without rewriting everything from scratch. + +--- + +## 1. Sharpen the public API + +### 1.1 Split responsibilities into clear layers + +Right now we have: + +* `ElfReader.ReadMetadata(string path)` +* `ElfProcessScanner.GetProcessModules(int? pid = null)` + +For a best‑in‑class library, I’d explicitly layer things: + +```csharp +public interface IElfParser +{ + ElfMetadata Parse(Stream stream, string? pathHint = null); +} + +public interface IElfFileInspector +{ + ElfMetadata InspectFile(string path); +} + +public interface IElfProcessInspector +{ + IReadOnlyList GetProcessModules(ElfProcessScanOptions? options = null); +} +``` + +With default implementations: + +* `ElfParser` – pure, stateless binary parser (no file I/O). +* `ElfFileInspector` – wraps `ElfParser` + file system. +* `ElfProcessInspector` – wraps `/proc//maps` (and optionally `dl_iterate_phdr`). + +This makes testing simpler (you can feed a `MemoryStream`) and keeps “how we read” decoupled from “how we parse.” + +### 1.2 Options objects & async variants + +Give users knobs and modern .NET ergonomics: + +```csharp +public sealed class ElfProcessScanOptions +{ + public int? Pid { get; init; } + public bool IncludeNonElfFiles { get; init; } = false; + public bool ParallelFileParsing { get; init; } = true; + public bool ComputeHashWhenBuildIdMissing { get; init; } = true; + public int? MaxFiles { get; init; } // safety valve on huge systems +} + +public static class ElfProcessScanner +{ + public static IReadOnlyList GetProcessModules( + ElfProcessScanOptions? options = null); + + public static IAsyncEnumerable GetProcessModulesAsync( + ElfProcessScanOptions? options = null, + CancellationToken cancellationToken = default); +} +``` + +Same for file scans: + +```csharp +public sealed class ElfFileScanOptions +{ + public bool ComputeFileHashWhenBuildIdPresent { get; init; } = false; + public bool ThrowOnNonElf { get; init; } = true; +} + +public static ElfMetadata ReadMetadata( + string path, + ElfFileScanOptions? options = null); +``` + +### 1.3 Strong types for identity + +Instead of `string BuildId`, add a value type: + +```csharp +public readonly struct ElfBuildId : IEquatable +{ + public string HexString { get; } // "a1b2c3..." + public string DebugPathComponent => $"{HexString[..2]}/{HexString[2..]}"; + + // Parse, TryParse, equality, GetHashCode, etc. +} +``` + +Then in `ElfMetadata`: + +```csharp +public ElfBuildId? BuildId { get; init; } // nullable +public string BuildIdSource { get; init; } // "NT_GNU_BUILD_ID" | "FileHash" | "None" +``` + +This prevents subtle bugs from string normalization and gives you the debuginfod‑style path precomputed. + +--- + +## 2. Make parsing spec‑accurate & robust + +### 2.1 Handle both PT_NOTE and SHT_NOTE `.note.gnu.build-id` + +Many binaries place build‑id in: + +* `PT_NOTE` segments **and/or** +* a section named `.note.gnu.build-id` (`SHT_NOTE`) + +Your spec only mentions `PT_NOTE`. For best coverage: + +1. Search all `PT_NOTE` segments for `NT_GNU_BUILD_ID`. +2. If none found, search `SHT_NOTE` sections with name `.note.gnu.build-id`. +3. If both exist and disagree (extremely rare), decide a precedence and log a diagnostic. + +### 2.2 Correct note alignment rules + +Spec nuance: + +* Note *fields* (`namesz`, `descsz`, `type`) are always 4‑byte aligned. +* On 64‑bit, the **overall note segment** may be aligned to 8 bytes, but the internal padding rules still use 4‑byte boundaries. + +Your spec uses `pad = (4 - (size % 4)) & 3`, which is correct, but I’d codify it clearly: + +```csharp +static int NotePadding(int size) => (4 - (size & 3)) & 3; +``` + +And call that everywhere you advance across notes so future maintainers don’t “optimize” it incorrectly. + +### 2.3 Be strict on bounds & corruption + +Add explicit, defensive checks: + +* Do not trust `p_offset` + `p_filesz` blindly. +* Before any read, verify `offset + length <= streamLength`. +* If the file lies about sizes, **fail gracefully** with a structured error. + +E.g.: + +```csharp +public sealed class ElfParseException : Exception +{ + public ElfParseErrorKind Kind { get; } + public string? Detail { get; } + + // ... +} + +public enum ElfParseErrorKind +{ + NotElf, + TruncatedHeader, + TruncatedProgramHeader, + TruncatedSectionHeader, + TruncatedNote, + UnsupportedClass, + UnsupportedEndianess, + IoError, + Unknown +} +``` + +And then: + +```csharp +if (header.Phoff + (ulong)header.Phnum * header.Phentsize > (ulong)fs.Length) + throw new ElfParseException(ElfParseErrorKind.TruncatedProgramHeader, "..."); +``` + +Best‑in‑class means you *never* trust the file, and your errors are debuggable. + +### 2.4 Big‑endian and 32‑bit are first‑class citizens + +Even if your primary target is x86_64 Linux, a robust spec: + +* Fully supports EI_CLASS = 1 and 2 (32/64). +* Fully supports EI_DATA = 1 and 2 (LSB/MSB). +* Has tests for at least one big‑endian ELF (e.g., sample artifacts in your test assets). + +Your current spec *mentions* big-endian, but I’d explicitly require: + +* A generic `EndianBinaryReader` abstraction that: + + * Wraps a `Stream` + * Exposes `ReadUInt16/32/64`, `ReadInt64`, `ReadBytes` with endianness. + +--- + +## 3. Performance & scale improvements + +### 3.1 Avoid full-file reads by design + +Your current design lets devs accidentally hash everything or read all sections even when not needed. + +Refine the spec so that **default path** is minimal I/O: + +* Read ELF header. +* Read program headers. +* Read only: + + * PT_NOTE ranges + * Section headers (once) + * `.shstrtab`, `.dynamic`, and its dynstr. + +Only compute SHA‑256 when expressly configured (via `ElfFileScanOptions.ComputeFileHashWhenBuildIdPresent` or `ComputeFileHashWhenBuildIdMissing`). + +### 3.2 Optional memory‑mapped mode + +For very large scans (filesystem crawls, containers), allow a mode that uses `MemoryMappedFile`: + +```csharp +public sealed class ElfReaderOptions +{ + public bool UseMemoryMappedFile { get; init; } = false; +} +``` + +Internally, you can spec that the implementation: + +* Uses `MemoryMappedFile.CreateFromFile` +* Creates views over relevant ranges (header, program headers, etc.) +* Avoids multiple OS reads for repeated random access. + +### 3.3 Parallel directory / image scanning + +If you foresee scanning whole images or file trees, define a helper: + +```csharp +public static class ElfDirectoryScanner +{ + public static IReadOnlyList Scan( + string rootDirectory, + ElfDirectoryScanOptions? options = null); + + public static IAsyncEnumerable ScanAsync( + string rootDirectory, + ElfDirectoryScanOptions? options = null, + CancellationToken cancellationToken = default); +} + +public sealed class ElfDirectoryScanOptions +{ + public SearchOption SearchOption { get; init; } = SearchOption.AllDirectories; + public int MaxDegreeOfParallelism { get; init; } = Environment.ProcessorCount; + public Func? PathFilter { get; init; } // e.g., skip /proc, /sys +} +``` + +And explicitly say that the implementation: + +* Uses `Parallel.ForEach` (or `Parallel.ForEachAsync` in .NET 8/9+) with bounded parallelism. +* Shares a single `ElfParser` across threads (it’s stateless). +* De‑dups by `(device, inode)` when possible (see below). + +--- + +## 4. Process scanner: correctness & completeness + +### 4.1 De‑duplication by inode, not just path + +The current spec de‑dups only by path. On Linux: + +* Same inode may have multiple paths (hard links, bind mounts, chroot/container overlays). + +For best‑in‑class accuracy of “unique binaries,” spec: + +* De‑duplicate entries by `(st_dev, st_ino)` from `stat(2)`, not just string path. +* Provide both views: unique by file identity and by path. + +API example: + +```csharp +public sealed class ElfProcessModules +{ + public IReadOnlyList UniqueFiles { get; init; } // dedup by inode + public IReadOnlyList Instances { get; init; } // per mapping +} + +public sealed class ElfModuleInstance +{ + public ElfMetadata Metadata { get; init; } + public string Path { get; init; } + public string? MappingRange { get; init; } // "7f2d9c214000-7f2d9c234000" +} +``` + +And `ElfProcessScanner.GetProcessModules` returns an `ElfProcessModules`, not just a flat list. + +### 4.2 Optional `dl_iterate_phdr` P/Invoke path + +For a “maximum correctness” mode, you can specify: + +* A secondary implementation that uses `dl_iterate_phdr` via P/Invoke. +* This gives you module base addresses and sometimes more consistent views across distros. +* You can hybridize: use `/proc//maps` for path enumeration and `dl_iterate_phdr` to confirm loaded segments (future feature). + +You don’t **have** to implement it day one, but the spec can carve out an extension point: + +```csharp +public enum ElfProcessModuleSource +{ + ProcMaps, + DlIteratePhdr +} + +public sealed class ElfProcessScanOptions +{ + public ElfProcessModuleSource Source { get; init; } = ElfProcessModuleSource.ProcMaps; +} +``` + +And define behavior if the requested source isn’t available. + +--- + +## 5. Observability & diagnostics + +Best‑in‑class libraries are easy to debug. + +### 5.1 Structured diagnostics on parse failures + +Instead of “swallow or log” in the scanner, define: + +```csharp +public sealed class ElfScanResult +{ + public IReadOnlyList Successes { get; init; } + public IReadOnlyList Errors { get; init; } +} + +public sealed class ElfScanError +{ + public string Path { get; init; } + public ElfParseErrorKind Kind { get; init; } + public string Message { get; init; } +} +``` + +And make `ElfProcessScanner.GetProcessModules` optionally return `ElfScanResult` (or have an overload). + +This way you can: + +* Report how many files failed. +* See common misconfigurations (e.g., insufficient permissions, truncated files). + +### 5.2 Logging hooks instead of hard-coded logging + +Don’t bake in a logging framework, but add a hook: + +```csharp +public interface IElfLogger +{ + void Debug(string message); + void Info(string message); + void Warn(string message); + void Error(string message, Exception? ex = null); +} + +public sealed class ElfReaderOptions +{ + public IElfLogger? Logger { get; init; } +} +``` + +Then use it for “soft failures” (skipping non‑ELF files, ignoring suspect sections, etc.). + +--- + +## 6. Security & safety considerations + +### 6.1 Treat inputs as untrusted + +Spec explicitly that: + +* No ELF is ever loaded or executed. +* No ld.so / dynamic loading is used: all reading is via `FileStream` / `MemoryMappedFile`. +* No writes occur to inspected paths. + +### 6.2 Control resource usage + +For environments scanning untrusted file trees (e.g., user uploads): + +* Have configurable caps on: + + * `MaxFileSizeBytes` to parse. + * `MaxNotesPerSegment` / `MaxSections` to avoid pathological “zip bomb” style ELFs. +* Fail with `ElfParseErrorKind.TruncatedHeader` or `Unsupported` rather than exhausting RAM. + +--- + +## 7. Testing & validation: make it part of the spec + +Instead of just “add tests,” bake them in as requirements. + +### 7.1 Golden tests vs `readelf` or `llvm-readobj` + +Define that CI must include: + +* For a set of ELFs (32‑bit, 64‑bit, big‑endian, stripped, PIE, static): + + * Compare `ElfMetadata.BuildId` with `readelf -n` output. + * Compare `ElfMetadata.Soname` with `readelf -d` / `objdump -p`. + +You don’t need to name the exact tools in the API, but the spec can say: + +> The library’s test suite **must** cross‑validate build‑id and SONAME values against a trusted system tool (such as `readelf` or `llvm-readobj`) for a curated set of binaries. + +### 7.2 Fuzzing & corruption tests + +Add: + +* A small fuzz harness that: + + * Mutates bytes in real ELF samples. + * Feeds them to `ElfParser`. + * Asserts: no crashes, only `ElfParseException`s. + +This directly supports the “never trust input” goal. + +### 7.3 Regression fixtures + +Check in a `testdata/` folder with: + +* Minimal 32‑bit/64‑bit ELF with build‑id. +* Minimal ELF without build‑id. +* Shared library with SONAME. +* Big‑endian sample. + +--- + +## 8. Extensibility hooks (future-friendly) + +Even if you only care about Linux/ELF today, you can design with “other formats later” in mind. + +### 8.1 Generalized module metadata interface + +```csharp +public interface IModuleMetadata +{ + string Path { get; } + string? Soname { get; } + string? BuildId { get; } + string Format { get; } // "ELF", "PE", "MachO" +} +``` + +`ElfMetadata` implements `IModuleMetadata`. That way, a future `PeMetadata` or `MachOMetadata` can slot into the same pipelines. + +### 8.2 Integration with SBOM & VEX + +Add a tiny, optional interface that lines up with your SBOM graph: + +```csharp +public interface IHasPackageCoordinates +{ + string? Purl { get; } +} + +public sealed partial class ElfMetadata : IHasPackageCoordinates +{ + public string? Purl { get; init; } // populated by your higher-layer resolver +} +``` + +The ELF layer doesn’t know how to compute `Purl`, but it gives a spot for higher layers to attach it without wrapping everything in another type. + +--- + +## 9. Documentation & usage examples + +Finally, “best in class” is as much about *developer experience* as code. + +Your spec should require: + +* XML docs on all public types/members (shown in IntelliSense). +* Samples: + + * “Read build‑id from a single file” + * “Enumerate current process modules and print build‑ids” + * “Scan a container filesystem for unique ELFs and dump JSON” + +For example: + +```csharp +// Example: dump all modules for the current process +var modules = ElfProcessScanner.GetProcessModules(); +foreach (var m in modules) +{ + Console.WriteLine($"{m.Path} | SONAME={m.Soname} | BUILD-ID={m.BuildId?.HexString ?? ""}"); +} +``` + +--- + +## TL;DR: What to actually change in your current spec + +If you just want a concrete checklist: + +1. **Refine API** + + * Introduce `ElfBuildId` struct, options objects, async variants. + * Split parser vs file/process scanners. + +2. **Parsing correctness** + + * Support build‑id in both PT_NOTE and `.note.gnu.build-id`. + * Add strict bounds checks and `ElfParseException` with `ElfParseErrorKind`. + * Treat big‑endian & 32‑bit as first‑class. + +3. **Performance** + + * Make full file hashing opt‑in. + * Avoid unnecessary section reads. + * Add optional memory‑mapped mode. + +4. **Process scanner** + + * De‑dup by inode, not just path. + * Return both unique files and per‑mapping instances. + * Add structured error reporting (successes + failures). + +5. **Testing & security** + + * Mandate cross‑validation vs `readelf`. + * Add fuzz/corruption tests. + * Add resource caps (max file size, max sections/notes). + +If you’d like, next step I can do is **rewrite the public C# surface** (interfaces, classes, XML docs) in one place with all of these improvements baked in, so your team can just drop it into a project and fill in the internals. diff --git a/docs/product-advisories/20-Nov-2026 - Branch · Model .init_array Constructors as Reachability Roots.md b/docs/product-advisories/20-Nov-2026 - Branch · Model .init_array Constructors as Reachability Roots.md new file mode 100644 index 000000000..106903742 --- /dev/null +++ b/docs/product-advisories/20-Nov-2026 - Branch · Model .init_array Constructors as Reachability Roots.md @@ -0,0 +1,768 @@ +Here’s a quick, practical heads‑up about **binary initialization routines** and why they matter for reachability and vuln triage. + +--- + +### What’s happening before `main()` + +In ELF binaries/shared objects, the runtime linker runs **constructors** *before* `main()`: + +* `.preinit_array` → runs first (rare, but highest priority) +* `.init_array` → common place for constructors (ordered by index) +* Legacy sections: `.init` (function) and `.ctors` (older toolchains) +* On exit you also have `.fini_array` / `.fini` + +These constructors can: + +* Register signal/atexit handlers +* Start threads, open sockets/files, tweak `LD_PRELOAD` hooks +* Call library code you assumed was only used later + +So if you’re doing **call‑graph reachability** for vulnerability impact, starting from only `main()` (or exported APIs) can **miss real edges** that execute at load time. + +--- + +### What to model (synthetic roots) + +Treat the following as **synthetic entry points** in your graph: + +1. All function pointers in `.preinit_array` +2. All function pointers in `.init_array` +3. The symbol `_init` (if present) and legacy `.ctors` entries +4. For completeness on teardown paths: `.fini_array`, `_fini` +5. **Dynamic loader interposition**: if `DT_NEEDED` libs have their own constructors, they’re roots too (even if you never call them explicitly) + +For PIE/DSO builds, remember that every loaded **dependency’s** init arrays run as part of `dlopen()`/program start—model those edges across DSOs. + +--- + +### How to extract quickly + +* **Static parse**: read `PT_DYNAMIC`, then `DT_PREINIT_ARRAY`, `DT_INIT_ARRAY`, their sizes; iterate pointers and add edges to your graph. +* **Symbol fallback**: if `DT_INIT`/`_init` exists, add it as a root. +* **Ordering**: preserve index order inside arrays (it can matter). +* **Relocations**: resolve `R_X86_64_RELATIVE` (etc.) so pointers point to the real code addresses. + +Mini‑C example (constructor runs pre‑main): + +```c +static void __attribute__((constructor)) boot(void) { + // vulnerable call here executes before main() +} +int main(){ return 0; } +``` + +--- + +### For Stella Ops (binary reachability) + +* **Graph seeds**: `roots = { init arrays of main ELF + all DT_NEEDED DSOs }` +* **Policy**: mark edges from these roots as `phase=load` vs `phase=runtime`, so your explainer can say “reachable at load time.” +* **PURLs**: attach edges to the package/node that owns the constructor symbol (DSO package purl), not just the main app. +* **Attestation**: store the discovered root list (addresses + resolved symbols + DSO soname) in your deterministic scan manifest, so audits can replay it. +* **Heuristics**: if `dlopen()` is detected statically (strings/symbols), add a potential root “DLOPEN_INIT[*]” bucket for libs found under common plugin dirs. + +--- + +### Quick checklist + +* [ ] Parse `.preinit_array`, `.init_array`, `.init` (and legacy `.ctors`) +* [ ] Resolve relocations; preserve order +* [ ] Seed graph with these as **synthetic roots** +* [ ] Include constructors of every `DT_NEEDED` DSO +* [ ] Tag edges as `phase=load` for prioritization/explainability +* [ ] Persist root list in the scan’s evidence bundle + +If you want, I can drop in a tiny .NET/ELF parser snippet or a Rust routine that walks `DT_INIT_ARRAY` and returns symbol‑resolved roots next. +Here’s a concrete, C#‑oriented spec you can hand to a developer to implement ELF init/constructor discovery and plug it into a reachability engine like Stella Ops. + +I’ll structure it like an internal design doc: + +1. What we need to do +2. Public API (what the rest of the system calls) +3. ELF parsing details (minimal, but correct) +4. Constructor / init routine discovery algorithm +5. Dynamic deps (DT_NEEDED) and load‑time roots +6. Integration with the call graph / reachability +7. Attestation / evidence output +8. Testing strategy + +--- + +## 1. Goal / Requirements + +**Business goal** + +When scanning ELF binaries and shared libraries, we must model functions that run **before `main()`** or at **library load/unload** as *synthetic entry points* in the call graph: + +* `.preinit_array` (pre‑init constructors) +* `.init_array` (constructors) +* Legacy constructs: + + * `.ctors` array + * `_init` (via `DT_INIT`) +* For teardown (optional but recommended): + + * `.fini_array` + * `_fini` (via `DT_FINI`) + +**We must:** + +* Discover all these routines in: + + * The main executable + * All its `DT_NEEDED` shared libraries (and any DSOs subsequently loaded, if we scan them) +* Represent them as **roots** in the reachability graph: + + * `phase = Load` for preinit/init/constructors + * `phase = Unload` for finalizers +* Resolve each routine to: + + * Owning binary path and SONAME + * Virtual address in the ELF + * Best‑effort symbol name (`_ZN...`, `my_ctor`, etc.) + * Order/index within its array (to preserve call order) +* Emit a structured **evidence/attestation** record so scans are replayable. + +--- + +## 2. Public API (C#) + +### 2.1 Data model + +Create a small domain model in a library, e.g. `StellaOps.ElfInit`: + +```csharp +namespace StellaOps.ElfInit; + +public enum InitRoutineKind +{ + PreInitArray, + InitArray, + LegacyCtorsSection, + LegacyInitSymbol, + FiniArray, + LegacyFiniSymbol +} + +public enum InitPhase +{ + Load, + Unload +} + +public sealed record InitRoutineRoot( + string BinaryPath, // Full path on disk + string? Soname, // From DT_SONAME if present + InitRoutineKind Kind, + InitPhase Phase, + ulong VirtualAddress, // VA within this ELF + ulong? FileOffset, // File offset (if resolved), null if unknown + string? SymbolName, // Best-effort name from symbol table + int? ArrayIndex // Index for array-based roots +); +``` + +### 2.2 Discovery service + +Public entry point that other components use: + +```csharp +public interface IInitRoutineDiscovery +{ + /// + /// Discover load/unload routines (constructors) in a single ELF file + /// and, optionally, in its DT_NEEDED dependencies. + /// + InitDiscoveryResult Discover(string elfPath, InitDiscoveryOptions options); +} + +public sealed record InitDiscoveryOptions +{ + /// + /// If true, also discover init routines in DT_NEEDED shared libraries + /// (using IElfDependencyResolver to locate them on disk). + /// + public bool IncludeDependencies { get; init; } = true; + + /// + /// If true, include fini routines (.fini_array, DT_FINI, etc.) + /// as unload-phase roots. + /// + public bool IncludeUnloadPhase { get; init; } = true; +} + +public sealed record InitDiscoveryResult( + IReadOnlyList Roots, + IReadOnlyList Errors // non-fatal problems per binary +); + +public sealed record InitRoutineError( + string BinaryPath, + string Message, + Exception? Exception = null +); +``` + +### 2.3 Dependency resolution + +We don’t hard‑code how to find `DT_NEEDED` libraries on disk. Define an abstraction: + +```csharp +public interface IElfDependencyResolver +{ + /// + /// Resolve SONAME (e.g. "libc.so.6") to a local file path. + /// Returns null if not found. + /// + string? ResolveLibrary(string soname, string referencingBinaryPath); +} +``` + +The implementation can respect `LD_LIBRARY_PATH`, typical system dirs, container images, etc., but that’s outside this spec. + +`IInitRoutineDiscovery` will depend on: + +* `IElfParser` +* `IElfDependencyResolver` +* `ISymbolResolver` (symbol tables) + +--- + +## 3. ELF Parsing Spec (C#‑friendly) + +You can either use a NuGet ELF library or implement a minimal in‑house parser. This spec assumes a **minimal custom parser** that supports: + +* ELF64, little‑endian +* ET_EXEC, ET_DYN +* x86‑64 (`e_machine == EM_X86_64`) as v1; keep architecture pluggable for later + +### 3.1 Core types + +Create an internal parser namespace, e.g. `StellaOps.Elf`: + +```csharp +internal sealed class ElfFile +{ + public string Path { get; } + public ElfClass ElfClass { get; } + public ElfEndianness Endianness { get; } + public ElfHeader Header { get; } + public IReadOnlyList ProgramHeaders { get; } + public IReadOnlyList SectionHeaders { get; } + public DynamicSection? Dynamic { get; } + + public ReadOnlyMemory RawBytes { get; } + + // Helper: mapping VA -> file offset using PT_LOAD segments + public bool TryMapVaToFileOffset(ulong virtualAddress, out ulong fileOffset); +} + +internal enum ElfClass { Elf32, Elf64 } +internal enum ElfEndianness { Little, Big } + +// Fill out ElfHeader / ProgramHeader / SectionHeader / DynamicEntry types +``` + +Implementation notes: + +* Read ELF header: + + * Validate magic: `0x7F 'E' 'L' 'F'` + * `EI_CLASS` → 32/64‑bit + * `EI_DATA` → endianness +* Read **program headers** (`e_phoff`, `e_phnum`). + + * Identify `PT_LOAD` (for VA→file mapping). + * Identify `PT_DYNAMIC` (for `DynamicSection`). +* Read **section headers** (`e_shoff`, `e_shnum`). + + * Identify sections by name: `.preinit_array`, `.init_array`, `.fini_array`, `.ctors`. + * You need the section name string table `.shstrtab` to decode names. + +### 3.2 Dynamic section parsing + +Define dynamic section model: + +```csharp +internal sealed class DynamicSection +{ + public IReadOnlyList Entries { get; } + public ulong? InitFunction { get; } // DT_INIT + public ulong? FiniFunction { get; } // DT_FINI + public ulong? InitArrayAddress { get; } // DT_INIT_ARRAY + public ulong? InitArraySize { get; } // DT_INIT_ARRAYSZ + public ulong? FiniArrayAddress { get; } // DT_FINI_ARRAY + public ulong? FiniArraySize { get; } // DT_FINI_ARRAYSZ + public ulong? PreInitArrayAddress { get; } // DT_PREINIT_ARRAY + public ulong? PreInitArraySize { get; } // DT_PREINIT_ARRAYSZ + + public string? Soname { get; } // DT_SONAME (decoded via DT_STRTAB) + public IReadOnlyList Needed { get; } // DT_NEEDED list + + public ulong? StrTabAddress { get; } + public ulong? SymTabAddress { get; } + public ulong? StrTabSize { get; } +} +``` + +Implementation details: + +* Dynamic entries are at `PT_DYNAMIC.p_offset`, each `Elf64_Dyn`: + + * `d_tag` (signed 64‑bit) + * `d_un` union (`d_val` or `d_ptr`, treat as `ulong`) + +* Tags of interest (values are from ELF spec): + + * `DT_NULL = 0` + * `DT_NEEDED = 1` + * `DT_STRTAB = 5` + * `DT_SYMTAB = 6` + * `DT_STRSZ = 10` + * `DT_INIT = 12` + * `DT_FINI = 13` + * `DT_SONAME = 14` + * `DT_INIT_ARRAY = 25` + * `DT_FINI_ARRAY = 26` + * `DT_INIT_ARRAYSZ = 27` + * `DT_FINI_ARRAYSZ = 28` + * `DT_PREINIT_ARRAY = 32` + * `DT_PREINIT_ARRAYSZ = 33` + +* To decode SONAME and NEEDED: + + * Use `DT_STRTAB` as base VA of the dynamic string table. + * Map VA to file offset with `TryMapVaToFileOffset`. + * For each `DT_NEEDED` / `DT_SONAME`, treat `d_val` as an offset into that string table; read a null‑terminated UTF‑8 C‑string. + +--- + +## 4. Constructor & Init Routine Discovery + +We now define the algorithm implemented by `InitRoutineDiscovery` for a **single ELF file**. + +High‑level steps: + +1. Parse `ElfFile`. +2. Parse `DynamicSection`. +3. Resolve: + + * Pre‑init array (`DT_PREINIT_ARRAY`, `.preinit_array`) + * Init array (`DT_INIT_ARRAY`, `.init_array`) + * Legacy `.ctors` + * `_init`, `_fini` via `DT_INIT`/`DT_FINI` + * Fini array (`DT_FINI_ARRAY`, `.fini_array`) +4. For each VA, optionally resolve symbol name. +5. Build `InitRoutineRoot` entries. + +### 4.1 Pointer size & endianness + +* For ELF64: + + * Pointer size = 8 bytes. +* For ELF32: + + * Pointer size = 4 bytes (if/when you support it). +* Use `BinaryPrimitives.ReadUInt64LittleEndian` or `ReadUInt64BigEndian` depending on `ElfEndianness`. + +### 4.2 Mapping VA → file offset + +`ElfFile.TryMapVaToFileOffset`: + +* Iterate `ProgramHeaders` with `p_type == PT_LOAD`. +* If `virtualAddress` in `[p_vaddr, p_vaddr + p_memsz)`: + + * `fileOffset = p_offset + (virtualAddress - p_vaddr)` +* Return false if no matching segment. + +### 4.3 Reading init arrays + +Generic helper: + +```csharp +internal static IReadOnlyList ReadPointerArray( + ElfFile elf, + ulong arrayVa, + ulong arrayBytes) +{ + var results = new List(); + if (!elf.TryMapVaToFileOffset(arrayVa, out var fileOffset)) + return results; + + int pointerSize = elf.ElfClass == ElfClass.Elf64 ? 8 : 4; + int count = (int)(arrayBytes / (ulong)pointerSize); + + var span = elf.RawBytes.Span; + for (int i = 0; i < count; i++) + { + ulong offset = fileOffset + (ulong)(i * pointerSize); + if (offset + (ulong)pointerSize > (ulong)span.Length) + break; + + ulong pointerValue = elf.Endianness switch + { + ElfEndianness.Little when pointerSize == 8 + => System.Buffers.Binary.BinaryPrimitives.ReadUInt64LittleEndian(span[(int)offset..]), + ElfEndianness.Little + => System.Buffers.Binary.BinaryPrimitives.ReadUInt32LittleEndian(span[(int)offset..]), + ElfEndianness.Big when pointerSize == 8 + => System.Buffers.Binary.BinaryPrimitives.ReadUInt64BigEndian(span[(int)offset..]), + _ // Big, 32-bit + => System.Buffers.Binary.BinaryPrimitives.ReadUInt32BigEndian(span[(int)offset..]), + }; + + if (pointerValue != 0) + results.Add(pointerValue); + } + + return results; +} +``` + +Apply to: + +* Pre‑init: if `Dynamic.PreInitArrayAddress` and `Dynamic.PreInitArraySize` present. +* Init: if `Dynamic.InitArrayAddress` and `Dynamic.InitArraySize` present. +* Fini: if `Dynamic.FiniArrayAddress` and `Dynamic.FiniArraySize` present. + +### 4.4 Legacy `.ctors` section + +Fallback for older toolchains: + +* Find section with `Name == ".ctors"`. +* Its contents are just an array of pointers (same pointer size as ELF). +* Some compilers include a sentinel `-1` or `0` at beginning or end. Treat: + + * `0` or `0xFFFFFFFFFFFFFFFF` (for 64‑bit) as sentinel; skip them. +* Use similar `ReadPointerArray` logic but starting from `sh_offset` rather than a VA. + +### 4.5 `_init` / `_fini` functions + +* `Dynamic.InitFunction` (from `DT_INIT`) is a single VA. +* `Dynamic.FiniFunction` (from `DT_FINI`) likewise. + +Even if arrays exist, these may also be present; treat them as **independent roots**. + +--- + +## 5. Symbol Resolution (best‑effort names) + +Define interface: + +```csharp +public interface ISymbolResolver +{ + /// + /// Find the symbol whose address matches `virtualAddress` exactly, + /// or, if not found, the closest preceding symbol (with an offset). + /// + SymbolInfo? ResolveSymbol(ElfFile elf, ulong virtualAddress); +} + +public sealed record SymbolInfo( + string Name, + ulong Value, + ulong Size +); +``` + +Implementation sketch: + +* Use `.dynsym` (dynamic symbol table), and `.symtab` (full symbol table) if available. +* Each symbol entry includes: + + * Name offset in string table + * Value (VA) + * Size + * Type/binding (function, object, etc.) +* Build an in‑memory index (e.g. sorted by `Value`) per ELF file. +* `ResolveSymbol`: + + * Prefer exact match of `Value`. + * If none, find symbol with largest `Value` less than `virtualAddress` and treat as “nearest symbol + offset”. + * You can show just `Name` or `Name+0xOFFSET` in explanations; for `InitRoutineRoot` we store plain `Name`. + +--- + +## 6. Dynamic Dependencies & Load-Time Roots + +When `InitDiscoveryOptions.IncludeDependencies == true`: + +1. For root ELF: + + * Discover its roots as above. +2. For each `neededSoname` in `Dynamic.Needed`: + + * Ask `IElfDependencyResolver.ResolveLibrary(neededSoname, rootElfPath)`. + * If it returns a path not yet processed: + + * Parse this ELF and recursively discover its roots. +3. Return a **flat list** of all `InitRoutineRoot` objects, but with their own `BinaryPath`/`Soname`. + +Important: **We do not implicitly model `dlopen()`** at this stage. That’s separate: + +* As an optional heuristic, if the binary imports `dlopen`, tag those DSOs so later we can add “potential plugin load” roots. You can park this as a TODO in the comments. + +--- + +## 7. Call Graph / Reachability Integration + +This depends on your existing modeling, but here’s a generic spec a C# dev can follow. + +Assume there is an internal model: + +```csharp +public sealed class CallGraph +{ + public Node GetOrCreateNode(string binaryPath, ulong virtualAddress, string? symbolName); + public Node GetOrCreateSyntheticRoot(string rootId, string description); + public void AddEdge(Node from, Node to, CallEdgeMetadata metadata); +} + +public sealed record CallEdgeMetadata( + string EdgeKind, // e.g. "loader-init" + InitPhase Phase, // Load / Unload + InitRoutineKind InitKind, + int? ArrayIndex +); +``` + +### 7.1 Synthetic loader node + +Create a single graph node representing the dynamic loader / program start: + +```csharp +var loaderNode = callGraph.GetOrCreateSyntheticRoot( + "LOADER", + "ELF dynamic loader / process start" +); +``` + +### 7.2 Adding edges for each root + +For each `InitRoutineRoot root`: + +1. Get or create a node for the target function: + + ```csharp + var target = callGraph.GetOrCreateNode( + root.BinaryPath, + root.VirtualAddress, + root.SymbolName + ); + ``` + +2. Add edge from loader: + + ```csharp + callGraph.AddEdge( + loaderNode, + target, + new CallEdgeMetadata( + EdgeKind: "loader-init", + Phase: root.Phase, + InitKind: root.Kind, + ArrayIndex: root.ArrayIndex + ) + ); + ``` + +3. Optional: If you model **per‑library** loader nodes, you can add: + + * `LOADER -> libLoaderNode` + * `libLoaderNode -> each constructor` + + but that’s a nice‑to‑have, not required. + +### 7.3 Phases + +* For `.preinit_array`, `.init_array`, `.ctors`, `_init`: + + * `Phase = InitPhase.Load` +* For `.fini_array`, `_fini`: + + * `Phase = InitPhase.Unload` + +This allows downstream UI to say e.g.: + +> This vulnerable function is reachable at **load time** via constructor `foo()` in `libbar.so`. + +--- + +## 8. Attestation / Evidence Output + +We want deterministic, auditable output per scan. + +Define a JSON schema (C# record) stored alongside other scan artifacts: + +```csharp +public sealed record InitRoutineEvidence( + string ScannerVersion, + DateTimeOffset ScanTimeUtc, + IReadOnlyList Entries +); + +public sealed record InitRoutineEvidenceEntry( + string BinaryPath, + string? Soname, + InitRoutineKind Kind, + InitPhase Phase, + ulong VirtualAddress, + ulong? FileOffset, + string? SymbolName, + int? ArrayIndex +); +``` + +Implementation details: + +* After `IInitRoutineDiscovery.Discover` completes: + + * Convert each `InitRoutineRoot` to `InitRoutineEvidenceEntry`. + * Serialize with `System.Text.Json` (property names in camelCase or snake_case; choose a stable convention). +* Store the evidence file e.g. `init_roots.json` inside the scan’s result directory. + +--- + +## 9. Implementation Details & Edge Cases + +### 9.1 Architectures + +First version: + +* Support: + + * `ElfClass.Elf64` + * `ElfEndianness.Little` + * `EM_X86_64` +* For anything else: + + * Log an `InitRoutineError` and skip (but don’t hard‑fail the whole scan). + +Design the parser so architecture is an enum: + +```csharp +internal enum ElfMachine : ushort +{ + X86_64 = 62, + // others later +} +``` + +### 9.2 Relocations (simplification) + +Real loaders apply relocations to constructor arrays; some pointers may be stored as relative relocations. + +For **v1 implementation**: + +* Assume that: + + * Array entries are already absolute VAs in the ELF’s address space (which is typical for non‑PIE or when link‑time addresses are used). +* If you need better fidelity later: + + * Parse `.rela.dyn` / `.rel.dyn`. + * Apply `R_X86_64_RELATIVE` relocations whose `r_offset` falls within the array’s address range: + + * Effective address = (base address + addend); if you treat base as 0, you get a VA that’s correct **within the file** (relative). + +Document this as a TODO so later you can extend without breaking the API. + +### 9.3 Error handling + +* All parsing errors **must be non‑fatal** to the overall scan: + + * Record `InitRoutineError` with `BinaryPath`, message, and exception. + * Continue with other binaries. +* If a binary is not ELF or has invalid magic: + + * Return no roots, but optionally log a low‑severity error. + +--- + +## 10. Testing Strategy + +### 10.1 Unit tests with synthetic ELF fixtures + +Create a small test project `StellaOps.ElfInit.Tests` with known ELF files checked into test resources: + +* Binaries compiled with small C programs like: + + ```c + static void __attribute__((constructor)) c1(void) {} + static void __attribute__((constructor)) c2(void) {} + static void __attribute__((destructor)) d1(void) {} + int main() { return 0; } + ``` + +* Variants: + + * Using `.ctors` (old GCC flags) for legacy coverage. + * Shared library with `__attribute__((constructor))` and `DT_NEEDED` from a main binary. + * Binary with no constructors (expect zero roots). + +Assertions: + +* The count of `InitRoutineRoot` matches expected. +* `Kind` and `Phase` are correct. +* `ArrayIndex` is correctly ordered: 0,1,2 … +* `SymbolName` contains expected mangled function names (if compiler doesn’t drop them). +* For dependencies: + + * Discover roots in `libfoo.so` when main depends on it via `DT_NEEDED`. + +### 10.2 Integration tests with call graph + +* Given a small binary and a known vulnerable function reachable from a constructor: + + * Run full pipeline. + * Assert that the vulnerable function is marked reachable from synthetic `LOADER` node via the constructor. + +### 10.3 Fuzz / robustness + +* Run the discovery on: + + * Random non‑ELF files. + * Truncated ELF files. + * Very large binaries. +* Ensure no unhandled exceptions; only `InitRoutineError` entries. + +--- + +## 11. Suggested C# Project Layout + +```text +src/ + StellaOps.ElfInit/ + IInitRoutineDiscovery.cs + InitRoutineModels.cs + InitRoutineDiscovery.cs + IElfDependencyResolver.cs + ISymbolResolver.cs + Evidence/ + InitRoutineEvidence.cs + Elf/ + ElfFile.cs + ElfParser.cs + ElfHeader.cs + ProgramHeader.cs + SectionHeader.cs + DynamicSection.cs + VaMapper.cs + PointerArrayReader.cs +tests/ + StellaOps.ElfInit.Tests/ + Resources/ + sample_no_ctor + sample_init_array + sample_preinit_init_fini + sample_with_deps_main + libsample_ctor.so + InitRoutineDiscoveryTests.cs +``` + +--- + +If you’d like, I can next: + +* Draft `InitRoutineDiscovery` in C# with full method bodies, or +* Provide a minimal `ElfFile`/`ElfParser` implementation skeleton you can fill in. diff --git a/docs/product-advisories/20-Nov-2026 - Branch · Reachability & Moat Watch — Verified 2025 Updates.md b/docs/product-advisories/20-Nov-2026 - Branch · Reachability & Moat Watch — Verified 2025 Updates.md new file mode 100644 index 000000000..640c28196 --- /dev/null +++ b/docs/product-advisories/20-Nov-2026 - Branch · Reachability & Moat Watch — Verified 2025 Updates.md @@ -0,0 +1,1859 @@ +Short answer: +Yes, we have more than enough public information and ecosystem knowledge to design *better* reachability graphs for PHP and JavaScript than what Snyk/Semgrep expose today—especially in terms of openness, precision, and determinism. What we do **not** have is their exact proprietary heuristics, but we don’t need those to surpass them architecturally. + +Let me break it down in the dimensions that matter for Stella Ops. + +--- + +## 1. What we concretely know from Snyk & Semgrep + +From public material we can infer the *shape* of their systems: + +* **Snyk** + + * Builds a call graph of the application + dependencies to decide if vulnerable functions are on an execution path from “entry points” (e.g., HTTP handlers, CLI entry, etc.). ([Snyk][1]) + * For its “Reachable Vulnerabilities” feature, Snyk explicitly states that it ingests your repo, builds a call graph, then discards source and keeps only the graph + function names. ([docs.snyk.io][2]) + * Combines SCA with static analysis and uses reachability as a factor in “risk score” / prioritization. ([docs.snyk.io][3]) + +* **Semgrep (Supply Chain)** + + * Reachability is computed by correlating manifests/lockfiles with static analysis of the code to see whether vulnerable components are actually used. ([semgrep.dev][4]) + * Uses a dependency graph for supply chain (including transitive deps) and classifies findings as “always reachable / conditionally reachable / needs review / no reachability analysis.” ([semgrep.dev][5]) + * For PHP specifically, they now advertise reachability as GA in Supply Chain (we saw that in your earlier search). This tells us they do at least basic call-graph level reasoning + data flow for PHP. + +Conceptually, that already gives us the core primitives: + +* Call graphs (application + dependencies). +* Entry point modeling. +* Mapping vulnerable symbols (functions/methods/routes) to nodes in that graph. +* Reachability classification at the level of “reachable / no-path / conditional / not analyzed”. + +We also have additional public references (Endor Labs, Coana, GitLab, GitHub, etc.) that all describe more or less the same model: build call graphs or code property graphs and do forward/backward reachability over them. ([endorlabs.com][6]) + +So: the algorithmic *space* is well-documented. The secret sauce is mostly heuristics and engineering, not unknown math. + +--- + +## 2. Where the gaps actually are + +What we **do not** get from Snyk/Semgrep publicly: + +* Concrete internal call-graph algorithms and framework models (how they resolve dynamic imports, reflection, magic in PHP, complex JS bundler semantics). +* Their framework-specific “entry point catalogs” (e.g., mapping Express/Koa/NestJS/Next.js routes, Laravel/Symfony/WordPress hooks, etc.). +* Their internal tuning of false-positive / false-negative trade-offs per language and framework. +* Their private benchmarks and labeled datasets. + +That means we cannot “clone Snyk’s reachability,” but we absolutely can design: + +1. A **better graph spec**. +2. A **more transparent and deterministic pipeline**. +3. Stronger **binary + container + SBOM/VEX integration**. + +Which is exactly aligned with your Stella Ops vision. + +--- + +## 3. For PHP & JavaScript specifically: can we beat them? + +For **graph quality and expressiveness**, yes, we can. + +### JavaScript / TypeScript + +Existing tools face these pain points: + +* Highly dynamic imports (`require(...)`, `import()`, bundlers). +* Multiple module systems (CJS, ESM, UMD), tree-shaking, dead code elimination. +* Framework magic (Next.js, React SSR, Express middlewares, serverless handlers). + +Public info shows Snyk builds a call graph and analyzes execution paths, but details on how they handle all JS edge cases are abstracted away. ([Snyk][1]) + +What we can do better in Stella Ops graphs: + +* **First-class “resolution nodes”**: + + * Represent module resolution, bundler steps, and dynamic import decisions as explicit nodes/edges in the graph. + * This makes ambiguity *visible* instead of hidden inside a heuristic. +* **Framework contracts**: + + * Have pluggable “route/handler mappers” per framework (Express, Nest, Next, Fastify, serverless wrappers) so entry points are explicit graph roots, not magic. +* **Multiple call-graph layers**: + + * Source-level graph (TS/JS). + * Bundled output graph (Webpack/Vite/Rollup). + * Runtime-inferred hints (if we later choose to add traces), all merged into a unified reachability graph with provenance tags. + +If we design our graph format to preserve all uncertainty explicitly (e.g., edges tagged as “over-approximate”, “dynamic-guess”, “runtime-confirmed”), we will have *better analytical quality* even if raw coverage is comparable. + +### PHP + +Semgrep now has PHP reachability GA in Supply Chain, but again we only see the outcomes, not the internal graph model. ([DEV Community][7]) + +We can exploit known pain points in PHP: + +* Dynamic includes / autoloaders. +* Magic methods, dynamic dispatch, frameworks like Laravel/Symfony/WordPress/Drupal. +* Templating / view layers that act as “hidden” entry points. + +Improvements in the Stella Ops model: + +* **Autoloader-aware graph layer**: + + * Model Composer autoloading rules explicitly; edges from `composer.json` and PSR-4/PSR-0 rules into the graph. +* **Framework profiles**: + + * For Laravel/Symfony/etc., we ship profiles that define how controllers, routes, middlewares, commands, and events are wired. Those profiles become graph generators, not just regex signatures. +* **Source-to-SBOM linkage**: + + * Nodes are annotated with PURLs and SBOM component IDs, so you get reachability graph edges directly against SBOM + VEX. + +Again, even without their internals, we can design a **richer, more transparent graph representation**. + +--- + +## 4. How Stella Ops can clearly surpass them (graph-wise) + +Given your existing roadmap (SBOM spine, deterministic replay, lattice policies), we can deliberately design a reachability graph system that outclasses them in these axes: + +1. **Open, documented graph spec** + + * Define a “Reachability Graph Manifest”: + + * Nodes: functions/methods/routes/files/modules + dependency components (PURLs). + * Edges: call edges, data-flow edges, dependency edges, “resolution” edges. + * Metadata: language, framework, hashes, provenance, SBOM linkage. + * Publish it so others can generate/consume the same graphs. + +2. **Deterministic, replayable scans** + + * Every scan is defined by: + + * Exact sources (hashes). + * Analyzer version. + * Ruleset + framework profiles. + * Result: any reachability verdict can be re-computed bit-for-bit later. + +3. **PURL-level edges for supply chain** + + * Reachability graph includes direct edges: + + * `app:function` → `package:function` → `CVE`. + * This is exactly what most tools conceptually do, but we make it explicit and exportable. + +4. **Rich status model beyond “reachable / not”** + + * Adopt and extend Semgrep-like statuses: `always_reachable`, `conditionally_reachable`, `unreachable`, `not_analyzed`, `ambiguous`. + * Add: + + * Confidence levels. + * Types of ambiguity (dynamic dispatch, reflection, unresolved include, etc.). + +5. **Integration with lattice / policy engine** + + * You already plan a Trust Algebra / lattice engine. + * The reachability graph becomes a first-class input: + + * Policies like “treat `conditional_reachable` through untrusted input as High Risk; treat `always_reachable` behind auth as Medium; treat `unreachable` but exploitable via RCE as Special Case”. + +6. **Offline, sovereign, binary-aware** + + * Most commercial tools are SaaS and source-centric. + * We can: + + * Run completely offline. + * Attach reachability graphs to container images, SBOMs, and in-toto attestations. + * Extend to compiled artifacts later (ELF/PE/Mach-O, which you already plan). + +All of this is achievable with publicly available theory + our own engineering, independent of Snyk/Semgrep proprietary code. + +--- + +## 5. Practical conclusion + +So, to answer your question precisely: + +* We do **not** have enough information to duplicate Snyk or Semgrep’s *internal implementations*. +* We absolutely have enough information (from their docs + the broader reachability literature + open tools) to: + + * Design a **more expressive, auditable reachability graph model** for PHP and JavaScript. + * Implement our own call-graph construction and reachability analysis that is: + + * Deterministic, + * SBOM/VEX-native, + * Framework-aware, + * And aligned with Stella Ops’ offline/attested architecture. + +If you want, next step I can sketch a concrete “Reachability Graph v1” JSON schema for PHP/JS with PURL edges and confidence annotations that can plug directly into your Scanner/WebService and SBOM spine. + +[1]: https://snyk.io/blog/optimizing-prioritization-with-deep-application-level-context/?utm_source=chatgpt.com "Optimizing prioritization with deep application-level context" +[2]: https://docs.snyk.io/snyk-data-and-governance/how-snyk-handles-your-data?utm_source=chatgpt.com "How Snyk handles your data | Snyk User Docs" +[3]: https://docs.snyk.io/manage-risk/prioritize-issues-for-fixing/reachability-analysis?utm_source=chatgpt.com "Reachability analysis - Snyk User Docs" +[4]: https://semgrep.dev/blog/2024/sca-reachability-analysis-methods?utm_source=chatgpt.com "Comparing Reachability Analysis methods" +[5]: https://semgrep.dev/blog/2024/less-effort-more-insight-introducing-dependency-graph-for-supply-chain?utm_source=chatgpt.com "Less effort, more insight: Introducing Dependency Graph ..." +[6]: https://www.endorlabs.com/learn/what-is-reachability-based-dependency-analysis?utm_source=chatgpt.com "What is Reachability-Based Dependency Analysis? | Blog" +[7]: https://dev.to/semgrep/ai-memories-php-reachability-cve-policies-and-benchmarking-3naj?utm_source=chatgpt.com "AI Code Assistant Memories, PHP Reachability, CVE ..." +Good, let’s turn this into something a mid-level engineer can actually implement for PHP without guesswork. + +Below is a **concrete, implementation-ready reachability spec** for PHP v1, structured so you can give it directly to an engineer as requirements. + +--- + +## 1. Scope of PHP Reachability v1 + +**Goal** +Given: + +* A PHP project (source code), +* `composer.json` + `composer.lock`, +* A list of vulnerable symbols (e.g., FQNs from a vulnerability DB, each tied to a PURL), + +produce: + +1. A **call graph** of PHP functions/methods (with nodes and edges). +2. A **mapping** between nodes and dependency components (PURLs). +3. A **reachability report** per vulnerable symbol: + + * `reachable`, `maybe_reachable`, `unreachable`, `not_analyzed` + * With example call paths when reachable. + +**Deliberate limitations of v1 (explicit, to keep it doable):** + +* Supported code: + + * Plain PHP functions. + * Class methods (normal, static). +* Supported calls: + + * Direct function calls: `foo()` + * Method calls: `$obj->bar()`, `Foo::bar()` +* Supported resolution features: + + * Namespaces + `use` imports. + * Composer autoload mapping (PSR-4/0, classmap) from `composer.json`. +* Not fully supported (treated conservatively as “maybe”): + + * Dynamic function names (`$fn()`). + * Dynamic method calls (`$obj->$name()`). + * Heavy reflection magic. + * Complex framework containers (Laravel, Symfony DI) – reserved for v2. + +--- + +## 2. Reachability Graph Document (JSON) + +The main artifact is a **graph document**. One file per scan: + +```json +{ + "schemaVersion": "1.0.0", + "language": "php", + "project": { + "projectId": "my-app", + "rootDir": "/src/app", + "hash": "sha256:..." + }, + "components": [ + { + "id": "comp-1", + "purl": "pkg:composer/vendor/lib-a@1.2.3", + "name": "vendor/lib-a", + "version": "1.2.3" + } + ], + "nodes": [], + "edges": [], + "vulnerabilities": [], + "reachabilityResults": [] +} +``` + +### 2.1 Node model + +Every node is a **callable** (function or method) or an **entry point**. + +```json +{ + "id": "node-uuid-or-hash", + "kind": "function | method | entrypoint", + "name": "index", + "fqn": "\\App\\Controller\\HomeController::index", + "file": "src/Controller/HomeController.php", + "line": 42, + "componentId": "comp-1", + "purl": "pkg:composer/vendor/lib-a@1.2.3", + "entryPointType": "http_route | cli | unknown | null", + "extras": { + "namespace": "\\App\\Controller", + "className": "HomeController", + "visibility": "public | protected | private | null" + } +} +``` + +**Rules for node creation** + +* **Function node** + + * `kind = "function"` + * `fqn` = `\Namespace\functionName` +* **Method node** + + * `kind = "method"` + * `fqn` = `\Namespace\ClassName::methodName` +* **Entrypoint node** + + * `kind = "entrypoint"` + * `entryPointType` set accordingly (may be `unknown` initially). + * Typically represents: + + * `public/index.php` + * `bin/console` commands, etc. + * Entrypoints can either: + + * Be separate nodes that **call** real functions/methods, or + * Be the same node as a method/function flagged as `entrypoint`. + For v1, keep it simple: **separate entrypoint nodes** that call “real” nodes. + +### 2.2 Edge model + +Edges capture relationships in the graph. + +```json +{ + "id": "edge-uuid-or-hash", + "from": "node-id-1", + "to": "node-id-2", + "type": "call | include | autoload | entry_call", + "confidence": "high | medium | low", + "extras": { + "callExpression": "Foo::bar($x)", + "file": "src/Controller/HomeController.php", + "line": 50 + } +} +``` + +**Edge types (v1)** + +* `call` + From a function/method to another function/method (resolved). +* `include` + From a file-level node or entrypoint to nodes defined in included file (optional for v1; can be “expanded” by treating all included definitions as reachable). +* `autoload` + From usage site to class definition when resolved via Composer autoload (optional to expose as a separate edge type; good for debug). +* `entry_call` + From an entrypoint node to the first callable(s) it invokes. + +For v1, an engineer can implement **only `call` + `entry_call`** and treat `include`/`autoload` as internal mechanics that result in `call` edges. + +### 2.3 Vulnerabilities model + +Input from your vulnerability database (or later from VEX) mapped into the graph: + +```json +{ + "id": "CVE-2020-1234", + "source": "internal-db-or-nvd-id", + "componentPurl": "pkg:composer/vendor/lib-a@1.2.3", + "symbolFqn": "\\Vendor\\LibA\\Foo::dangerousMethod", + "symbolKind": "method | function", + "severity": "critical | high | medium | low", + "extras": { + "description": "RCE in Foo::dangerousMethod", + "range": ">=1.0.0,<1.2.5" + } +} +``` + +At graph build time, you **pre-resolve** `symbolFqn` to `node.id` where possible and record it in `extras`. + +--- + +## 3. Reachability Results Structure + +Once you have the graph and the vulnerability list, you run reachability and produce: + +```json +{ + "vulnerabilityId": "CVE-2020-1234", + "componentPurl": "pkg:composer/vendor/lib-a@1.2.3", + "symbolFqn": "\\Vendor\\LibA\\Foo::dangerousMethod", + "targetNodeId": "node-123", + "status": "reachable | maybe_reachable | unreachable | not_analyzed", + "reason": "short explanation string", + "paths": [ + ["entry-node-1", "node-10", "node-20", "node-123"] + ], + "analysisMeta": { + "algorithmVersion": "1.0.0", + "maxDepth": 100, + "timestamp": "2025-11-20T19:30:00Z" + } +} +``` + +**Status semantics:** + +* `reachable` + There exists at least one **concrete call path** from an entrypoint node to `targetNodeId` using only `confidence = high` edges. +* `maybe_reachable` + A path exists but at least one edge along any path has `confidence = medium | low` (dynamic call, unresolved class alias, etc.). +* `unreachable` + No path exists from any entrypoint to the target node in the constructed graph. +* `not_analyzed` + We failed to build a node for the symbol or failed the analysis (parse errors, missing source, etc.). + +--- + +## 4. Analysis Pipeline Spec (Step-by-Step) + +This is the part a mid-level engineer can follow as tasks. + +### 4.1 Inputs + +* Directory with PHP code (`/app`). +* `composer.json`, `composer.lock`. +* List of vulnerabilities (as above). +* Optional SBOM mapping PURLs to file paths (if you have it; otherwise use Composer metadata only). + +--- + +### 4.2 Step 1 – Parse Composer Metadata & Build Components + +1. Read `composer.lock`. +2. For each package in `"packages"`: + + * Build `purl` like: + `pkg:composer/@` + * Create `components[]` entry (with generated `componentId`). +3. For the root project, create one component (e.g., `app`) with `purl = null` or a synthetic one (`pkg:composer/mycompany/myapp@dev`). + +**Output:** + +* `components[]` array. +* `componentIndex`: map from package name to `componentId`. + +--- + +### 4.3 Step 2 – PHP AST & Symbol Table + +Use a standard AST library (e.g., `nikic/php-parser`) – explicitly allowed and expected. + +For each PHP file in: + +* application source dirs (e.g. `src/`, `app/`), +* vendor dirs (if you choose to parse vendor code; v1 may do that only for needed components): + +Perform: + +1. Parse file → AST. +2. Extract: + + * File namespace. + * `use` imports (class aliases). + * Function definitions: name, line. + * Class definitions: name, namespace, methods. +3. Build **symbol table**: + +```php +// conceptual structure: +class SymbolTable { + // Fully qualified class or function name → node meta + public array $functionsByFqn; + public array $methodsByFqn; // "\Ns\Class::method" +} +``` + +4. Determine `componentId` for each file: + + * If path under `vendor/vendor-name/package-name/` → map to that Composer package → `componentId`. + * Else → root app component. + +5. Create **nodes**: + +* For each function: + + * Node `kind = "function"`. +* For each method: + + * Node `kind = "method"`. + +Assign `id`, `file`, `line`, `fqn`, `componentId`, `purl`. + +**Output:** + +* `nodes[]` with all functions/methods. +* `symbolTable` (for resolving calls). + +--- + +### 4.4 Step 3 – Entrypoint Detection + +v1 simple rules: + +1. Any of: + + * `public/index.php` + * `index.php` in project root + * Files under `bin/` or `cli/` with `#!/usr/bin/env php` shebang + are considered **entrypoint files**. + +2. For each entrypoint file: + + * Create an `entrypoint` node with: + + * `file` = that file + * `entryPointType` = `"http_route"` (for `public/index.php`) or `"cli"` (for `bin/*`) or `"unknown"`. + * Add to `nodes[]`. + +3. Later, when scanning each entrypoint file’s AST, you will create `entry_call` edges from the entrypoint node to the first layer of call targets inside that file. + +**Output:** + +* Additional `entrypoint` nodes. + +--- + +### 4.5 Step 4 – Call Graph Construction + +For each parsed file: + +1. Traverse AST for call expressions: + + * `foo()` → candidate function call. + * `$obj->bar()` → instance method call. + * `Foo::bar()` → static method call. + +2. **Resolve function calls**: + + Given: + + * Called name (may be qualified, relative, or unqualified). + * Current file namespace. + + Resolution rules: + + * If fully qualified (starts with `\`): use directly as FQN. + * Else: + + * Check `use` imports for alias match. + * If no alias, prepend current namespace. + * Look up FQN in `symbolTable.functionsByFqn` or `methodsByFqn`. + * If found → **resolved call** with `confidence = "high"`. + * If not found → mark `confidence = "low"` and set `to` to a synthetic node id like `unknown` or skip creating an edge in v1 (implementation choice – recommended: create edge to special `unknown` node). + +3. **Resolve method calls `$obj->bar()`** (v1 simplified): + + * Assume dynamic instance type is not known statically → resolution is ambiguous. + * For v1, treat these as: + + * `confidence = "medium"` and: + + * If `$obj` variable has a clear `new ClassName` assignment in the same function, try to infer class and use same resolution rules as static calls. + * Otherwise, create edges from calling node to all methods named `bar` in **any class inside the same component**. + * This is over-approximate but conservative. + +4. **Resolve static method calls `Foo::bar()`**: + + * Resolve `Foo` to FQN using namespace + imports (same as functions). + * Build FQN `\Ns\Foo::bar`. + * Look up in `symbolTable.methodsByFqn`. + * Mark `confidence = "high"` when resolved. + +5. **Connect entrypoints**: + + * For each entrypoint file: + + * Identify top-level calls in that file (same rules as above). + * Edges: + + * `type = "entry_call"` + * `from = entrypointNodeId` + * `to = resolved callee node` + +**Output:** + +* `edges[]` with `call` and `entry_call` edges. + +--- + +### 4.6 Step 5 – Map Vulnerabilities to Nodes + +For each vulnerability: + +1. If `symbolFqn` is not null: + + * If `symbolKind == "method"` → look into `symbolTable.methodsByFqn`. + * If `symbolKind == "function"` → `symbolTable.functionsByFqn`. + +2. If found → record `targetNodeId` in a lookup: `vulnId → nodeId`. + +3. If not found → `status` will later become `not_analyzed`. + +--- + +### 4.7 Step 6 – Reachability Algorithm + +Core logic: multiple BFS (or DFS) from entrypoints over the call graph. + +**Pre-compute entry roots:** + +* `entryNodes` = ids of all nodes with `kind = "entrypoint"`. + +**Algorithm (BFS from all entrypoints):** + +Pseudo-code (language-agnostic): + +```php +function computeReachability(Graph $graph, array $entryNodes): ReachabilityContext { + $queue = new SplQueue(); + $visited = []; // nodeId => true + $predecessor = []; // nodeId => parent nodeId (for path reconstruction) + $edgeConfidenceOnPath = []; // nodeId => "high" | "medium" | "low" + + foreach ($entryNodes as $entryId) { + $queue->enqueue($entryId); + $visited[$entryId] = true; + $edgeConfidenceOnPath[$entryId] = "high"; + } + + while (!$queue->isEmpty()) { + $current = $queue->dequeue(); + + foreach ($graph->outEdges($current) as $edge) { + if ($edge->type !== 'call' && $edge->type !== 'entry_call') { + continue; + } + + $next = $edge->to; + if (isset($visited[$next])) { + continue; + } + + $visited[$next] = true; + $predecessor[$next] = $current; + + // propagate confidence (lowest on the path wins) + $prevConf = $edgeConfidenceOnPath[$current] ?? "high"; + $edgeConf = $edge->confidence; // "high"/"medium"/"low" + $edgeConfidenceOnPath[$next] = minConfidence($prevConf, $edgeConf); + + $queue->enqueue($next); + } + } + + return new ReachabilityContext($visited, $predecessor, $edgeConfidenceOnPath); +} + +function minConfidence(string $a, string $b): string { + $order = ["high" => 3, "medium" => 2, "low" => 1]; + return ($order[$a] <= $order[$b]) ? $a : $b; +} +``` + +**Classify each vulnerability:** + +For each vulnerability with `targetNodeId`: + +1. If `targetNodeId` is missing → `status = "not_analyzed"`. +2. Else if `targetNodeId` is **not** in `visited` → `status = "unreachable"`. +3. Else: + + * Let `conf = edgeConfidenceOnPath[targetNodeId]`. + * If `conf == "high"` → `status = "reachable"`. + * If `conf == "medium" or "low"` → `status = "maybe_reachable"`. + +**Path reconstruction:** + +To generate one example path: + +```php +function reconstructPath(array $predecessor, string $targetId): array { + $path = []; + $current = $targetId; + while (isset($predecessor[$current])) { + array_unshift($path, $current); + $current = $predecessor[$current]; + } + array_unshift($path, $current); // entrypoint at start + return $path; +} +``` + +Store that `path` array in `reachabilityResults[].paths[]`. + +--- + +## 5. Handling PHP “messy bits” (v1 rules) + +This is where we mark things as `maybe` instead of pretending we know. + +1. **Dynamic function names** `$fn()`: + + * Create **no edges** by default in v1. + * Optionally, if `$fn` is a constant string literal obvious in the same function, treat as a normal call. + * Otherwise: leave it out and accept that some cases will be missed → vulnerability may be marked `unreachable` but flagged with `analysisMeta.dynamicCallsIgnored = true`. + +2. **Dynamic methods** `$obj->$method()`: + + * Same principle as above. + +3. **Reflection / `call_user_func` / `call_user_func_array`**: + + * v1: do not try to resolve. + * Optional: track the call sites; mark their outgoing edges as `confidence = "low"` and connect to **all** functions/methods of that name when the name is a string literal. + +4. **Includes** (`include`, `require`, `require_once`, `include_once`): + + * v1 simplest rule: + + * Treat the included file as **fully reachable** from the including file. + * Pseudo-implementation: when building symbol table, everything defined in the included file is considered potentially called by the including file’s entrypoint logic. + * Implementation shortcut: + + * For the first version, you can even skip modeling edges, and instead mark all nodes in included files as “reachable from the entrypoint” if included directly by an entrypoint file. Later refine. + +--- + +## 6. What the engineer actually builds (modules & tasks) + +You can frame it to them like this: + +1. **Module `PhpProjectLoader`** + + * Reads project root, finds `composer.json`, `composer.lock`. + * Produces `components[]` and mapping from file-path → componentId. + +2. **Module `PhpAstIndexer`** + + * Uses `nikic/php-parser`. + * For each `.php` file: + + * Produces entries in `symbolTable`. + * Produces base `nodes[]` (functions, methods). + * Creates `entrypoint` nodes based on known file patterns. + +3. **Module `PhpCallGraphBuilder`** + + * Walks AST again: + + * For each callable body, finds call expressions. + * Resolves calls via `symbolTable`. + * Produces `edges[]`. + +4. **Module `PhpReachabilityEngine`** + + * Runs BFS from entrypoints. + * Classifies per-vulnerability reachability. + +5. **Module `GraphSerializer`** + + * Assembles everything into the JSON schema described in sections 2–3. + +Each module is testable with small sample projects. + +--- + +## 7. Minimal working example (very small) + +Project: + +```php +// public/index.php +index(); + +// src/Controller/HomeController.php +dangerousMethod(); + } +} +``` + +Vulnerability: + +```json +{ + "id": "CVE-2020-1234", + "componentPurl": "pkg:composer/vendor/lib-a@1.2.3", + "symbolFqn": "\\Vendor\\LibA\\Foo::dangerousMethod", + "symbolKind": "method" +} +``` + +Expected reachability path (conceptually): + +```json +[ + "entry:public/index.php", + "\\App\\Controller\\HomeController::index", + "\\Vendor\\LibA\\Foo::dangerousMethod" +] +``` + +Status: `reachable` with `confidence = high`. + +--- + +If you’d like, next step I can: + +* Strip this into a **formal JSON Schema** file (`reachability-php-graph.schema.json`) and +* Propose a **directory layout + interfaces** in C#/.NET 10 for `StellaOps.Scanner.Php` so you can drop it straight into the repo. +Here is a JavaScript/TypeScript reachability spec that a mid-level engineer can actually implement, but which is still “best in class” in terms of clarity, determinism, and extensibility. + +I’ll mirror the PHP structure you already have so Scanner/WebService and Sbomer can treat them uniformly. + +--- + +## 1. Scope of JS Reachability v1 + +**Goal** + +Given: + +* A JS/TS project (Node-centric), +* `package.json` + lockfile (`package-lock.json` / `yarn.lock` / `pnpm-lock.yaml`), +* A list of vulnerable symbols (tied to npm PURLs), + +produce: + +1. A **function-level call graph** (nodes + edges). +2. Mapping of nodes to **components** (`pkg:npm/...` PURLs). +3. A **reachability verdict** for each vulnerable symbol: + + * `reachable`, `maybe_reachable`, `unreachable`, `not_analyzed` + * With at least one example call path when reachable/maybe_reachable. + +**Deliberate v1 constraints** + +To keep it very implementable: + +* Target runtime: **Node.js** (server-side). +* Source: **TypeScript + JavaScript** in one unified analysis. + + * Use TypeScript compiler with `allowJs: true` so JS and TS share the same Program. +* Modules: + + * ES Modules (`import`/`export`). + * CommonJS (`require`, `module.exports`, `exports`). +* Supported calls: + + * Direct calls: `foo()`. + * Method calls: `obj.method()`, `Class.method()`. +* Bundlers (Webpack, Vite, etc.): **out of scope v1** (treat source before bundling). +* Dynamic features (handled conservatively, see below): + + * `eval`, `Function` constructor, dynamic imports, `obj[methodName]()`, etc. + +--- + +## 2. Reachability Graph Document (JSON) + +Same high-level shape as PHP, but annotated for JS/TS. + +```json +{ + "schemaVersion": "1.0.0", + "language": "javascript", + "project": { + "projectId": "my-node-app", + "rootDir": "/app", + "hash": "sha256:..." + }, + "components": [], + "nodes": [], + "edges": [], + "vulnerabilities": [], + "reachabilityResults": [] +} +``` + +### 2.1 Components + +Each npm package (including the root app) is a component. + +```json +{ + "id": "comp-1", + "purl": "pkg:npm/express@4.19.2", + "name": "express", + "version": "4.19.2", + "isRoot": false, + "extras": { + "resolvedPath": "node_modules/express" + } +} +``` + +For the root project, you can use: + +```json +{ + "id": "comp-root", + "purl": "pkg:npm/my-company-my-app@1.0.0", + "name": "my-company-my-app", + "version": "1.0.0", + "isRoot": true +} +``` + +A mid-level engineer can easily build this from `package.json` + the chosen lockfile. + +--- + +### 2.2 Nodes (callables & entrypoints) + +Every node is a callable or an entrypoint. + +```json +{ + "id": "node-uuid-or-hash", + "kind": "function | method | arrow | class_constructor | entrypoint", + "name": "handleRequest", + "fqn": "src/controllers/userController.ts::handleRequest", + "file": "src/controllers/userController.ts", + "line": 42, + "componentId": "comp-root", + "purl": "pkg:npm/my-company-my-app@1.0.0", + "exportName": "handleRequest", + "exportKind": "named | default | none", + "className": "UserController", + "entryPointType": "http_route | cli | worker | unknown | null", + "extras": { + "isAsync": true, + "isRouteHandler": true + } +} +``` + +**Rules for node creation** + +* **Function node** + + * `kind = "function"` for `function foo() {}` and `export function foo() {}`. + * `fqn` = `::foo`. +* **Arrow function node** + + * `kind = "arrow"` when it is used as a callback that matters (e.g. Express handler). + * Option: generate synthetic name `file.ts:::`. +* **Method node** + + * `kind = "method"` for class methods. + * `fqn` = `::ClassName.methodName`. +* **Class constructor node** + + * `kind = "class_constructor"` for `constructor()` if you want constructor-level analysis. +* **Entrypoint node** + + * `kind = "entrypoint"`. + * `entryPointType` according to detection rules (see §4). + * `fqn` = `::`, e.g. `src/server.ts::node-entry`. + +You don’t need to over-engineer FQNs; they just need to be stable and unique. + +--- + +### 2.3 Edges + +Edges model function/method/module relationships. + +```json +{ + "id": "edge-uuid-or-hash", + "from": "node-id-1", + "to": "node-id-2", + "type": "call | entry_call | import | export", + "confidence": "high | medium | low", + "extras": { + "callExpression": "userController.handleRequest(req, res)", + "file": "src/routes/userRoutes.ts", + "line": 30 + } +} +``` + +For reachability v1, **only `call` and `entry_call` are required**. `import`/`export` edges are useful for debugging but not strictly necessary for BFS reachability. + +--- + +### 2.4 Vulnerabilities + +Library-level vulnerabilities are described in terms of npm PURL and symbol. + +```json +{ + "id": "CVE-2020-1234", + "source": "internal-db-or-nvd-id", + "componentPurl": "pkg:npm/some-lib@1.2.3", + "packageName": "some-lib", + "symbolExportName": "dangerousFunction", + "symbolKind": "function | method", + "severity": "critical", + "extras": { + "description": "Prototype pollution in dangerousFunction", + "range": ">=1.0.0 <1.2.5" + } +} +``` + +At graph-build time, you pre-resolve `symbolExportName` → `node.id` where possible. + +--- + +### 2.5 Reachability Results + +Exactly the same shape as for PHP. + +```json +{ + "vulnerabilityId": "CVE-2020-1234", + "componentPurl": "pkg:npm/some-lib@1.2.3", + "symbolExportName": "dangerousFunction", + "targetNodeId": "node-123", + "status": "reachable | maybe_reachable | unreachable | not_analyzed", + "reason": "short explanation", + "paths": [ + ["entry-node-1", "node-20", "node-50", "node-123"] + ], + "analysisMeta": { + "algorithmVersion": "1.0.0", + "maxDepth": 200, + "timestamp": "2025-11-20T19:30:00Z" + } +} +``` + +--- + +## 3. Module & Symbol Resolution (JS/TS specifics) + +Backend: **TypeScript compiler API** with `allowJs: true`. + +### 3.1 Build TS Program + +1. Generate a `tsconfig.reachability.json` with: + + * `allowJs: true` + * `checkJs: true` + * `moduleResolution: "node"` or `"bundler"` depending on project. + * `rootDir` set to project root. +2. Use TS API to create `Program`. +3. Use `TypeChecker` to resolve symbols where possible. + +This gives you: + +* File list (including JS/TS). +* Symbols for exports/imports. +* Class and function definitions. + +### 3.2 Export indexing per module + +For each source file: + +* Enumerate: + + * `export function foo() {}` + * `export const bar = () => {}` + * `export default function () {}` / `export default class {}`. + * `export { foo }` statements. + * `module.exports = ...` / `exports.foo = ...` (handle as CommonJS exports). + +Build an index: + +```ts +interface ExportedSymbol { + moduleFile: string; // relative path + exportName: string; // "foo", "default" + nodeId: string; // ID in nodes[] +} +``` + +### 3.3 Import resolution + +For each `ImportDeclaration`: + +* `import { foo as localFoo } from 'some-lib'` + + * Map `localFoo` → `(module='some-lib', exportName='foo')`. + +* `import foo from 'some-lib'` + + * Map `foo` → `(module='some-lib', exportName='default')`. + +* `import * as lib from 'some-lib'` + + * Map namespace `lib` → `(module='some-lib', exportName='*')`. + +For CommonJS: + +* `const x = require('some-lib')` + + * Map `x` → `(module='some-lib', exportName='*')`. + +* `const { dangerousFunction } = require('some-lib')` + + * Map `dangerousFunction` → `(module='some-lib', exportName='dangerousFunction')`. + +Later, when you see calls, you use this mapping. + +--- + +## 4. Entrypoint Detection (Node-centric) + +v1 rules that are easy to implement: + +1. **CLI entrypoints** + + * Files listed in `bin` section of `package.json`. + * Files with `#!/usr/bin/env node` shebang. + * Node: + + * `kind = "entrypoint"`, + * `entryPointType = "cli"`. + +2. **Server entrypoints** + + * Heuristic: look for `src/server.ts`, `src/index.ts`, `index.js` at project root. + * Mark them as `entrypoint` with `entryPointType = "http_route"`. + +3. **Framework routes (Express v1)** + + * Pattern: `const app = express(); app.get('/path', handler)`: + + * `handler` can be: + + * Identifier (function name), + * Arrow function, + * Function expression. + + For each such route: + + * Create an `entrypoint` node per route or mark handler callable as reachable from server entrypoint: + + * Easiest v1: create **`entry_call` edge**: + + * From server entrypoint node (e.g., file `src/server.ts`) to handler node. + * Mark handler node `extras.isRouteHandler = true`. + +You do not have to model individual HTTP methods or paths semantically in v1; just treat each handler as a reachable entrypoint into business logic. + +--- + +## 5. Call Graph Construction + +This is the heart of the algorithm. + +### 5.1 Node creation (summary) + +While visiting AST: + +* For each: + + * `FunctionDeclaration` + * `MethodDeclaration` + * `ArrowFunction` (that is: + + * exported, or + * assigned to a variable that is used as a callback/handler) +* Create a `node`. + +Tie each node to: + +* `file` (relative path), +* `line` (start line), +* `componentId` (from mapping file path → package), +* optional `exportName` (if exported from module). + +### 5.2 Call extraction rules + +For each function/method body (i.e., node): + +#### 5.2.1 Direct calls: `foo()` + +* If callee is an identifier `foo`: + + 1. Check if `foo` is a **local function** in the same file. + 2. If not, check import alias table: + + * If `foo` maps to `(module='pkg', exportName='bar')`, then: + + * Resolve to exported symbol for `pkg` + `bar` if you have its sources. + * If library source not indexed, create a synthetic node for that library export (optional). + 3. If resolved, add edge: + + * `type = "call"`, + * `confidence = "high"`. + +#### 5.2.2 Property calls: `obj.method()` + +* If callee is `obj.method(...)`: + + 1. If `obj` is an imported namespace: + + * e.g. `import * as lib from 'some-lib'; lib.dangerousFunction()`. + * Then treat: + + * `module='some-lib'`, `exportName='dangerousFunction'`. + * Edge `confidence = "high"`. + + 2. If `obj` is created via `new ClassName()` where `ClassName` is known: + + * Use TypeScript type checker or simple pattern: + + * Look for `const obj = new ClassName(...)` in same function. + * Map to method `ClassName.method`. + * Edge `confidence = "high"`. + + 3. Else: + + * As a v1 heuristic, you **do not** spread to everything; instead: + + * Either: + + * Skip edge and lose some coverage, or + * Add `confidence = "medium"` edge from current node to **all methods called `method`** in the same component. + * Recommended: medium-confidence to all same-name methods in same component (conservative, but safe). + +#### 5.2.3 CommonJS require patterns + +* `const x = require('some-lib'); x.dangerousFunction()`: + + * Track variable → module mapping from `require`. + * When you see `x.something()`: + + * `module='some-lib'`, `exportName='something'`. + * `confidence = "medium"` (less structured than ES import). + +#### 5.2.4 Dynamic imports & very dynamic calls + +* `await import('some-lib')`, `obj[methodName]()`, `eval`, `Function`, etc.: + + v1 policy (simple and honest): + + * Do **not** create specific edges unless: + + * The target module name is a **string literal** and the method name is a **string literal** in same expression. + * Otherwise: + + * Optionally create a single edge from current node to a special `node-unknown` with `confidence = "low"`. + * This preserves a record that “something dynamic happens here” without lying. + +--- + +## 6. Mapping Nodes to Components (PURLs) + +Using the filesystem: + +* If file path begins with `node_modules//...`: + + * Map that file to component with `name = pkgName` and the version from lockfile. + +* All other files belong to the root component (the app) or to a local “workspace” package if you support monorepos later. + +Each node inherits `componentId` from its file. Each component has a `purl`: + +* `pkg:npm/@`. + +This is how you connect reachability to SBOM/VEX later. + +--- + +## 7. Vulnerability → Node mapping + +Given a vulnerability: + +```json +{ + "componentPurl": "pkg:npm/some-lib@1.2.3", + "packageName": "some-lib", + "symbolExportName": "dangerousFunction" +} +``` + +Steps: + +1. Find `componentId` by matching `componentPurl` or `packageName`. +2. In that component, find node(s) where: + + * `exportName == "dangerousFunction"`, or + * For CommonJS, any top-level function marked as part of the module’s exports under that name. +3. If found: + + * `targetNodeId = node.id`. +4. If not: + + * Mark `not_analyzed` later. + +--- + +## 8. Reachability Algorithm (BFS) + +Exactly like PHP v1, but now over JS nodes. + +**Pre-compute:** + +* `entryNodes` = all nodes where `kind = "entrypoint"`. + +**Compute reachable set:** + +```ts +function computeReachability(graph: Graph, entryNodes: string[]): ReachabilityContext { + const queue: string[] = []; + const visited: Record = {}; + const predecessor: Record = {}; + const edgeConfidenceOnPath: Record = {}; + + for (const entry of entryNodes) { + queue.push(entry); + visited[entry] = true; + edgeConfidenceOnPath[entry] = "high"; + } + + while (queue.length > 0) { + const current = queue.shift()!; + + for (const edge of graph.outEdges(current)) { + if (edge.type !== "call" && edge.type !== "entry_call") continue; + + const next = edge.to; + if (visited[next]) continue; + + visited[next] = true; + predecessor[next] = current; + + const prevConf = edgeConfidenceOnPath[current] ?? "high"; + const edgeConf = edge.confidence; + edgeConfidenceOnPath[next] = minConfidence(prevConf, edgeConf); + + queue.push(next); + } + } + + return { visited, predecessor, edgeConfidenceOnPath }; +} + +function minConfidence(a: "high" | "medium" | "low", + b: "high" | "medium" | "low"): "high" | "medium" | "low" { + const order: Record = { high: 3, medium: 2, low: 1 }; + return order[a] <= order[b] ? a : b; +} +``` + +**Classify per vulnerability:** + +For each vulnerability with `targetNodeId`: + +1. If missing → `status = "not_analyzed"`. +2. If `targetNodeId` not in `visited` → `status = "unreachable"`. +3. Otherwise: + + * `conf = edgeConfidenceOnPath[targetNodeId]`. + * If `conf == "high"` → `status = "reachable"`. + * Else (`medium` or `low`) → `status = "maybe_reachable"`. + +**Path reconstruction:** + +Same as PHP: + +```ts +function reconstructPath(predecessor: Record, + targetId: string): string[] { + const path: string[] = []; + let current: string | undefined = targetId; + + while (current !== undefined) { + path.unshift(current); + current = predecessor[current]; + } + + return path; +} +``` + +Store at least one path in `paths[]`. + +--- + +## 9. Handling JS “messy bits” (v1 rules) + +You want to be honest, not magical. So: + +1. **eval, new Function, dynamic import with non-literal arguments** + + * Do not pretend you know where control goes. + * Either: + + * Ignore for graph (recommended v1), or + * Edge to `node-unknown` with `confidence="low"`. + * Mark in `analysisMeta` that dynamic features were detected. + +2. **obj[methodName]() with unknown methodName** + + * If `methodName` is string literal and `obj` is clearly typed, you can resolve. + * Otherwise: no edges (or low-confidence to `node-unknown`). + +3. **No source for library** + + * If you do not index `node_modules`, you cannot trace inside vulnerable library. + * Still useful: we just need the library’s exported symbol node as “synthetic”: + + * Create a synthetic node representing `some-lib::dangerousFunction` and attach all calls to it. + * That node gets `componentId` for `some-lib`. + * Reachability is still valid (we do not need the internal implementation for SCA). + +--- + +## 10. Implementation plan for a mid-level engineer + +Assume this runs in a **Node.js/TypeScript container** that Scanner calls, returning JSON. + +### 10.1 Modules to build + +1. `JsProjectLoader` + + * Reads `package.json` + lockfile. + * Builds `components[]` (npm packages + root app). + * Maps file paths → `componentId`. + +2. `TsProgramBuilder` + + * Generates `tsconfig.reachability.json`. + * Creates TS Program with `allowJs: true`. + * Exposes `sourceFiles` and `typeChecker`. + +3. `JsSymbolIndexer` + + * Walks all source files. + * Indexes: + + * Exported functions/classes. + * Imported bindings / requires. + * Creates base `nodes[]` and export index. + +4. `JsEntrypointDetector` + + * Reads `package.json` for `bin` and main entry. + * Applies server/Express heuristics. + * Adds `entrypoint` nodes. + +5. `JsCallGraphBuilder` + + * For each function/method node: + + * Traverses its AST. + * Emits `call` edges as per §5. + * Emits `entry_call` edges for server/route wiring. + +6. `VulnerabilityMapper` + + * Takes vulnerability input (package + symbolExportName). + * Resolves them to `targetNodeId`. + +7. `ReachabilityEngine` + + * Implements BFS algorithm. + * Produces `reachabilityResults[]` with paths. + +8. `GraphSerializer` + + * Packs everything into the JSON structure described in §§2–3. + +### 10.2 Minimal example (to sanity check) + +Project snippet: + +```ts +// src/server.ts +import express from 'express'; +import { dangerousFunction } from 'some-lib'; + +const app = express(); + +app.get('/test', (req, res) => { + dangerousFunction(req.query.input); + res.send('ok'); +}); + +app.listen(3000); +``` + +Vulnerability: + +```json +{ + "id": "CVE-2020-1234", + "componentPurl": "pkg:npm/some-lib@1.2.3", + "packageName": "some-lib", + "symbolExportName": "dangerousFunction" +} +``` + +Expected graph & result: + +* Entrypoint node: + + * `src/server.ts::node-entry` (or similar). +* Handler arrow function node for `/test`. +* Node for `some-lib::dangerousFunction` (real or synthetic). + +Reachability path: + +```json +[ + "entry:src/server.ts::node-entry", + "node-handler-/test", + "node-some-lib::dangerousFunction" +] +``` + +Status: `reachable`, `confidence = high`. + +--- + +If you want, next iteration I can: + +* Turn this into a formal JSON Schema (`reachability-js-graph.schema.json`), +* And sketch the exact Node/TS code structure (file layout, TS interfaces) for the container that StellaOps.Scanner would call. +You already have a very solid v1 for both PHP and JS. The next step is not to add “more stuff” randomly, but to make the specs: + +* More **uniform** (one core reachability model, multiple languages). +* More **honest** about uncertainty. +* More **useful** for scoring, policy, and SBOM/VEX. + +Here is what I would improve. + +--- + +## 1. Cross‑language improvements (applies to both PHP & JS) + +### 1.1 Unify under a single core schema + +Right now PHP and JS are parallel but not explicitly unified. I would define: + +* A **language‑agnostic core**: + + * `Node` (id, kind, file, line, componentId, purl, tags). + * `Edge` (id, from, to, type, confidence, tags). + * `Vulnerability` (id, componentPurl, symbolId or symbolFqn, severity, tags). + * `ReachabilityResult` (vulnId, targetNodeId, status, paths[], analysisMeta). +* A **language extension block**: + + * `phpExtras` (namespace, className, visibility, etc.). + * `jsExtras` (exportName, exportKind, isAsync, etc.). + +This gives you one “Reachability Graph 1.x” spec with per‑language specialisation instead of two separate specs. + +### 1.2 Stronger identity & hashing rules + +Make node and edge IDs deterministic and explicitly specified: + +* Node ID derived from: + + * `language`, `componentId`, `file`, `fqn`, `kind` → `sha256` truncated. +* Edge ID derived from: + + * `from`, `to`, `type`, `file`, `line`. + +Benefits: + +* Stable IDs across runs for the same code → easy diffing, caching, incremental scans. +* Downstream tools (policy engine, UI) can key on IDs confidently. + +### 1.3 Multi‑axis confidence instead of a single label + +Replace the single `confidence` enum with **multi‑axis confidence**: + +```json +"confidence": { + "resolution": "high|medium|low", // how well we resolved the callee + "typeInference": "high|medium|low", + "controlFlow": "high|medium|low" +} +``` + +And define: + +* `pathConfidence` = min of all axes along the path. +* `status` still uses `reachable` / `maybe_reachable` / etc., but you retain the underlying breakdown for scoring and debugging. + +### 1.4 Path conditions and guards (lightweight) + +Introduce optional **path condition annotations** on edges: + +```json +"extras": { + "guard": "if ($userIsLoggedIn)", + "guardType": "auth | feature_flag | input_validation | unknown" +} +``` + +You do not need full symbolic execution. A simple heuristic suffices: + +* Detect `if (...)` around the call and capture the textual condition. +* Categorize by simple patterns (presence of `isAdmin`, `feature`, `flag`, etc.). + +Later, the Trust Algebra can say: “reachable only under feature flag + behind auth → downgrade risk.” + +### 1.5 Partial coverage & truncation flags + +Make the graph self‑describing about its **limitations**: + +At graph level: + +```json +"analysisMeta": { + "languages": ["php"], + "vendorCodeParsed": true, + "dynamicFeaturesHandled": ["dynamic-includes-partial", "reflection-ignored"], + "maxNodes": 500000, + "truncated": false +} +``` + +Per‑node or per‑file: + +```json +"extras": { + "parseErrors": false, + "analysisSkippedReason": null +} +``` + +Per‑vulnerability: + +* Add `coverageStatus`: `full`, `partial`, `unknown` to complement `status`. + +This avoids a common trap: tools silently dropping edges/nodes and still reporting “unreachable.” + +### 1.6 First‑class SBOM/VEX linkage + +You already include PURLs. Go one step further: + +* `componentId` links to: + + * `bomRef` (CycloneDX) or `componentId` (SPDX) if available. +* `vulnerabilityId` links to: + + * `vexRef` in any existing VEX document. + +This allows: + +* A VEX producer to say “not affected / affected but not exploited” with **explicit reference** to the reachability graph and specific `targetNodeId`s. + +--- + +## 2. PHP‑specific improvements + +### 2.1 Autoloader‑aware edges as first‑class concept + +Right now autoload is mostly implicit. Make it explicit and deterministic: + +* During Composer metadata processing, build: + + * **Autoload map**: `FQN class → file`. +* Add `autoload` edges: + + * From “usage site” node (where `new ClassName()` first appears) to a **file‑level node** representing the defining file. + +Why it helps: + +* Clarifies how classes were resolved (or not). +* Easier to debug “class not found” vs “we never parsed vendor code.” + +### 2.2 More precise includes / requires + +Upgrade the naive rule “everything in included file is reachable”: + +1. Represent each file as a special node `kind="file"`. +2. `include` / `require` statements produce `include` edges from current node/file to the file node. +3. Then: + + * All functions/methods defined in that file get `define_in` edges from file node. + * A separate simple pass marks them reachable from that file’s callers. + +Add a nuance: + +* If the include path is static and resolved at scan time → `resolution.high`. +* If dynamic (e.g., `include $baseDir.'/file.php';`) → `resolution.medium` or `low`. + +### 2.3 Better dynamic dispatch handling for methods + +Current v1 rule (“connect to all methods with that name in the component”) is safe but noisy. + +Refinement: + +* Use **local type inference** in the same function/method: + + * `$x = new Foo(); $x->bar();` → high resolution. + * `$x = factory(); $x->bar();`: + + * If factory returns a union of known types, edges to those types with `resolution.medium`. +* Introduce a tag on edges: + + * `extras.dispatchKind = "static" | "local-new" | "factory-heuristic" | "unknown"`. + +This preserves the safety of your current design but cuts down false positives for common patterns. + +### 2.4 Framework‑aware entrypoints (v2, but spec‑ready now) + +Extend `entryPointType` with framework flavors, even if initial implementation is shallow: + +* `laravel_http`, `symfony_http`, `wordpress_hook`, `drupal_hook`, etc. + +And allow: + +```json +"extras": { + "framework": "laravel", + "route": "GET /users", + "hookName": "init" +} +``` + +You do not have to implement every framework in v1, but the spec should allow these so you can ship small, incremental framework profiles without changing the schema. + +--- + +## 3. JavaScript/TypeScript‑specific improvements + +### 3.1 Explicit async / event‑loop edges + +Today all calls are treated uniformly. For JS/TS, you should model: + +* `setTimeout`, `setInterval`, `setImmediate`, `queueMicrotask`, `process.nextTick`, `Promise.then/catch/finally`, event emitters. + +Two improvements: + +1. Additional edge types: + + * `async_call`, `event_callback`, `timer_callback`. +2. Node extras: + + * `extras.trigger = "timer" | "promise" | "event" | "unknown"`. + +This lets you later express policies like: “reachable only via a rarely used cron‑like timer” vs “reachable via normal HTTP request.” + +### 3.2 Bundler awareness (but spec‑only in v1) + +Even if v1 implementation ignores bundlers, the spec should anticipate them: + +* Allow a **bundle mapping block**: + +```json +"bundles": [ + { + "id": "bundle-main", + "tool": "webpack", + "inputFiles": ["src/index.ts", "src/server.ts"], + "outputFiles": ["dist/main.js"] + } +] +``` + +* Optionally, allow edges: + + * `type = "bundle_map"` from source file nodes to bundled file nodes. + +You can attach reachability graphs to either pre‑bundle or post‑bundle views later, without breaking the schema. + +### 3.3 Stronger TypeScript‑based resolution + +Encode the fact that a call was resolved using TS type information vs heuristic: + +* On edges, add: + +```json +"extras": { + "resolutionStrategy": "ts-typechecker | local-scope | require-heuristic | unresolved" +} +``` + +This provides a clear line between “hard” and “soft” links for the scoring engine and for debugging why something is `maybe_reachable`. + +### 3.4 Workspace / monorepo semantics + +Support Yarn / pnpm / npm workspaces at the schema level: + +* Allow components to have: + +```json +"extras": { + "workspace": "packages/service-a", + "isWorkspaceRoot": false +} +``` + +And support edges: + +* `type = "workspace_dep"` for internal package imports. + +This makes it straightforward to see when a vulnerable library is pulled via an internal package boundary, which is common in large JS monorepos. + +--- + +## 4. Operational & lifecycle improvements + +### 4.1 Explicit incremental scan support + +Add an optional **delta section** so a scanner can emit only changes: + +```json +"delta": { + "baseGraphHash": "sha256:...", + "addedNodes": [...], + "removedNodeIds": [...], + "addedEdges": [...], + "removedEdgeIds": [...] +} +``` + +This is particularly valuable for large repos where full graphs are costly and CI needs fast turnaround. + +### 4.2 Test / non‑prod code classification + +Mark nodes/edges originating from tests or non‑prod code: + +* `extras.codeRole = "prod | test | devtool | unknown"`. + +Entry points from test runners (e.g., PHPUnit, Jest, Mocha) should either be: + +* Ignored (default), or +* Explicitly flagged as `entryPointType = "test"` so policies can decide whether to count that reachability. + +### 4.3 Normative definitions of statuses + +You already use `reachable`, `maybe_reachable`, `unreachable`, `not_analyzed`. Make the semantics **normative** in the spec: + +* Tie `reachable` / `maybe_reachable` to: + + * Existence of a path from **at least one recognized entrypoint**. + * Minimum `pathConfidence` thresholds. +* Require that tools distinguish: + + * “No path in the graph” vs “graph incomplete here.” + +This allows multiple tools to implement the spec and still produce comparable, auditable results. + +--- + +If you want, the next concrete step could be: + +* A **“Reachability Graph 1.1”** document that: + + * Extracts the shared core, + * Adds multi‑axis confidence, + * Adds partial‑coverage metadata, + * Extends the enums for edge types and entrypoint types for PHP/JS. + +That gives your team a clean target for implementation without materially increasing complexity for a mid‑level engineer. diff --git a/docs/product-advisories/20-Nov-2026 - Encoding Binary Reachability with PURL‑Resolved Edges.md b/docs/product-advisories/20-Nov-2026 - Encoding Binary Reachability with PURL‑Resolved Edges.md new file mode 100644 index 000000000..c069d38d6 --- /dev/null +++ b/docs/product-advisories/20-Nov-2026 - Encoding Binary Reachability with PURL‑Resolved Edges.md @@ -0,0 +1,1088 @@ + + + + +Here’s a simple, practical way to think about **binary reachability** that cleanly joins call graphs with SBOMs—without reusing external tools. + +--- + +### The big idea (plain English) + +* Each **function call edge** in a binary’s call graph is annotated with: + + * a **purl** (package URL) identifying which component the callee belongs to, and + * a **symbol digest** (stable hash of the callee’s normalized symbol signature). +* With those two tags, call graphs from **PE/ELF/Mach‑O** can be merged across binaries and mapped onto your **SBOM components**, giving a **single vulnerability graph** that answers: *“Is this vulnerable function reachable in my deployment?”* + +--- + +### Why this matters for Stella Ops + +* **One graph to rule them all:** Libraries used by multiple services merge naturally via the same purl, so you see cross‑service blast radius instantly. +* **Deterministic & auditable:** Digests + purls make edges reproducible (great for “replayable scans” and audit trails). +* **Zero tool reuse required:** You can implement PE/ELF/Mach‑O parsing once in C# and still interoperate with SBOM/VEX ecosystems via purls. + +--- + +### Minimal data model + +```json +{ + "nodes": [ + {"id":"sym:hash:callee","kind":"symbol","purl":"pkg:nuget/Newtonsoft.Json@13.0.3","sig":"Newtonsoft.Json.JsonConvert::DeserializeObject(string)"}, + {"id":"bin:hash:myapi","kind":"binary","format":"pe","name":"MyApi.exe","build":"sha256:..."} + ], + "edges": [ + { + "from":"sym:hash:caller", + "to":"sym:hash:callee", + "etype":"calls", + "purl":"pkg:nuget/Newtonsoft.Json@13.0.3", + "sym_digest":"sha256:SYM_CALLEE", + "site":{"binary":"bin:hash:myapi","offset":"0x0041AFD0"} + } + ], + "sbom": [ + {"purl":"pkg:nuget/Newtonsoft.Json@13.0.3","component_id":"c-123","files":["/app/MyApi.exe"] } + ] +} +``` + +--- + +### How to build it (C#‑centric, binary‑first) + +1. **Lift symbols per format** + + * **PE**: parse COFF + PDB (if present), fallback to export tables; normalize “namespace.type::method(sig)”. + * **ELF**: `.dynsym`/`.symtab` + DWARF (if present); demangle (Itanium/LLVM rules). + * **Mach‑O**: LC_SYMTAB + DWARF; demangle. +2. **Compute `symbol digests`** + + * Hash of normalized signature + (optionally) instruction fingerprint for resilience to addresses. +3. **Build intra‑binary call graph** + + * Conservative static: function→function edges from **import thunks**, relocation targets, and lightweight disassembly of direct calls. + * Optional dynamic refinement: PERF/eBPF or ETW traces to mark *observed* edges. +4. **Resolve each callee to a `purl`** + + * Map import/segment to owning file → map file to SBOM component → emit its purl. + * If multiple candidates, emit edge with a small `candidates[]` set; policy later can prune. +5. **Merge graphs across binaries** + + * Union by `(purl, sym_digest)` for callees; keep multiple `site` locations. +6. **Attach vulnerabilities** + + * From VEX/CVE → affected package purls → mark reachable if any path exists from entrypoints to a vulnerable `(purl, sym_digest)`. + +--- + +### Practical policies that work well + +* **Entrypoints:** ASP.NET controller actions, `Main`, exported handlers, cron entry shims. +* **Edge confidence:** tag edges as `import`, `reloc`, `disasm`, or `runtime`; prefer runtime in prioritization. +* **Unknowns registry:** if symbol can’t be resolved, record `purl:"pkg:unknown"` with reason (stripped, obfuscated, thunk), so it’s visible—not silently dropped. + +--- + +### Quick win you can ship first + +* Start with **imports-only reachability** (no disassembly). For most CVEs in popular packages, imports + SBOM mapping already highlights real risk. +* Add **light disassembly** for direct `call` opcodes later to improve precision. + +If you want, I can turn this into a ready‑to‑drop **.NET 10 library skeleton**: parsers (PE/ELF/Mach‑O), symbol normalizer, digestor, graph model, and SBOM mapper with purl resolvers. + +Below is a concrete, implementation-ready specification aimed at a solid, “average” C# developer. The goal is that they can build this module without knowing all of StellaOps context. + +--- + +## 1. Purpose and Scope + +Implement a reusable .NET library that: + +1. Reads binaries (PE, ELF, Mach-O). +2. Extracts **functions/symbols** and their **call relationships** (call graph). +3. Annotates each call edge with: + + * The **callee’s purl** (package URL / SBOM component). + * A **symbol digest** (stable function identifier). +4. Produces a **reachability graph** in memory and as JSON. + +This will be used by other StellaOps services (Scanner / Sbomer / Vexer) to answer: +“Is this vulnerable function from package X reachable in my environment?” + +Non-goals for v1: + +* No dynamic tracing (no eBPF, no ETW). Static only. +* No external CLI tools (no `objdump`, `llvm-nm`, etc.). Everything in-process and in C#. + +--- + +## 2. Project Structure + +Create a new class library: + +* Project: `StellaOps.Scanner.BinaryReachability` +* TargetFramework: `net10.0` +* Nullable: `enable` +* Language: latest C# available for .NET 10 + +Recommended namespaces: + +* `StellaOps.Scanner.BinaryReachability` +* `StellaOps.Scanner.BinaryReachability.Model` +* `StellaOps.Scanner.BinaryReachability.Parsing` +* `StellaOps.Scanner.BinaryReachability.Parsing.Pe` +* `StellaOps.Scanner.BinaryReachability.Parsing.Elf` +* `StellaOps.Scanner.BinaryReachability.Parsing.MachO` +* `StellaOps.Scanner.BinaryReachability.Sbom` +* `StellaOps.Scanner.BinaryReachability.Graph` + +--- + +## 3. Core Domain Model + +### 3.1 Enumerations + +```csharp +namespace StellaOps.Scanner.BinaryReachability.Model; + +public enum BinaryFormat +{ + Pe, + Elf, + MachO +} + +public enum SymbolKind +{ + Function, + Method, + Constructor, + Destructor, + ImportStub, + Thunk, + Unknown +} + +public enum EdgeKind +{ + DirectCall, + IndirectCall, + ImportCall, + ConstructorInit, // e.g. .init_array + Other +} + +public enum EdgeConfidence +{ + High, // import, relocation, clear direct call + Medium, // best-effort disassembly + Low // heuristics, fallback +} +``` + +### 3.2 Node and Edge Records + +```csharp +namespace StellaOps.Scanner.BinaryReachability.Model; + +public sealed record BinaryNode( + string BinaryId, // e.g. "bin:sha256:..." + string FilePath, // path in image or filesystem + BinaryFormat Format, + string? BuildId, // ELF build-id, Mach-O UUID, PE pdb-signature (optional) + string FileHash // sha256 of binary bytes +); + +public sealed record SymbolNode( + string SymbolId, // stable within this graph: "sym:{digest}" + string NormalizedName, // normalized signature/name + SymbolKind Kind, + string? Purl, // nullable: may be unknown + string SymbolDigest // sha256 of normalized name +); +``` + +### 3.3 Call Edge and Call Site + +```csharp +namespace StellaOps.Scanner.BinaryReachability.Model; + +public sealed record CallSite( + string BinaryId, + ulong Offset, // RVA / file offset + string? SourceFile, // Optional, if we can resolve + int? SourceLine // Optional +); + +public sealed record CallEdge( + string FromSymbolId, + string ToSymbolId, + EdgeKind EdgeKind, + EdgeConfidence Confidence, + string? CalleePurl, // resolved package of callee + string CalleeSymbolDigest, // same as target SymbolDigest + CallSite Site +); +``` + +### 3.4 Graph Container + +```csharp +namespace StellaOps.Scanner.BinaryReachability.Graph; + +using StellaOps.Scanner.BinaryReachability.Model; + +public sealed class ReachabilityGraph +{ + public Dictionary Binaries { get; } = new(); + public Dictionary Symbols { get; } = new(); + public List Edges { get; } = new(); + + public void AddBinary(BinaryNode binary) => Binaries[binary.BinaryId] = binary; + public void AddSymbol(SymbolNode symbol) => Symbols[symbol.SymbolId] = symbol; + public void AddEdge(CallEdge edge) => Edges.Add(edge); +} +``` + +--- + +## 4. Public API (what other modules call) + +Define a simple facade service that other StellaOps components use. + +```csharp +namespace StellaOps.Scanner.BinaryReachability; + +using StellaOps.Scanner.BinaryReachability.Graph; +using StellaOps.Scanner.BinaryReachability.Model; +using StellaOps.Scanner.BinaryReachability.Sbom; + +public interface IBinaryReachabilityService +{ + /// + /// Builds a reachability graph for all binaries in the given directory (e.g. unpacked container filesystem), + /// using SBOM data to resolve PURLs. + /// + ReachabilityGraph BuildGraph( + string rootDirectory, + ISbomComponentResolver sbomResolver); + + /// + /// Serialize the graph to JSON for persistence / later replay. + /// + string SerializeGraph(ReachabilityGraph graph); +} +``` + +Implementation class: + +```csharp +public sealed class BinaryReachabilityService : IBinaryReachabilityService +{ + // Will compose format-specific parsers and SBOM resolver inside. +} +``` + +--- + +## 5. SBOM Component Resolver + +We need only a minimal interface to attach PURLs to binaries and symbols. + +```csharp +namespace StellaOps.Scanner.BinaryReachability.Sbom; + +public interface ISbomComponentResolver +{ + /// + /// Resolve the purl for a binary file (by path or build-id). + /// Return null if not found. + /// + string? ResolvePurlForBinary(string filePath, string? buildId, string fileHash); + + /// + /// Optional: resolve purl by a library name only (e.g. "libssl.so.3", "libcrypto.so.3"). + /// Used when we have imports but not full path. + /// + string? ResolvePurlByLibraryName(string libraryName); +} +``` + +For the C# dev: + +* Implementation will consume **CycloneDX/SPDX SBOMs** that already map files (hash/path/buildId) to components and purls. +* For v1, a simple resolver that: + + * Loads SBOM JSON. + * Indexes components by: + + * File path (normalized). + * File hash. + * BuildId where available. + * Implements the two methods above using dictionary lookups. + +--- + +## 6. Binary Parsing Abstractions + +### 6.1 Common Interface + +```csharp +namespace StellaOps.Scanner.BinaryReachability.Parsing; + +using StellaOps.Scanner.BinaryReachability.Model; + +public interface IBinaryParser +{ + bool CanParse(string filePath, ReadOnlySpan header); + + /// + /// Parse basic binary metadata: format, build-id, file-hash already computed by caller. + /// + BinaryNode ParseBinaryMetadata(string filePath, string fileHash); + + /// + /// Parse functions/symbols from this binary. + /// Return a list of SymbolNode with Purl left null (will be set later). + /// + IReadOnlyList ParseSymbols(BinaryNode binary); + + /// + /// Build intra-binary call edges (from this binary’s functions to others), without PURL info. + /// ToSymbolId should be based on SymbolDigest; PURL will be attached later. + /// + IReadOnlyList ParseCallGraph(BinaryNode binary, IReadOnlyList symbols); +} +``` + +### 6.2 Parser Implementations + +Create three concrete parsers: + +* `PeBinaryParser` in `Parsing.Pe` +* `ElfBinaryParser` in `Parsing.Elf` +* `MachOBinaryParser` in `Parsing.MachO` + +And a small factory: + +```csharp +public sealed class BinaryParserFactory +{ + private readonly List _parsers; + + public BinaryParserFactory() + { + _parsers = new List + { + new Pe.PeBinaryParser(), + new Elf.ElfBinaryParser(), + new MachO.MachOBinaryParser() + }; + } + + public IBinaryParser? GetParser(string filePath, ReadOnlySpan header) + => _parsers.FirstOrDefault(p => p.CanParse(filePath, header)); +} +``` + +--- + +## 7. Symbol Normalization and Digesting + +Create a small helper for consistent symbol IDs. + +```csharp +namespace StellaOps.Scanner.BinaryReachability.Model; + +public static class SymbolIdFactory +{ + public static string ComputeNormalizedName(string rawName) + => rawName.Trim(); // v1: minimal; later we can extend (demangling, etc.) + + public static string ComputeSymbolDigest(string normalizedName) + { + using var sha = System.Security.Cryptography.SHA256.Create(); + var bytes = System.Text.Encoding.UTF8.GetBytes(normalizedName); + var hash = sha.ComputeHash(bytes); + var hex = Convert.ToHexString(hash).ToLowerInvariant(); + return hex; + } + + public static string CreateSymbolId(string symbolDigest) + => $"sym:{symbolDigest}"; +} +``` + +Usage in parsers: + +* For each function name the parser finds: + + * `normalizedName = SymbolIdFactory.ComputeNormalizedName(rawName);` + * `digest = SymbolIdFactory.ComputeSymbolDigest(normalizedName);` + * `symbolId = SymbolIdFactory.CreateSymbolId(digest);` + * Create `SymbolNode`. + +Notes for developer: + +* Do not include file path or address in the digest (we want determinism across builds). +* In the future we can expand normalization to include demangled signatures and parameter types. + +--- + +## 8. Building the Graph (step-by-step) + +Implementation of `BinaryReachabilityService.BuildGraph` should follow this algorithm. + +### 8.1 Scan Files + +1. Recursively enumerate all files under `rootDirectory`. +2. For each file: + + * Open as stream. + * Read first 4–8 bytes as header. + * Try `BinaryParserFactory.GetParser`. + * If no parser, skip file. + +### 8.2 Parse Binary Metadata and Symbols + +For each parseable file: + +1. Compute SHA256 of file content → `fileHash`. +2. `parser.ParseBinaryMetadata(filePath, fileHash)` → `BinaryNode`. +3. Add `BinaryNode` to `ReachabilityGraph.Binaries`. +4. `parser.ParseSymbols(binary)` → list of `SymbolNode`. +5. For each symbol: + + * Add to `ReachabilityGraph.Symbols` if not already present: + + * Key: `SymbolId`. + * If existing, keep first or merge (for v1: keep first). + +Maintain an in-memory index: + +```csharp +// symbolDigest -> SymbolNode +Dictionary symbolsByDigest; +``` + +### 8.3 Parse Call Graph per Binary + +For each binary: + +1. `parser.ParseCallGraph(binary, itsSymbols)` → edges (without PURL attached). +2. For each edge: + + * Ensure `FromSymbolId` and `ToSymbolId` correspond to known `SymbolNode`: + + * `ToSymbolId` should be `sym:{digest}` for the callee. + * Add edge to `ReachabilityGraph.Edges`. + +At this point, edges know only `FromSymbolId`, `ToSymbolId`, kind, confidence, and `CallSite`. + +### 8.4 Attach PURLs + +Now run a second pass to attach PURLs to symbols and edges: + +1. For each `BinaryNode`: + + * Call `sbomResolver.ResolvePurlForBinary(binary.FilePath, binary.BuildId, binary.FileHash)`. + * If not null, this is the **binary’s own purl** (used for "who owns these functions"). +2. Maintain: + +```csharp +Dictionary binaryPurlsById; // BinaryId -> purl? +``` + +3. For each `CallEdge`: + + * Get callee symbol: + + * `var symbol = graph.Symbols[edge.ToSymbolId];` + * If `symbol.Purl` is null: + + * If callee is local (same binary – parser may mark it via metadata or `CallSite.BinaryId`): + + * Assign `symbol.Purl = binaryPurlsById[callSite.BinaryId]` (can be null). + * If callee is imported from an external library: + + * Parser should provide library name in `NormalizedName` or additional metadata (for v1, you can store library in a separate structure). + * Use `sbomResolver.ResolvePurlByLibraryName(libraryName)` to find purl. + * Set `symbol.Purl` to that value (even if null). + * Set `edge.CalleePurl = symbol.Purl`. + * Set `edge.CalleeSymbolDigest = symbol.SymbolDigest`. + +Note: For v1 you can simplify: + +* Assume all callees in this binary belong to `binary`’s purl. +* Later, extend to per-library mapping. + +--- + +## 9. Format-Specific Minimum Requirements + +For each parser, aim for this minimum. + +### 9.1 PE Parser (Windows) + +Tasks: + +1. Identify PE by `MZ` + PE header. +2. Extract: + + * Machine type. + * Optional: PDB signature / age (for potential BuildId in the future). +3. Symbols: + + * Use export table for exported functions. + * Use import table for imported functions (these represent edges from this binary to others). +4. Call graph: + + * For v1: edges from each local function to imported functions via import table. + * Later: add simple disassembly of `.text` section to detect intra-binary calls. + +Practical approach: + +* Use `System.Reflection.PortableExecutable` if possible, or a small custom PE reader. +* Represent imported function name as `"!"` in `NormalizedName`. + +### 9.2 ELF Parser (Linux) + +Tasks: + +1. Detect ELF by magic `0x7F 'E' 'L' 'F'`. +2. Extract: + + * BuildId (from `.note.gnu.build-id` if present). + * Architecture. +3. Symbols: + + * Read `.dynsym` (dynamic symbols) and `.symtab` if present. + * Functions only (symbol type FUNC). +4. Call graph (minimum): + + * Imports via PLT/GOT entries (function calls to shared libs). + * Map symbol names to `SymbolNode` as above. + +Implementation: + +* Write a simple ELF reader: parse header, section headers, locate `.dynsym`, `.strtab`, `.symtab`, `.note.gnu.build-id`. + +### 9.3 Mach-O Parser (macOS) + +Tasks: + +1. Detect Mach-O via magic (`0xFEEDFACE`, `0xFEEDFACF`, etc.). +2. Extract: + + * UUID (LC_UUID) as BuildId equivalent. +3. Symbols: + + * Use LC_SYMTAB and associated string table. +4. Call graph: + + * Similar approach as ELF for imports; minimum: cross-binary call edges via import stubs. + +Implementation: + +* Minimal Mach-O parser: read load commands, find LC_SYMTAB and LC_UUID. + +--- + +## 10. JSON Serialization Format + +Use System.Text.Json with simple DTOs mirroring `ReachabilityGraph`. For v1, you can serialize the domain model directly. + +Example structure (for reference only): + +```json +{ + "nodes": { + "binaries": [ + { "binaryId": "bin:sha256:...", "filePath": "/app/MyApi.exe", "format": "Pe", "buildId": null, "fileHash": "..." } + ], + "symbols": [ + { "symbolId": "sym:...", "normalizedName": "MyNamespace.MyType::MyMethod()", "kind": "Function", "purl": "pkg:nuget/MyLib@1.2.3", "symbolDigest": "..." } + ] + }, + "edges": [ + { + "fromSymbolId": "sym:...", + "toSymbolId": "sym:...", + "edgeKind": "ImportCall", + "confidence": "High", + "calleePurl": "pkg:nuget/MyLib@1.2.3", + "calleeSymbolDigest": "...", + "site": { "binaryId": "bin:sha256:...", "offset": "0", "sourceFile": null, "sourceLine": null } + } + ] +} +``` + +--- + +## 11. Error Handling & Logging + +* For unreadable or unsupported binaries: + + * Log a warning and continue. +* For parsing errors: + + * Catch exceptions, log with file path and format, continue with other files. +* For SBOM resolution failures: + + * Not an error; leave Purl as null. + +Logs should at least include: + +* Number of binaries discovered, parsed successfully, failed. +* Number of symbols and edges created. +* Number of edges with `CalleePurl` null vs non-null. + +--- + +## 12. Test Plan (high-level) + +1. **Unit tests** for: + + * `SymbolIdFactory` (deterministic digests). + * `BinaryReachabilityService` with mocked parsers & SBOM resolver. +2. **Integration tests** (per platform) using small sample binaries: + + * A PE with one import (e.g. `MessageBoxA`). + * An ELF binary calling `printf`. + * A Mach-O binary with a simple imported function. +3. Check that: + + * Graph contains expected binaries and symbols. + * Call edges exist and have correct `FromSymbolId` / `ToSymbolId`. + * PURLs are attached when SBOM resolver is provided with matching entries. + +--- + +If you want, next step I can break this into a concrete task list (Jira-style) for a single mid-level C# developer over 1–2 sprints, including approximate order and dependencies. +You can push these specs significantly closer to “best in class” by tightening a few correctness details, making some implicit assumptions explicit, and adding non‑functional and quality gates. Below is a structured list of concrete improvements you can fold back into the spec. + +I’ll focus on changes that materially affect correctness, extensibility, and implementation quality for an “average C# dev” without exploding complexity. + +--- + +## 1. Clarify Non‑Functional Requirements + +Right now the spec is almost entirely functional. Add a short NFR section so the developer has explicit targets: + +**Add a “Non‑Functional Requirements” section:** + +* **Performance** + + * Target scanning throughput, e.g. “On commodity hardware, aim for at least 50–100 MB/s of binaries scanned in static mode.” + * Specify acceptable complexity: “All parsing operations must be linear in file size where possible; avoid quadratic algorithms over symbol tables.” + +* **Memory** + + * Provide a rough upper bound, e.g. “Graph building must not exceed 512 MB RAM for 10k binaries with typical Linux container images.” + +* **Thread safety** + + * Clarify: “All parser implementations must be stateless and thread‑safe; `BinaryReachabilityService.BuildGraph` may scan binaries in parallel.” + +* **Portability** + + * Minimum supported OS set (Windows, Linux, macOS) and CPU architectures (x86_64, ARM64); important because ELF/Mach‑O vary. + +This keeps the implementation from being “correct but unusably slow” and tells the dev what “good enough” looks like. + +--- + +## 2. Fix and Strengthen Symbol Identity (Very Important) + +Current spec uses `SymbolId = "sym:{digest}"` where digest is only based on normalized name. That will collapse distinct functions that happen to share the same name/signature across different libraries/packages, which is unacceptable once you care about cross‑component reachability. + +**Improve the spec as follows:** + +1. **Split “symbol node identity” from “canonical symbol key”:** + + * Keep a local identity that is always unique per binary: + + ```csharp + public sealed record SymbolNode( + string SymbolId, // e.g. "sym:{binaryId}:{localIndex}" + string NormalizedName, + SymbolKind Kind, + string? Purl, + string SymbolDigest // stable digest of NormalizedName + ); + ``` + + * Define a **canonical symbol key** struct for cross‑binary grouping: + + ```csharp + public readonly record struct CanonicalSymbolKey( + string SymbolDigest, // sha256(normalizedName) + string? Purl // null for unknown package + ); + ``` + + * Inside `ReachabilityGraph`, add: + + ```csharp + public Dictionary> CanonicalSymbolIndex { get; } = new(); + ``` + +2. **Clarify behavior:** + + * Never merge two `SymbolNode`s just because they share the same digest. + * For “global reasoning” (e.g. “all call sites to the vulnerable function X from package Y”), use `CanonicalSymbolKey(SymbolDigest, Purl)`. + +3. **Update `CallEdge`:** + + * Keep `FromSymbolId` and `ToSymbolId` as node IDs. + * Include the canonical key in a dedicated field: + + ```csharp + public sealed record CallEdge( + string FromSymbolId, + string ToSymbolId, + EdgeKind EdgeKind, + EdgeConfidence Confidence, + CanonicalSymbolKey? CalleeKey, + CallSite Site + ); + ``` + +This single change prevents subtle and serious misattribution across libraries with overlapping APIs. + +--- + +## 3. Explicit Build Identity Semantics (PE/ELF/Mach‑O) + +The spec currently says `BuildId` is “optional” and format‑specific, but does not define **how** to compute it per format. Best‑in‑class means this is deterministic and documented. + +**Extend the spec with a “Binary Identity” section:** + +* **PE (Windows)** + + * `BuildId` = PDB GUID + Age if available (from CodeView debug directory). + * If PDB info is missing, set `BuildId = null` and rely on `FileHash`. +* **ELF (Linux)** + + * `BuildId` = contents of `.note.gnu.build-id` if present. +* **Mach‑O (macOS)** + + * `BuildId` = UUID from `LC_UUID` load command. + +Also specify: + +* **Primary identity order**: `(BuildId, FileHash)`; if `BuildId` is null, use `FileHash` only. +* SBOM resolvers MUST treat `(BuildId, FileHash)` as the canonical key to map binaries to components, with file path only as a hint. + +This gives you robust correlation between SBOM entries and binaries, across containers and file renames. + +--- + +## 4. Enrich the Edge Model and Call Site Semantics + +For precision and debuggability, specify what edges mean more rigorously. + +**Add fields and definitions:** + +1. **Direction and type:** + + Add a small discriminator describing the origin of the edge: + + ```csharp + public enum EdgeSource + { + ImportTable, // import thunk / PLT / stub + Relocation, // relocation to symbol + Disassembly, // decoded CALL / BL / JAL + Metadata, // .NET metadata, DWARF, etc. + Other + } + ``` + + Extend `CallEdge`: + + ```csharp + public sealed record CallEdge( + string FromSymbolId, + string ToSymbolId, + EdgeKind EdgeKind, + EdgeConfidence Confidence, + EdgeSource Source, + CanonicalSymbolKey? CalleeKey, + CallSite Site + ); + ``` + +2. **Intra‑ vs inter‑binary** + + * Define: `Site.BinaryId` always refers to the binary containing the call instruction. + * Intra‑binary edge: `FromSymbol` and `ToSymbol` share same `BinaryId`. + * Inter‑binary edge: otherwise. + +3. **Unknown or unresolved callees** + + * Do not drop unresolved calls; add a special `UnknownSymbolNode` per binary: + + * `NormalizedName = ""`, `Kind = SymbolKind.Unknown`, `Purl = null`. + * Edges to unknown must have `Confidence = EdgeConfidence.Low`. + +This makes downstream consumers able to distinguish “we are sure this is a call to libX.Y” from “we saw a call but do not know to where”. + +--- + +## 5. Strengthen Symbol Normalization Rules (Demangling etc.) + +For best‑in‑class results, you want reproducible signatures independent of compiler version, and you want to unify mangled C++/Rust/etc. names. + +**Extend the `SymbolIdFactory` spec with clear rules:** + +1. **Language‑agnostic core** + + * Always: + + * Demangle if possible. + * Normalize whitespace. + * Normalize namespace separators to `.` and member separator to `::`. + * Remove address/offset suffixes embedded in names. + +2. **Format‑ and language‑specific guidance** + + * For C/C++ (MSVC / Itanium ABI): + + * Use a demangler (your own or library) to get `retType namespace.Type::Func(paramTypes...)`. + * Omit return type in normalization to make signatures more stable: `namespace.Type::Func(paramTypes...)`. + * For Rust: + + * Strip hash suffixes from symbol name. + * Use “crate::module::Type::func(params...)” pattern where possible. + * For Go: + + * Normalize from `runtime.main_main` → `runtime.main.main` etc. + * For .NET (if/when you add managed parsing later): + + * Use fully qualified CLR names: `Namespace.Type::Method(ParamType1,ParamType2)`. + +3. **Document stability guarantees** + + * Given identical source (function name + parameter list), the `SymbolDigest` must remain stable across builds, architectures, optimization levels, and link addresses. + * If demangling fails, fallback to raw name but strip obvious hashes if safe. + +Specify this in prose and keep the implementation flexible, but the rules must be clear enough that two developers implementing the parser will produce the same digest for the same symbol. + +--- + +## 6. More Precise SBOM & PURL Resolution Behavior + +The SBOM integration is crucial to StellaOps; push this further so it is deterministic and auditable. + +**Extend `ISbomComponentResolver` behavior:** + +1. **Resolution order** + + Document a strict order: + + 1. `(BuildId, FileHash)` match. + 2. `FileHash` only. + 3. Normalized file path if SBOM has explicit path mapping. + 4. Library name fallback via `ResolvePurlByLibraryName`. + +2. **Multiple SBOMs and conflicts** + + * Allow multiple SBOM sources; if two SBOMs claim different purls for the same `(BuildId, FileHash)`, define a policy: + + * e.g. fail fast with a “conflicting SBOM” error; or choose a deterministic priority order. + +3. **Library name mapping contract** + + Add a small DTO to make the mapping explicit: + + ```csharp + public sealed record LibraryReference( + string BinaryId, + string LibraryName, // "libssl.so.3" / "KERNEL32.dll" + string? ResolvedPath // if the loader path is known + ); + ``` + + Extend `IBinaryParser` with: + + ```csharp + IReadOnlyList ParseLibraryReferences(BinaryNode binary); + ``` + + Then describe how `BinaryReachabilityService` uses those to call `ResolvePurlByLibraryName`. + +4. **Unknown purls** + + * Require that unknowns are explicit: + + * When `ResolvePurlForBinary` returns null, store `Purl = null` and flag this in logs: “No SBOM component for binary X (BuildId=..., Hash=...)”. + +This ensures SBOM resolution remains a traceable, deterministic step rather than a best‑effort guess. + +--- + +## 7. Explicit JSON Schema & Versioning + +For replayability and compatibility, define a clear JSON schema and version. + +**Add:** + +* A top‑level metadata section: + + ```json + { + "schemaVersion": "1.0.0", + "generatedAt": "2025-11-20T12:34:56Z", + "tool": "StellaOps.Scanner.BinaryReachability", + "toolVersion": "1.0.0", + "graph": { ... } + } + ``` + +* Commit to: + + * Only additive changes in minor versions. + * Backwards‑compatible changes within the same major version. + * If you change anything structural (e.g. how symbol IDs work), bump `schemaVersion` major. + +Optionally, provide a compact JSON schema file (or at least a documented shape) so other teams can implement readers in other languages. + +--- + +## 8. Concurrency, Streaming, and Large Images + +For best‑in‑class scalability, specify how large images are handled. + +**Clarify in the spec:** + +1. **Parallelization** + + * `BinaryReachabilityService.BuildGraph`: + + * May scan binaries in parallel using `Parallel.ForEach`. + * All parsers must be thread‑safe and not rely on shared mutable state. + +2. **Streaming option (optional but recommended)** + + * Provide a second API for very large repositories: + + ```csharp + public interface IGraphSink + { + void OnBinary(BinaryNode binary); + void OnSymbol(SymbolNode symbol); + void OnEdge(CallEdge edge); + } + + void BuildGraphStreaming(string rootDirectory, ISbomComponentResolver sbomResolver, IGraphSink sink); + ``` + + * This allows building graphs into a database or message bus without keeping everything in memory. + +Even if you do not implement streaming immediately, designing the interface now keeps the architecture future‑proof. + +--- + +## 9. Observability and Diagnostics + +Best‑in‑class implementation requires good introspection for debugging wrong reachability conclusions. + +**Specify minimal observability requirements:** + +* **Logging** + + * At least: + + * Info: number of binaries, symbols, edges, time taken. + * Warning: unsupported binary formats, SBOM resolution failures, demangling failures. + * Error: parser exceptions per file (with file path and format). + +* **Debug artifacts** + + * Optional environment or flag that dumps per‑binary debug info: + + * Raw symbol table (names + addresses). + * Normalized names and digests. + * Library references. + * Call edges for that binary. + +* **Metrics hooks** + + * Provide a simple interface for metrics: + + ```csharp + public interface IReachabilityMetrics + { + void IncrementCounter(string name, long value = 1); + void ObserveDuration(string name, TimeSpan duration); + } + ``` + + And allow `BinaryReachabilityService` to be constructed with an optional metrics implementation. + +--- + +## 10. Expanded Test Strategy and Quality Gates + +Your test plan is decent but can be made more systematic. + +**Extend test plan:** + +1. **Golden corpus** + + * Maintain a small but curated set of PE/ELF/Mach‑O binaries (checked in or generated) where: + + * Expected symbols and edges are stored as JSON. + * CI compares current output with the golden graph byte‑for‑byte (or structurally). + +2. **Cross‑compiler coverage** + + * At least: + + * C/C++ built by different toolchains (MSVC, clang, gcc). + * Different optimization levels (`-O0`, `-O2`) to ensure stability of parsing. + +3. **Fuzzing / robustness** + + * Create tests with truncated / corrupted binaries to ensure: + + * No crashes. + * Meaningful, bounded error behavior. + +4. **SBOM integration tests** + + * For a test root directory: + + * Synthetic SBOM mapping files to binaries. + * Validate correct purl assignment and conflict handling. + +5. **Determinism tests** + + * Run `BuildGraph` twice on the same directory and assert that: + + * Graph is structurally identical (including order‑independent comparison). + +This makes it much harder for regressions to slip in when you extend parsers or normalization. + +--- + +## 11. Clear Extension Points and Roadmap Notes + +Finally, add a short “Future Extensions” section so the dev knows what to keep in mind when structuring code: + +* Support for: + + * Inlined function tracking (via DWARF/PDB). + * Managed .NET assemblies’ metadata (C# IL call graph). + * Dynamic edge sources (runtime traces) merged into the same graph. +* The spec should instruct: “Design parsers and the graph model so they can accept additional `EdgeSource` types and symbol metadata without breaking existing consumers.” + +That gives the current implementation a clear direction and prevents design dead ends. + +--- + +If you want, I can now take your current spec and rewrite it into a single, consolidated “v1.0 Best‑in‑Class Spec” document that incorporates all of the above changes, so you can hand it directly to an implementation team.