Vlad, here’s a concrete, **pure‑C#** blueprint to build a multi‑format binary analyzer (Mach‑O, ELF, PE) that produces **call graphs + reachability**, with **no external tools**. Where needed, I point to permissively‑licensed code you can **port** (copy) from other ecosystems. --- ## 0) Targets & non‑negotiables * **Formats:** Mach‑O (inc. LC_DYLD_INFO / LC_DYLD_CHAINED_FIXUPS), ELF (SysV gABI), PE/COFF * **Architectures:** x86‑64 (and x86), AArch64 (ARM64) * **Outputs:** JSON with **purls** per module + function‑level call graph & reachability * **No tool reuse:** Only pure C# libraries or code **ported** from permissive sources --- ## 1) Parsing the containers (pure C#) **Pick one C# reader per format, keeping licenses permissive:** * **ELF & Mach‑O:** `ELFSharp` (pure managed C#; ELF + Mach‑O reading). MIT/X11 license. ([GitHub][1]) * **ELF & PE (+ DWARF v4):** `LibObjectFile` (C#, BSD‑2). Good ELF relocations (i386, x86_64, ARM, AArch64), PE directories, DWARF sections. Use it as your **common object model** for ELF+PE, then add a Mach‑O adapter. ([GitHub][2]) * **PE (optional alternative):** `PeNet` (pure C#, broad PE directories, imp/exp, TLS, certs). MIT. Useful if you want a second implementation for cross‑checks. ([GitHub][3]) > Why two libs? `LibObjectFile` gives you DWARF and clean models for ELF/PE; `ELFSharp` covers Mach‑O today (and ELF as a fallback). You control the code paths. **Spec references you’ll implement against** (for correctness of your readers & link‑time semantics): * **ELF (gABI, AMD64 supplement):** dynamic section, PLT/GOT, `R_X86_64_JUMP_SLOT` semantics (eager vs lazy). ([refspecs.linuxbase.org][4]) * **PE/COFF:** imports/exports/IAT, delay‑load, TLS. ([Microsoft Learn][5]) * **Mach‑O:** file layout, load commands (`LC_SYMTAB`, `LC_DYSYMTAB`, `LC_FUNCTION_STARTS`, `LC_DYLD_INFO(_ONLY)`), and the modern `LC_DYLD_CHAINED_FIXUPS`. ([leopard-adc.pepas.com][6]) --- ## 2) Mach‑O: what you must **port** (byte‑for‑byte compatible) Apple moved from traditional dyld bind opcodes to **chained fixups** on macOS 12/iOS 15+; you need both: * **Dyld bind opcodes** (`LC_DYLD_INFO(_ONLY)`): parse the BIND/LAZY_BIND streams (tuples of ``). Port minimal logic from **LLVM** or **LIEF** (both Apache‑2.0‑compatible) into C#. ([LIEF][7]) * **Chained fixups** (`LC_DYLD_CHAINED_FIXUPS`): port `dyld_chained_fixups_header` structs & chain walking from LLVM’s `MachO.h` or Apple’s dyld headers. This restores imports/rebases without running dyld. ([LLVM][8]) * **Function discovery hint:** read `LC_FUNCTION_STARTS` (ULEB128 deltas) to seed function boundaries—very helpful on stripped binaries. ([Stack Overflow][9]) * **Stubs mapping:** resolve `__TEXT,__stubs` ↔ `__DATA,__la_symbol_ptr` via the **indirect symbol table**; conceptually identical to ELF’s PLT/GOT. ([MaskRay][10]) > If you prefer an in‑C# base for Mach‑O manipulation, **Melanzana.MachO** exists (MIT) and has been used by .NET folks for Mach‑O/Code Signing/obj writing; you can mine its approach for load‑command modeling. ([GitHub][11]) --- ## 3) Disassembly (pure C#, multi‑arch) * **x86/x64:** `iced` (C# decoder/disassembler/encoder; MIT; fast & complete). ([GitHub][12]) * **AArch64/ARM64:** two options that keep you pure‑C#: * **Disarm** (pure C# ARM64 disassembler; MIT). Good starting point to decode & get branch/call kinds. ([GitHub][13]) * **Port from Ryujinx ARMeilleure** (ARMv8 decoder/JIT in C#, MIT). You can lift only the **decoder** pieces you need. ([Gitee][14]) * **x86 fallback:** `SharpDisasm` (udis86 port in C#; BSD‑2). Older than iced; keep as a reference. ([GitHub][15]) --- ## 4) Call graph recovery (static) **4.1 Function seeds** * From symbols (`.dynsym`/`LC_SYMTAB`/PE exports) * From **LC_FUNCTION_STARTS** (Mach‑O) for stripped code ([Stack Overflow][9]) * From entrypoints (`_start`/`main` or PE AddressOfEntryPoint) * From exception/unwind tables & DWARF (when present)—`LibObjectFile` already models DWARF v4. ([GitHub][2]) **4.2 CFG & interprocedural calls** * **Decode** with iced/Disarm from each seed; form **basic blocks** by following control‑flow until terminators (ret/jmp/call). * **Direct calls:** immediate targets become edges (PC‑relative fixups where needed). * **Imported calls:** * **ELF:** calls to PLT stubs → resolve via `.rela.plt` & `R_*_JUMP_SLOT` to symbol names (link‑time target). ([cs61.seas.harvard.edu][16]) * **PE:** calls through the **IAT** → resolve via `IMAGE_IMPORT_DESCRIPTOR` / thunk tables. ([Microsoft Learn][5]) * **Mach‑O:** calls to `__stubs` use **indirect symbol table** + `__la_symbol_ptr` (or chained fixups) → map to dylib/symbol. ([reinterpretcast.com][17]) * **Indirect calls within the binary:** heuristics only (function pointer tables, vtables, small constant pools). Keep them labeled **“indirect‑unresolved”** unless a heuristic yields a concrete target. **4.3 Cross‑binary graph** * Build module‑level edges by simulating the platform’s loader: * **ELF:** honor `DT_NEEDED`, `DT_RPATH/RUNPATH`, versioning (`.gnu.version*`) to pick the definer of an imported symbol. gABI rules apply. ([refspecs.linuxbase.org][4]) * **PE:** pick DLL from the import descriptors. ([Microsoft Learn][5]) * **Mach‑O:** `LC_LOAD_DYLIB` + dyld binding / chained fixups determine the provider image. ([LIEF][7]) --- ## 5) Reachability analysis Represent the **call graph** using a .NET graph lib (or a simple adjacency set). I suggest: * **QuikGraph** (successor of QuickGraph; MIT) for algorithms (DFS/BFS, SCCs). Use it to compute reachability from chosen roots (entrypoint(s), exported APIs, or “sinks”). ([GitHub][18]) You can visualize with **MSAGL** (MIT) when you need layouts, but your core output is JSON. ([GitHub][19]) --- ## 6) Symbol demangling (nice‑to‑have, pure C#) * **Itanium (ELF/Mach‑O):** Either port LLVM’s Itanium demangler or use a C# lib like **CxxDemangler** (a C# rewrite of `cpp_demangle`). ([LLVM][20]) * **MSVC (PE):** Port LLVM’s `MicrosoftDemangle.cpp` (Apache‑2.0 with LLVM exception) to C#. ([LLVM][21]) --- ## 7) JSON output (with purls) Use a stable schema (example) to feed SBOM/vuln matching downstream: ```json { "modules": [ { "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1?arch=amd64", "format": "ELF", "arch": "x86_64", "path": "/usr/lib/x86_64-linux-gnu/libssl.so.1.1", "exports": ["SSL_read", "SSL_write"], "imports": ["BIO_new", "EVP_CipherInit_ex"], "functions": [{"name":"SSL_do_handshake","va":"0x401020","size":512,"demangled": "..."}] } ], "graph": { "nodes": [ {"id":"bin:main@0x401000","module": "pkg:generic/myapp@1.0.0"}, {"id":"lib:SSL_read","module":"pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1?arch=amd64"} ], "edges": [ {"src":"bin:main@0x401000","dst":"lib:SSL_read","kind":"import_call","evidence":"ELF.R_X86_64_JUMP_SLOT"} ] }, "reachability": { "roots": ["bin:_start","bin:main@0x401000"], "reachable": ["lib:SSL_read", "lib:SSL_write"], "unresolved_indirect_calls": [ {"site":"0x402ABC","reason":"register-indirect"} ] } } ``` --- ## 8) Minimal C# module layout (sketch) ``` Stella.Analysis.Core/ BinaryModule.cs // common model (sections, symbols, relocs, imports/exports) Loader/ PeLoader.cs // wrap LibObjectFile (or PeNet) to BinaryModule ElfLoader.cs // wrap LibObjectFile to BinaryModule MachOLoader.cs // wrap ELFSharp + your ported Dyld/ChainedFixups Disasm/ X86Disassembler.cs // iced bridge: bytes -> instructions Arm64Disassembler.cs // Disarm (or ARMeilleure port) bridge Graph/ CallGraphBuilder.cs // builds CFG per function + inter-procedural edges Reachability.cs // BFS/DFS over QuikGraph Demangle/ ItaniumDemangler.cs // port or wrap CxxDemangler MicrosoftDemangler.cs // port from LLVM Export/ JsonWriter.cs // writes schema above ``` --- ## 9) Implementation notes (where issues usually bite) * **Mach‑O moderns:** Implement both dyld opcode **and** chained fixups; many macOS 12+/iOS15+ binaries only have chained fixups. ([emergetools.com][22]) * **Stubs vs real targets (Mach‑O):** map `__stubs` → `__la_symbol_ptr` via **indirect symbols** to the true imported symbol (or its post‑fixup target). ([reinterpretcast.com][17]) * **ELF PLT/GOT:** treat `.plt` entries as **call trampolines**; ultimate edge should point to the symbol (library) that satisfies `DT_NEEDED` + version. ([refspecs.linuxbase.org][4]) * **PE delay‑load:** don’t forget `IMAGE_DELAYLOAD_DESCRIPTOR` for delayed IATs. ([Microsoft Learn][5]) * **Function discovery:** use `LC_FUNCTION_STARTS` when symbols are stripped; it’s a cheap way to seed analysis. ([Stack Overflow][9]) * **Name clarity:** demangle Itanium/MSVC so downstream vuln rules can match consistently. ([LLVM][20]) --- ## 10) What to **copy/port** verbatim (safe licenses) * **Dyld bind & exports trie logic:** from **LLVM** or **LIEF** Mach‑O (Apache‑2.0). Great for getting the exact opcode semantics right. ([LIEF][7]) * **Chained fixups structs/walkers:** from **LLVM MachO.h** or Apple dyld headers (permissive headers). ([LLVM][8]) * **Itanium/MS demanglers:** LLVM demangler sources are standalone; easy to translate to C#. ([LLVM][23]) * **ARM64 decoder:** if Disarm gaps hurt, lift just the **decoder** pieces from **Ryujinx ARMeilleure** (MIT). ([Gitee][14]) *(Avoid GPL’d parsers like binutils/BFD; they will contaminate your codebase’s licensing.)* --- ## 11) End‑to‑end pipeline (per container image) 1. **Enumerate binaries** in the container FS. 2. **Parse** each with the appropriate loader → `BinaryModule` (+ imports/exports/symbols/relocs). 3. **Simulate linking** per platform to resolve imported functions to provider libraries. ([refspecs.linuxbase.org][4]) 4. **Disassemble** functions (iced/Disarm) → CFGs → **call edges** (direct, PLT/IAT/stub, indirect). 5. **Assemble call graph** across modules; normalize names via demangling. 6. **Reachability**: given roots (entry or user‑specified) compute reachable set; emit JSON with **purls** (from your SBOM/package resolver). 7. **(Optional)** dump GraphViz / MSAGL views for debugging. ([GitHub][19]) --- ## 12) Quick heuristics for vulnerability triage * **Sink maps**: flag edges to high‑risk APIs (`strcpy`, `gets`, legacy SSL ciphers) even without CVE versioning. * **DWARF line info** (when present): attach file:line to nodes for developer action. `LibObjectFile` gives you DWARF v4 reads. ([GitHub][2]) --- ## 13) Test corpora * **ELF:** glibc/openssl/libpng from distro repos; validate `R_*_JUMP_SLOT` handling and PLT edges. ([cs61.seas.harvard.edu][16]) * **PE:** system DLLs (Kernel32, Advapi32) and a small MSVC console app; validate IAT & delay‑load. ([Microsoft Learn][5]) * **Mach‑O:** Xcode‑built binaries across macOS 11 & 12+ to cover both dyld opcode and chained fixups paths; verify `LC_FUNCTION_STARTS` improves discovery. ([Stack Overflow][9]) --- ## 14) Deliverables you can start coding now * **MachOLoader.cs** * Parse headers + load commands (ELFSharp). * Implement `DyldInfoParser` (port from LLVM/LIEF) and `ChainedFixupsParser` (port structs & walkers). ([LIEF][7]) * **X86Disassembler.cs / Arm64Disassembler.cs** (iced / Disarm bridges). ([GitHub][12]) * **CallGraphBuilder.cs** (recursive descent + linear sweep fallback; PLT/IAT/stub resolution). * **Reachability.cs** (QuikGraph BFS/DFS). ([GitHub][18]) * **JsonWriter.cs** (schema above with purls). --- ### References (core, load‑bearing) * **ELFSharp** (ELF + Mach‑O pure C#). ([GitHub][1]) * **LibObjectFile** (ELF/PE/DWARF C#, BSD‑2). ([GitHub][2]) * **iced** (x86/x64 disasm, C#, MIT). ([GitHub][12]) * **Disarm** (ARM64 disasm, C#, MIT). ([GitHub][13]) * **Ryujinx (ARMeilleure)** (ARMv8 decode/JIT in C#, MIT). ([Gitee][14]) * **ELF gABI & AMD64 supplement** (PLT/GOT, relocations). ([refspecs.linuxbase.org][4]) * **PE/COFF** (imports/exports/IAT). ([Microsoft Learn][5]) * **Mach‑O docs** (load commands; LC_FUNCTION_STARTS; dyld bindings; chained fixups). ([Apple Developer][24]) --- If you want, I can draft **`MachOLoader` + `DyldInfoParser`** in C# next, including chained‑fixups structs (ported from LLVM’s headers) and an **iced**‑based call‑edge walker for x86‑64. [1]: https://github.com/konrad-kruczynski/elfsharp "GitHub - konrad-kruczynski/elfsharp: Pure managed C# library for reading ELF, UImage, Mach-O binaries." [2]: https://github.com/xoofx/LibObjectFile "GitHub - xoofx/LibObjectFile: LibObjectFile is a .NET library to read, manipulate and write linker and executable object files (e.g ELF, PE, DWARF, ar...)" [3]: https://github.com/secana/PeNet?utm_source=chatgpt.com "secana/PeNet: Portable Executable (PE) library written in . ..." [4]: https://refspecs.linuxbase.org/elf/gabi4%2B/contents.html?utm_source=chatgpt.com "System V Application Binary Interface - DRAFT - 24 April 2001" [5]: https://learn.microsoft.com/en-us/windows/win32/debug/pe-format?utm_source=chatgpt.com "PE Format - Win32 apps" [6]: https://leopard-adc.pepas.com/documentation/DeveloperTools/Conceptual/MachOTopics/0-Introduction/introduction.html?utm_source=chatgpt.com "Mach-O Programming Topics: Introduction" [7]: https://lief.re/doc/stable/doxygen/classLIEF_1_1MachO_1_1DyldInfo.html?utm_source=chatgpt.com "MachO::DyldInfo Class Reference - LIEF" [8]: https://llvm.org/doxygen/structllvm_1_1MachO_1_1dyld__chained__fixups__header.html?utm_source=chatgpt.com "MachO::dyld_chained_fixups_header Struct Reference" [9]: https://stackoverflow.com/questions/9602438/mach-o-file-lc-function-starts-load-command?utm_source=chatgpt.com "Mach-O file LC_FUNCTION_STARTS load command" [10]: https://maskray.me/blog/2021-09-19-all-about-procedure-linkage-table?utm_source=chatgpt.com "All about Procedure Linkage Table" [11]: https://github.com/dotnet/runtime/issues/77178 "Discussion: ObjWriter in C# · Issue #77178 · dotnet/runtime · GitHub" [12]: https://github.com/icedland/iced?utm_source=chatgpt.com "icedland/iced: Blazing fast and correct x86/x64 ..." [13]: https://github.com/SamboyCoding/Disarm?utm_source=chatgpt.com "SamboyCoding/Disarm: Fast, pure-C# ARM64 Disassembler" [14]: https://gitee.com/ryujinx/Ryujinx/blob/master/LICENSE.txt?utm_source=chatgpt.com "Ryujinx/Ryujinx" [15]: https://github.com/justinstenning/SharpDisasm?utm_source=chatgpt.com "justinstenning/SharpDisasm" [16]: https://cs61.seas.harvard.edu/site/2022/pdf/x86-64-abi-20210928.pdf?utm_source=chatgpt.com "System V Application Binary Interface" [17]: https://www.reinterpretcast.com/hello-world-mach-o?utm_source=chatgpt.com "The Nitty Gritty of “Hello World” on macOS | reinterpretcast.com" [18]: https://github.com/KeRNeLith/QuikGraph?utm_source=chatgpt.com "KeRNeLith/QuikGraph: Generic Graph Data Structures and ..." [19]: https://github.com/microsoft/automatic-graph-layout?utm_source=chatgpt.com "microsoft/automatic-graph-layout: A set of tools for ..." [20]: https://llvm.org/doxygen/structllvm_1_1ItaniumPartialDemangler.html?utm_source=chatgpt.com "ItaniumPartialDemangler Struct Reference" [21]: https://llvm.org/doxygen/MicrosoftDemangle_8cpp_source.html?utm_source=chatgpt.com "lib/Demangle/MicrosoftDemangle.cpp Source File" [22]: https://www.emergetools.com/blog/posts/iOS15LaunchTime?utm_source=chatgpt.com "How iOS 15 makes your app launch faster" [23]: https://llvm.org/doxygen/ItaniumDemangle_8cpp.html?utm_source=chatgpt.com "lib/Demangle/ItaniumDemangle.cpp File Reference" [24]: https://developer.apple.com/library/archive/documentation/Performance/Conceptual/CodeFootprint/Articles/MachOOverview.html?utm_source=chatgpt.com "Overview of the Mach-O Executable Format" Good, this is exactly the point where you want to nail the internal model, otherwise everything on top (VEX, policy, reporting) will be shaky. Below is a detailed, implementation-ready plan for a **reachability graph with purl-aware edges**, covering ELF, PE, and Mach-O, in C#. I’ll structure it as: 1. Overall graph design (3 layers: function, module, purl) 2. Core C# data model 3. Pipeline steps (end-to-end) 4. Format-specific edge construction (ELF / PE / Mach-O) 5. Reachability queries (from entrypoints to vulnerable purls / functions) 6. JSON output layout and integration with SBOM --- ## 1. Overall graph design You want three tightly linked graph layers: 1. **Function-level call graph (FLG)** * Nodes: individual **functions** inside binaries * Edges: calls from function A → function B (intra- or inter-module) 2. **Module-level graph (MLG)** * Nodes: **binaries** (ELF/PE/Mach-O files) * Edges: “module A calls module B at least once” (aggregated from FLG) 3. **Purl-level graph (PLG)** * Nodes: **purls** (packages or generic artifacts) * Edges: “purl P1 depends-at-runtime on purl P2” (aggregated from module edges) The **reachability algorithm** runs primarily on the **function graph**, but: * You can project reachability results to **module** and **purl** nodes. * You can also run coarse-grained analysis directly on **purl graph** when needed (“Is any code in purl X reachable from the container entrypoint?”). --- ## 2. Core C# data model ### 2.1 Identifiers and enums ```csharp public enum BinaryFormat { Elf, Pe, MachO } public readonly record struct ModuleId(string Path, BinaryFormat Format); public readonly record struct Purl(string Value); public enum EdgeKind { IntraModuleDirect, // call foo -> bar in same module ImportCall, // call via plt/iat/stub to imported function SyntheticRoot, // root (entrypoint) edge IndirectUnresolved // optional: we saw an indirect call we couldn't resolve } ``` ### 2.2 Function node ```csharp public sealed class FunctionNode { public int Id { get; init; } // internal numeric id public ModuleId Module { get; init; } public Purl Purl { get; init; } // resolved from Module -> Purl public ulong Address { get; init; } // VA or RVA public string Name { get; init; } // mangled public string? DemangledName { get; init; } // optional public bool IsExported { get; init; } public bool IsImportedStub { get; init; } // e.g. PLT stub, Mach-O stub, PE thunks public bool IsRoot { get; set; } // _start/main/entrypoint etc. } ``` ### 2.3 Edges ```csharp public sealed class CallEdge { public int FromId { get; init; } // FunctionNode.Id public int ToId { get; init; } // FunctionNode.Id public EdgeKind Kind { get; init; } public string Evidence { get; init; } // e.g. "ELF.R_X86_64_JUMP_SLOT", "PE.IAT", "MachO.indirectSym" } ``` ### 2.4 Graph container ```csharp public sealed class CallGraph { public IReadOnlyDictionary Nodes { get; init; } public IReadOnlyDictionary> OutEdges { get; init; } public IReadOnlyDictionary> InEdges { get; init; } // Convenience: mappings public IReadOnlyDictionary> FunctionsByModule { get; init; } public IReadOnlyDictionary> FunctionsByPurl { get; init; } } ``` ### 2.5 Purl-level graph view You don’t store a separate physical graph; you **derive** it on demand: ```csharp public sealed class PurlEdge { public Purl From { get; init; } public Purl To { get; init; } public List<(int FromFnId, int ToFnId)> SupportingCalls { get; init; } } public sealed class PurlGraphView { public IReadOnlyDictionary> Adjacent { get; init; } public IReadOnlyList Edges { get; init; } } ``` --- ## 3. Pipeline steps (end-to-end) ### Step 0 – Inputs * Set of binaries (files) extracted from container image. * SBOM or other metadata that can map a file path (or hash) → **purl**. ### Step 1 – Parse binaries → `BinaryModule` objects You define a common in-memory model: ```csharp public sealed class BinaryModule { public ModuleId Id { get; init; } public Purl Purl { get; init; } public BinaryFormat Format { get; init; } // Raw sections / segments public IReadOnlyList Sections { get; init; } // Symbols public IReadOnlyList Symbols { get; init; } // imports + exports + locals // Relocations / fixups public IReadOnlyList Relocations { get; init; } // Import/export tables (PE)/dylib commands (Mach-O)/DT_NEEDED (ELF) public ImportInfo[] Imports { get; init; } public ExportInfo[] Exports { get; init; } } ``` Implement format-specific loaders: * `ElfLoader : IBinaryLoader` * `PeLoader : IBinaryLoader` * `MachOLoader : IBinaryLoader` Each loader uses your chosen C# parsers or ported code and fills `BinaryModule`. ### Step 2 – Disassembly → basic blocks & candidate functions For each `BinaryModule`: 1. Use appropriate decoder (iced for x86/x64; Disarm/ported ARMeilleure for AArch64). 2. Seed function starts: * Exported functions * Entry points (`_start`, `main`, AddressOfEntryPoint) * Mach-O `LC_FUNCTION_STARTS` if available 3. Walk instructions to build basic blocks: * Stop blocks at conditional/unconditional branches, calls, rets. * Record for each call site: * Address of caller function * Operand type (immediate, memory with import table address, etc.) Disassembler outputs a list of `FunctionNode` skeletons (no cross-module link yet) and a list of **raw call sites**: ```csharp public sealed class RawCallSite { public int CallerFunctionId { get; init; } public ulong InstructionAddress { get; init; } public ulong? DirectTargetAddress { get; init; } // e.g. CALL 0x401000 public ulong? MemoryTargetAddress { get; init; } // e.g. CALL [0x404000] public bool IsIndirect { get; init; } // register-based etc. } ``` ### Step 3 – Build function nodes Using disassembly + symbol tables: * For each discovered function: * Determine: address, name (if sym available), export/import flags. * Map `ModuleId` → `Purl` using `IPurlResolver`. * Populate `FunctionNode` instances and index them by `Id`. ### Step 4 – Construct intra-module edges For each `RawCallSite`: * If `DirectTargetAddress` falls inside a known function’s address range in the **same module**, add **IntraModuleDirect** edge. This gives you “normal” calls like `foo()` calling `bar()` in the same .so/.dll/. ### Step 5 – Construct inter-module edges (import calls) This is where ELF/PE/Mach-O differ; details in section 4 below. But the abstract logic is: 1. For each call site with `MemoryTargetAddress` (IAT slot / GOT entry / la_symbol_ptr / PLT): 2. From the module’s import, relocation or fixup tables, determine: * Which **imported symbol** it corresponds to (name, ordinal, etc.). * Which **imported module / dylib / DLL** provides that symbol. 3. Find (or create) a `FunctionNode` representing that imported symbol in the **provider module**. 4. Add an **ImportCall** edge from caller function to the provider `FunctionNode`. This is the key to turning low-level dynamic linking into **purl-aware cross-module edges**, because each `FunctionNode` is already stamped with a `Purl`. ### Step 6 – Build adjacency structures Once you have all `FunctionNode`s and `CallEdge`s: * Build `OutEdges` and `InEdges` dictionaries keyed by `FunctionNode.Id`. * Build `FunctionsByModule` / `FunctionsByPurl`. --- ## 4. Format-specific edge construction This is the “how” for step 5, per binary format. ### 4.1 ELF Goal: map call sites that go via PLT/GOT to an imported function in a `DT_NEEDED` library. Algorithm: 1. Parse: * `.dynsym`, `.dynstr` – dynamic symbol table * `.rela.plt` / `.rel.plt` – relocation entries for PLT * `.got.plt` / `.got` – PLT’s GOT * `DT_NEEDED` entries – list of linked shared objects and their sonames 2. For each relocation of type `R_*_JUMP_SLOT`: * It applies to an entry in the PLT GOT; that GOT entry is what CALL instructions read from. * Relocation gives you: * Offset in GOT (`r_offset`) * Symbol index (`r_info` → symbol) → dynamic symbol (`ElfSymbol`) * Symbol name, type (FUNC), binding, etc. 3. Link GOT entries to call sites: * For each `RawCallSite` with `MemoryTargetAddress`, check if that address falls inside `.got.plt` (or `.got`). If it does: * Find relocation whose `r_offset` equals that GOT entry offset. * That tells you which **symbol** is being called. 4. Determine provider module: * From the symbol’s `st_name` and `DT_NEEDED` list, decide which shared object is expected to define it (an approximation is: first DT_NEEDED that provides that name). * Map DT_NEEDED → `ModuleId` (you’ll have loaded these modules separately, or you can create “placeholder modules” if they’re not in the container image). 5. Create edges: * Create/find `FunctionNode` for the **imported symbol** in provider module. * Add `CallEdge` from caller function to imported function, `EdgeKind = ImportCall`, `Evidence = "ELF.R_X86_64_JUMP_SLOT"` (or arch-specific). This yields edges like: * `myapp:main` → `libssl.so.1.1:SSL_read` * `libfoo.so:foo` → `libc.so.6:malloc` ### 4.2 PE Goal: map call sites that go via the Import Address Table (IAT) to imported functions in DLLs. Algorithm: 1. Parse: * `IMAGE_IMPORT_DESCRIPTOR[]` – each for a DLL name. * Original thunk table (INT) – names/ordinals of imported symbols. * IAT – where the loader writes function addresses at runtime. 2. For each import entry: * Determine: * DLL name (`Name`) * Function name or ordinal (from INT) * IAT slot address (RVA) 3. Link IAT slots to call sites: * For each `RawCallSite` with `MemoryTargetAddress`: * Check if this address equals the VA of an IAT slot. * If yes, the call site is effectively calling that imported function. 4. Determine provider module: * The DLL name gives you a target module (e.g. `KERNEL32.dll` → `ModuleId`). * Ensure that DLL is represented as a `BinaryModule` or a “placeholder” if not present in image. 5. Create edges: * Create/find `FunctionNode` for imported function in provider module. * Add `CallEdge` with `EdgeKind = ImportCall` and `Evidence = "PE.IAT"` (or `"PE.DelayLoad"` if using delay load descriptors). Example: * `myservice.exe:Start` → `SSPICLI.dll:AcquireCredentialsHandleW` ### 4.3 Mach-O Goal: map stub calls via `__TEXT,__stubs` / `__DATA,__la_symbol_ptr` (and / or chained fixups) to symbols in dependent dylibs. Algorithm (for classic dyld opcodes, not chained fixups, then extend): 1. Parse: * Load commands: * `LC_SYMTAB`, `LC_DYSYMTAB` * `LC_LOAD_DYLIB` (to know dependent dylibs) * `LC_FUNCTION_STARTS` (for seeding functions) * `LC_DYLD_INFO` (rebase/bind/lazy bind) * `__TEXT,__stubs` – stub code * `__DATA,__la_symbol_ptr` (or `__DATA_CONST,__la_symbol_ptr`) – lazy pointer table * **Indirect symbol table** – maps slot indices to symbol table indices 2. Stub → la_symbol_ptr mapping: * Stubs are small functions (usually a few instructions) that indirect through the corresponding `la_symbol_ptr` entry. * For each stub function: * Determine which la_symbol_ptr entry it uses (based on stub index and linking metadata). * From the indirect symbol table, find which dynamic symbol that la_symbol_ptr entry corresponds to. * This gives you symbol name and the index in `LC_LOAD_DYLIB` (dylib ordinal). 3. Link stub call sites: * In disassembly, treat calls to these stub functions as **import calls**. * For each call instruction `CALL stub_function`: * `RawCallSite.DirectTargetAddress` lies inside `__TEXT,__stubs`. * Resolve stub → la_symbol_ptr → symbol → dylib. 4. Determine provider module: * From dylib ordinal and load commands, get the path / install name of dylib (`libssl.1.1.dylib`, etc.). * Map that to a `ModuleId` in your module set. 5. Create edges: * Create/find imported `FunctionNode` in provider module. * Add `CallEdge` from caller to that function with `EdgeKind = ImportCall`, `Evidence = "MachO.IndirectSymbol"`. For **chained fixups** (`LC_DYLD_CHAINED_FIXUPS`), you’ll compute a similar mapping but walking chain entries instead of traditional lazy/weak binds. The key is still: * Map a stub or function to a **fixup** entry. * From fixup, determine the symbol and dylib. * Then connect call-site → imported function. --- ## 5. Reachability queries Once the graph is built, reachability is “just graph algorithms” + mapping back to purls. ### 5.1 Roots Decide what are your **root functions**: * Binary entrypoints: * ELF: `_start`, `main`, constructors (`.init_array`) * PE: AddressOfEntryPoint, registered service entrypoints * Mach-O: `_main`, constructors * Optionally, any exported API function that a container orchestrator or plugin system will call. Mark them as `FunctionNode.IsRoot = true` and create synthetic edges from a special root node if you want: ```csharp var syntheticRoot = new FunctionNode { Id = 0, Name = "", IsRoot = true, // Module, Purl can be special markers }; foreach (var fn in allFunctions.Where(f => f.IsRoot)) { edges.Add(new CallEdge { FromId = syntheticRoot.Id, ToId = fn.Id, Kind = EdgeKind.SyntheticRoot, Evidence = "Root" }); } ``` ### 5.2 Reachability algorithm (function-level) Use BFS/DFS from the root node(s): ```csharp public sealed class ReachabilityResult { public HashSet ReachableFunctions { get; } = new(); } public ReachabilityResult ComputeReachableFunctions(CallGraph graph, IEnumerable rootIds) { var visited = new HashSet(); var stack = new Stack(); foreach (var root in rootIds) { if (visited.Add(root)) stack.Push(root); } while (stack.Count > 0) { var current = stack.Pop(); if (!graph.OutEdges.TryGetValue(current, out var edges)) continue; foreach (var edge in edges) { if (visited.Add(edge.ToId)) stack.Push(edge.ToId); } } return new ReachabilityResult { ReachableFunctions = visited }; } ``` ### 5.3 Project reachability to modules and purls Given `ReachableFunctions`: ```csharp public sealed class ReachabilityProjection { public HashSet ReachableModules { get; } = new(); public HashSet ReachablePurls { get; } = new(); } public ReachabilityProjection ProjectToModulesAndPurls(CallGraph graph, ReachabilityResult result) { var projection = new ReachabilityProjection(); foreach (var fnId in result.ReachableFunctions) { if (!graph.Nodes.TryGetValue(fnId, out var fn)) continue; projection.ReachableModules.Add(fn.Module); projection.ReachablePurls.Add(fn.Purl); } return projection; } ``` Now you can answer questions like: * “Is any code from purl `pkg:deb/openssl@1.1.1w-1` reachable from the container entrypoint?” * “Which purls are reachable at all?” ### 5.4 Vulnerability reachability Assume you’ve mapped each vulnerability to: * `Purl` (where it lives) * `AffectedFunctionNames` (symbols; optionally demangled) You can implement: ```csharp public sealed class VulnerabilitySink { public string VulnerabilityId { get; init; } // CVE-... public Purl Purl { get; init; } public string FunctionName { get; init; } // symbol name or demangled } ``` Resolution algorithm: 1. For each `VulnerabilitySink`, find all `FunctionNode` with: * `node.Purl == sink.Purl` and * `node.Name` or `node.DemangledName` matches `sink.FunctionName`. 2. For each such node, check `ReachableFunctions.Contains(node.Id)`. 3. Build a `Finding` object: ```csharp public sealed class VulnerabilityFinding { public string VulnerabilityId { get; init; } public Purl Purl { get; init; } public bool IsReachable { get; init; } public List SinkFunctionIds { get; init; } = new(); } ``` Plus, if you want **path evidence**, you run a shortest-path search (BFS predecessor map) from root to sink and store the sequence of `FunctionNode.Id`s. --- ## 6. Purl edges (derived graph) For reporting and analytics, it’s useful to produce a **purl-level dependency graph**. Given `CallGraph`: ```csharp public PurlGraphView BuildPurlGraph(CallGraph graph) { var edgesByPair = new Dictionary<(Purl From, Purl To), PurlEdge>(); foreach (var kv in graph.OutEdges) { var fromFn = graph.Nodes[kv.Key]; foreach (var edge in kv.Value) { var toFn = graph.Nodes[edge.ToId]; if (fromFn.Purl.Equals(toFn.Purl)) continue; // intra-purl, skip if you only care about inter-purl var key = (fromFn.Purl, toFn.Purl); if (!edgesByPair.TryGetValue(key, out var pe)) { pe = new PurlEdge { From = fromFn.Purl, To = toFn.Purl, SupportingCalls = new List<(int, int)>() }; edgesByPair[key] = pe; } pe.SupportingCalls.Add((fromFn.Id, toFn.Id)); } } var adj = new Dictionary>(); foreach (var kv in edgesByPair) { var (from, to) = kv.Key; if (!adj.TryGetValue(from, out var list)) { list = new HashSet(); adj[from] = list; } list.Add(to); } return new PurlGraphView { Adjacent = adj, Edges = edgesByPair.Values.ToList() }; } ``` This gives you: * A coarse view of runtime dependencies between purls (“Purl A calls into Purl B”). * Enough context to emit purl-level VEX or to reason about trust at package granularity. --- ## 7. JSON output and SBOM integration ### 7.1 JSON shape (high level) You can emit a composite document: ```json { "image": "registry.example.com/app@sha256:...", "modules": [ { "moduleId": { "path": "/usr/lib/libssl.so.1.1", "format": "Elf" }, "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1", "arch": "x86_64" } ], "functions": [ { "id": 42, "name": "SSL_do_handshake", "demangledName": null, "module": { "path": "/usr/lib/libssl.so.1.1", "format": "Elf" }, "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1", "address": "0x401020", "exported": true } ], "edges": [ { "from": 10, "to": 42, "kind": "ImportCall", "evidence": "ELF.R_X86_64_JUMP_SLOT" } ], "reachability": { "roots": [1], "reachableFunctions": [1,10,42] }, "purlGraph": { "edges": [ { "from": "pkg:generic/myapp@1.0.0", "to": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1", "supportingCalls": [[10,42]] } ] }, "vulnerabilities": [ { "id": "CVE-2024-XXXX", "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1", "sinkFunctions": [42], "reachable": true, "paths": [ [1, 10, 42] ] } ] } ``` ### 7.2 Purl resolution Implement an `IPurlResolver` interface: ```csharp public interface IPurlResolver { Purl ResolveForModule(string filePath, byte[] contentHash); } ``` Possible implementations: * `SbomPurlResolver` – given a CycloneDX/SPDX SBOM for the image, match by path or checksum. * `LinuxPackagePurlResolver` – read `/var/lib/dpkg/status` / rpm DB in the filesystem. * `GenericPurlResolver` – fallback: `pkg:generic/`. You call the resolver in your loaders so that **every `BinaryModule` has a purl** and thus every `FunctionNode` has a purl. --- ## 8. Concrete implementation tasks for your team 1. **Data model & interfaces** * Implement `ModuleId`, `FunctionNode`, `CallEdge`, `CallGraph`. * Define `RawCallSite`, `BinaryModule`, and `IPurlResolver`. 2. **Loaders** * `ElfLoader`: fill symbols, dynamic relocations (PLT), DT_NEEDED, etc. * `PeLoader`: import descriptors, IAT, delay-load. * `MachOLoader`: load commands, stubs, la_symbol_ptr, indirect symbols / chained fixups. 3. **Disassembly** * `X86Disassembler` (iced) and `Arm64Disassembler` (Disarm or port). * Function detection and `RawCallSite` extraction. 4. **CallGraphBuilder** * Build intra-module edges from direct calls. * Build inter-module edges using the format-specific rules above. * Construct final `CallGraph` with adjacency maps and purl mappings. 5. **Reachability** * Implement BFS/DFS from root functions. * Projection to modules + purls. * Vulnerability sink resolution & path reconstruction. 6. **Export** * JSON serializer for the schema above. * Optional: purl-level summary / VEX generator. --- If you want, next step I can do a **more concrete design for `CallGraphBuilder`** (including per-format helper classes with method signatures) or a **C# skeleton** for the `ElfImportResolver`, `PeImportResolver`, and `MachOStubResolver` that plug directly into this plan.