git.stella-ops.org/18-Nov-2026 - CSharp-Binary-Analyzer.md at 522fff73cd1dea85bfa61dba58a8f1805abcd0b4 - git.stella-ops.org

Files

Docs CI / lint-and-preview (push) Has been cancelled

Details

feat: Add comprehensive documentation for binary reachability with PURL-resolved edges

- Introduced a detailed specification for encoding binary reachability that integrates call graphs with SBOMs.
- Defined a minimal data model including nodes, edges, and SBOM components.
- Outlined a step-by-step guide for building the reachability graph in a C#-centric manner.
- Established core domain models, including enumerations for binary formats and symbol kinds.
- Created a public API for the binary reachability service, including methods for graph building and serialization.
- Specified SBOM component resolution and binary parsing abstractions for PE, ELF, and Mach-O formats.
- Enhanced symbol normalization and digesting processes to ensure deterministic signatures.
- Included error handling, logging, and a high-level test plan to ensure robustness and correctness.
- Added non-functional requirements to guide performance, memory usage, and thread safety.

2025-11-20 23:16:02 +02:00

37 KiB

Raw Blame History

Vlad, here’s a concrete, pure‑C# blueprint to build a multi‑format binary analyzer (Mach‑O, ELF, PE) that produces call graphs + reachability, with no external tools. Where needed, I point to permissively‑licensed code you can port (copy) from other ecosystems.

0) Targets & non‑negotiables

Formats: Mach‑O (inc. LC_DYLD_INFO / LC_DYLD_CHAINED_FIXUPS), ELF (SysV gABI), PE/COFF
Architectures: x86‑64 (and x86), AArch64 (ARM64)
Outputs: JSON with purls per module + function‑level call graph & reachability
No tool reuse: Only pure C# libraries or code ported from permissive sources

1) Parsing the containers (pure C#)

Pick one C# reader per format, keeping licenses permissive:

ELF & Mach‑O: ELFSharp (pure managed C#; ELF + Mach‑O reading). MIT/X11 license. (GitHub)
ELF & PE (+ DWARF v4): LibObjectFile (C#, BSD‑2). Good ELF relocations (i386, x86_64, ARM, AArch64), PE directories, DWARF sections. Use it as your common object model for ELF+PE, then add a Mach‑O adapter. (GitHub)
PE (optional alternative): PeNet (pure C#, broad PE directories, imp/exp, TLS, certs). MIT. Useful if you want a second implementation for cross‑checks. (GitHub)

Why two libs? LibObjectFile gives you DWARF and clean models for ELF/PE; ELFSharp covers Mach‑O today (and ELF as a fallback). You control the code paths.

Spec references you’ll implement against (for correctness of your readers & link‑time semantics):

ELF (gABI, AMD64 supplement): dynamic section, PLT/GOT, R_X86_64_JUMP_SLOT semantics (eager vs lazy). (refspecs.linuxbase.org)
PE/COFF: imports/exports/IAT, delay‑load, TLS. (Microsoft Learn)
Mach‑O: file layout, load commands (LC_SYMTAB, LC_DYSYMTAB, LC_FUNCTION_STARTS, LC_DYLD_INFO(_ONLY)), and the modern LC_DYLD_CHAINED_FIXUPS. (leopard-adc.pepas.com)

2) Mach‑O: what you must port (byte‑for‑byte compatible)

Apple moved from traditional dyld bind opcodes to chained fixups on macOS 12/iOS 15+; you need both:

Dyld bind opcodes (LC_DYLD_INFO(_ONLY)): parse the BIND/LAZY_BIND streams (tuples of <seg,off,type,ordinal,symbol,addend>). Port minimal logic from LLVM or LIEF (both Apache‑2.0‑compatible) into C#. (LIEF)
Chained fixups (LC_DYLD_CHAINED_FIXUPS): port dyld_chained_fixups_header structs & chain walking from LLVM’s MachO.h or Apple’s dyld headers. This restores imports/rebases without running dyld. (LLVM)
Function discovery hint: read LC_FUNCTION_STARTS (ULEB128 deltas) to seed function boundaries—very helpful on stripped binaries. (Stack Overflow)
Stubs mapping: resolve __TEXT,__stubs ↔ __DATA,__la_symbol_ptr via the indirect symbol table; conceptually identical to ELF’s PLT/GOT. (MaskRay)

If you prefer an in‑C# base for Mach‑O manipulation, Melanzana.MachO exists (MIT) and has been used by .NET folks for Mach‑O/Code Signing/obj writing; you can mine its approach for load‑command modeling. (GitHub)

3) Disassembly (pure C#, multi‑arch)

x86/x64: iced (C# decoder/disassembler/encoder; MIT; fast & complete). (GitHub)
AArch64/ARM64: two options that keep you pure‑C#:
- Disarm (pure C# ARM64 disassembler; MIT). Good starting point to decode & get branch/call kinds. (GitHub)
- Port from Ryujinx ARMeilleure (ARMv8 decoder/JIT in C#, MIT). You can lift only the decoder pieces you need. (Gitee)
x86 fallback: SharpDisasm (udis86 port in C#; BSD‑2). Older than iced; keep as a reference. (GitHub)

4) Call graph recovery (static)

4.1 Function seeds

From symbols (.dynsym/LC_SYMTAB/PE exports)
From LC_FUNCTION_STARTS (Mach‑O) for stripped code (Stack Overflow)
From entrypoints (_start/main or PE AddressOfEntryPoint)
From exception/unwind tables & DWARF (when present)—LibObjectFile already models DWARF v4. (GitHub)

4.2 CFG & interprocedural calls

Decode with iced/Disarm from each seed; form basic blocks by following control‑flow until terminators (ret/jmp/call).
Direct calls: immediate targets become edges (PC‑relative fixups where needed).
Imported calls:
- ELF: calls to PLT stubs → resolve via .rela.plt & R_*_JUMP_SLOT to symbol names (link‑time target). (cs61.seas.harvard.edu)
- PE: calls through the IAT → resolve via IMAGE_IMPORT_DESCRIPTOR / thunk tables. (Microsoft Learn)
- Mach‑O: calls to __stubs use indirect symbol table + __la_symbol_ptr (or chained fixups) → map to dylib/symbol. (reinterpretcast.com)
Indirect calls within the binary: heuristics only (function pointer tables, vtables, small constant pools). Keep them labeled “indirect‑unresolved” unless a heuristic yields a concrete target.

4.3 Cross‑binary graph

Build module‑level edges by simulating the platform’s loader:
- ELF: honor DT_NEEDED, DT_RPATH/RUNPATH, versioning (.gnu.version*) to pick the definer of an imported symbol. gABI rules apply. (refspecs.linuxbase.org)
- PE: pick DLL from the import descriptors. (Microsoft Learn)
- Mach‑O: LC_LOAD_DYLIB + dyld binding / chained fixups determine the provider image. (LIEF)

5) Reachability analysis

Represent the call graph using a .NET graph lib (or a simple adjacency set). I suggest:

QuikGraph (successor of QuickGraph; MIT) for algorithms (DFS/BFS, SCCs). Use it to compute reachability from chosen roots (entrypoint(s), exported APIs, or “sinks”). (GitHub)

You can visualize with MSAGL (MIT) when you need layouts, but your core output is JSON. (GitHub)

6) Symbol demangling (nice‑to‑have, pure C#)

Itanium (ELF/Mach‑O): Either port LLVM’s Itanium demangler or use a C# lib like CxxDemangler (a C# rewrite of cpp_demangle). (LLVM)
MSVC (PE): Port LLVM’s MicrosoftDemangle.cpp (Apache‑2.0 with LLVM exception) to C#. (LLVM)

7) JSON output (with purls)

Use a stable schema (example) to feed SBOM/vuln matching downstream:

{
  "modules": [
    {
      "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1?arch=amd64",
      "format": "ELF",
      "arch": "x86_64",
      "path": "/usr/lib/x86_64-linux-gnu/libssl.so.1.1",
      "exports": ["SSL_read", "SSL_write"],
      "imports": ["BIO_new", "EVP_CipherInit_ex"],
      "functions": [{"name":"SSL_do_handshake","va":"0x401020","size":512,"demangled": "..."}]
    }
  ],
  "graph": {
    "nodes": [
      {"id":"bin:main@0x401000","module": "pkg:generic/myapp@1.0.0"},
      {"id":"lib:SSL_read","module":"pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1?arch=amd64"}
    ],
    "edges": [
      {"src":"bin:main@0x401000","dst":"lib:SSL_read","kind":"import_call","evidence":"ELF.R_X86_64_JUMP_SLOT"}
    ]
  },
  "reachability": {
    "roots": ["bin:_start","bin:main@0x401000"],
    "reachable": ["lib:SSL_read", "lib:SSL_write"],
    "unresolved_indirect_calls": [
      {"site":"0x402ABC","reason":"register-indirect"}
    ]
  }
}

8) Minimal C# module layout (sketch)

Stella.Analysis.Core/
  BinaryModule.cs            // common model (sections, symbols, relocs, imports/exports)
  Loader/
    PeLoader.cs              // wrap LibObjectFile (or PeNet) to BinaryModule
    ElfLoader.cs             // wrap LibObjectFile to BinaryModule
    MachOLoader.cs           // wrap ELFSharp + your ported Dyld/ChainedFixups
  Disasm/
    X86Disassembler.cs       // iced bridge: bytes -> instructions
    Arm64Disassembler.cs     // Disarm (or ARMeilleure port) bridge
  Graph/
    CallGraphBuilder.cs      // builds CFG per function + inter-procedural edges
    Reachability.cs          // BFS/DFS over QuikGraph
  Demangle/
    ItaniumDemangler.cs      // port or wrap CxxDemangler
    MicrosoftDemangler.cs    // port from LLVM
  Export/
    JsonWriter.cs            // writes schema above

9) Implementation notes (where issues usually bite)

Mach‑O moderns: Implement both dyld opcode and chained fixups; many macOS 12+/iOS15+ binaries only have chained fixups. (emergetools.com)
Stubs vs real targets (Mach‑O): map __stubs → __la_symbol_ptr via indirect symbols to the true imported symbol (or its post‑fixup target). (reinterpretcast.com)
ELF PLT/GOT: treat .plt entries as call trampolines; ultimate edge should point to the symbol (library) that satisfies DT_NEEDED + version. (refspecs.linuxbase.org)
PE delay‑load: don’t forget IMAGE_DELAYLOAD_DESCRIPTOR for delayed IATs. (Microsoft Learn)
Function discovery: use LC_FUNCTION_STARTS when symbols are stripped; it’s a cheap way to seed analysis. (Stack Overflow)
Name clarity: demangle Itanium/MSVC so downstream vuln rules can match consistently. (LLVM)

10) What to copy/port verbatim (safe licenses)

Dyld bind & exports trie logic: from LLVM or LIEF Mach‑O (Apache‑2.0). Great for getting the exact opcode semantics right. (LIEF)
Chained fixups structs/walkers: from LLVM MachO.h or Apple dyld headers (permissive headers). (LLVM)
Itanium/MS demanglers: LLVM demangler sources are standalone; easy to translate to C#. (LLVM)
ARM64 decoder: if Disarm gaps hurt, lift just the decoder pieces from Ryujinx ARMeilleure (MIT). (Gitee)

(Avoid GPL’d parsers like binutils/BFD; they will contaminate your codebase’s licensing.)

11) End‑to‑end pipeline (per container image)

Enumerate binaries in the container FS.
Parse each with the appropriate loader → BinaryModule (+ imports/exports/symbols/relocs).
Simulate linking per platform to resolve imported functions to provider libraries. (refspecs.linuxbase.org)
Disassemble functions (iced/Disarm) → CFGs → call edges (direct, PLT/IAT/stub, indirect).
Assemble call graph across modules; normalize names via demangling.
Reachability: given roots (entry or user‑specified) compute reachable set; emit JSON with purls (from your SBOM/package resolver).
(Optional) dump GraphViz / MSAGL views for debugging. (GitHub)

12) Quick heuristics for vulnerability triage

Sink maps: flag edges to high‑risk APIs (strcpy, gets, legacy SSL ciphers) even without CVE versioning.
DWARF line info (when present): attach file:line to nodes for developer action. LibObjectFile gives you DWARF v4 reads. (GitHub)

13) Test corpora

ELF: glibc/openssl/libpng from distro repos; validate R_*_JUMP_SLOT handling and PLT edges. (cs61.seas.harvard.edu)
PE: system DLLs (Kernel32, Advapi32) and a small MSVC console app; validate IAT & delay‑load. (Microsoft Learn)
Mach‑O: Xcode‑built binaries across macOS 11 & 12+ to cover both dyld opcode and chained fixups paths; verify LC_FUNCTION_STARTS improves discovery. (Stack Overflow)

14) Deliverables you can start coding now

MachOLoader.cs
- Parse headers + load commands (ELFSharp).
- Implement DyldInfoParser (port from LLVM/LIEF) and ChainedFixupsParser (port structs & walkers). (LIEF)
X86Disassembler.cs / Arm64Disassembler.cs (iced / Disarm bridges). (GitHub)
CallGraphBuilder.cs (recursive descent + linear sweep fallback; PLT/IAT/stub resolution).
Reachability.cs (QuikGraph BFS/DFS). (GitHub)
JsonWriter.cs (schema above with purls).

References (core, load‑bearing)

ELFSharp (ELF + Mach‑O pure C#). (GitHub)
LibObjectFile (ELF/PE/DWARF C#, BSD‑2). (GitHub)
iced (x86/x64 disasm, C#, MIT). (GitHub)
Disarm (ARM64 disasm, C#, MIT). (GitHub)
Ryujinx (ARMeilleure) (ARMv8 decode/JIT in C#, MIT). (Gitee)
ELF gABI & AMD64 supplement (PLT/GOT, relocations). (refspecs.linuxbase.org)
PE/COFF (imports/exports/IAT). (Microsoft Learn)
Mach‑O docs (load commands; LC_FUNCTION_STARTS; dyld bindings; chained fixups). (Apple Developer)

If you want, I can draft MachOLoader + DyldInfoParser in C# next, including chained‑fixups structs (ported from LLVM’s headers) and an iced‑based call‑edge walker for x86‑64.

Good, this is exactly the point where you want to nail the internal model, otherwise everything on top (VEX, policy, reporting) will be shaky.

Below is a detailed, implementation-ready plan for a reachability graph with purl-aware edges, covering ELF, PE, and Mach-O, in C#.

I’ll structure it as:

Overall graph design (3 layers: function, module, purl)
Core C# data model
Pipeline steps (end-to-end)
Format-specific edge construction (ELF / PE / Mach-O)
Reachability queries (from entrypoints to vulnerable purls / functions)
JSON output layout and integration with SBOM

1. Overall graph design

You want three tightly linked graph layers:

Function-level call graph (FLG)
- Nodes: individual functions inside binaries
- Edges: calls from function A → function B (intra- or inter-module)
Module-level graph (MLG)
- Nodes: binaries (ELF/PE/Mach-O files)
- Edges: “module A calls module B at least once” (aggregated from FLG)
Purl-level graph (PLG)
- Nodes: purls (packages or generic artifacts)
- Edges: “purl P1 depends-at-runtime on purl P2” (aggregated from module edges)

The reachability algorithm runs primarily on the function graph, but:

You can project reachability results to module and purl nodes.
You can also run coarse-grained analysis directly on purl graph when needed (“Is any code in purl X reachable from the container entrypoint?”).

2. Core C# data model

2.1 Identifiers and enums

public enum BinaryFormat { Elf, Pe, MachO }

public readonly record struct ModuleId(string Path, BinaryFormat Format);

public readonly record struct Purl(string Value);

public enum EdgeKind
{
    IntraModuleDirect,       // call foo -> bar in same module
    ImportCall,              // call via plt/iat/stub to imported function
    SyntheticRoot,           // root (entrypoint) edge
    IndirectUnresolved       // optional: we saw an indirect call we couldn't resolve
}

2.2 Function node

public sealed class FunctionNode
{
    public int Id { get; init; }                // internal numeric id
    public ModuleId Module { get; init; }
    public Purl Purl { get; init; }             // resolved from Module -> Purl
    public ulong Address { get; init; }         // VA or RVA
    public string Name { get; init; }           // mangled
    public string? DemangledName { get; init; } // optional
    public bool IsExported { get; init; }
    public bool IsImportedStub { get; init; }   // e.g. PLT stub, Mach-O stub, PE thunks
    public bool IsRoot { get; set; }            // _start/main/entrypoint etc.
}

2.3 Edges

public sealed class CallEdge
{
    public int FromId { get; init; }        // FunctionNode.Id
    public int ToId { get; init; }          // FunctionNode.Id
    public EdgeKind Kind { get; init; }
    public string Evidence { get; init; }   // e.g. "ELF.R_X86_64_JUMP_SLOT", "PE.IAT", "MachO.indirectSym"
}

2.4 Graph container

public sealed class CallGraph
{
    public IReadOnlyDictionary<int, FunctionNode> Nodes { get; init; }
    public IReadOnlyDictionary<int, List<CallEdge>> OutEdges { get; init; }
    public IReadOnlyDictionary<int, List<CallEdge>> InEdges { get; init; }

    // Convenience: mappings
    public IReadOnlyDictionary<ModuleId, List<int>> FunctionsByModule { get; init; }
    public IReadOnlyDictionary<Purl, List<int>> FunctionsByPurl { get; init; }
}

2.5 Purl-level graph view

You don’t store a separate physical graph; you derive it on demand:

public sealed class PurlEdge
{
    public Purl From { get; init; }
    public Purl To { get; init; }
    public List<(int FromFnId, int ToFnId)> SupportingCalls { get; init; }
}

public sealed class PurlGraphView
{
    public IReadOnlyDictionary<Purl, HashSet<Purl>> Adjacent { get; init; }
    public IReadOnlyList<PurlEdge> Edges { get; init; }
}

3. Pipeline steps (end-to-end)

Step 0 – Inputs

Set of binaries (files) extracted from container image.
SBOM or other metadata that can map a file path (or hash) → purl.

Step 1 – Parse binaries → `BinaryModule` objects

You define a common in-memory model:

public sealed class BinaryModule
{
    public ModuleId Id { get; init; }
    public Purl Purl { get; init; }
    public BinaryFormat Format { get; init; }

    // Raw sections / segments
    public IReadOnlyList<SectionInfo> Sections { get; init; }

    // Symbols
    public IReadOnlyList<SymbolInfo> Symbols { get; init; }     // imports + exports + locals

    // Relocations / fixups
    public IReadOnlyList<RelocationInfo> Relocations { get; init; }

    // Import/export tables (PE)/dylib commands (Mach-O)/DT_NEEDED (ELF)
    public ImportInfo[] Imports { get; init; }
    public ExportInfo[] Exports { get; init; }
}

Implement format-specific loaders:

ElfLoader : IBinaryLoader
PeLoader : IBinaryLoader
MachOLoader : IBinaryLoader

Each loader uses your chosen C# parsers or ported code and fills BinaryModule.

Step 2 – Disassembly → basic blocks & candidate functions

For each BinaryModule:

Use appropriate decoder (iced for x86/x64; Disarm/ported ARMeilleure for AArch64).
Seed function starts:
- Exported functions
- Entry points (_start, main, AddressOfEntryPoint)
- Mach-O LC_FUNCTION_STARTS if available
Walk instructions to build basic blocks:
- Stop blocks at conditional/unconditional branches, calls, rets.
- Record for each call site:
  - Address of caller function
  - Operand type (immediate, memory with import table address, etc.)

Disassembler outputs a list of FunctionNode skeletons (no cross-module link yet) and a list of raw call sites:

public sealed class RawCallSite
{
    public int CallerFunctionId { get; init; }
    public ulong InstructionAddress { get; init; }
    public ulong? DirectTargetAddress { get; init; }     // e.g. CALL 0x401000
    public ulong? MemoryTargetAddress { get; init; }     // e.g. CALL [0x404000]
    public bool IsIndirect { get; init; }                // register-based etc.
}

Step 3 – Build function nodes

Using disassembly + symbol tables:

For each discovered function:
- Determine: address, name (if sym available), export/import flags.
- Map ModuleId → Purl using IPurlResolver.
Populate FunctionNode instances and index them by Id.

Step 4 – Construct intra-module edges

For each RawCallSite:

If DirectTargetAddress falls inside a known function’s address range in the same module, add IntraModuleDirect edge.

This gives you “normal” calls like foo() calling bar() in the same .so/.dll/.

Step 5 – Construct inter-module edges (import calls)

This is where ELF/PE/Mach-O differ; details in section 4 below.

But the abstract logic is:

For each call site with MemoryTargetAddress (IAT slot / GOT entry / la_symbol_ptr / PLT):
From the module’s import, relocation or fixup tables, determine:
- Which imported symbol it corresponds to (name, ordinal, etc.).
- Which imported module / dylib / DLL provides that symbol.
Find (or create) a FunctionNode representing that imported symbol in the provider module.
Add an ImportCall edge from caller function to the provider FunctionNode.

This is the key to turning low-level dynamic linking into purl-aware cross-module edges, because each FunctionNode is already stamped with a Purl.

Step 6 – Build adjacency structures

Once you have all FunctionNodes and CallEdges:

Build OutEdges and InEdges dictionaries keyed by FunctionNode.Id.
Build FunctionsByModule / FunctionsByPurl.

4. Format-specific edge construction

This is the “how” for step 5, per binary format.

4.1 ELF

Goal: map call sites that go via PLT/GOT to an imported function in a DT_NEEDED library.

Algorithm:

Parse:
- .dynsym, .dynstr – dynamic symbol table
- .rela.plt / .rel.plt – relocation entries for PLT
- .got.plt / .got – PLT’s GOT
- DT_NEEDED entries – list of linked shared objects and their sonames
For each relocation of type R_*_JUMP_SLOT:
- It applies to an entry in the PLT GOT; that GOT entry is what CALL instructions read from.
- Relocation gives you:
  - Offset in GOT (r_offset)
  - Symbol index (r_info → symbol) → dynamic symbol (ElfSymbol)
  - Symbol name, type (FUNC), binding, etc.
Link GOT entries to call sites:
- For each RawCallSite with MemoryTargetAddress, check if that address falls inside .got.plt (or .got). If it does:
  - Find relocation whose r_offset equals that GOT entry offset.
  - That tells you which symbol is being called.
Determine provider module:
- From the symbol’s st_name and DT_NEEDED list, decide which shared object is expected to define it (an approximation is: first DT_NEEDED that provides that name).
- Map DT_NEEDED → ModuleId (you’ll have loaded these modules separately, or you can create “placeholder modules” if they’re not in the container image).
Create edges:
- Create/find FunctionNode for the imported symbol in provider module.
- Add CallEdge from caller function to imported function, EdgeKind = ImportCall, Evidence = "ELF.R_X86_64_JUMP_SLOT" (or arch-specific).

This yields edges like:

myapp:main → libssl.so.1.1:SSL_read
libfoo.so:foo → libc.so.6:malloc

4.2 PE

Goal: map call sites that go via the Import Address Table (IAT) to imported functions in DLLs.

Algorithm:

Parse:
- IMAGE_IMPORT_DESCRIPTOR[] – each for a DLL name.
- Original thunk table (INT) – names/ordinals of imported symbols.
- IAT – where the loader writes function addresses at runtime.
For each import entry:
- Determine:
  - DLL name (Name)
  - Function name or ordinal (from INT)
  - IAT slot address (RVA)
Link IAT slots to call sites:
- For each RawCallSite with MemoryTargetAddress:
  - Check if this address equals the VA of an IAT slot.
  - If yes, the call site is effectively calling that imported function.
Determine provider module:
- The DLL name gives you a target module (e.g. KERNEL32.dll → ModuleId).
- Ensure that DLL is represented as a BinaryModule or a “placeholder” if not present in image.
Create edges:
- Create/find FunctionNode for imported function in provider module.
- Add CallEdge with EdgeKind = ImportCall and Evidence = "PE.IAT" (or "PE.DelayLoad" if using delay load descriptors).

Example:

myservice.exe:Start → SSPICLI.dll:AcquireCredentialsHandleW

4.3 Mach-O

Goal: map stub calls via __TEXT,__stubs / __DATA,__la_symbol_ptr (and / or chained fixups) to symbols in dependent dylibs.

Algorithm (for classic dyld opcodes, not chained fixups, then extend):

Parse:
- Load commands:
  - LC_SYMTAB, LC_DYSYMTAB
  - LC_LOAD_DYLIB (to know dependent dylibs)
  - LC_FUNCTION_STARTS (for seeding functions)
  - LC_DYLD_INFO (rebase/bind/lazy bind)
- __TEXT,__stubs – stub code
- __DATA,__la_symbol_ptr (or __DATA_CONST,__la_symbol_ptr) – lazy pointer table
- Indirect symbol table – maps slot indices to symbol table indices
Stub → la_symbol_ptr mapping:
- Stubs are small functions (usually a few instructions) that indirect through the corresponding la_symbol_ptr entry.
- For each stub function:
  - Determine which la_symbol_ptr entry it uses (based on stub index and linking metadata).
  - From the indirect symbol table, find which dynamic symbol that la_symbol_ptr entry corresponds to.
    - This gives you symbol name and the index in LC_LOAD_DYLIB (dylib ordinal).
Link stub call sites:
- In disassembly, treat calls to these stub functions as import calls.
- For each call instruction CALL stub_function:
  - RawCallSite.DirectTargetAddress lies inside __TEXT,__stubs.
  - Resolve stub → la_symbol_ptr → symbol → dylib.
Determine provider module:
- From dylib ordinal and load commands, get the path / install name of dylib (libssl.1.1.dylib, etc.).
- Map that to a ModuleId in your module set.
Create edges:
- Create/find imported FunctionNode in provider module.
- Add CallEdge from caller to that function with EdgeKind = ImportCall, Evidence = "MachO.IndirectSymbol".

For chained fixups (LC_DYLD_CHAINED_FIXUPS), you’ll compute a similar mapping but walking chain entries instead of traditional lazy/weak binds. The key is still:

Map a stub or function to a fixup entry.
From fixup, determine the symbol and dylib.
Then connect call-site → imported function.

5. Reachability queries

Once the graph is built, reachability is “just graph algorithms” + mapping back to purls.

5.1 Roots

Decide what are your root functions:

Binary entrypoints:
- ELF: _start, main, constructors (.init_array)
- PE: AddressOfEntryPoint, registered service entrypoints
- Mach-O: _main, constructors
Optionally, any exported API function that a container orchestrator or plugin system will call.

Mark them as FunctionNode.IsRoot = true and create synthetic edges from a special root node if you want:

var syntheticRoot = new FunctionNode
{
    Id = 0,
    Name = "<root>",
    IsRoot = true,
    // Module, Purl can be special markers
};

foreach (var fn in allFunctions.Where(f => f.IsRoot))
{
    edges.Add(new CallEdge
    {
        FromId = syntheticRoot.Id,
        ToId = fn.Id,
        Kind = EdgeKind.SyntheticRoot,
        Evidence = "Root"
    });
}

5.2 Reachability algorithm (function-level)

Use BFS/DFS from the root node(s):

public sealed class ReachabilityResult
{
    public HashSet<int> ReachableFunctions { get; } = new();
}

public ReachabilityResult ComputeReachableFunctions(CallGraph graph, IEnumerable<int> rootIds)
{
    var visited = new HashSet<int>();
    var stack = new Stack<int>();

    foreach (var root in rootIds)
    {
        if (visited.Add(root))
            stack.Push(root);
    }

    while (stack.Count > 0)
    {
        var current = stack.Pop();

        if (!graph.OutEdges.TryGetValue(current, out var edges))
            continue;

        foreach (var edge in edges)
        {
            if (visited.Add(edge.ToId))
                stack.Push(edge.ToId);
        }
    }

    return new ReachabilityResult { ReachableFunctions = visited };
}

5.3 Project reachability to modules and purls

Given ReachableFunctions:

public sealed class ReachabilityProjection
{
    public HashSet<ModuleId> ReachableModules { get; } = new();
    public HashSet<Purl> ReachablePurls { get; } = new();
}

public ReachabilityProjection ProjectToModulesAndPurls(CallGraph graph, ReachabilityResult result)
{
    var projection = new ReachabilityProjection();

    foreach (var fnId in result.ReachableFunctions)
    {
        if (!graph.Nodes.TryGetValue(fnId, out var fn))
            continue;

        projection.ReachableModules.Add(fn.Module);
        projection.ReachablePurls.Add(fn.Purl);
    }

    return projection;
}

Now you can answer questions like:

“Is any code from purl pkg:deb/openssl@1.1.1w-1 reachable from the container entrypoint?”
“Which purls are reachable at all?”

5.4 Vulnerability reachability

Assume you’ve mapped each vulnerability to:

Purl (where it lives)
AffectedFunctionNames (symbols; optionally demangled)

You can implement:

public sealed class VulnerabilitySink
{
    public string VulnerabilityId { get; init; } // CVE-...
    public Purl Purl { get; init; }
    public string FunctionName { get; init; }    // symbol name or demangled
}

Resolution algorithm:

For each VulnerabilitySink, find all FunctionNode with:
- node.Purl == sink.Purl and
- node.Name or node.DemangledName matches sink.FunctionName.
For each such node, check ReachableFunctions.Contains(node.Id).
Build a Finding object:

public sealed class VulnerabilityFinding
{
    public string VulnerabilityId { get; init; }
    public Purl Purl { get; init; }
    public bool IsReachable { get; init; }
    public List<int> SinkFunctionIds { get; init; } = new();
}

Plus, if you want path evidence, you run a shortest-path search (BFS predecessor map) from root to sink and store the sequence of FunctionNode.Ids.

6. Purl edges (derived graph)

For reporting and analytics, it’s useful to produce a purl-level dependency graph.

Given CallGraph:

public PurlGraphView BuildPurlGraph(CallGraph graph)
{
    var edgesByPair = new Dictionary<(Purl From, Purl To), PurlEdge>();

    foreach (var kv in graph.OutEdges)
    {
        var fromFn = graph.Nodes[kv.Key];

        foreach (var edge in kv.Value)
        {
            var toFn = graph.Nodes[edge.ToId];

            if (fromFn.Purl.Equals(toFn.Purl))
                continue; // intra-purl, skip if you only care about inter-purl

            var key = (fromFn.Purl, toFn.Purl);
            if (!edgesByPair.TryGetValue(key, out var pe))
            {
                pe = new PurlEdge
                {
                    From = fromFn.Purl,
                    To = toFn.Purl,
                    SupportingCalls = new List<(int, int)>()
                };
                edgesByPair[key] = pe;
            }

            pe.SupportingCalls.Add((fromFn.Id, toFn.Id));
        }
    }

    var adj = new Dictionary<Purl, HashSet<Purl>>();

    foreach (var kv in edgesByPair)
    {
        var (from, to) = kv.Key;
        if (!adj.TryGetValue(from, out var list))
        {
            list = new HashSet<Purl>();
            adj[from] = list;
        }
        list.Add(to);
    }

    return new PurlGraphView
    {
        Adjacent = adj,
        Edges = edgesByPair.Values.ToList()
    };
}

This gives you:

A coarse view of runtime dependencies between purls (“Purl A calls into Purl B”).
Enough context to emit purl-level VEX or to reason about trust at package granularity.

7. JSON output and SBOM integration

7.1 JSON shape (high level)

You can emit a composite document:

{
  "image": "registry.example.com/app@sha256:...",
  "modules": [
    {
      "moduleId": { "path": "/usr/lib/libssl.so.1.1", "format": "Elf" },
      "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1",
      "arch": "x86_64"
    }
  ],
  "functions": [
    {
      "id": 42,
      "name": "SSL_do_handshake",
      "demangledName": null,
      "module": { "path": "/usr/lib/libssl.so.1.1", "format": "Elf" },
      "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1",
      "address": "0x401020",
      "exported": true
    }
  ],
  "edges": [
    {
      "from": 10,
      "to": 42,
      "kind": "ImportCall",
      "evidence": "ELF.R_X86_64_JUMP_SLOT"
    }
  ],
  "reachability": {
    "roots": [1],
    "reachableFunctions": [1,10,42]
  },
  "purlGraph": {
    "edges": [
      {
        "from": "pkg:generic/myapp@1.0.0",
        "to": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1",
        "supportingCalls": [[10,42]]
      }
    ]
  },
  "vulnerabilities": [
    {
      "id": "CVE-2024-XXXX",
      "purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1",
      "sinkFunctions": [42],
      "reachable": true,
      "paths": [
        [1, 10, 42]
      ]
    }
  ]
}

7.2 Purl resolution

Implement an IPurlResolver interface:

public interface IPurlResolver
{
    Purl ResolveForModule(string filePath, byte[] contentHash);
}

Possible implementations:

SbomPurlResolver – given a CycloneDX/SPDX SBOM for the image, match by path or checksum.
LinuxPackagePurlResolver – read /var/lib/dpkg/status / rpm DB in the filesystem.
GenericPurlResolver – fallback: pkg:generic/<hash>.

You call the resolver in your loaders so that every BinaryModule has a purl and thus every FunctionNode has a purl.

8. Concrete implementation tasks for your team

Data model & interfaces
- Implement ModuleId, FunctionNode, CallEdge, CallGraph.
- Define RawCallSite, BinaryModule, and IPurlResolver.
Loaders
- ElfLoader: fill symbols, dynamic relocations (PLT), DT_NEEDED, etc.
- PeLoader: import descriptors, IAT, delay-load.
- MachOLoader: load commands, stubs, la_symbol_ptr, indirect symbols / chained fixups.
Disassembly
- X86Disassembler (iced) and Arm64Disassembler (Disarm or port).
- Function detection and RawCallSite extraction.
CallGraphBuilder
- Build intra-module edges from direct calls.
- Build inter-module edges using the format-specific rules above.
- Construct final CallGraph with adjacency maps and purl mappings.
Reachability
- Implement BFS/DFS from root functions.
- Projection to modules + purls.
- Vulnerability sink resolution & path reconstruction.
Export
- JSON serializer for the schema above.
- Optional: purl-level summary / VEX generator.

If you want, next step I can do a more concrete design for CallGraphBuilder (including per-format helper classes with method signatures) or a C# skeleton for the ElfImportResolver, PeImportResolver, and MachOStubResolver that plug directly into this plan.

37 KiB Raw Blame History Unescape Escape