feat: Add new provenance and crypto registry documentation
Some checks failed
api-governance / spectral-lint (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled

- Introduced attestation inventory and subject-rekor mapping files for tracking Docker packages.
- Added a comprehensive crypto registry decision document outlining defaults and required follow-ups.
- Created an offline feeds manifest for bundling air-gap resources.
- Implemented a script to generate and update binary manifests for curated binaries.
- Added a verification script to ensure binary artefacts are located in approved directories.
- Defined new schemas for AdvisoryEvidenceBundle, OrchestratorEnvelope, ScannerReportReadyPayload, and ScannerScanCompletedPayload.
- Established project files for StellaOps.Orchestrator.Schemas and StellaOps.PolicyAuthoritySignals.Contracts.
- Updated vendor manifest to track pinned binaries for integrity.
This commit is contained in:
master
2025-11-18 23:47:13 +02:00
parent d3ecd7f8e6
commit e91da22836
44 changed files with 6793 additions and 99 deletions

View File

@@ -0,0 +1,989 @@
Vlad, heres a concrete, **pureC#** blueprint to build a multiformat binary analyzer (MachO, ELF, PE) that produces **call graphs + reachability**, with **no external tools**. Where needed, I point to permissivelylicensed code you can **port** (copy) from other ecosystems.
---
## 0) Targets & nonnegotiables
* **Formats:** MachO (inc. LC_DYLD_INFO / LC_DYLD_CHAINED_FIXUPS), ELF (SysV gABI), PE/COFF
* **Architectures:** x8664 (and x86), AArch64 (ARM64)
* **Outputs:** JSON with **purls** per module + functionlevel call graph & reachability
* **No tool reuse:** Only pure C# libraries or code **ported** from permissive sources
---
## 1) Parsing the containers (pure C#)
**Pick one C# reader per format, keeping licenses permissive:**
* **ELF & MachO:** `ELFSharp` (pure managed C#; ELF + MachO reading). MIT/X11 license. ([GitHub][1])
* **ELF & PE (+ DWARF v4):** `LibObjectFile` (C#, BSD2). Good ELF relocations (i386, x86_64, ARM, AArch64), PE directories, DWARF sections. Use it as your **common object model** for ELF+PE, then add a MachO adapter. ([GitHub][2])
* **PE (optional alternative):** `PeNet` (pure C#, broad PE directories, imp/exp, TLS, certs). MIT. Useful if you want a second implementation for crosschecks. ([GitHub][3])
> Why two libs? `LibObjectFile` gives you DWARF and clean models for ELF/PE; `ELFSharp` covers MachO today (and ELF as a fallback). You control the code paths.
**Spec references youll implement against** (for correctness of your readers & linktime semantics):
* **ELF (gABI, AMD64 supplement):** dynamic section, PLT/GOT, `R_X86_64_JUMP_SLOT` semantics (eager vs lazy). ([refspecs.linuxbase.org][4])
* **PE/COFF:** imports/exports/IAT, delayload, TLS. ([Microsoft Learn][5])
* **MachO:** file layout, load commands (`LC_SYMTAB`, `LC_DYSYMTAB`, `LC_FUNCTION_STARTS`, `LC_DYLD_INFO(_ONLY)`), and the modern `LC_DYLD_CHAINED_FIXUPS`. ([leopard-adc.pepas.com][6])
---
## 2) MachO: what you must **port** (byteforbyte compatible)
Apple moved from traditional dyld bind opcodes to **chained fixups** on macOS 12/iOS 15+; you need both:
* **Dyld bind opcodes** (`LC_DYLD_INFO(_ONLY)`): parse the BIND/LAZY_BIND streams (tuples of `<seg,off,type,ordinal,symbol,addend>`). Port minimal logic from **LLVM** or **LIEF** (both Apache2.0compatible) into C#. ([LIEF][7])
* **Chained fixups** (`LC_DYLD_CHAINED_FIXUPS`): port `dyld_chained_fixups_header` structs & chain walking from LLVMs `MachO.h` or Apples dyld headers. This restores imports/rebases without running dyld. ([LLVM][8])
* **Function discovery hint:** read `LC_FUNCTION_STARTS` (ULEB128 deltas) to seed function boundaries—very helpful on stripped binaries. ([Stack Overflow][9])
* **Stubs mapping:** resolve `__TEXT,__stubs``__DATA,__la_symbol_ptr` via the **indirect symbol table**; conceptually identical to ELFs PLT/GOT. ([MaskRay][10])
> If you prefer an inC# base for MachO manipulation, **Melanzana.MachO** exists (MIT) and has been used by .NET folks for MachO/Code Signing/obj writing; you can mine its approach for loadcommand modeling. ([GitHub][11])
---
## 3) Disassembly (pure C#, multiarch)
* **x86/x64:** `iced` (C# decoder/disassembler/encoder; MIT; fast & complete). ([GitHub][12])
* **AArch64/ARM64:** two options that keep you pureC#:
* **Disarm** (pure C# ARM64 disassembler; MIT). Good starting point to decode & get branch/call kinds. ([GitHub][13])
* **Port from Ryujinx ARMeilleure** (ARMv8 decoder/JIT in C#, MIT). You can lift only the **decoder** pieces you need. ([Gitee][14])
* **x86 fallback:** `SharpDisasm` (udis86 port in C#; BSD2). Older than iced; keep as a reference. ([GitHub][15])
---
## 4) Call graph recovery (static)
**4.1 Function seeds**
* From symbols (`.dynsym`/`LC_SYMTAB`/PE exports)
* From **LC_FUNCTION_STARTS** (MachO) for stripped code ([Stack Overflow][9])
* From entrypoints (`_start`/`main` or PE AddressOfEntryPoint)
* From exception/unwind tables & DWARF (when present)—`LibObjectFile` already models DWARF v4. ([GitHub][2])
**4.2 CFG & interprocedural calls**
* **Decode** with iced/Disarm from each seed; form **basic blocks** by following controlflow until terminators (ret/jmp/call).
* **Direct calls:** immediate targets become edges (PCrelative fixups where needed).
* **Imported calls:**
* **ELF:** calls to PLT stubs → resolve via `.rela.plt` & `R_*_JUMP_SLOT` to symbol names (linktime target). ([cs61.seas.harvard.edu][16])
* **PE:** calls through the **IAT** → resolve via `IMAGE_IMPORT_DESCRIPTOR` / thunk tables. ([Microsoft Learn][5])
* **MachO:** calls to `__stubs` use **indirect symbol table** + `__la_symbol_ptr` (or chained fixups) → map to dylib/symbol. ([reinterpretcast.com][17])
* **Indirect calls within the binary:** heuristics only (function pointer tables, vtables, small constant pools). Keep them labeled **“indirectunresolved”** unless a heuristic yields a concrete target.
**4.3 Crossbinary graph**
* Build modulelevel edges by simulating the platforms loader:
* **ELF:** honor `DT_NEEDED`, `DT_RPATH/RUNPATH`, versioning (`.gnu.version*`) to pick the definer of an imported symbol. gABI rules apply. ([refspecs.linuxbase.org][4])
* **PE:** pick DLL from the import descriptors. ([Microsoft Learn][5])
* **MachO:** `LC_LOAD_DYLIB` + dyld binding / chained fixups determine the provider image. ([LIEF][7])
---
## 5) Reachability analysis
Represent the **call graph** using a .NET graph lib (or a simple adjacency set). I suggest:
* **QuikGraph** (successor of QuickGraph; MIT) for algorithms (DFS/BFS, SCCs). Use it to compute reachability from chosen roots (entrypoint(s), exported APIs, or “sinks”). ([GitHub][18])
You can visualize with **MSAGL** (MIT) when you need layouts, but your core output is JSON. ([GitHub][19])
---
## 6) Symbol demangling (nicetohave, pure C#)
* **Itanium (ELF/MachO):** Either port LLVMs Itanium demangler or use a C# lib like **CxxDemangler** (a C# rewrite of `cpp_demangle`). ([LLVM][20])
* **MSVC (PE):** Port LLVMs `MicrosoftDemangle.cpp` (Apache2.0 with LLVM exception) to C#. ([LLVM][21])
---
## 7) JSON output (with purls)
Use a stable schema (example) to feed SBOM/vuln matching downstream:
```json
{
"modules": [
{
"purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1?arch=amd64",
"format": "ELF",
"arch": "x86_64",
"path": "/usr/lib/x86_64-linux-gnu/libssl.so.1.1",
"exports": ["SSL_read", "SSL_write"],
"imports": ["BIO_new", "EVP_CipherInit_ex"],
"functions": [{"name":"SSL_do_handshake","va":"0x401020","size":512,"demangled": "..."}]
}
],
"graph": {
"nodes": [
{"id":"bin:main@0x401000","module": "pkg:generic/myapp@1.0.0"},
{"id":"lib:SSL_read","module":"pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1?arch=amd64"}
],
"edges": [
{"src":"bin:main@0x401000","dst":"lib:SSL_read","kind":"import_call","evidence":"ELF.R_X86_64_JUMP_SLOT"}
]
},
"reachability": {
"roots": ["bin:_start","bin:main@0x401000"],
"reachable": ["lib:SSL_read", "lib:SSL_write"],
"unresolved_indirect_calls": [
{"site":"0x402ABC","reason":"register-indirect"}
]
}
}
```
---
## 8) Minimal C# module layout (sketch)
```
Stella.Analysis.Core/
BinaryModule.cs // common model (sections, symbols, relocs, imports/exports)
Loader/
PeLoader.cs // wrap LibObjectFile (or PeNet) to BinaryModule
ElfLoader.cs // wrap LibObjectFile to BinaryModule
MachOLoader.cs // wrap ELFSharp + your ported Dyld/ChainedFixups
Disasm/
X86Disassembler.cs // iced bridge: bytes -> instructions
Arm64Disassembler.cs // Disarm (or ARMeilleure port) bridge
Graph/
CallGraphBuilder.cs // builds CFG per function + inter-procedural edges
Reachability.cs // BFS/DFS over QuikGraph
Demangle/
ItaniumDemangler.cs // port or wrap CxxDemangler
MicrosoftDemangler.cs // port from LLVM
Export/
JsonWriter.cs // writes schema above
```
---
## 9) Implementation notes (where issues usually bite)
* **MachO moderns:** Implement both dyld opcode **and** chained fixups; many macOS 12+/iOS15+ binaries only have chained fixups. ([emergetools.com][22])
* **Stubs vs real targets (MachO):** map `__stubs``__la_symbol_ptr` via **indirect symbols** to the true imported symbol (or its postfixup target). ([reinterpretcast.com][17])
* **ELF PLT/GOT:** treat `.plt` entries as **call trampolines**; ultimate edge should point to the symbol (library) that satisfies `DT_NEEDED` + version. ([refspecs.linuxbase.org][4])
* **PE delayload:** dont forget `IMAGE_DELAYLOAD_DESCRIPTOR` for delayed IATs. ([Microsoft Learn][5])
* **Function discovery:** use `LC_FUNCTION_STARTS` when symbols are stripped; its a cheap way to seed analysis. ([Stack Overflow][9])
* **Name clarity:** demangle Itanium/MSVC so downstream vuln rules can match consistently. ([LLVM][20])
---
## 10) What to **copy/port** verbatim (safe licenses)
* **Dyld bind & exports trie logic:** from **LLVM** or **LIEF** MachO (Apache2.0). Great for getting the exact opcode semantics right. ([LIEF][7])
* **Chained fixups structs/walkers:** from **LLVM MachO.h** or Apple dyld headers (permissive headers). ([LLVM][8])
* **Itanium/MS demanglers:** LLVM demangler sources are standalone; easy to translate to C#. ([LLVM][23])
* **ARM64 decoder:** if Disarm gaps hurt, lift just the **decoder** pieces from **Ryujinx ARMeilleure** (MIT). ([Gitee][14])
*(Avoid GPLd parsers like binutils/BFD; they will contaminate your codebases licensing.)*
---
## 11) Endtoend pipeline (per container image)
1. **Enumerate binaries** in the container FS.
2. **Parse** each with the appropriate loader → `BinaryModule` (+ imports/exports/symbols/relocs).
3. **Simulate linking** per platform to resolve imported functions to provider libraries. ([refspecs.linuxbase.org][4])
4. **Disassemble** functions (iced/Disarm) → CFGs → **call edges** (direct, PLT/IAT/stub, indirect).
5. **Assemble call graph** across modules; normalize names via demangling.
6. **Reachability**: given roots (entry or userspecified) compute reachable set; emit JSON with **purls** (from your SBOM/package resolver).
7. **(Optional)** dump GraphViz / MSAGL views for debugging. ([GitHub][19])
---
## 12) Quick heuristics for vulnerability triage
* **Sink maps**: flag edges to highrisk APIs (`strcpy`, `gets`, legacy SSL ciphers) even without CVE versioning.
* **DWARF line info** (when present): attach file:line to nodes for developer action. `LibObjectFile` gives you DWARF v4 reads. ([GitHub][2])
---
## 13) Test corpora
* **ELF:** glibc/openssl/libpng from distro repos; validate `R_*_JUMP_SLOT` handling and PLT edges. ([cs61.seas.harvard.edu][16])
* **PE:** system DLLs (Kernel32, Advapi32) and a small MSVC console app; validate IAT & delayload. ([Microsoft Learn][5])
* **MachO:** Xcodebuilt binaries across macOS 11 & 12+ to cover both dyld opcode and chained fixups paths; verify `LC_FUNCTION_STARTS` improves discovery. ([Stack Overflow][9])
---
## 14) Deliverables you can start coding now
* **MachOLoader.cs**
* Parse headers + load commands (ELFSharp).
* Implement `DyldInfoParser` (port from LLVM/LIEF) and `ChainedFixupsParser` (port structs & walkers). ([LIEF][7])
* **X86Disassembler.cs / Arm64Disassembler.cs** (iced / Disarm bridges). ([GitHub][12])
* **CallGraphBuilder.cs** (recursive descent + linear sweep fallback; PLT/IAT/stub resolution).
* **Reachability.cs** (QuikGraph BFS/DFS). ([GitHub][18])
* **JsonWriter.cs** (schema above with purls).
---
### References (core, loadbearing)
* **ELFSharp** (ELF + MachO pure C#). ([GitHub][1])
* **LibObjectFile** (ELF/PE/DWARF C#, BSD2). ([GitHub][2])
* **iced** (x86/x64 disasm, C#, MIT). ([GitHub][12])
* **Disarm** (ARM64 disasm, C#, MIT). ([GitHub][13])
* **Ryujinx (ARMeilleure)** (ARMv8 decode/JIT in C#, MIT). ([Gitee][14])
* **ELF gABI & AMD64 supplement** (PLT/GOT, relocations). ([refspecs.linuxbase.org][4])
* **PE/COFF** (imports/exports/IAT). ([Microsoft Learn][5])
* **MachO docs** (load commands; LC_FUNCTION_STARTS; dyld bindings; chained fixups). ([Apple Developer][24])
---
If you want, I can draft **`MachOLoader` + `DyldInfoParser`** in C# next, including chainedfixups structs (ported from LLVMs headers) and an **iced**based calledge walker for x8664.
[1]: https://github.com/konrad-kruczynski/elfsharp "GitHub - konrad-kruczynski/elfsharp: Pure managed C# library for reading ELF, UImage, Mach-O binaries."
[2]: https://github.com/xoofx/LibObjectFile "GitHub - xoofx/LibObjectFile: LibObjectFile is a .NET library to read, manipulate and write linker and executable object files (e.g ELF, PE, DWARF, ar...)"
[3]: https://github.com/secana/PeNet?utm_source=chatgpt.com "secana/PeNet: Portable Executable (PE) library written in . ..."
[4]: https://refspecs.linuxbase.org/elf/gabi4%2B/contents.html?utm_source=chatgpt.com "System V Application Binary Interface - DRAFT - 24 April 2001"
[5]: https://learn.microsoft.com/en-us/windows/win32/debug/pe-format?utm_source=chatgpt.com "PE Format - Win32 apps"
[6]: https://leopard-adc.pepas.com/documentation/DeveloperTools/Conceptual/MachOTopics/0-Introduction/introduction.html?utm_source=chatgpt.com "Mach-O Programming Topics: Introduction"
[7]: https://lief.re/doc/stable/doxygen/classLIEF_1_1MachO_1_1DyldInfo.html?utm_source=chatgpt.com "MachO::DyldInfo Class Reference - LIEF"
[8]: https://llvm.org/doxygen/structllvm_1_1MachO_1_1dyld__chained__fixups__header.html?utm_source=chatgpt.com "MachO::dyld_chained_fixups_header Struct Reference"
[9]: https://stackoverflow.com/questions/9602438/mach-o-file-lc-function-starts-load-command?utm_source=chatgpt.com "Mach-O file LC_FUNCTION_STARTS load command"
[10]: https://maskray.me/blog/2021-09-19-all-about-procedure-linkage-table?utm_source=chatgpt.com "All about Procedure Linkage Table"
[11]: https://github.com/dotnet/runtime/issues/77178 "Discussion: ObjWriter in C# · Issue #77178 · dotnet/runtime · GitHub"
[12]: https://github.com/icedland/iced?utm_source=chatgpt.com "icedland/iced: Blazing fast and correct x86/x64 ..."
[13]: https://github.com/SamboyCoding/Disarm?utm_source=chatgpt.com "SamboyCoding/Disarm: Fast, pure-C# ARM64 Disassembler"
[14]: https://gitee.com/ryujinx/Ryujinx/blob/master/LICENSE.txt?utm_source=chatgpt.com "Ryujinx/Ryujinx"
[15]: https://github.com/justinstenning/SharpDisasm?utm_source=chatgpt.com "justinstenning/SharpDisasm"
[16]: https://cs61.seas.harvard.edu/site/2022/pdf/x86-64-abi-20210928.pdf?utm_source=chatgpt.com "System V Application Binary Interface"
[17]: https://www.reinterpretcast.com/hello-world-mach-o?utm_source=chatgpt.com "The Nitty Gritty of “Hello World” on macOS | reinterpretcast.com"
[18]: https://github.com/KeRNeLith/QuikGraph?utm_source=chatgpt.com "KeRNeLith/QuikGraph: Generic Graph Data Structures and ..."
[19]: https://github.com/microsoft/automatic-graph-layout?utm_source=chatgpt.com "microsoft/automatic-graph-layout: A set of tools for ..."
[20]: https://llvm.org/doxygen/structllvm_1_1ItaniumPartialDemangler.html?utm_source=chatgpt.com "ItaniumPartialDemangler Struct Reference"
[21]: https://llvm.org/doxygen/MicrosoftDemangle_8cpp_source.html?utm_source=chatgpt.com "lib/Demangle/MicrosoftDemangle.cpp Source File"
[22]: https://www.emergetools.com/blog/posts/iOS15LaunchTime?utm_source=chatgpt.com "How iOS 15 makes your app launch faster"
[23]: https://llvm.org/doxygen/ItaniumDemangle_8cpp.html?utm_source=chatgpt.com "lib/Demangle/ItaniumDemangle.cpp File Reference"
[24]: https://developer.apple.com/library/archive/documentation/Performance/Conceptual/CodeFootprint/Articles/MachOOverview.html?utm_source=chatgpt.com "Overview of the Mach-O Executable Format"
Good, this is exactly the point where you want to nail the internal model, otherwise everything on top (VEX, policy, reporting) will be shaky.
Below is a detailed, implementation-ready plan for a **reachability graph with purl-aware edges**, covering ELF, PE, and Mach-O, in C#.
Ill structure it as:
1. Overall graph design (3 layers: function, module, purl)
2. Core C# data model
3. Pipeline steps (end-to-end)
4. Format-specific edge construction (ELF / PE / Mach-O)
5. Reachability queries (from entrypoints to vulnerable purls / functions)
6. JSON output layout and integration with SBOM
---
## 1. Overall graph design
You want three tightly linked graph layers:
1. **Function-level call graph (FLG)**
* Nodes: individual **functions** inside binaries
* Edges: calls from function A → function B (intra- or inter-module)
2. **Module-level graph (MLG)**
* Nodes: **binaries** (ELF/PE/Mach-O files)
* Edges: “module A calls module B at least once” (aggregated from FLG)
3. **Purl-level graph (PLG)**
* Nodes: **purls** (packages or generic artifacts)
* Edges: “purl P1 depends-at-runtime on purl P2” (aggregated from module edges)
The **reachability algorithm** runs primarily on the **function graph**, but:
* You can project reachability results to **module** and **purl** nodes.
* You can also run coarse-grained analysis directly on **purl graph** when needed (“Is any code in purl X reachable from the container entrypoint?”).
---
## 2. Core C# data model
### 2.1 Identifiers and enums
```csharp
public enum BinaryFormat { Elf, Pe, MachO }
public readonly record struct ModuleId(string Path, BinaryFormat Format);
public readonly record struct Purl(string Value);
public enum EdgeKind
{
IntraModuleDirect, // call foo -> bar in same module
ImportCall, // call via plt/iat/stub to imported function
SyntheticRoot, // root (entrypoint) edge
IndirectUnresolved // optional: we saw an indirect call we couldn't resolve
}
```
### 2.2 Function node
```csharp
public sealed class FunctionNode
{
public int Id { get; init; } // internal numeric id
public ModuleId Module { get; init; }
public Purl Purl { get; init; } // resolved from Module -> Purl
public ulong Address { get; init; } // VA or RVA
public string Name { get; init; } // mangled
public string? DemangledName { get; init; } // optional
public bool IsExported { get; init; }
public bool IsImportedStub { get; init; } // e.g. PLT stub, Mach-O stub, PE thunks
public bool IsRoot { get; set; } // _start/main/entrypoint etc.
}
```
### 2.3 Edges
```csharp
public sealed class CallEdge
{
public int FromId { get; init; } // FunctionNode.Id
public int ToId { get; init; } // FunctionNode.Id
public EdgeKind Kind { get; init; }
public string Evidence { get; init; } // e.g. "ELF.R_X86_64_JUMP_SLOT", "PE.IAT", "MachO.indirectSym"
}
```
### 2.4 Graph container
```csharp
public sealed class CallGraph
{
public IReadOnlyDictionary<int, FunctionNode> Nodes { get; init; }
public IReadOnlyDictionary<int, List<CallEdge>> OutEdges { get; init; }
public IReadOnlyDictionary<int, List<CallEdge>> InEdges { get; init; }
// Convenience: mappings
public IReadOnlyDictionary<ModuleId, List<int>> FunctionsByModule { get; init; }
public IReadOnlyDictionary<Purl, List<int>> FunctionsByPurl { get; init; }
}
```
### 2.5 Purl-level graph view
You dont store a separate physical graph; you **derive** it on demand:
```csharp
public sealed class PurlEdge
{
public Purl From { get; init; }
public Purl To { get; init; }
public List<(int FromFnId, int ToFnId)> SupportingCalls { get; init; }
}
public sealed class PurlGraphView
{
public IReadOnlyDictionary<Purl, HashSet<Purl>> Adjacent { get; init; }
public IReadOnlyList<PurlEdge> Edges { get; init; }
}
```
---
## 3. Pipeline steps (end-to-end)
### Step 0 Inputs
* Set of binaries (files) extracted from container image.
* SBOM or other metadata that can map a file path (or hash) → **purl**.
### Step 1 Parse binaries → `BinaryModule` objects
You define a common in-memory model:
```csharp
public sealed class BinaryModule
{
public ModuleId Id { get; init; }
public Purl Purl { get; init; }
public BinaryFormat Format { get; init; }
// Raw sections / segments
public IReadOnlyList<SectionInfo> Sections { get; init; }
// Symbols
public IReadOnlyList<SymbolInfo> Symbols { get; init; } // imports + exports + locals
// Relocations / fixups
public IReadOnlyList<RelocationInfo> Relocations { get; init; }
// Import/export tables (PE)/dylib commands (Mach-O)/DT_NEEDED (ELF)
public ImportInfo[] Imports { get; init; }
public ExportInfo[] Exports { get; init; }
}
```
Implement format-specific loaders:
* `ElfLoader : IBinaryLoader`
* `PeLoader : IBinaryLoader`
* `MachOLoader : IBinaryLoader`
Each loader uses your chosen C# parsers or ported code and fills `BinaryModule`.
### Step 2 Disassembly → basic blocks & candidate functions
For each `BinaryModule`:
1. Use appropriate decoder (iced for x86/x64; Disarm/ported ARMeilleure for AArch64).
2. Seed function starts:
* Exported functions
* Entry points (`_start`, `main`, AddressOfEntryPoint)
* Mach-O `LC_FUNCTION_STARTS` if available
3. Walk instructions to build basic blocks:
* Stop blocks at conditional/unconditional branches, calls, rets.
* Record for each call site:
* Address of caller function
* Operand type (immediate, memory with import table address, etc.)
Disassembler outputs a list of `FunctionNode` skeletons (no cross-module link yet) and a list of **raw call sites**:
```csharp
public sealed class RawCallSite
{
public int CallerFunctionId { get; init; }
public ulong InstructionAddress { get; init; }
public ulong? DirectTargetAddress { get; init; } // e.g. CALL 0x401000
public ulong? MemoryTargetAddress { get; init; } // e.g. CALL [0x404000]
public bool IsIndirect { get; init; } // register-based etc.
}
```
### Step 3 Build function nodes
Using disassembly + symbol tables:
* For each discovered function:
* Determine: address, name (if sym available), export/import flags.
* Map `ModuleId``Purl` using `IPurlResolver`.
* Populate `FunctionNode` instances and index them by `Id`.
### Step 4 Construct intra-module edges
For each `RawCallSite`:
* If `DirectTargetAddress` falls inside a known functions address range in the **same module**, add **IntraModuleDirect** edge.
This gives you “normal” calls like `foo()` calling `bar()` in the same .so/.dll/.
### Step 5 Construct inter-module edges (import calls)
This is where ELF/PE/Mach-O differ; details in section 4 below.
But the abstract logic is:
1. For each call site with `MemoryTargetAddress` (IAT slot / GOT entry / la_symbol_ptr / PLT):
2. From the modules import, relocation or fixup tables, determine:
* Which **imported symbol** it corresponds to (name, ordinal, etc.).
* Which **imported module / dylib / DLL** provides that symbol.
3. Find (or create) a `FunctionNode` representing that imported symbol in the **provider module**.
4. Add an **ImportCall** edge from caller function to the provider `FunctionNode`.
This is the key to turning low-level dynamic linking into **purl-aware cross-module edges**, because each `FunctionNode` is already stamped with a `Purl`.
### Step 6 Build adjacency structures
Once you have all `FunctionNode`s and `CallEdge`s:
* Build `OutEdges` and `InEdges` dictionaries keyed by `FunctionNode.Id`.
* Build `FunctionsByModule` / `FunctionsByPurl`.
---
## 4. Format-specific edge construction
This is the “how” for step 5, per binary format.
### 4.1 ELF
Goal: map call sites that go via PLT/GOT to an imported function in a `DT_NEEDED` library.
Algorithm:
1. Parse:
* `.dynsym`, `.dynstr` dynamic symbol table
* `.rela.plt` / `.rel.plt` relocation entries for PLT
* `.got.plt` / `.got` PLTs GOT
* `DT_NEEDED` entries list of linked shared objects and their sonames
2. For each relocation of type `R_*_JUMP_SLOT`:
* It applies to an entry in the PLT GOT; that GOT entry is what CALL instructions read from.
* Relocation gives you:
* Offset in GOT (`r_offset`)
* Symbol index (`r_info` → symbol) → dynamic symbol (`ElfSymbol`)
* Symbol name, type (FUNC), binding, etc.
3. Link GOT entries to call sites:
* For each `RawCallSite` with `MemoryTargetAddress`, check if that address falls inside `.got.plt` (or `.got`). If it does:
* Find relocation whose `r_offset` equals that GOT entry offset.
* That tells you which **symbol** is being called.
4. Determine provider module:
* From the symbols `st_name` and `DT_NEEDED` list, decide which shared object is expected to define it (an approximation is: first DT_NEEDED that provides that name).
* Map DT_NEEDED → `ModuleId` (youll have loaded these modules separately, or you can create “placeholder modules” if theyre not in the container image).
5. Create edges:
* Create/find `FunctionNode` for the **imported symbol** in provider module.
* Add `CallEdge` from caller function to imported function, `EdgeKind = ImportCall`, `Evidence = "ELF.R_X86_64_JUMP_SLOT"` (or arch-specific).
This yields edges like:
* `myapp:main``libssl.so.1.1:SSL_read`
* `libfoo.so:foo``libc.so.6:malloc`
### 4.2 PE
Goal: map call sites that go via the Import Address Table (IAT) to imported functions in DLLs.
Algorithm:
1. Parse:
* `IMAGE_IMPORT_DESCRIPTOR[]` each for a DLL name.
* Original thunk table (INT) names/ordinals of imported symbols.
* IAT where the loader writes function addresses at runtime.
2. For each import entry:
* Determine:
* DLL name (`Name`)
* Function name or ordinal (from INT)
* IAT slot address (RVA)
3. Link IAT slots to call sites:
* For each `RawCallSite` with `MemoryTargetAddress`:
* Check if this address equals the VA of an IAT slot.
* If yes, the call site is effectively calling that imported function.
4. Determine provider module:
* The DLL name gives you a target module (e.g. `KERNEL32.dll``ModuleId`).
* Ensure that DLL is represented as a `BinaryModule` or a “placeholder” if not present in image.
5. Create edges:
* Create/find `FunctionNode` for imported function in provider module.
* Add `CallEdge` with `EdgeKind = ImportCall` and `Evidence = "PE.IAT"` (or `"PE.DelayLoad"` if using delay load descriptors).
Example:
* `myservice.exe:Start``SSPICLI.dll:AcquireCredentialsHandleW`
### 4.3 Mach-O
Goal: map stub calls via `__TEXT,__stubs` / `__DATA,__la_symbol_ptr` (and / or chained fixups) to symbols in dependent dylibs.
Algorithm (for classic dyld opcodes, not chained fixups, then extend):
1. Parse:
* Load commands:
* `LC_SYMTAB`, `LC_DYSYMTAB`
* `LC_LOAD_DYLIB` (to know dependent dylibs)
* `LC_FUNCTION_STARTS` (for seeding functions)
* `LC_DYLD_INFO` (rebase/bind/lazy bind)
* `__TEXT,__stubs` stub code
* `__DATA,__la_symbol_ptr` (or `__DATA_CONST,__la_symbol_ptr`) lazy pointer table
* **Indirect symbol table** maps slot indices to symbol table indices
2. Stub → la_symbol_ptr mapping:
* Stubs are small functions (usually a few instructions) that indirect through the corresponding `la_symbol_ptr` entry.
* For each stub function:
* Determine which la_symbol_ptr entry it uses (based on stub index and linking metadata).
* From the indirect symbol table, find which dynamic symbol that la_symbol_ptr entry corresponds to.
* This gives you symbol name and the index in `LC_LOAD_DYLIB` (dylib ordinal).
3. Link stub call sites:
* In disassembly, treat calls to these stub functions as **import calls**.
* For each call instruction `CALL stub_function`:
* `RawCallSite.DirectTargetAddress` lies inside `__TEXT,__stubs`.
* Resolve stub → la_symbol_ptr → symbol → dylib.
4. Determine provider module:
* From dylib ordinal and load commands, get the path / install name of dylib (`libssl.1.1.dylib`, etc.).
* Map that to a `ModuleId` in your module set.
5. Create edges:
* Create/find imported `FunctionNode` in provider module.
* Add `CallEdge` from caller to that function with `EdgeKind = ImportCall`, `Evidence = "MachO.IndirectSymbol"`.
For **chained fixups** (`LC_DYLD_CHAINED_FIXUPS`), youll compute a similar mapping but walking chain entries instead of traditional lazy/weak binds. The key is still:
* Map a stub or function to a **fixup** entry.
* From fixup, determine the symbol and dylib.
* Then connect call-site → imported function.
---
## 5. Reachability queries
Once the graph is built, reachability is “just graph algorithms” + mapping back to purls.
### 5.1 Roots
Decide what are your **root functions**:
* Binary entrypoints:
* ELF: `_start`, `main`, constructors (`.init_array`)
* PE: AddressOfEntryPoint, registered service entrypoints
* Mach-O: `_main`, constructors
* Optionally, any exported API function that a container orchestrator or plugin system will call.
Mark them as `FunctionNode.IsRoot = true` and create synthetic edges from a special root node if you want:
```csharp
var syntheticRoot = new FunctionNode
{
Id = 0,
Name = "<root>",
IsRoot = true,
// Module, Purl can be special markers
};
foreach (var fn in allFunctions.Where(f => f.IsRoot))
{
edges.Add(new CallEdge
{
FromId = syntheticRoot.Id,
ToId = fn.Id,
Kind = EdgeKind.SyntheticRoot,
Evidence = "Root"
});
}
```
### 5.2 Reachability algorithm (function-level)
Use BFS/DFS from the root node(s):
```csharp
public sealed class ReachabilityResult
{
public HashSet<int> ReachableFunctions { get; } = new();
}
public ReachabilityResult ComputeReachableFunctions(CallGraph graph, IEnumerable<int> rootIds)
{
var visited = new HashSet<int>();
var stack = new Stack<int>();
foreach (var root in rootIds)
{
if (visited.Add(root))
stack.Push(root);
}
while (stack.Count > 0)
{
var current = stack.Pop();
if (!graph.OutEdges.TryGetValue(current, out var edges))
continue;
foreach (var edge in edges)
{
if (visited.Add(edge.ToId))
stack.Push(edge.ToId);
}
}
return new ReachabilityResult { ReachableFunctions = visited };
}
```
### 5.3 Project reachability to modules and purls
Given `ReachableFunctions`:
```csharp
public sealed class ReachabilityProjection
{
public HashSet<ModuleId> ReachableModules { get; } = new();
public HashSet<Purl> ReachablePurls { get; } = new();
}
public ReachabilityProjection ProjectToModulesAndPurls(CallGraph graph, ReachabilityResult result)
{
var projection = new ReachabilityProjection();
foreach (var fnId in result.ReachableFunctions)
{
if (!graph.Nodes.TryGetValue(fnId, out var fn))
continue;
projection.ReachableModules.Add(fn.Module);
projection.ReachablePurls.Add(fn.Purl);
}
return projection;
}
```
Now you can answer questions like:
* “Is any code from purl `pkg:deb/openssl@1.1.1w-1` reachable from the container entrypoint?”
* “Which purls are reachable at all?”
### 5.4 Vulnerability reachability
Assume youve mapped each vulnerability to:
* `Purl` (where it lives)
* `AffectedFunctionNames` (symbols; optionally demangled)
You can implement:
```csharp
public sealed class VulnerabilitySink
{
public string VulnerabilityId { get; init; } // CVE-...
public Purl Purl { get; init; }
public string FunctionName { get; init; } // symbol name or demangled
}
```
Resolution algorithm:
1. For each `VulnerabilitySink`, find all `FunctionNode` with:
* `node.Purl == sink.Purl` and
* `node.Name` or `node.DemangledName` matches `sink.FunctionName`.
2. For each such node, check `ReachableFunctions.Contains(node.Id)`.
3. Build a `Finding` object:
```csharp
public sealed class VulnerabilityFinding
{
public string VulnerabilityId { get; init; }
public Purl Purl { get; init; }
public bool IsReachable { get; init; }
public List<int> SinkFunctionIds { get; init; } = new();
}
```
Plus, if you want **path evidence**, you run a shortest-path search (BFS predecessor map) from root to sink and store the sequence of `FunctionNode.Id`s.
---
## 6. Purl edges (derived graph)
For reporting and analytics, its useful to produce a **purl-level dependency graph**.
Given `CallGraph`:
```csharp
public PurlGraphView BuildPurlGraph(CallGraph graph)
{
var edgesByPair = new Dictionary<(Purl From, Purl To), PurlEdge>();
foreach (var kv in graph.OutEdges)
{
var fromFn = graph.Nodes[kv.Key];
foreach (var edge in kv.Value)
{
var toFn = graph.Nodes[edge.ToId];
if (fromFn.Purl.Equals(toFn.Purl))
continue; // intra-purl, skip if you only care about inter-purl
var key = (fromFn.Purl, toFn.Purl);
if (!edgesByPair.TryGetValue(key, out var pe))
{
pe = new PurlEdge
{
From = fromFn.Purl,
To = toFn.Purl,
SupportingCalls = new List<(int, int)>()
};
edgesByPair[key] = pe;
}
pe.SupportingCalls.Add((fromFn.Id, toFn.Id));
}
}
var adj = new Dictionary<Purl, HashSet<Purl>>();
foreach (var kv in edgesByPair)
{
var (from, to) = kv.Key;
if (!adj.TryGetValue(from, out var list))
{
list = new HashSet<Purl>();
adj[from] = list;
}
list.Add(to);
}
return new PurlGraphView
{
Adjacent = adj,
Edges = edgesByPair.Values.ToList()
};
}
```
This gives you:
* A coarse view of runtime dependencies between purls (“Purl A calls into Purl B”).
* Enough context to emit purl-level VEX or to reason about trust at package granularity.
---
## 7. JSON output and SBOM integration
### 7.1 JSON shape (high level)
You can emit a composite document:
```json
{
"image": "registry.example.com/app@sha256:...",
"modules": [
{
"moduleId": { "path": "/usr/lib/libssl.so.1.1", "format": "Elf" },
"purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1",
"arch": "x86_64"
}
],
"functions": [
{
"id": 42,
"name": "SSL_do_handshake",
"demangledName": null,
"module": { "path": "/usr/lib/libssl.so.1.1", "format": "Elf" },
"purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1",
"address": "0x401020",
"exported": true
}
],
"edges": [
{
"from": 10,
"to": 42,
"kind": "ImportCall",
"evidence": "ELF.R_X86_64_JUMP_SLOT"
}
],
"reachability": {
"roots": [1],
"reachableFunctions": [1,10,42]
},
"purlGraph": {
"edges": [
{
"from": "pkg:generic/myapp@1.0.0",
"to": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1",
"supportingCalls": [[10,42]]
}
]
},
"vulnerabilities": [
{
"id": "CVE-2024-XXXX",
"purl": "pkg:deb/ubuntu/openssl@1.1.1w-0ubuntu1",
"sinkFunctions": [42],
"reachable": true,
"paths": [
[1, 10, 42]
]
}
]
}
```
### 7.2 Purl resolution
Implement an `IPurlResolver` interface:
```csharp
public interface IPurlResolver
{
Purl ResolveForModule(string filePath, byte[] contentHash);
}
```
Possible implementations:
* `SbomPurlResolver` given a CycloneDX/SPDX SBOM for the image, match by path or checksum.
* `LinuxPackagePurlResolver` read `/var/lib/dpkg/status` / rpm DB in the filesystem.
* `GenericPurlResolver` fallback: `pkg:generic/<hash>`.
You call the resolver in your loaders so that **every `BinaryModule` has a purl** and thus every `FunctionNode` has a purl.
---
## 8. Concrete implementation tasks for your team
1. **Data model & interfaces**
* Implement `ModuleId`, `FunctionNode`, `CallEdge`, `CallGraph`.
* Define `RawCallSite`, `BinaryModule`, and `IPurlResolver`.
2. **Loaders**
* `ElfLoader`: fill symbols, dynamic relocations (PLT), DT_NEEDED, etc.
* `PeLoader`: import descriptors, IAT, delay-load.
* `MachOLoader`: load commands, stubs, la_symbol_ptr, indirect symbols / chained fixups.
3. **Disassembly**
* `X86Disassembler` (iced) and `Arm64Disassembler` (Disarm or port).
* Function detection and `RawCallSite` extraction.
4. **CallGraphBuilder**
* Build intra-module edges from direct calls.
* Build inter-module edges using the format-specific rules above.
* Construct final `CallGraph` with adjacency maps and purl mappings.
5. **Reachability**
* Implement BFS/DFS from root functions.
* Projection to modules + purls.
* Vulnerability sink resolution & path reconstruction.
6. **Export**
* JSON serializer for the schema above.
* Optional: purl-level summary / VEX generator.
---
If you want, next step I can do a **more concrete design for `CallGraphBuilder`** (including per-format helper classes with method signatures) or a **C# skeleton** for the `ElfImportResolver`, `PeImportResolver`, and `MachOStubResolver` that plug directly into this plan.