Here’s a simple, practical way to think about **binary reachability** that cleanly joins call graphs with SBOMs—without reusing external tools. --- ### The big idea (plain English) * Each **function call edge** in a binary’s call graph is annotated with: * a **purl** (package URL) identifying which component the callee belongs to, and * a **symbol digest** (stable hash of the callee’s normalized symbol signature). * With those two tags, call graphs from **PE/ELF/Mach‑O** can be merged across binaries and mapped onto your **SBOM components**, giving a **single vulnerability graph** that answers: *“Is this vulnerable function reachable in my deployment?”* --- ### Why this matters for Stella Ops * **One graph to rule them all:** Libraries used by multiple services merge naturally via the same purl, so you see cross‑service blast radius instantly. * **Deterministic & auditable:** Digests + purls make edges reproducible (great for “replayable scans” and audit trails). * **Zero tool reuse required:** You can implement PE/ELF/Mach‑O parsing once in C# and still interoperate with SBOM/VEX ecosystems via purls. --- ### Minimal data model ```json { "nodes": [ {"id":"sym:hash:callee","kind":"symbol","purl":"pkg:nuget/Newtonsoft.Json@13.0.3","sig":"Newtonsoft.Json.JsonConvert::DeserializeObject(string)"}, {"id":"bin:hash:myapi","kind":"binary","format":"pe","name":"MyApi.exe","build":"sha256:..."} ], "edges": [ { "from":"sym:hash:caller", "to":"sym:hash:callee", "etype":"calls", "purl":"pkg:nuget/Newtonsoft.Json@13.0.3", "sym_digest":"sha256:SYM_CALLEE", "site":{"binary":"bin:hash:myapi","offset":"0x0041AFD0"} } ], "sbom": [ {"purl":"pkg:nuget/Newtonsoft.Json@13.0.3","component_id":"c-123","files":["/app/MyApi.exe"] } ] } ``` --- ### How to build it (C#‑centric, binary‑first) 1. **Lift symbols per format** * **PE**: parse COFF + PDB (if present), fallback to export tables; normalize “namespace.type::method(sig)”. * **ELF**: `.dynsym`/`.symtab` + DWARF (if present); demangle (Itanium/LLVM rules). * **Mach‑O**: LC_SYMTAB + DWARF; demangle. 2. **Compute `symbol digests`** * Hash of normalized signature + (optionally) instruction fingerprint for resilience to addresses. 3. **Build intra‑binary call graph** * Conservative static: function→function edges from **import thunks**, relocation targets, and lightweight disassembly of direct calls. * Optional dynamic refinement: PERF/eBPF or ETW traces to mark *observed* edges. 4. **Resolve each callee to a `purl`** * Map import/segment to owning file → map file to SBOM component → emit its purl. * If multiple candidates, emit edge with a small `candidates[]` set; policy later can prune. 5. **Merge graphs across binaries** * Union by `(purl, sym_digest)` for callees; keep multiple `site` locations. 6. **Attach vulnerabilities** * From VEX/CVE → affected package purls → mark reachable if any path exists from entrypoints to a vulnerable `(purl, sym_digest)`. --- ### Practical policies that work well * **Entrypoints:** ASP.NET controller actions, `Main`, exported handlers, cron entry shims. * **Edge confidence:** tag edges as `import`, `reloc`, `disasm`, or `runtime`; prefer runtime in prioritization. * **Unknowns registry:** if symbol can’t be resolved, record `purl:"pkg:unknown"` with reason (stripped, obfuscated, thunk), so it’s visible—not silently dropped. --- ### Quick win you can ship first * Start with **imports-only reachability** (no disassembly). For most CVEs in popular packages, imports + SBOM mapping already highlights real risk. * Add **light disassembly** for direct `call` opcodes later to improve precision. If you want, I can turn this into a ready‑to‑drop **.NET 10 library skeleton**: parsers (PE/ELF/Mach‑O), symbol normalizer, digestor, graph model, and SBOM mapper with purl resolvers. Below is a concrete, implementation-ready specification aimed at a solid, “average” C# developer. The goal is that they can build this module without knowing all of StellaOps context. --- ## 1. Purpose and Scope Implement a reusable .NET library that: 1. Reads binaries (PE, ELF, Mach-O). 2. Extracts **functions/symbols** and their **call relationships** (call graph). 3. Annotates each call edge with: * The **callee’s purl** (package URL / SBOM component). * A **symbol digest** (stable function identifier). 4. Produces a **reachability graph** in memory and as JSON. This will be used by other StellaOps services (Scanner / Sbomer / Vexer) to answer: “Is this vulnerable function from package X reachable in my environment?” Non-goals for v1: * No dynamic tracing (no eBPF, no ETW). Static only. * No external CLI tools (no `objdump`, `llvm-nm`, etc.). Everything in-process and in C#. --- ## 2. Project Structure Create a new class library: * Project: `StellaOps.Scanner.BinaryReachability` * TargetFramework: `net10.0` * Nullable: `enable` * Language: latest C# available for .NET 10 Recommended namespaces: * `StellaOps.Scanner.BinaryReachability` * `StellaOps.Scanner.BinaryReachability.Model` * `StellaOps.Scanner.BinaryReachability.Parsing` * `StellaOps.Scanner.BinaryReachability.Parsing.Pe` * `StellaOps.Scanner.BinaryReachability.Parsing.Elf` * `StellaOps.Scanner.BinaryReachability.Parsing.MachO` * `StellaOps.Scanner.BinaryReachability.Sbom` * `StellaOps.Scanner.BinaryReachability.Graph` --- ## 3. Core Domain Model ### 3.1 Enumerations ```csharp namespace StellaOps.Scanner.BinaryReachability.Model; public enum BinaryFormat { Pe, Elf, MachO } public enum SymbolKind { Function, Method, Constructor, Destructor, ImportStub, Thunk, Unknown } public enum EdgeKind { DirectCall, IndirectCall, ImportCall, ConstructorInit, // e.g. .init_array Other } public enum EdgeConfidence { High, // import, relocation, clear direct call Medium, // best-effort disassembly Low // heuristics, fallback } ``` ### 3.2 Node and Edge Records ```csharp namespace StellaOps.Scanner.BinaryReachability.Model; public sealed record BinaryNode( string BinaryId, // e.g. "bin:sha256:..." string FilePath, // path in image or filesystem BinaryFormat Format, string? BuildId, // ELF build-id, Mach-O UUID, PE pdb-signature (optional) string FileHash // sha256 of binary bytes ); public sealed record SymbolNode( string SymbolId, // stable within this graph: "sym:{digest}" string NormalizedName, // normalized signature/name SymbolKind Kind, string? Purl, // nullable: may be unknown string SymbolDigest // sha256 of normalized name ); ``` ### 3.3 Call Edge and Call Site ```csharp namespace StellaOps.Scanner.BinaryReachability.Model; public sealed record CallSite( string BinaryId, ulong Offset, // RVA / file offset string? SourceFile, // Optional, if we can resolve int? SourceLine // Optional ); public sealed record CallEdge( string FromSymbolId, string ToSymbolId, EdgeKind EdgeKind, EdgeConfidence Confidence, string? CalleePurl, // resolved package of callee string CalleeSymbolDigest, // same as target SymbolDigest CallSite Site ); ``` ### 3.4 Graph Container ```csharp namespace StellaOps.Scanner.BinaryReachability.Graph; using StellaOps.Scanner.BinaryReachability.Model; public sealed class ReachabilityGraph { public Dictionary Binaries { get; } = new(); public Dictionary Symbols { get; } = new(); public List Edges { get; } = new(); public void AddBinary(BinaryNode binary) => Binaries[binary.BinaryId] = binary; public void AddSymbol(SymbolNode symbol) => Symbols[symbol.SymbolId] = symbol; public void AddEdge(CallEdge edge) => Edges.Add(edge); } ``` --- ## 4. Public API (what other modules call) Define a simple facade service that other StellaOps components use. ```csharp namespace StellaOps.Scanner.BinaryReachability; using StellaOps.Scanner.BinaryReachability.Graph; using StellaOps.Scanner.BinaryReachability.Model; using StellaOps.Scanner.BinaryReachability.Sbom; public interface IBinaryReachabilityService { /// /// Builds a reachability graph for all binaries in the given directory (e.g. unpacked container filesystem), /// using SBOM data to resolve PURLs. /// ReachabilityGraph BuildGraph( string rootDirectory, ISbomComponentResolver sbomResolver); /// /// Serialize the graph to JSON for persistence / later replay. /// string SerializeGraph(ReachabilityGraph graph); } ``` Implementation class: ```csharp public sealed class BinaryReachabilityService : IBinaryReachabilityService { // Will compose format-specific parsers and SBOM resolver inside. } ``` --- ## 5. SBOM Component Resolver We need only a minimal interface to attach PURLs to binaries and symbols. ```csharp namespace StellaOps.Scanner.BinaryReachability.Sbom; public interface ISbomComponentResolver { /// /// Resolve the purl for a binary file (by path or build-id). /// Return null if not found. /// string? ResolvePurlForBinary(string filePath, string? buildId, string fileHash); /// /// Optional: resolve purl by a library name only (e.g. "libssl.so.3", "libcrypto.so.3"). /// Used when we have imports but not full path. /// string? ResolvePurlByLibraryName(string libraryName); } ``` For the C# dev: * Implementation will consume **CycloneDX/SPDX SBOMs** that already map files (hash/path/buildId) to components and purls. * For v1, a simple resolver that: * Loads SBOM JSON. * Indexes components by: * File path (normalized). * File hash. * BuildId where available. * Implements the two methods above using dictionary lookups. --- ## 6. Binary Parsing Abstractions ### 6.1 Common Interface ```csharp namespace StellaOps.Scanner.BinaryReachability.Parsing; using StellaOps.Scanner.BinaryReachability.Model; public interface IBinaryParser { bool CanParse(string filePath, ReadOnlySpan header); /// /// Parse basic binary metadata: format, build-id, file-hash already computed by caller. /// BinaryNode ParseBinaryMetadata(string filePath, string fileHash); /// /// Parse functions/symbols from this binary. /// Return a list of SymbolNode with Purl left null (will be set later). /// IReadOnlyList ParseSymbols(BinaryNode binary); /// /// Build intra-binary call edges (from this binary’s functions to others), without PURL info. /// ToSymbolId should be based on SymbolDigest; PURL will be attached later. /// IReadOnlyList ParseCallGraph(BinaryNode binary, IReadOnlyList symbols); } ``` ### 6.2 Parser Implementations Create three concrete parsers: * `PeBinaryParser` in `Parsing.Pe` * `ElfBinaryParser` in `Parsing.Elf` * `MachOBinaryParser` in `Parsing.MachO` And a small factory: ```csharp public sealed class BinaryParserFactory { private readonly List _parsers; public BinaryParserFactory() { _parsers = new List { new Pe.PeBinaryParser(), new Elf.ElfBinaryParser(), new MachO.MachOBinaryParser() }; } public IBinaryParser? GetParser(string filePath, ReadOnlySpan header) => _parsers.FirstOrDefault(p => p.CanParse(filePath, header)); } ``` --- ## 7. Symbol Normalization and Digesting Create a small helper for consistent symbol IDs. ```csharp namespace StellaOps.Scanner.BinaryReachability.Model; public static class SymbolIdFactory { public static string ComputeNormalizedName(string rawName) => rawName.Trim(); // v1: minimal; later we can extend (demangling, etc.) public static string ComputeSymbolDigest(string normalizedName) { using var sha = System.Security.Cryptography.SHA256.Create(); var bytes = System.Text.Encoding.UTF8.GetBytes(normalizedName); var hash = sha.ComputeHash(bytes); var hex = Convert.ToHexString(hash).ToLowerInvariant(); return hex; } public static string CreateSymbolId(string symbolDigest) => $"sym:{symbolDigest}"; } ``` Usage in parsers: * For each function name the parser finds: * `normalizedName = SymbolIdFactory.ComputeNormalizedName(rawName);` * `digest = SymbolIdFactory.ComputeSymbolDigest(normalizedName);` * `symbolId = SymbolIdFactory.CreateSymbolId(digest);` * Create `SymbolNode`. Notes for developer: * Do not include file path or address in the digest (we want determinism across builds). * In the future we can expand normalization to include demangled signatures and parameter types. --- ## 8. Building the Graph (step-by-step) Implementation of `BinaryReachabilityService.BuildGraph` should follow this algorithm. ### 8.1 Scan Files 1. Recursively enumerate all files under `rootDirectory`. 2. For each file: * Open as stream. * Read first 4–8 bytes as header. * Try `BinaryParserFactory.GetParser`. * If no parser, skip file. ### 8.2 Parse Binary Metadata and Symbols For each parseable file: 1. Compute SHA256 of file content → `fileHash`. 2. `parser.ParseBinaryMetadata(filePath, fileHash)` → `BinaryNode`. 3. Add `BinaryNode` to `ReachabilityGraph.Binaries`. 4. `parser.ParseSymbols(binary)` → list of `SymbolNode`. 5. For each symbol: * Add to `ReachabilityGraph.Symbols` if not already present: * Key: `SymbolId`. * If existing, keep first or merge (for v1: keep first). Maintain an in-memory index: ```csharp // symbolDigest -> SymbolNode Dictionary symbolsByDigest; ``` ### 8.3 Parse Call Graph per Binary For each binary: 1. `parser.ParseCallGraph(binary, itsSymbols)` → edges (without PURL attached). 2. For each edge: * Ensure `FromSymbolId` and `ToSymbolId` correspond to known `SymbolNode`: * `ToSymbolId` should be `sym:{digest}` for the callee. * Add edge to `ReachabilityGraph.Edges`. At this point, edges know only `FromSymbolId`, `ToSymbolId`, kind, confidence, and `CallSite`. ### 8.4 Attach PURLs Now run a second pass to attach PURLs to symbols and edges: 1. For each `BinaryNode`: * Call `sbomResolver.ResolvePurlForBinary(binary.FilePath, binary.BuildId, binary.FileHash)`. * If not null, this is the **binary’s own purl** (used for "who owns these functions"). 2. Maintain: ```csharp Dictionary binaryPurlsById; // BinaryId -> purl? ``` 3. For each `CallEdge`: * Get callee symbol: * `var symbol = graph.Symbols[edge.ToSymbolId];` * If `symbol.Purl` is null: * If callee is local (same binary – parser may mark it via metadata or `CallSite.BinaryId`): * Assign `symbol.Purl = binaryPurlsById[callSite.BinaryId]` (can be null). * If callee is imported from an external library: * Parser should provide library name in `NormalizedName` or additional metadata (for v1, you can store library in a separate structure). * Use `sbomResolver.ResolvePurlByLibraryName(libraryName)` to find purl. * Set `symbol.Purl` to that value (even if null). * Set `edge.CalleePurl = symbol.Purl`. * Set `edge.CalleeSymbolDigest = symbol.SymbolDigest`. Note: For v1 you can simplify: * Assume all callees in this binary belong to `binary`’s purl. * Later, extend to per-library mapping. --- ## 9. Format-Specific Minimum Requirements For each parser, aim for this minimum. ### 9.1 PE Parser (Windows) Tasks: 1. Identify PE by `MZ` + PE header. 2. Extract: * Machine type. * Optional: PDB signature / age (for potential BuildId in the future). 3. Symbols: * Use export table for exported functions. * Use import table for imported functions (these represent edges from this binary to others). 4. Call graph: * For v1: edges from each local function to imported functions via import table. * Later: add simple disassembly of `.text` section to detect intra-binary calls. Practical approach: * Use `System.Reflection.PortableExecutable` if possible, or a small custom PE reader. * Represent imported function name as `"!"` in `NormalizedName`. ### 9.2 ELF Parser (Linux) Tasks: 1. Detect ELF by magic `0x7F 'E' 'L' 'F'`. 2. Extract: * BuildId (from `.note.gnu.build-id` if present). * Architecture. 3. Symbols: * Read `.dynsym` (dynamic symbols) and `.symtab` if present. * Functions only (symbol type FUNC). 4. Call graph (minimum): * Imports via PLT/GOT entries (function calls to shared libs). * Map symbol names to `SymbolNode` as above. Implementation: * Write a simple ELF reader: parse header, section headers, locate `.dynsym`, `.strtab`, `.symtab`, `.note.gnu.build-id`. ### 9.3 Mach-O Parser (macOS) Tasks: 1. Detect Mach-O via magic (`0xFEEDFACE`, `0xFEEDFACF`, etc.). 2. Extract: * UUID (LC_UUID) as BuildId equivalent. 3. Symbols: * Use LC_SYMTAB and associated string table. 4. Call graph: * Similar approach as ELF for imports; minimum: cross-binary call edges via import stubs. Implementation: * Minimal Mach-O parser: read load commands, find LC_SYMTAB and LC_UUID. --- ## 10. JSON Serialization Format Use System.Text.Json with simple DTOs mirroring `ReachabilityGraph`. For v1, you can serialize the domain model directly. Example structure (for reference only): ```json { "nodes": { "binaries": [ { "binaryId": "bin:sha256:...", "filePath": "/app/MyApi.exe", "format": "Pe", "buildId": null, "fileHash": "..." } ], "symbols": [ { "symbolId": "sym:...", "normalizedName": "MyNamespace.MyType::MyMethod()", "kind": "Function", "purl": "pkg:nuget/MyLib@1.2.3", "symbolDigest": "..." } ] }, "edges": [ { "fromSymbolId": "sym:...", "toSymbolId": "sym:...", "edgeKind": "ImportCall", "confidence": "High", "calleePurl": "pkg:nuget/MyLib@1.2.3", "calleeSymbolDigest": "...", "site": { "binaryId": "bin:sha256:...", "offset": "0", "sourceFile": null, "sourceLine": null } } ] } ``` --- ## 11. Error Handling & Logging * For unreadable or unsupported binaries: * Log a warning and continue. * For parsing errors: * Catch exceptions, log with file path and format, continue with other files. * For SBOM resolution failures: * Not an error; leave Purl as null. Logs should at least include: * Number of binaries discovered, parsed successfully, failed. * Number of symbols and edges created. * Number of edges with `CalleePurl` null vs non-null. --- ## 12. Test Plan (high-level) 1. **Unit tests** for: * `SymbolIdFactory` (deterministic digests). * `BinaryReachabilityService` with mocked parsers & SBOM resolver. 2. **Integration tests** (per platform) using small sample binaries: * A PE with one import (e.g. `MessageBoxA`). * An ELF binary calling `printf`. * A Mach-O binary with a simple imported function. 3. Check that: * Graph contains expected binaries and symbols. * Call edges exist and have correct `FromSymbolId` / `ToSymbolId`. * PURLs are attached when SBOM resolver is provided with matching entries. --- If you want, next step I can break this into a concrete task list (Jira-style) for a single mid-level C# developer over 1–2 sprints, including approximate order and dependencies. You can push these specs significantly closer to “best in class” by tightening a few correctness details, making some implicit assumptions explicit, and adding non‑functional and quality gates. Below is a structured list of concrete improvements you can fold back into the spec. I’ll focus on changes that materially affect correctness, extensibility, and implementation quality for an “average C# dev” without exploding complexity. --- ## 1. Clarify Non‑Functional Requirements Right now the spec is almost entirely functional. Add a short NFR section so the developer has explicit targets: **Add a “Non‑Functional Requirements” section:** * **Performance** * Target scanning throughput, e.g. “On commodity hardware, aim for at least 50–100 MB/s of binaries scanned in static mode.” * Specify acceptable complexity: “All parsing operations must be linear in file size where possible; avoid quadratic algorithms over symbol tables.” * **Memory** * Provide a rough upper bound, e.g. “Graph building must not exceed 512 MB RAM for 10k binaries with typical Linux container images.” * **Thread safety** * Clarify: “All parser implementations must be stateless and thread‑safe; `BinaryReachabilityService.BuildGraph` may scan binaries in parallel.” * **Portability** * Minimum supported OS set (Windows, Linux, macOS) and CPU architectures (x86_64, ARM64); important because ELF/Mach‑O vary. This keeps the implementation from being “correct but unusably slow” and tells the dev what “good enough” looks like. --- ## 2. Fix and Strengthen Symbol Identity (Very Important) Current spec uses `SymbolId = "sym:{digest}"` where digest is only based on normalized name. That will collapse distinct functions that happen to share the same name/signature across different libraries/packages, which is unacceptable once you care about cross‑component reachability. **Improve the spec as follows:** 1. **Split “symbol node identity” from “canonical symbol key”:** * Keep a local identity that is always unique per binary: ```csharp public sealed record SymbolNode( string SymbolId, // e.g. "sym:{binaryId}:{localIndex}" string NormalizedName, SymbolKind Kind, string? Purl, string SymbolDigest // stable digest of NormalizedName ); ``` * Define a **canonical symbol key** struct for cross‑binary grouping: ```csharp public readonly record struct CanonicalSymbolKey( string SymbolDigest, // sha256(normalizedName) string? Purl // null for unknown package ); ``` * Inside `ReachabilityGraph`, add: ```csharp public Dictionary> CanonicalSymbolIndex { get; } = new(); ``` 2. **Clarify behavior:** * Never merge two `SymbolNode`s just because they share the same digest. * For “global reasoning” (e.g. “all call sites to the vulnerable function X from package Y”), use `CanonicalSymbolKey(SymbolDigest, Purl)`. 3. **Update `CallEdge`:** * Keep `FromSymbolId` and `ToSymbolId` as node IDs. * Include the canonical key in a dedicated field: ```csharp public sealed record CallEdge( string FromSymbolId, string ToSymbolId, EdgeKind EdgeKind, EdgeConfidence Confidence, CanonicalSymbolKey? CalleeKey, CallSite Site ); ``` This single change prevents subtle and serious misattribution across libraries with overlapping APIs. --- ## 3. Explicit Build Identity Semantics (PE/ELF/Mach‑O) The spec currently says `BuildId` is “optional” and format‑specific, but does not define **how** to compute it per format. Best‑in‑class means this is deterministic and documented. **Extend the spec with a “Binary Identity” section:** * **PE (Windows)** * `BuildId` = PDB GUID + Age if available (from CodeView debug directory). * If PDB info is missing, set `BuildId = null` and rely on `FileHash`. * **ELF (Linux)** * `BuildId` = contents of `.note.gnu.build-id` if present. * **Mach‑O (macOS)** * `BuildId` = UUID from `LC_UUID` load command. Also specify: * **Primary identity order**: `(BuildId, FileHash)`; if `BuildId` is null, use `FileHash` only. * SBOM resolvers MUST treat `(BuildId, FileHash)` as the canonical key to map binaries to components, with file path only as a hint. This gives you robust correlation between SBOM entries and binaries, across containers and file renames. --- ## 4. Enrich the Edge Model and Call Site Semantics For precision and debuggability, specify what edges mean more rigorously. **Add fields and definitions:** 1. **Direction and type:** Add a small discriminator describing the origin of the edge: ```csharp public enum EdgeSource { ImportTable, // import thunk / PLT / stub Relocation, // relocation to symbol Disassembly, // decoded CALL / BL / JAL Metadata, // .NET metadata, DWARF, etc. Other } ``` Extend `CallEdge`: ```csharp public sealed record CallEdge( string FromSymbolId, string ToSymbolId, EdgeKind EdgeKind, EdgeConfidence Confidence, EdgeSource Source, CanonicalSymbolKey? CalleeKey, CallSite Site ); ``` 2. **Intra‑ vs inter‑binary** * Define: `Site.BinaryId` always refers to the binary containing the call instruction. * Intra‑binary edge: `FromSymbol` and `ToSymbol` share same `BinaryId`. * Inter‑binary edge: otherwise. 3. **Unknown or unresolved callees** * Do not drop unresolved calls; add a special `UnknownSymbolNode` per binary: * `NormalizedName = ""`, `Kind = SymbolKind.Unknown`, `Purl = null`. * Edges to unknown must have `Confidence = EdgeConfidence.Low`. This makes downstream consumers able to distinguish “we are sure this is a call to libX.Y” from “we saw a call but do not know to where”. --- ## 5. Strengthen Symbol Normalization Rules (Demangling etc.) For best‑in‑class results, you want reproducible signatures independent of compiler version, and you want to unify mangled C++/Rust/etc. names. **Extend the `SymbolIdFactory` spec with clear rules:** 1. **Language‑agnostic core** * Always: * Demangle if possible. * Normalize whitespace. * Normalize namespace separators to `.` and member separator to `::`. * Remove address/offset suffixes embedded in names. 2. **Format‑ and language‑specific guidance** * For C/C++ (MSVC / Itanium ABI): * Use a demangler (your own or library) to get `retType namespace.Type::Func(paramTypes...)`. * Omit return type in normalization to make signatures more stable: `namespace.Type::Func(paramTypes...)`. * For Rust: * Strip hash suffixes from symbol name. * Use “crate::module::Type::func(params...)” pattern where possible. * For Go: * Normalize from `runtime.main_main` → `runtime.main.main` etc. * For .NET (if/when you add managed parsing later): * Use fully qualified CLR names: `Namespace.Type::Method(ParamType1,ParamType2)`. 3. **Document stability guarantees** * Given identical source (function name + parameter list), the `SymbolDigest` must remain stable across builds, architectures, optimization levels, and link addresses. * If demangling fails, fallback to raw name but strip obvious hashes if safe. Specify this in prose and keep the implementation flexible, but the rules must be clear enough that two developers implementing the parser will produce the same digest for the same symbol. --- ## 6. More Precise SBOM & PURL Resolution Behavior The SBOM integration is crucial to StellaOps; push this further so it is deterministic and auditable. **Extend `ISbomComponentResolver` behavior:** 1. **Resolution order** Document a strict order: 1. `(BuildId, FileHash)` match. 2. `FileHash` only. 3. Normalized file path if SBOM has explicit path mapping. 4. Library name fallback via `ResolvePurlByLibraryName`. 2. **Multiple SBOMs and conflicts** * Allow multiple SBOM sources; if two SBOMs claim different purls for the same `(BuildId, FileHash)`, define a policy: * e.g. fail fast with a “conflicting SBOM” error; or choose a deterministic priority order. 3. **Library name mapping contract** Add a small DTO to make the mapping explicit: ```csharp public sealed record LibraryReference( string BinaryId, string LibraryName, // "libssl.so.3" / "KERNEL32.dll" string? ResolvedPath // if the loader path is known ); ``` Extend `IBinaryParser` with: ```csharp IReadOnlyList ParseLibraryReferences(BinaryNode binary); ``` Then describe how `BinaryReachabilityService` uses those to call `ResolvePurlByLibraryName`. 4. **Unknown purls** * Require that unknowns are explicit: * When `ResolvePurlForBinary` returns null, store `Purl = null` and flag this in logs: “No SBOM component for binary X (BuildId=..., Hash=...)”. This ensures SBOM resolution remains a traceable, deterministic step rather than a best‑effort guess. --- ## 7. Explicit JSON Schema & Versioning For replayability and compatibility, define a clear JSON schema and version. **Add:** * A top‑level metadata section: ```json { "schemaVersion": "1.0.0", "generatedAt": "2025-11-20T12:34:56Z", "tool": "StellaOps.Scanner.BinaryReachability", "toolVersion": "1.0.0", "graph": { ... } } ``` * Commit to: * Only additive changes in minor versions. * Backwards‑compatible changes within the same major version. * If you change anything structural (e.g. how symbol IDs work), bump `schemaVersion` major. Optionally, provide a compact JSON schema file (or at least a documented shape) so other teams can implement readers in other languages. --- ## 8. Concurrency, Streaming, and Large Images For best‑in‑class scalability, specify how large images are handled. **Clarify in the spec:** 1. **Parallelization** * `BinaryReachabilityService.BuildGraph`: * May scan binaries in parallel using `Parallel.ForEach`. * All parsers must be thread‑safe and not rely on shared mutable state. 2. **Streaming option (optional but recommended)** * Provide a second API for very large repositories: ```csharp public interface IGraphSink { void OnBinary(BinaryNode binary); void OnSymbol(SymbolNode symbol); void OnEdge(CallEdge edge); } void BuildGraphStreaming(string rootDirectory, ISbomComponentResolver sbomResolver, IGraphSink sink); ``` * This allows building graphs into a database or message bus without keeping everything in memory. Even if you do not implement streaming immediately, designing the interface now keeps the architecture future‑proof. --- ## 9. Observability and Diagnostics Best‑in‑class implementation requires good introspection for debugging wrong reachability conclusions. **Specify minimal observability requirements:** * **Logging** * At least: * Info: number of binaries, symbols, edges, time taken. * Warning: unsupported binary formats, SBOM resolution failures, demangling failures. * Error: parser exceptions per file (with file path and format). * **Debug artifacts** * Optional environment or flag that dumps per‑binary debug info: * Raw symbol table (names + addresses). * Normalized names and digests. * Library references. * Call edges for that binary. * **Metrics hooks** * Provide a simple interface for metrics: ```csharp public interface IReachabilityMetrics { void IncrementCounter(string name, long value = 1); void ObserveDuration(string name, TimeSpan duration); } ``` And allow `BinaryReachabilityService` to be constructed with an optional metrics implementation. --- ## 10. Expanded Test Strategy and Quality Gates Your test plan is decent but can be made more systematic. **Extend test plan:** 1. **Golden corpus** * Maintain a small but curated set of PE/ELF/Mach‑O binaries (checked in or generated) where: * Expected symbols and edges are stored as JSON. * CI compares current output with the golden graph byte‑for‑byte (or structurally). 2. **Cross‑compiler coverage** * At least: * C/C++ built by different toolchains (MSVC, clang, gcc). * Different optimization levels (`-O0`, `-O2`) to ensure stability of parsing. 3. **Fuzzing / robustness** * Create tests with truncated / corrupted binaries to ensure: * No crashes. * Meaningful, bounded error behavior. 4. **SBOM integration tests** * For a test root directory: * Synthetic SBOM mapping files to binaries. * Validate correct purl assignment and conflict handling. 5. **Determinism tests** * Run `BuildGraph` twice on the same directory and assert that: * Graph is structurally identical (including order‑independent comparison). This makes it much harder for regressions to slip in when you extend parsers or normalization. --- ## 11. Clear Extension Points and Roadmap Notes Finally, add a short “Future Extensions” section so the dev knows what to keep in mind when structuring code: * Support for: * Inlined function tracking (via DWARF/PDB). * Managed .NET assemblies’ metadata (C# IL call graph). * Dynamic edge sources (runtime traces) merged into the same graph. * The spec should instruct: “Design parsers and the graph model so they can accept additional `EdgeSource` types and symbol metadata without breaking existing consumers.” That gives the current implementation a clear direction and prevents design dead ends. --- If you want, I can now take your current spec and rewrite it into a single, consolidated “v1.0 Best‑in‑Class Spec” document that incorporates all of the above changes, so you can hand it directly to an implementation team.