Files
git.stella-ops.org/docs/product-advisories/archived/20-Nov-2025 - Encoding Binary Reachability with PURL‑Resolved Edges.md
2025-11-23 17:18:17 +02:00

33 KiB
Raw Blame History

Heres a simple, practical way to think about binary reachability that cleanly joins call graphs with SBOMs—without reusing external tools.


The big idea (plain English)

  • Each function call edge in a binarys call graph is annotated with:

    • a purl (package URL) identifying which component the callee belongs to, and
    • a symbol digest (stable hash of the callees normalized symbol signature).
  • With those two tags, call graphs from PE/ELF/MachO can be merged across binaries and mapped onto your SBOM components, giving a single vulnerability graph that answers: “Is this vulnerable function reachable in my deployment?”


Why this matters for StellaOps

  • One graph to rule them all: Libraries used by multiple services merge naturally via the same purl, so you see crossservice blast radius instantly.
  • Deterministic & auditable: Digests + purls make edges reproducible (great for “replayable scans” and audit trails).
  • Zero tool reuse required: You can implement PE/ELF/MachO parsing once in C# and still interoperate with SBOM/VEX ecosystems via purls.

Minimal data model

{
  "nodes": [
    {"id":"sym:hash:callee","kind":"symbol","purl":"pkg:nuget/Newtonsoft.Json@13.0.3","sig":"Newtonsoft.Json.JsonConvert::DeserializeObject<T>(string)"},
    {"id":"bin:hash:myapi","kind":"binary","format":"pe","name":"MyApi.exe","build":"sha256:..."}
  ],
  "edges": [
    {
      "from":"sym:hash:caller",
      "to":"sym:hash:callee",
      "etype":"calls",
      "purl":"pkg:nuget/Newtonsoft.Json@13.0.3",
      "sym_digest":"sha256:SYM_CALLEE",
      "site":{"binary":"bin:hash:myapi","offset":"0x0041AFD0"}
    }
  ],
  "sbom": [
    {"purl":"pkg:nuget/Newtonsoft.Json@13.0.3","component_id":"c-123","files":["/app/MyApi.exe"] }
  ]
}

How to build it (C#centric, binaryfirst)

  1. Lift symbols per format

    • PE: parse COFF + PDB (if present), fallback to export tables; normalize “namespace.type::method(sig)”.
    • ELF: .dynsym/.symtab + DWARF (if present); demangle (Itanium/LLVM rules).
    • MachO: LC_SYMTAB + DWARF; demangle.
  2. Compute symbol digests

    • Hash of normalized signature + (optionally) instruction fingerprint for resilience to addresses.
  3. Build intrabinary call graph

    • Conservative static: function→function edges from import thunks, relocation targets, and lightweight disassembly of direct calls.
    • Optional dynamic refinement: PERF/eBPF or ETW traces to mark observed edges.
  4. Resolve each callee to a purl

    • Map import/segment to owning file → map file to SBOM component → emit its purl.
    • If multiple candidates, emit edge with a small candidates[] set; policy later can prune.
  5. Merge graphs across binaries

    • Union by (purl, sym_digest) for callees; keep multiple site locations.
  6. Attach vulnerabilities

    • From VEX/CVE → affected package purls → mark reachable if any path exists from entrypoints to a vulnerable (purl, sym_digest).

Practical policies that work well

  • Entrypoints: ASP.NET controller actions, Main, exported handlers, cron entry shims.
  • Edge confidence: tag edges as import, reloc, disasm, or runtime; prefer runtime in prioritization.
  • Unknowns registry: if symbol cant be resolved, record purl:"pkg:unknown" with reason (stripped, obfuscated, thunk), so its visible—not silently dropped.

Quick win you can ship first

  • Start with imports-only reachability (no disassembly). For most CVEs in popular packages, imports + SBOM mapping already highlights real risk.
  • Add light disassembly for direct call opcodes later to improve precision.

If you want, I can turn this into a readytodrop .NET 10 library skeleton: parsers (PE/ELF/MachO), symbol normalizer, digestor, graph model, and SBOM mapper with purl resolvers.

Below is a concrete, implementation-ready specification aimed at a solid, “average” C# developer. The goal is that they can build this module without knowing all of StellaOps context.


1. Purpose and Scope

Implement a reusable .NET library that:

  1. Reads binaries (PE, ELF, Mach-O).

  2. Extracts functions/symbols and their call relationships (call graph).

  3. Annotates each call edge with:

    • The callees purl (package URL / SBOM component).
    • A symbol digest (stable function identifier).
  4. Produces a reachability graph in memory and as JSON.

This will be used by other StellaOps services (Scanner / Sbomer / Vexer) to answer: “Is this vulnerable function from package X reachable in my environment?”

Non-goals for v1:

  • No dynamic tracing (no eBPF, no ETW). Static only.
  • No external CLI tools (no objdump, llvm-nm, etc.). Everything in-process and in C#.

2. Project Structure

Create a new class library:

  • Project: StellaOps.Scanner.BinaryReachability
  • TargetFramework: net10.0
  • Nullable: enable
  • Language: latest C# available for .NET 10

Recommended namespaces:

  • StellaOps.Scanner.BinaryReachability
  • StellaOps.Scanner.BinaryReachability.Model
  • StellaOps.Scanner.BinaryReachability.Parsing
  • StellaOps.Scanner.BinaryReachability.Parsing.Pe
  • StellaOps.Scanner.BinaryReachability.Parsing.Elf
  • StellaOps.Scanner.BinaryReachability.Parsing.MachO
  • StellaOps.Scanner.BinaryReachability.Sbom
  • StellaOps.Scanner.BinaryReachability.Graph

3. Core Domain Model

3.1 Enumerations

namespace StellaOps.Scanner.BinaryReachability.Model;

public enum BinaryFormat
{
    Pe,
    Elf,
    MachO
}

public enum SymbolKind
{
    Function,
    Method,
    Constructor,
    Destructor,
    ImportStub,
    Thunk,
    Unknown
}

public enum EdgeKind
{
    DirectCall,
    IndirectCall,
    ImportCall,
    ConstructorInit,   // e.g. .init_array
    Other
}

public enum EdgeConfidence
{
    High,       // import, relocation, clear direct call
    Medium,     // best-effort disassembly
    Low         // heuristics, fallback
}

3.2 Node and Edge Records

namespace StellaOps.Scanner.BinaryReachability.Model;

public sealed record BinaryNode(
    string BinaryId,             // e.g. "bin:sha256:..."
    string FilePath,             // path in image or filesystem
    BinaryFormat Format,
    string? BuildId,             // ELF build-id, Mach-O UUID, PE pdb-signature (optional)
    string FileHash              // sha256 of binary bytes
);

public sealed record SymbolNode(
    string SymbolId,             // stable within this graph: "sym:{digest}"
    string NormalizedName,       // normalized signature/name
    SymbolKind Kind,
    string? Purl,                // nullable: may be unknown
    string SymbolDigest          // sha256 of normalized name
);

3.3 Call Edge and Call Site

namespace StellaOps.Scanner.BinaryReachability.Model;

public sealed record CallSite(
    string BinaryId,
    ulong Offset,                // RVA / file offset
    string? SourceFile,          // Optional, if we can resolve
    int? SourceLine              // Optional
);

public sealed record CallEdge(
    string FromSymbolId,
    string ToSymbolId,
    EdgeKind EdgeKind,
    EdgeConfidence Confidence,
    string? CalleePurl,          // resolved package of callee
    string CalleeSymbolDigest,   // same as target SymbolDigest
    CallSite Site
);

3.4 Graph Container

namespace StellaOps.Scanner.BinaryReachability.Graph;

using StellaOps.Scanner.BinaryReachability.Model;

public sealed class ReachabilityGraph
{
    public Dictionary<string, BinaryNode> Binaries { get; } = new();
    public Dictionary<string, SymbolNode> Symbols { get; } = new();
    public List<CallEdge> Edges { get; } = new();

    public void AddBinary(BinaryNode binary) => Binaries[binary.BinaryId] = binary;
    public void AddSymbol(SymbolNode symbol) => Symbols[symbol.SymbolId] = symbol;
    public void AddEdge(CallEdge edge) => Edges.Add(edge);
}

4. Public API (what other modules call)

Define a simple facade service that other StellaOps components use.

namespace StellaOps.Scanner.BinaryReachability;

using StellaOps.Scanner.BinaryReachability.Graph;
using StellaOps.Scanner.BinaryReachability.Model;
using StellaOps.Scanner.BinaryReachability.Sbom;

public interface IBinaryReachabilityService
{
    /// <summary>
    /// Builds a reachability graph for all binaries in the given directory (e.g. unpacked container filesystem),
    /// using SBOM data to resolve PURLs.
    /// </summary>
    ReachabilityGraph BuildGraph(
        string rootDirectory,
        ISbomComponentResolver sbomResolver);

    /// <summary>
    /// Serialize the graph to JSON for persistence / later replay.
    /// </summary>
    string SerializeGraph(ReachabilityGraph graph);
}

Implementation class:

public sealed class BinaryReachabilityService : IBinaryReachabilityService
{
    // Will compose format-specific parsers and SBOM resolver inside.
}

5. SBOM Component Resolver

We need only a minimal interface to attach PURLs to binaries and symbols.

namespace StellaOps.Scanner.BinaryReachability.Sbom;

public interface ISbomComponentResolver
{
    /// <summary>
    /// Resolve the purl for a binary file (by path or build-id).
    /// Return null if not found.
    /// </summary>
    string? ResolvePurlForBinary(string filePath, string? buildId, string fileHash);

    /// <summary>
    /// Optional: resolve purl by a library name only (e.g. "libssl.so.3", "libcrypto.so.3").
    /// Used when we have imports but not full path.
    /// </summary>
    string? ResolvePurlByLibraryName(string libraryName);
}

For the C# dev:

  • Implementation will consume CycloneDX/SPDX SBOMs that already map files (hash/path/buildId) to components and purls.

  • For v1, a simple resolver that:

    • Loads SBOM JSON.

    • Indexes components by:

      • File path (normalized).
      • File hash.
      • BuildId where available.
    • Implements the two methods above using dictionary lookups.


6. Binary Parsing Abstractions

6.1 Common Interface

namespace StellaOps.Scanner.BinaryReachability.Parsing;

using StellaOps.Scanner.BinaryReachability.Model;

public interface IBinaryParser
{
    bool CanParse(string filePath, ReadOnlySpan<byte> header);

    /// <summary>
    /// Parse basic binary metadata: format, build-id, file-hash already computed by caller.
    /// </summary>
    BinaryNode ParseBinaryMetadata(string filePath, string fileHash);

    /// <summary>
    /// Parse functions/symbols from this binary.
    /// Return a list of SymbolNode with Purl left null (will be set later).
    /// </summary>
    IReadOnlyList<SymbolNode> ParseSymbols(BinaryNode binary);

    /// <summary>
    /// Build intra-binary call edges (from this binarys functions to others), without PURL info.
    /// ToSymbolId should be based on SymbolDigest; PURL will be attached later.
    /// </summary>
    IReadOnlyList<CallEdge> ParseCallGraph(BinaryNode binary, IReadOnlyList<SymbolNode> symbols);
}

6.2 Parser Implementations

Create three concrete parsers:

  • PeBinaryParser in Parsing.Pe
  • ElfBinaryParser in Parsing.Elf
  • MachOBinaryParser in Parsing.MachO

And a small factory:

public sealed class BinaryParserFactory
{
    private readonly List<IBinaryParser> _parsers;

    public BinaryParserFactory()
    {
        _parsers = new List<IBinaryParser>
        {
            new Pe.PeBinaryParser(),
            new Elf.ElfBinaryParser(),
            new MachO.MachOBinaryParser()
        };
    }

    public IBinaryParser? GetParser(string filePath, ReadOnlySpan<byte> header)
        => _parsers.FirstOrDefault(p => p.CanParse(filePath, header));
}

7. Symbol Normalization and Digesting

Create a small helper for consistent symbol IDs.

namespace StellaOps.Scanner.BinaryReachability.Model;

public static class SymbolIdFactory
{
    public static string ComputeNormalizedName(string rawName)
        => rawName.Trim();  // v1: minimal; later we can extend (demangling, etc.)

    public static string ComputeSymbolDigest(string normalizedName)
    {
        using var sha = System.Security.Cryptography.SHA256.Create();
        var bytes = System.Text.Encoding.UTF8.GetBytes(normalizedName);
        var hash = sha.ComputeHash(bytes);
        var hex = Convert.ToHexString(hash).ToLowerInvariant();
        return hex;
    }

    public static string CreateSymbolId(string symbolDigest)
        => $"sym:{symbolDigest}";
}

Usage in parsers:

  • For each function name the parser finds:

    • normalizedName = SymbolIdFactory.ComputeNormalizedName(rawName);
    • digest = SymbolIdFactory.ComputeSymbolDigest(normalizedName);
    • symbolId = SymbolIdFactory.CreateSymbolId(digest);
    • Create SymbolNode.

Notes for developer:

  • Do not include file path or address in the digest (we want determinism across builds).
  • In the future we can expand normalization to include demangled signatures and parameter types.

8. Building the Graph (step-by-step)

Implementation of BinaryReachabilityService.BuildGraph should follow this algorithm.

8.1 Scan Files

  1. Recursively enumerate all files under rootDirectory.

  2. For each file:

    • Open as stream.
    • Read first 48 bytes as header.
    • Try BinaryParserFactory.GetParser.
    • If no parser, skip file.

8.2 Parse Binary Metadata and Symbols

For each parseable file:

  1. Compute SHA256 of file content → fileHash.

  2. parser.ParseBinaryMetadata(filePath, fileHash)BinaryNode.

  3. Add BinaryNode to ReachabilityGraph.Binaries.

  4. parser.ParseSymbols(binary) → list of SymbolNode.

  5. For each symbol:

    • Add to ReachabilityGraph.Symbols if not already present:

      • Key: SymbolId.
      • If existing, keep first or merge (for v1: keep first).

Maintain an in-memory index:

// symbolDigest -> SymbolNode
Dictionary<string, SymbolNode> symbolsByDigest;

8.3 Parse Call Graph per Binary

For each binary:

  1. parser.ParseCallGraph(binary, itsSymbols) → edges (without PURL attached).

  2. For each edge:

    • Ensure FromSymbolId and ToSymbolId correspond to known SymbolNode:

      • ToSymbolId should be sym:{digest} for the callee.
    • Add edge to ReachabilityGraph.Edges.

At this point, edges know only FromSymbolId, ToSymbolId, kind, confidence, and CallSite.

8.4 Attach PURLs

Now run a second pass to attach PURLs to symbols and edges:

  1. For each BinaryNode:

    • Call sbomResolver.ResolvePurlForBinary(binary.FilePath, binary.BuildId, binary.FileHash).
    • If not null, this is the binarys own purl (used for "who owns these functions").
  2. Maintain:

Dictionary<string, string?> binaryPurlsById; // BinaryId -> purl?
  1. For each CallEdge:

    • Get callee symbol:

      • var symbol = graph.Symbols[edge.ToSymbolId];
    • If symbol.Purl is null:

      • If callee is local (same binary parser may mark it via metadata or CallSite.BinaryId):

        • Assign symbol.Purl = binaryPurlsById[callSite.BinaryId] (can be null).
      • If callee is imported from an external library:

        • Parser should provide library name in NormalizedName or additional metadata (for v1, you can store library in a separate structure).
        • Use sbomResolver.ResolvePurlByLibraryName(libraryName) to find purl.
        • Set symbol.Purl to that value (even if null).
    • Set edge.CalleePurl = symbol.Purl.

    • Set edge.CalleeSymbolDigest = symbol.SymbolDigest.

Note: For v1 you can simplify:

  • Assume all callees in this binary belong to binarys purl.
  • Later, extend to per-library mapping.

9. Format-Specific Minimum Requirements

For each parser, aim for this minimum.

9.1 PE Parser (Windows)

Tasks:

  1. Identify PE by MZ + PE header.

  2. Extract:

    • Machine type.
    • Optional: PDB signature / age (for potential BuildId in the future).
  3. Symbols:

    • Use export table for exported functions.
    • Use import table for imported functions (these represent edges from this binary to others).
  4. Call graph:

    • For v1: edges from each local function to imported functions via import table.
    • Later: add simple disassembly of .text section to detect intra-binary calls.

Practical approach:

  • Use System.Reflection.PortableExecutable if possible, or a small custom PE reader.
  • Represent imported function name as "<dllName>!<functionName>" in NormalizedName.

9.2 ELF Parser (Linux)

Tasks:

  1. Detect ELF by magic 0x7F 'E' 'L' 'F'.

  2. Extract:

    • BuildId (from .note.gnu.build-id if present).
    • Architecture.
  3. Symbols:

    • Read .dynsym (dynamic symbols) and .symtab if present.
    • Functions only (symbol type FUNC).
  4. Call graph (minimum):

    • Imports via PLT/GOT entries (function calls to shared libs).
    • Map symbol names to SymbolNode as above.

Implementation:

  • Write a simple ELF reader: parse header, section headers, locate .dynsym, .strtab, .symtab, .note.gnu.build-id.

9.3 Mach-O Parser (macOS)

Tasks:

  1. Detect Mach-O via magic (0xFEEDFACE, 0xFEEDFACF, etc.).

  2. Extract:

    • UUID (LC_UUID) as BuildId equivalent.
  3. Symbols:

    • Use LC_SYMTAB and associated string table.
  4. Call graph:

    • Similar approach as ELF for imports; minimum: cross-binary call edges via import stubs.

Implementation:

  • Minimal Mach-O parser: read load commands, find LC_SYMTAB and LC_UUID.

10. JSON Serialization Format

Use System.Text.Json with simple DTOs mirroring ReachabilityGraph. For v1, you can serialize the domain model directly.

Example structure (for reference only):

{
  "nodes": {
    "binaries": [
      { "binaryId": "bin:sha256:...", "filePath": "/app/MyApi.exe", "format": "Pe", "buildId": null, "fileHash": "..." }
    ],
    "symbols": [
      { "symbolId": "sym:...", "normalizedName": "MyNamespace.MyType::MyMethod()", "kind": "Function", "purl": "pkg:nuget/MyLib@1.2.3", "symbolDigest": "..." }
    ]
  },
  "edges": [
    {
      "fromSymbolId": "sym:...",
      "toSymbolId": "sym:...",
      "edgeKind": "ImportCall",
      "confidence": "High",
      "calleePurl": "pkg:nuget/MyLib@1.2.3",
      "calleeSymbolDigest": "...",
      "site": { "binaryId": "bin:sha256:...", "offset": "0", "sourceFile": null, "sourceLine": null }
    }
  ]
}

11. Error Handling & Logging

  • For unreadable or unsupported binaries:

    • Log a warning and continue.
  • For parsing errors:

    • Catch exceptions, log with file path and format, continue with other files.
  • For SBOM resolution failures:

    • Not an error; leave Purl as null.

Logs should at least include:

  • Number of binaries discovered, parsed successfully, failed.
  • Number of symbols and edges created.
  • Number of edges with CalleePurl null vs non-null.

12. Test Plan (high-level)

  1. Unit tests for:

    • SymbolIdFactory (deterministic digests).
    • BinaryReachabilityService with mocked parsers & SBOM resolver.
  2. Integration tests (per platform) using small sample binaries:

    • A PE with one import (e.g. MessageBoxA).
    • An ELF binary calling printf.
    • A Mach-O binary with a simple imported function.
  3. Check that:

    • Graph contains expected binaries and symbols.
    • Call edges exist and have correct FromSymbolId / ToSymbolId.
    • PURLs are attached when SBOM resolver is provided with matching entries.

If you want, next step I can break this into a concrete task list (Jira-style) for a single mid-level C# developer over 12 sprints, including approximate order and dependencies. You can push these specs significantly closer to “best in class” by tightening a few correctness details, making some implicit assumptions explicit, and adding nonfunctional and quality gates. Below is a structured list of concrete improvements you can fold back into the spec.

Ill focus on changes that materially affect correctness, extensibility, and implementation quality for an “average C# dev” without exploding complexity.


1. Clarify NonFunctional Requirements

Right now the spec is almost entirely functional. Add a short NFR section so the developer has explicit targets:

Add a “NonFunctional Requirements” section:

  • Performance

    • Target scanning throughput, e.g. “On commodity hardware, aim for at least 50100 MB/s of binaries scanned in static mode.”
    • Specify acceptable complexity: “All parsing operations must be linear in file size where possible; avoid quadratic algorithms over symbol tables.”
  • Memory

    • Provide a rough upper bound, e.g. “Graph building must not exceed 512 MB RAM for 10k binaries with typical Linux container images.”
  • Thread safety

    • Clarify: “All parser implementations must be stateless and threadsafe; BinaryReachabilityService.BuildGraph may scan binaries in parallel.”
  • Portability

    • Minimum supported OS set (Windows, Linux, macOS) and CPU architectures (x86_64, ARM64); important because ELF/MachO vary.

This keeps the implementation from being “correct but unusably slow” and tells the dev what “good enough” looks like.


2. Fix and Strengthen Symbol Identity (Very Important)

Current spec uses SymbolId = "sym:{digest}" where digest is only based on normalized name. That will collapse distinct functions that happen to share the same name/signature across different libraries/packages, which is unacceptable once you care about crosscomponent reachability.

Improve the spec as follows:

  1. Split “symbol node identity” from “canonical symbol key”:

    • Keep a local identity that is always unique per binary:

      public sealed record SymbolNode(
          string SymbolId,           // e.g. "sym:{binaryId}:{localIndex}"
          string NormalizedName,
          SymbolKind Kind,
          string? Purl,
          string SymbolDigest        // stable digest of NormalizedName
      );
      
    • Define a canonical symbol key struct for crossbinary grouping:

      public readonly record struct CanonicalSymbolKey(
          string SymbolDigest,       // sha256(normalizedName)
          string? Purl               // null for unknown package
      );
      
    • Inside ReachabilityGraph, add:

      public Dictionary<CanonicalSymbolKey, List<string>> CanonicalSymbolIndex { get; } = new();
      
  2. Clarify behavior:

    • Never merge two SymbolNodes just because they share the same digest.
    • For “global reasoning” (e.g. “all call sites to the vulnerable function X from package Y”), use CanonicalSymbolKey(SymbolDigest, Purl).
  3. Update CallEdge:

    • Keep FromSymbolId and ToSymbolId as node IDs.

    • Include the canonical key in a dedicated field:

      public sealed record CallEdge(
          string FromSymbolId,
          string ToSymbolId,
          EdgeKind EdgeKind,
          EdgeConfidence Confidence,
          CanonicalSymbolKey? CalleeKey,
          CallSite Site
      );
      

This single change prevents subtle and serious misattribution across libraries with overlapping APIs.


3. Explicit Build Identity Semantics (PE/ELF/MachO)

The spec currently says BuildId is “optional” and formatspecific, but does not define how to compute it per format. Bestinclass means this is deterministic and documented.

Extend the spec with a “Binary Identity” section:

  • PE (Windows)

    • BuildId = PDB GUID + Age if available (from CodeView debug directory).
    • If PDB info is missing, set BuildId = null and rely on FileHash.
  • ELF (Linux)

    • BuildId = contents of .note.gnu.build-id if present.
  • MachO (macOS)

    • BuildId = UUID from LC_UUID load command.

Also specify:

  • Primary identity order: (BuildId, FileHash); if BuildId is null, use FileHash only.
  • SBOM resolvers MUST treat (BuildId, FileHash) as the canonical key to map binaries to components, with file path only as a hint.

This gives you robust correlation between SBOM entries and binaries, across containers and file renames.


4. Enrich the Edge Model and Call Site Semantics

For precision and debuggability, specify what edges mean more rigorously.

Add fields and definitions:

  1. Direction and type:

    Add a small discriminator describing the origin of the edge:

    public enum EdgeSource
    {
        ImportTable,    // import thunk / PLT / stub
        Relocation,     // relocation to symbol
        Disassembly,    // decoded CALL / BL / JAL
        Metadata,       // .NET metadata, DWARF, etc.
        Other
    }
    

    Extend CallEdge:

    public sealed record CallEdge(
        string FromSymbolId,
        string ToSymbolId,
        EdgeKind EdgeKind,
        EdgeConfidence Confidence,
        EdgeSource Source,
        CanonicalSymbolKey? CalleeKey,
        CallSite Site
    );
    
  2. Intra vs interbinary

    • Define: Site.BinaryId always refers to the binary containing the call instruction.
    • Intrabinary edge: FromSymbol and ToSymbol share same BinaryId.
    • Interbinary edge: otherwise.
  3. Unknown or unresolved callees

    • Do not drop unresolved calls; add a special UnknownSymbolNode per binary:

      • NormalizedName = "<unknown>", Kind = SymbolKind.Unknown, Purl = null.
    • Edges to unknown must have Confidence = EdgeConfidence.Low.

This makes downstream consumers able to distinguish “we are sure this is a call to libX.Y” from “we saw a call but do not know to where”.


5. Strengthen Symbol Normalization Rules (Demangling etc.)

For bestinclass results, you want reproducible signatures independent of compiler version, and you want to unify mangled C++/Rust/etc. names.

Extend the SymbolIdFactory spec with clear rules:

  1. Languageagnostic core

    • Always:

      • Demangle if possible.
      • Normalize whitespace.
      • Normalize namespace separators to . and member separator to ::.
      • Remove address/offset suffixes embedded in names.
  2. Format and languagespecific guidance

    • For C/C++ (MSVC / Itanium ABI):

      • Use a demangler (your own or library) to get retType namespace.Type::Func(paramTypes...).
      • Omit return type in normalization to make signatures more stable: namespace.Type::Func(paramTypes...).
    • For Rust:

      • Strip hash suffixes from symbol name.
      • Use “crate::module::Type::func(params...)” pattern where possible.
    • For Go:

      • Normalize from runtime.main_mainruntime.main.main etc.
    • For .NET (if/when you add managed parsing later):

      • Use fully qualified CLR names: Namespace.Type::Method(ParamType1,ParamType2).
  3. Document stability guarantees

    • Given identical source (function name + parameter list), the SymbolDigest must remain stable across builds, architectures, optimization levels, and link addresses.
    • If demangling fails, fallback to raw name but strip obvious hashes if safe.

Specify this in prose and keep the implementation flexible, but the rules must be clear enough that two developers implementing the parser will produce the same digest for the same symbol.


6. More Precise SBOM & PURL Resolution Behavior

The SBOM integration is crucial to StellaOps; push this further so it is deterministic and auditable.

Extend ISbomComponentResolver behavior:

  1. Resolution order

    Document a strict order:

    1. (BuildId, FileHash) match.
    2. FileHash only.
    3. Normalized file path if SBOM has explicit path mapping.
    4. Library name fallback via ResolvePurlByLibraryName.
  2. Multiple SBOMs and conflicts

    • Allow multiple SBOM sources; if two SBOMs claim different purls for the same (BuildId, FileHash), define a policy:

      • e.g. fail fast with a “conflicting SBOM” error; or choose a deterministic priority order.
  3. Library name mapping contract

    Add a small DTO to make the mapping explicit:

    public sealed record LibraryReference(
        string BinaryId,
        string LibraryName,    // "libssl.so.3" / "KERNEL32.dll"
        string? ResolvedPath  // if the loader path is known
    );
    

    Extend IBinaryParser with:

    IReadOnlyList<LibraryReference> ParseLibraryReferences(BinaryNode binary);
    

    Then describe how BinaryReachabilityService uses those to call ResolvePurlByLibraryName.

  4. Unknown purls

    • Require that unknowns are explicit:

      • When ResolvePurlForBinary returns null, store Purl = null and flag this in logs: “No SBOM component for binary X (BuildId=..., Hash=...)”.

This ensures SBOM resolution remains a traceable, deterministic step rather than a besteffort guess.


7. Explicit JSON Schema & Versioning

For replayability and compatibility, define a clear JSON schema and version.

Add:

  • A toplevel metadata section:

    {
      "schemaVersion": "1.0.0",
      "generatedAt": "2025-11-20T12:34:56Z",
      "tool": "StellaOps.Scanner.BinaryReachability",
      "toolVersion": "1.0.0",
      "graph": { ... }
    }
    
  • Commit to:

    • Only additive changes in minor versions.
    • Backwardscompatible changes within the same major version.
    • If you change anything structural (e.g. how symbol IDs work), bump schemaVersion major.

Optionally, provide a compact JSON schema file (or at least a documented shape) so other teams can implement readers in other languages.


8. Concurrency, Streaming, and Large Images

For bestinclass scalability, specify how large images are handled.

Clarify in the spec:

  1. Parallelization

    • BinaryReachabilityService.BuildGraph:

      • May scan binaries in parallel using Parallel.ForEach.
      • All parsers must be threadsafe and not rely on shared mutable state.
  2. Streaming option (optional but recommended)

    • Provide a second API for very large repositories:

      public interface IGraphSink
      {
          void OnBinary(BinaryNode binary);
          void OnSymbol(SymbolNode symbol);
          void OnEdge(CallEdge edge);
      }
      
      void BuildGraphStreaming(string rootDirectory, ISbomComponentResolver sbomResolver, IGraphSink sink);
      
    • This allows building graphs into a database or message bus without keeping everything in memory.

Even if you do not implement streaming immediately, designing the interface now keeps the architecture futureproof.


9. Observability and Diagnostics

Bestinclass implementation requires good introspection for debugging wrong reachability conclusions.

Specify minimal observability requirements:

  • Logging

    • At least:

      • Info: number of binaries, symbols, edges, time taken.
      • Warning: unsupported binary formats, SBOM resolution failures, demangling failures.
      • Error: parser exceptions per file (with file path and format).
  • Debug artifacts

    • Optional environment or flag that dumps perbinary debug info:

      • Raw symbol table (names + addresses).
      • Normalized names and digests.
      • Library references.
      • Call edges for that binary.
  • Metrics hooks

    • Provide a simple interface for metrics:

      public interface IReachabilityMetrics
      {
          void IncrementCounter(string name, long value = 1);
          void ObserveDuration(string name, TimeSpan duration);
      }
      

      And allow BinaryReachabilityService to be constructed with an optional metrics implementation.


10. Expanded Test Strategy and Quality Gates

Your test plan is decent but can be made more systematic.

Extend test plan:

  1. Golden corpus

    • Maintain a small but curated set of PE/ELF/MachO binaries (checked in or generated) where:

      • Expected symbols and edges are stored as JSON.
      • CI compares current output with the golden graph byteforbyte (or structurally).
  2. Crosscompiler coverage

    • At least:

      • C/C++ built by different toolchains (MSVC, clang, gcc).
      • Different optimization levels (-O0, -O2) to ensure stability of parsing.
  3. Fuzzing / robustness

    • Create tests with truncated / corrupted binaries to ensure:

      • No crashes.
      • Meaningful, bounded error behavior.
  4. SBOM integration tests

    • For a test root directory:

      • Synthetic SBOM mapping files to binaries.
      • Validate correct purl assignment and conflict handling.
  5. Determinism tests

    • Run BuildGraph twice on the same directory and assert that:

      • Graph is structurally identical (including orderindependent comparison).

This makes it much harder for regressions to slip in when you extend parsers or normalization.


11. Clear Extension Points and Roadmap Notes

Finally, add a short “Future Extensions” section so the dev knows what to keep in mind when structuring code:

  • Support for:

    • Inlined function tracking (via DWARF/PDB).
    • Managed .NET assemblies metadata (C# IL call graph).
    • Dynamic edge sources (runtime traces) merged into the same graph.
  • The spec should instruct: “Design parsers and the graph model so they can accept additional EdgeSource types and symbol metadata without breaking existing consumers.”

That gives the current implementation a clear direction and prevents design dead ends.


If you want, I can now take your current spec and rewrite it into a single, consolidated “v1.0 BestinClass Spec” document that incorporates all of the above changes, so you can hand it directly to an implementation team.