33 KiB
Here’s a simple, practical way to think about binary reachability that cleanly joins call graphs with SBOMs—without reusing external tools.
The big idea (plain English)
-
Each function call edge in a binary’s call graph is annotated with:
- a purl (package URL) identifying which component the callee belongs to, and
- a symbol digest (stable hash of the callee’s normalized symbol signature).
-
With those two tags, call graphs from PE/ELF/Mach‑O can be merged across binaries and mapped onto your SBOM components, giving a single vulnerability graph that answers: “Is this vulnerable function reachable in my deployment?”
Why this matters for Stella Ops
- One graph to rule them all: Libraries used by multiple services merge naturally via the same purl, so you see cross‑service blast radius instantly.
- Deterministic & auditable: Digests + purls make edges reproducible (great for “replayable scans” and audit trails).
- Zero tool reuse required: You can implement PE/ELF/Mach‑O parsing once in C# and still interoperate with SBOM/VEX ecosystems via purls.
Minimal data model
{
"nodes": [
{"id":"sym:hash:callee","kind":"symbol","purl":"pkg:nuget/Newtonsoft.Json@13.0.3","sig":"Newtonsoft.Json.JsonConvert::DeserializeObject<T>(string)"},
{"id":"bin:hash:myapi","kind":"binary","format":"pe","name":"MyApi.exe","build":"sha256:..."}
],
"edges": [
{
"from":"sym:hash:caller",
"to":"sym:hash:callee",
"etype":"calls",
"purl":"pkg:nuget/Newtonsoft.Json@13.0.3",
"sym_digest":"sha256:SYM_CALLEE",
"site":{"binary":"bin:hash:myapi","offset":"0x0041AFD0"}
}
],
"sbom": [
{"purl":"pkg:nuget/Newtonsoft.Json@13.0.3","component_id":"c-123","files":["/app/MyApi.exe"] }
]
}
How to build it (C#‑centric, binary‑first)
-
Lift symbols per format
- PE: parse COFF + PDB (if present), fallback to export tables; normalize “namespace.type::method(sig)”.
- ELF:
.dynsym/.symtab+ DWARF (if present); demangle (Itanium/LLVM rules). - Mach‑O: LC_SYMTAB + DWARF; demangle.
-
Compute
symbol digests- Hash of normalized signature + (optionally) instruction fingerprint for resilience to addresses.
-
Build intra‑binary call graph
- Conservative static: function→function edges from import thunks, relocation targets, and lightweight disassembly of direct calls.
- Optional dynamic refinement: PERF/eBPF or ETW traces to mark observed edges.
-
Resolve each callee to a
purl- Map import/segment to owning file → map file to SBOM component → emit its purl.
- If multiple candidates, emit edge with a small
candidates[]set; policy later can prune.
-
Merge graphs across binaries
- Union by
(purl, sym_digest)for callees; keep multiplesitelocations.
- Union by
-
Attach vulnerabilities
- From VEX/CVE → affected package purls → mark reachable if any path exists from entrypoints to a vulnerable
(purl, sym_digest).
- From VEX/CVE → affected package purls → mark reachable if any path exists from entrypoints to a vulnerable
Practical policies that work well
- Entrypoints: ASP.NET controller actions,
Main, exported handlers, cron entry shims. - Edge confidence: tag edges as
import,reloc,disasm, orruntime; prefer runtime in prioritization. - Unknowns registry: if symbol can’t be resolved, record
purl:"pkg:unknown"with reason (stripped, obfuscated, thunk), so it’s visible—not silently dropped.
Quick win you can ship first
- Start with imports-only reachability (no disassembly). For most CVEs in popular packages, imports + SBOM mapping already highlights real risk.
- Add light disassembly for direct
callopcodes later to improve precision.
If you want, I can turn this into a ready‑to‑drop .NET 10 library skeleton: parsers (PE/ELF/Mach‑O), symbol normalizer, digestor, graph model, and SBOM mapper with purl resolvers.
Below is a concrete, implementation-ready specification aimed at a solid, “average” C# developer. The goal is that they can build this module without knowing all of StellaOps context.
1. Purpose and Scope
Implement a reusable .NET library that:
-
Reads binaries (PE, ELF, Mach-O).
-
Extracts functions/symbols and their call relationships (call graph).
-
Annotates each call edge with:
- The callee’s purl (package URL / SBOM component).
- A symbol digest (stable function identifier).
-
Produces a reachability graph in memory and as JSON.
This will be used by other StellaOps services (Scanner / Sbomer / Vexer) to answer: “Is this vulnerable function from package X reachable in my environment?”
Non-goals for v1:
- No dynamic tracing (no eBPF, no ETW). Static only.
- No external CLI tools (no
objdump,llvm-nm, etc.). Everything in-process and in C#.
2. Project Structure
Create a new class library:
- Project:
StellaOps.Scanner.BinaryReachability - TargetFramework:
net10.0 - Nullable:
enable - Language: latest C# available for .NET 10
Recommended namespaces:
StellaOps.Scanner.BinaryReachabilityStellaOps.Scanner.BinaryReachability.ModelStellaOps.Scanner.BinaryReachability.ParsingStellaOps.Scanner.BinaryReachability.Parsing.PeStellaOps.Scanner.BinaryReachability.Parsing.ElfStellaOps.Scanner.BinaryReachability.Parsing.MachOStellaOps.Scanner.BinaryReachability.SbomStellaOps.Scanner.BinaryReachability.Graph
3. Core Domain Model
3.1 Enumerations
namespace StellaOps.Scanner.BinaryReachability.Model;
public enum BinaryFormat
{
Pe,
Elf,
MachO
}
public enum SymbolKind
{
Function,
Method,
Constructor,
Destructor,
ImportStub,
Thunk,
Unknown
}
public enum EdgeKind
{
DirectCall,
IndirectCall,
ImportCall,
ConstructorInit, // e.g. .init_array
Other
}
public enum EdgeConfidence
{
High, // import, relocation, clear direct call
Medium, // best-effort disassembly
Low // heuristics, fallback
}
3.2 Node and Edge Records
namespace StellaOps.Scanner.BinaryReachability.Model;
public sealed record BinaryNode(
string BinaryId, // e.g. "bin:sha256:..."
string FilePath, // path in image or filesystem
BinaryFormat Format,
string? BuildId, // ELF build-id, Mach-O UUID, PE pdb-signature (optional)
string FileHash // sha256 of binary bytes
);
public sealed record SymbolNode(
string SymbolId, // stable within this graph: "sym:{digest}"
string NormalizedName, // normalized signature/name
SymbolKind Kind,
string? Purl, // nullable: may be unknown
string SymbolDigest // sha256 of normalized name
);
3.3 Call Edge and Call Site
namespace StellaOps.Scanner.BinaryReachability.Model;
public sealed record CallSite(
string BinaryId,
ulong Offset, // RVA / file offset
string? SourceFile, // Optional, if we can resolve
int? SourceLine // Optional
);
public sealed record CallEdge(
string FromSymbolId,
string ToSymbolId,
EdgeKind EdgeKind,
EdgeConfidence Confidence,
string? CalleePurl, // resolved package of callee
string CalleeSymbolDigest, // same as target SymbolDigest
CallSite Site
);
3.4 Graph Container
namespace StellaOps.Scanner.BinaryReachability.Graph;
using StellaOps.Scanner.BinaryReachability.Model;
public sealed class ReachabilityGraph
{
public Dictionary<string, BinaryNode> Binaries { get; } = new();
public Dictionary<string, SymbolNode> Symbols { get; } = new();
public List<CallEdge> Edges { get; } = new();
public void AddBinary(BinaryNode binary) => Binaries[binary.BinaryId] = binary;
public void AddSymbol(SymbolNode symbol) => Symbols[symbol.SymbolId] = symbol;
public void AddEdge(CallEdge edge) => Edges.Add(edge);
}
4. Public API (what other modules call)
Define a simple facade service that other StellaOps components use.
namespace StellaOps.Scanner.BinaryReachability;
using StellaOps.Scanner.BinaryReachability.Graph;
using StellaOps.Scanner.BinaryReachability.Model;
using StellaOps.Scanner.BinaryReachability.Sbom;
public interface IBinaryReachabilityService
{
/// <summary>
/// Builds a reachability graph for all binaries in the given directory (e.g. unpacked container filesystem),
/// using SBOM data to resolve PURLs.
/// </summary>
ReachabilityGraph BuildGraph(
string rootDirectory,
ISbomComponentResolver sbomResolver);
/// <summary>
/// Serialize the graph to JSON for persistence / later replay.
/// </summary>
string SerializeGraph(ReachabilityGraph graph);
}
Implementation class:
public sealed class BinaryReachabilityService : IBinaryReachabilityService
{
// Will compose format-specific parsers and SBOM resolver inside.
}
5. SBOM Component Resolver
We need only a minimal interface to attach PURLs to binaries and symbols.
namespace StellaOps.Scanner.BinaryReachability.Sbom;
public interface ISbomComponentResolver
{
/// <summary>
/// Resolve the purl for a binary file (by path or build-id).
/// Return null if not found.
/// </summary>
string? ResolvePurlForBinary(string filePath, string? buildId, string fileHash);
/// <summary>
/// Optional: resolve purl by a library name only (e.g. "libssl.so.3", "libcrypto.so.3").
/// Used when we have imports but not full path.
/// </summary>
string? ResolvePurlByLibraryName(string libraryName);
}
For the C# dev:
-
Implementation will consume CycloneDX/SPDX SBOMs that already map files (hash/path/buildId) to components and purls.
-
For v1, a simple resolver that:
-
Loads SBOM JSON.
-
Indexes components by:
- File path (normalized).
- File hash.
- BuildId where available.
-
Implements the two methods above using dictionary lookups.
-
6. Binary Parsing Abstractions
6.1 Common Interface
namespace StellaOps.Scanner.BinaryReachability.Parsing;
using StellaOps.Scanner.BinaryReachability.Model;
public interface IBinaryParser
{
bool CanParse(string filePath, ReadOnlySpan<byte> header);
/// <summary>
/// Parse basic binary metadata: format, build-id, file-hash already computed by caller.
/// </summary>
BinaryNode ParseBinaryMetadata(string filePath, string fileHash);
/// <summary>
/// Parse functions/symbols from this binary.
/// Return a list of SymbolNode with Purl left null (will be set later).
/// </summary>
IReadOnlyList<SymbolNode> ParseSymbols(BinaryNode binary);
/// <summary>
/// Build intra-binary call edges (from this binary’s functions to others), without PURL info.
/// ToSymbolId should be based on SymbolDigest; PURL will be attached later.
/// </summary>
IReadOnlyList<CallEdge> ParseCallGraph(BinaryNode binary, IReadOnlyList<SymbolNode> symbols);
}
6.2 Parser Implementations
Create three concrete parsers:
PeBinaryParserinParsing.PeElfBinaryParserinParsing.ElfMachOBinaryParserinParsing.MachO
And a small factory:
public sealed class BinaryParserFactory
{
private readonly List<IBinaryParser> _parsers;
public BinaryParserFactory()
{
_parsers = new List<IBinaryParser>
{
new Pe.PeBinaryParser(),
new Elf.ElfBinaryParser(),
new MachO.MachOBinaryParser()
};
}
public IBinaryParser? GetParser(string filePath, ReadOnlySpan<byte> header)
=> _parsers.FirstOrDefault(p => p.CanParse(filePath, header));
}
7. Symbol Normalization and Digesting
Create a small helper for consistent symbol IDs.
namespace StellaOps.Scanner.BinaryReachability.Model;
public static class SymbolIdFactory
{
public static string ComputeNormalizedName(string rawName)
=> rawName.Trim(); // v1: minimal; later we can extend (demangling, etc.)
public static string ComputeSymbolDigest(string normalizedName)
{
using var sha = System.Security.Cryptography.SHA256.Create();
var bytes = System.Text.Encoding.UTF8.GetBytes(normalizedName);
var hash = sha.ComputeHash(bytes);
var hex = Convert.ToHexString(hash).ToLowerInvariant();
return hex;
}
public static string CreateSymbolId(string symbolDigest)
=> $"sym:{symbolDigest}";
}
Usage in parsers:
-
For each function name the parser finds:
normalizedName = SymbolIdFactory.ComputeNormalizedName(rawName);digest = SymbolIdFactory.ComputeSymbolDigest(normalizedName);symbolId = SymbolIdFactory.CreateSymbolId(digest);- Create
SymbolNode.
Notes for developer:
- Do not include file path or address in the digest (we want determinism across builds).
- In the future we can expand normalization to include demangled signatures and parameter types.
8. Building the Graph (step-by-step)
Implementation of BinaryReachabilityService.BuildGraph should follow this algorithm.
8.1 Scan Files
-
Recursively enumerate all files under
rootDirectory. -
For each file:
- Open as stream.
- Read first 4–8 bytes as header.
- Try
BinaryParserFactory.GetParser. - If no parser, skip file.
8.2 Parse Binary Metadata and Symbols
For each parseable file:
-
Compute SHA256 of file content →
fileHash. -
parser.ParseBinaryMetadata(filePath, fileHash)→BinaryNode. -
Add
BinaryNodetoReachabilityGraph.Binaries. -
parser.ParseSymbols(binary)→ list ofSymbolNode. -
For each symbol:
-
Add to
ReachabilityGraph.Symbolsif not already present:- Key:
SymbolId. - If existing, keep first or merge (for v1: keep first).
- Key:
-
Maintain an in-memory index:
// symbolDigest -> SymbolNode
Dictionary<string, SymbolNode> symbolsByDigest;
8.3 Parse Call Graph per Binary
For each binary:
-
parser.ParseCallGraph(binary, itsSymbols)→ edges (without PURL attached). -
For each edge:
-
Ensure
FromSymbolIdandToSymbolIdcorrespond to knownSymbolNode:ToSymbolIdshould besym:{digest}for the callee.
-
Add edge to
ReachabilityGraph.Edges.
-
At this point, edges know only FromSymbolId, ToSymbolId, kind, confidence, and CallSite.
8.4 Attach PURLs
Now run a second pass to attach PURLs to symbols and edges:
-
For each
BinaryNode:- Call
sbomResolver.ResolvePurlForBinary(binary.FilePath, binary.BuildId, binary.FileHash). - If not null, this is the binary’s own purl (used for "who owns these functions").
- Call
-
Maintain:
Dictionary<string, string?> binaryPurlsById; // BinaryId -> purl?
-
For each
CallEdge:-
Get callee symbol:
var symbol = graph.Symbols[edge.ToSymbolId];
-
If
symbol.Purlis null:-
If callee is local (same binary – parser may mark it via metadata or
CallSite.BinaryId):- Assign
symbol.Purl = binaryPurlsById[callSite.BinaryId](can be null).
- Assign
-
If callee is imported from an external library:
- Parser should provide library name in
NormalizedNameor additional metadata (for v1, you can store library in a separate structure). - Use
sbomResolver.ResolvePurlByLibraryName(libraryName)to find purl. - Set
symbol.Purlto that value (even if null).
- Parser should provide library name in
-
-
Set
edge.CalleePurl = symbol.Purl. -
Set
edge.CalleeSymbolDigest = symbol.SymbolDigest.
-
Note: For v1 you can simplify:
- Assume all callees in this binary belong to
binary’s purl. - Later, extend to per-library mapping.
9. Format-Specific Minimum Requirements
For each parser, aim for this minimum.
9.1 PE Parser (Windows)
Tasks:
-
Identify PE by
MZ+ PE header. -
Extract:
- Machine type.
- Optional: PDB signature / age (for potential BuildId in the future).
-
Symbols:
- Use export table for exported functions.
- Use import table for imported functions (these represent edges from this binary to others).
-
Call graph:
- For v1: edges from each local function to imported functions via import table.
- Later: add simple disassembly of
.textsection to detect intra-binary calls.
Practical approach:
- Use
System.Reflection.PortableExecutableif possible, or a small custom PE reader. - Represent imported function name as
"<dllName>!<functionName>"inNormalizedName.
9.2 ELF Parser (Linux)
Tasks:
-
Detect ELF by magic
0x7F 'E' 'L' 'F'. -
Extract:
- BuildId (from
.note.gnu.build-idif present). - Architecture.
- BuildId (from
-
Symbols:
- Read
.dynsym(dynamic symbols) and.symtabif present. - Functions only (symbol type FUNC).
- Read
-
Call graph (minimum):
- Imports via PLT/GOT entries (function calls to shared libs).
- Map symbol names to
SymbolNodeas above.
Implementation:
- Write a simple ELF reader: parse header, section headers, locate
.dynsym,.strtab,.symtab,.note.gnu.build-id.
9.3 Mach-O Parser (macOS)
Tasks:
-
Detect Mach-O via magic (
0xFEEDFACE,0xFEEDFACF, etc.). -
Extract:
- UUID (LC_UUID) as BuildId equivalent.
-
Symbols:
- Use LC_SYMTAB and associated string table.
-
Call graph:
- Similar approach as ELF for imports; minimum: cross-binary call edges via import stubs.
Implementation:
- Minimal Mach-O parser: read load commands, find LC_SYMTAB and LC_UUID.
10. JSON Serialization Format
Use System.Text.Json with simple DTOs mirroring ReachabilityGraph. For v1, you can serialize the domain model directly.
Example structure (for reference only):
{
"nodes": {
"binaries": [
{ "binaryId": "bin:sha256:...", "filePath": "/app/MyApi.exe", "format": "Pe", "buildId": null, "fileHash": "..." }
],
"symbols": [
{ "symbolId": "sym:...", "normalizedName": "MyNamespace.MyType::MyMethod()", "kind": "Function", "purl": "pkg:nuget/MyLib@1.2.3", "symbolDigest": "..." }
]
},
"edges": [
{
"fromSymbolId": "sym:...",
"toSymbolId": "sym:...",
"edgeKind": "ImportCall",
"confidence": "High",
"calleePurl": "pkg:nuget/MyLib@1.2.3",
"calleeSymbolDigest": "...",
"site": { "binaryId": "bin:sha256:...", "offset": "0", "sourceFile": null, "sourceLine": null }
}
]
}
11. Error Handling & Logging
-
For unreadable or unsupported binaries:
- Log a warning and continue.
-
For parsing errors:
- Catch exceptions, log with file path and format, continue with other files.
-
For SBOM resolution failures:
- Not an error; leave Purl as null.
Logs should at least include:
- Number of binaries discovered, parsed successfully, failed.
- Number of symbols and edges created.
- Number of edges with
CalleePurlnull vs non-null.
12. Test Plan (high-level)
-
Unit tests for:
SymbolIdFactory(deterministic digests).BinaryReachabilityServicewith mocked parsers & SBOM resolver.
-
Integration tests (per platform) using small sample binaries:
- A PE with one import (e.g.
MessageBoxA). - An ELF binary calling
printf. - A Mach-O binary with a simple imported function.
- A PE with one import (e.g.
-
Check that:
- Graph contains expected binaries and symbols.
- Call edges exist and have correct
FromSymbolId/ToSymbolId. - PURLs are attached when SBOM resolver is provided with matching entries.
If you want, next step I can break this into a concrete task list (Jira-style) for a single mid-level C# developer over 1–2 sprints, including approximate order and dependencies. You can push these specs significantly closer to “best in class” by tightening a few correctness details, making some implicit assumptions explicit, and adding non‑functional and quality gates. Below is a structured list of concrete improvements you can fold back into the spec.
I’ll focus on changes that materially affect correctness, extensibility, and implementation quality for an “average C# dev” without exploding complexity.
1. Clarify Non‑Functional Requirements
Right now the spec is almost entirely functional. Add a short NFR section so the developer has explicit targets:
Add a “Non‑Functional Requirements” section:
-
Performance
- Target scanning throughput, e.g. “On commodity hardware, aim for at least 50–100 MB/s of binaries scanned in static mode.”
- Specify acceptable complexity: “All parsing operations must be linear in file size where possible; avoid quadratic algorithms over symbol tables.”
-
Memory
- Provide a rough upper bound, e.g. “Graph building must not exceed 512 MB RAM for 10k binaries with typical Linux container images.”
-
Thread safety
- Clarify: “All parser implementations must be stateless and thread‑safe;
BinaryReachabilityService.BuildGraphmay scan binaries in parallel.”
- Clarify: “All parser implementations must be stateless and thread‑safe;
-
Portability
- Minimum supported OS set (Windows, Linux, macOS) and CPU architectures (x86_64, ARM64); important because ELF/Mach‑O vary.
This keeps the implementation from being “correct but unusably slow” and tells the dev what “good enough” looks like.
2. Fix and Strengthen Symbol Identity (Very Important)
Current spec uses SymbolId = "sym:{digest}" where digest is only based on normalized name. That will collapse distinct functions that happen to share the same name/signature across different libraries/packages, which is unacceptable once you care about cross‑component reachability.
Improve the spec as follows:
-
Split “symbol node identity” from “canonical symbol key”:
-
Keep a local identity that is always unique per binary:
public sealed record SymbolNode( string SymbolId, // e.g. "sym:{binaryId}:{localIndex}" string NormalizedName, SymbolKind Kind, string? Purl, string SymbolDigest // stable digest of NormalizedName ); -
Define a canonical symbol key struct for cross‑binary grouping:
public readonly record struct CanonicalSymbolKey( string SymbolDigest, // sha256(normalizedName) string? Purl // null for unknown package ); -
Inside
ReachabilityGraph, add:public Dictionary<CanonicalSymbolKey, List<string>> CanonicalSymbolIndex { get; } = new();
-
-
Clarify behavior:
- Never merge two
SymbolNodes just because they share the same digest. - For “global reasoning” (e.g. “all call sites to the vulnerable function X from package Y”), use
CanonicalSymbolKey(SymbolDigest, Purl).
- Never merge two
-
Update
CallEdge:-
Keep
FromSymbolIdandToSymbolIdas node IDs. -
Include the canonical key in a dedicated field:
public sealed record CallEdge( string FromSymbolId, string ToSymbolId, EdgeKind EdgeKind, EdgeConfidence Confidence, CanonicalSymbolKey? CalleeKey, CallSite Site );
-
This single change prevents subtle and serious misattribution across libraries with overlapping APIs.
3. Explicit Build Identity Semantics (PE/ELF/Mach‑O)
The spec currently says BuildId is “optional” and format‑specific, but does not define how to compute it per format. Best‑in‑class means this is deterministic and documented.
Extend the spec with a “Binary Identity” section:
-
PE (Windows)
BuildId= PDB GUID + Age if available (from CodeView debug directory).- If PDB info is missing, set
BuildId = nulland rely onFileHash.
-
ELF (Linux)
BuildId= contents of.note.gnu.build-idif present.
-
Mach‑O (macOS)
BuildId= UUID fromLC_UUIDload command.
Also specify:
- Primary identity order:
(BuildId, FileHash); ifBuildIdis null, useFileHashonly. - SBOM resolvers MUST treat
(BuildId, FileHash)as the canonical key to map binaries to components, with file path only as a hint.
This gives you robust correlation between SBOM entries and binaries, across containers and file renames.
4. Enrich the Edge Model and Call Site Semantics
For precision and debuggability, specify what edges mean more rigorously.
Add fields and definitions:
-
Direction and type:
Add a small discriminator describing the origin of the edge:
public enum EdgeSource { ImportTable, // import thunk / PLT / stub Relocation, // relocation to symbol Disassembly, // decoded CALL / BL / JAL Metadata, // .NET metadata, DWARF, etc. Other }Extend
CallEdge:public sealed record CallEdge( string FromSymbolId, string ToSymbolId, EdgeKind EdgeKind, EdgeConfidence Confidence, EdgeSource Source, CanonicalSymbolKey? CalleeKey, CallSite Site ); -
Intra‑ vs inter‑binary
- Define:
Site.BinaryIdalways refers to the binary containing the call instruction. - Intra‑binary edge:
FromSymbolandToSymbolshare sameBinaryId. - Inter‑binary edge: otherwise.
- Define:
-
Unknown or unresolved callees
-
Do not drop unresolved calls; add a special
UnknownSymbolNodeper binary:NormalizedName = "<unknown>",Kind = SymbolKind.Unknown,Purl = null.
-
Edges to unknown must have
Confidence = EdgeConfidence.Low.
-
This makes downstream consumers able to distinguish “we are sure this is a call to libX.Y” from “we saw a call but do not know to where”.
5. Strengthen Symbol Normalization Rules (Demangling etc.)
For best‑in‑class results, you want reproducible signatures independent of compiler version, and you want to unify mangled C++/Rust/etc. names.
Extend the SymbolIdFactory spec with clear rules:
-
Language‑agnostic core
-
Always:
- Demangle if possible.
- Normalize whitespace.
- Normalize namespace separators to
.and member separator to::. - Remove address/offset suffixes embedded in names.
-
-
Format‑ and language‑specific guidance
-
For C/C++ (MSVC / Itanium ABI):
- Use a demangler (your own or library) to get
retType namespace.Type::Func(paramTypes...). - Omit return type in normalization to make signatures more stable:
namespace.Type::Func(paramTypes...).
- Use a demangler (your own or library) to get
-
For Rust:
- Strip hash suffixes from symbol name.
- Use “crate::module::Type::func(params...)” pattern where possible.
-
For Go:
- Normalize from
runtime.main_main→runtime.main.mainetc.
- Normalize from
-
For .NET (if/when you add managed parsing later):
- Use fully qualified CLR names:
Namespace.Type::Method(ParamType1,ParamType2).
- Use fully qualified CLR names:
-
-
Document stability guarantees
- Given identical source (function name + parameter list), the
SymbolDigestmust remain stable across builds, architectures, optimization levels, and link addresses. - If demangling fails, fallback to raw name but strip obvious hashes if safe.
- Given identical source (function name + parameter list), the
Specify this in prose and keep the implementation flexible, but the rules must be clear enough that two developers implementing the parser will produce the same digest for the same symbol.
6. More Precise SBOM & PURL Resolution Behavior
The SBOM integration is crucial to StellaOps; push this further so it is deterministic and auditable.
Extend ISbomComponentResolver behavior:
-
Resolution order
Document a strict order:
(BuildId, FileHash)match.FileHashonly.- Normalized file path if SBOM has explicit path mapping.
- Library name fallback via
ResolvePurlByLibraryName.
-
Multiple SBOMs and conflicts
-
Allow multiple SBOM sources; if two SBOMs claim different purls for the same
(BuildId, FileHash), define a policy:- e.g. fail fast with a “conflicting SBOM” error; or choose a deterministic priority order.
-
-
Library name mapping contract
Add a small DTO to make the mapping explicit:
public sealed record LibraryReference( string BinaryId, string LibraryName, // "libssl.so.3" / "KERNEL32.dll" string? ResolvedPath // if the loader path is known );Extend
IBinaryParserwith:IReadOnlyList<LibraryReference> ParseLibraryReferences(BinaryNode binary);Then describe how
BinaryReachabilityServiceuses those to callResolvePurlByLibraryName. -
Unknown purls
-
Require that unknowns are explicit:
- When
ResolvePurlForBinaryreturns null, storePurl = nulland flag this in logs: “No SBOM component for binary X (BuildId=..., Hash=...)”.
- When
-
This ensures SBOM resolution remains a traceable, deterministic step rather than a best‑effort guess.
7. Explicit JSON Schema & Versioning
For replayability and compatibility, define a clear JSON schema and version.
Add:
-
A top‑level metadata section:
{ "schemaVersion": "1.0.0", "generatedAt": "2025-11-20T12:34:56Z", "tool": "StellaOps.Scanner.BinaryReachability", "toolVersion": "1.0.0", "graph": { ... } } -
Commit to:
- Only additive changes in minor versions.
- Backwards‑compatible changes within the same major version.
- If you change anything structural (e.g. how symbol IDs work), bump
schemaVersionmajor.
Optionally, provide a compact JSON schema file (or at least a documented shape) so other teams can implement readers in other languages.
8. Concurrency, Streaming, and Large Images
For best‑in‑class scalability, specify how large images are handled.
Clarify in the spec:
-
Parallelization
-
BinaryReachabilityService.BuildGraph:- May scan binaries in parallel using
Parallel.ForEach. - All parsers must be thread‑safe and not rely on shared mutable state.
- May scan binaries in parallel using
-
-
Streaming option (optional but recommended)
-
Provide a second API for very large repositories:
public interface IGraphSink { void OnBinary(BinaryNode binary); void OnSymbol(SymbolNode symbol); void OnEdge(CallEdge edge); } void BuildGraphStreaming(string rootDirectory, ISbomComponentResolver sbomResolver, IGraphSink sink); -
This allows building graphs into a database or message bus without keeping everything in memory.
-
Even if you do not implement streaming immediately, designing the interface now keeps the architecture future‑proof.
9. Observability and Diagnostics
Best‑in‑class implementation requires good introspection for debugging wrong reachability conclusions.
Specify minimal observability requirements:
-
Logging
-
At least:
- Info: number of binaries, symbols, edges, time taken.
- Warning: unsupported binary formats, SBOM resolution failures, demangling failures.
- Error: parser exceptions per file (with file path and format).
-
-
Debug artifacts
-
Optional environment or flag that dumps per‑binary debug info:
- Raw symbol table (names + addresses).
- Normalized names and digests.
- Library references.
- Call edges for that binary.
-
-
Metrics hooks
-
Provide a simple interface for metrics:
public interface IReachabilityMetrics { void IncrementCounter(string name, long value = 1); void ObserveDuration(string name, TimeSpan duration); }And allow
BinaryReachabilityServiceto be constructed with an optional metrics implementation.
-
10. Expanded Test Strategy and Quality Gates
Your test plan is decent but can be made more systematic.
Extend test plan:
-
Golden corpus
-
Maintain a small but curated set of PE/ELF/Mach‑O binaries (checked in or generated) where:
- Expected symbols and edges are stored as JSON.
- CI compares current output with the golden graph byte‑for‑byte (or structurally).
-
-
Cross‑compiler coverage
-
At least:
- C/C++ built by different toolchains (MSVC, clang, gcc).
- Different optimization levels (
-O0,-O2) to ensure stability of parsing.
-
-
Fuzzing / robustness
-
Create tests with truncated / corrupted binaries to ensure:
- No crashes.
- Meaningful, bounded error behavior.
-
-
SBOM integration tests
-
For a test root directory:
- Synthetic SBOM mapping files to binaries.
- Validate correct purl assignment and conflict handling.
-
-
Determinism tests
-
Run
BuildGraphtwice on the same directory and assert that:- Graph is structurally identical (including order‑independent comparison).
-
This makes it much harder for regressions to slip in when you extend parsers or normalization.
11. Clear Extension Points and Roadmap Notes
Finally, add a short “Future Extensions” section so the dev knows what to keep in mind when structuring code:
-
Support for:
- Inlined function tracking (via DWARF/PDB).
- Managed .NET assemblies’ metadata (C# IL call graph).
- Dynamic edge sources (runtime traces) merged into the same graph.
-
The spec should instruct: “Design parsers and the graph model so they can accept additional
EdgeSourcetypes and symbol metadata without breaking existing consumers.”
That gives the current implementation a clear direction and prevents design dead ends.
If you want, I can now take your current spec and rewrite it into a single, consolidated “v1.0 Best‑in‑Class Spec” document that incorporates all of the above changes, so you can hand it directly to an implementation team.