- Introduced a detailed specification for encoding binary reachability that integrates call graphs with SBOMs. - Defined a minimal data model including nodes, edges, and SBOM components. - Outlined a step-by-step guide for building the reachability graph in a C#-centric manner. - Established core domain models, including enumerations for binary formats and symbol kinds. - Created a public API for the binary reachability service, including methods for graph building and serialization. - Specified SBOM component resolution and binary parsing abstractions for PE, ELF, and Mach-O formats. - Enhanced symbol normalization and digesting processes to ensure deterministic signatures. - Included error handling, logging, and a high-level test plan to ensure robustness and correctness. - Added non-functional requirements to guide performance, memory usage, and thread safety.
34 KiB
Here’s a quick, practical win for your SBOM/runtime join story: record the ELF build‑id alongside soname and path when mapping modules to purls.
Why it matters:
- build‑id (from
.note.gnu.build-id) is a content hash that uniquely identifies an ELF image—even if filenames/paths change. - Distros and debuginfod index debug symbols by build‑id, so you can reliably join runtime traces → binaries → SBOM entries → debug artifacts.
- It hardens reachability and VEX joins (no “same soname, different bits” ambiguity).
What to capture per ELF
soname(if shared object)full pathat runtimepurl(package URL from your resolver)build_id(hex, no colons)arch,file type(ET_DYN/ET_EXEC), andbuild-id source(NT_GNU_BUILD_ID)
How to read it (portable snippets)
CLI
# show build-id quickly
readelf -n /path/to/bin | awk '/Build ID:/ {print $3}'
# or:
objdump -s --section .note.gnu.build-id /path/to/bin
C (runtime collector)
#include <link.h>
#include <string.h>
static int note_cb(struct dl_phdr_info *info, size_t size, void *data) {
for (int i=0; i<info->phnum; i++) {
const ElfW(Phdr) *ph = &info->phdr[i];
if (ph->p_type == PT_NOTE) {
// scan notes for NT_GNU_BUILD_ID (type=3, name="GNU")
// extract desc bytes → hex string build_id
}
}
return 0;
}
// call dl_iterate_phdr(note_cb, NULL);
Go (scanner)
f, _ := elf.Open(path)
for _, n := range f.Notes {
if n.Type == elf.NT_GNU_BUILD_ID && n.Name == "GNU" {
buildID := fmt.Sprintf("%x", n.Desc)
// record buildID
}
}
Suggested Stella Ops schema (add field, no versioning break)
{
"module": {
"path": "/usr/lib/x86_64-linux-gnu/libssl.so.3",
"soname": "libssl.so.3",
"purl": "pkg:deb/ubuntu/openssl@3.0.2-0ubuntu1.10?arch=amd64",
"elf": {
"build_id": "a1b2c3d4e5f6...",
"type": "ET_DYN",
"arch": "x86_64",
"notes": { "source": "NT_GNU_BUILD_ID" }
}
}
}
Join strategy
- Runtime → build‑id: collect from process maps (or dl_iterate_phdr) and file scan fallback.
- SBOM → candidate binaries: map by purl/filename, then confirm by build‑id where available.
- Debug/Source: query debuginfod or distro debug repos by build‑id to fetch symbols for precise call‑graph and reachability.
- VEX/Policies: treat build‑id as the primary key for binary‑level assertions; purl stays as the package‑level key.
Edge handling
- Stripped binaries: build‑id still present in the note; if missing, fall back to full‑file hash and flag
build_id_absent=true. - Containers: compute build‑id inside image layers and cache in your “Proof‑of‑Integrity Graph.”
- Kernel/Modules: same idea—
/sys/module/*/notes/.note.gnu.build-id.
Quick acceptance tests
- Scan a container image (Debian/Ubuntu/RHEL) and verify >90% of ELF objects yield a build‑id.
- Cross‑check one binary: path changes across containers, build‑id stays identical.
- Fetch symbols via debuginfod using that build‑id and run a tiny call‑graph demo to prove determinism.
If you want, I can draft the exact .NET 10 collector for Linux (P/Invoke dl_iterate_phdr) and a CycloneDX extension block to store build_id.
Here’s a concrete “implementation spec” for a C# dev to build an ELF metadata / build-id collector (“elf builder”). I’ll treat this as a small reusable .NET library plus some process-level helpers.
1. Goal & Scope
Goal: From C# on Linux, be able to:
-
Given an ELF file path, extract:
build-id(from.note.gnu.build-id, i.e. NT_GNU_BUILD_ID)soname(for shared objects)- ELF type (ET_EXEC / ET_DYN / etc.)
- machine architecture
- file path
- optional fallback: full-file hash if build-id is missing
-
Given a running process (usually self), enumerate loaded ELF modules and attach the above metadata per module.
The output will power your SBOM/runtime join (path + soname + build-id → purl).
2. Public API Spec
2.1 Core model
public enum ElfFileType
{
Unknown = 0,
Relocatable = 1, // ET_REL
Executable = 2, // ET_EXEC
SharedObject = 3, // ET_DYN
Core = 4 // ET_CORE
}
public sealed class ElfMetadata
{
public required string Path { get; init; }
public string? Soname { get; init; }
public string? BuildId { get; init; } // Hex, lowercase, no colons
public string BuildIdSource { get; init; } = ""; // "NT_GNU_BUILD_ID" | "FileHash" | ""
public ElfFileType FileType { get; init; }
public string Machine { get; init; } = ""; // e.g. "x86_64", "aarch64"
public bool Is64Bit { get; init; }
public bool IsLittleEndian { get; init; }
public string? FileHashSha256 { get; init; } // only if BuildId == null
}
2.2 File-level API
public static class ElfReader
{
/// <summary>
/// Parse the ELF file at the given path and extract metadata.
/// Throws if file is not ELF or cannot be read.
/// </summary>
public static ElfMetadata ReadMetadata(string path);
}
Behavior:
- Validates ELF magic.
- Supports both 32-bit and 64-bit ELF.
- Supports little and big endian (but you can initially only test little-endian).
- Uses program headers (PT_NOTE) and note parsing to extract build-id.
- Uses section headers + .dynamic to extract
DT_SONAME. - Sets
BuildIdSource = "NT_GNU_BUILD_ID"if build-id present. - If no build-id, computes
FileHashSha256and setsBuildIdSource = "FileHash".
2.3 Process-level API (Linux)
public static class ElfProcessScanner
{
/// <summary>
/// Enumerate ELF modules for the current process (default) or a given pid.
/// Only returns unique paths that are actual ELF files.
/// </summary>
public static IReadOnlyList<ElfMetadata> GetProcessModules(int? pid = null);
}
Default implementation:
-
Only supports Linux.
-
Reads
/proc/<pid>/maps. -
Filters entries that map regular files (path not
[vdso],[heap], etc.). -
De-duplicates by canonical path (e.g.
realpathbehavior). -
For each unique path:
- Check first 4 bytes for ELF magic.
- Call
ElfReader.ReadMetadata(path).
3. ELF Parsing: Binary Layout & Rules
You do not need unsafe code; a BinaryReader is enough.
3.1 ELF header
First 16 bytes: e_ident[].
Key fields:
-
e_ident[0..3]=0x7F, 'E', 'L', 'F'(magic) -
e_ident[4]=EI_CLASS:- 1 = 32-bit (
ELFCLASS32) - 2 = 64-bit (
ELFCLASS64)
- 1 = 32-bit (
-
e_ident[5]=EI_DATA:- 1 = little-endian (
ELFDATA2LSB) - 2 = big-endian (
ELFDATA2MSB)
- 1 = little-endian (
Then the “native” header fields, which differ slightly between 32 & 64 bit.
Define two internal structs (don’t use [StructLayout]; just read fields manually):
internal sealed class ElfHeaderCommon
{
public byte[] Ident = new byte[16];
public ushort Type; // e_type
public ushort Machine; // e_machine
public uint Version; // e_version
public ulong Entry; // e_entry (32/64 sized)
public ulong Phoff; // e_phoff
public ulong Shoff; // e_shoff
public uint Flags; // e_flags
public ushort Ehsize; // e_ehsize
public ushort Phentsize; // e_phentsize
public ushort Phnum; // e_phnum
public ushort Shentsize; // e_shentsize
public ushort Shnum; // e_shnum
public ushort Shstrndx; // e_shstrndx
}
Algorithm to read header:
-
ReadBytes(16)→Ident. Validate magic & EI_CLASS/EI_DATA. -
Decide
is64(from EI_CLASS) andlittleEndian(from EI_DATA). -
Use helper methods:
static ushort ReadUInt16(BinaryReader br, bool little) { ... } static uint ReadUInt32(BinaryReader br, bool little) { ... } static ulong ReadUInt64(BinaryReader br, bool little) { ... }Where these helpers swap bytes if file is big-endian and host is little-endian.
-
For 32-bit ELF: fields
Entry,Phoff,Shoffare 4-byte values that you zero-extend to 64-bit. -
For 64-bit ELF: fields are 8-byte values.
3.2 Program headers (for build-id)
Each program header:
-
32-bit:
uint32 p_type; uint32 p_offset; uint32 p_vaddr; uint32 p_paddr; uint32 p_filesz; uint32 p_memsz; uint32 p_flags; uint32 p_align; -
64-bit:
uint32 p_type; uint32 p_flags; uint64 p_offset; uint64 p_vaddr; uint64 p_paddr; uint64 p_filesz; uint64 p_memsz; uint64 p_align;
You only really need:
p_type(look forPT_NOTE= 4)p_offsetp_filesz
Reading algorithm:
internal sealed class ProgramHeader
{
public uint Type;
public ulong Offset;
public ulong FileSize;
}
-
Seek to
header.Phoff. -
For
i = 0..Phnum-1:-
For 32-bit:
Type = ReadUInt32()- Skip
p_offsetintoOffset = ReadUInt32() - Skip the rest.
-
For 64-bit:
Type = ReadUInt32()flags = ReadUInt32()(ignored)Offset = ReadUInt64()FileSize = ReadUInt64()- Skip rest.
-
-
Store those with
Type == 4(PT_NOTE).
3.3 Note segments & NT_GNU_BUILD_ID
Each note has:
uint32 namesz;
uint32 descsz;
uint32 type;
char name[namesz]; // padded to 4-byte boundary
byte desc[descsz]; // padded to 4-byte boundary
We care about:
type == 3(NT_GNU_BUILD_ID)name == "GNU"(null-terminated; usually"GNU\0")
Algorithm:
For each PT_NOTE program header:
-
Seek to
ph.Offset, setremaining = ph.FileSize. -
While
remaining >= 12:-
namesz = ReadUInt32() -
descsz = ReadUInt32() -
type = ReadUInt32() -
remaining -= 12. -
Read
nameBytes = ReadBytes(namesz);remaining -= namesz.- Skip padding:
pad = (4 - (namesz % 4)) & 3;Seek(pad),remaining -= pad.
- Skip padding:
-
Read
desc = ReadBytes(descsz);remaining -= descsz.- Skip padding:
pad = (4 - (descsz % 4)) & 3;Seek(pad),remaining -= pad.
- Skip padding:
-
If
type == 3andEncoding.ASCII.GetString(nameBytes).TrimEnd('\0') == "GNU":-
Convert
descto hex:string buildId = BitConverter.ToString(desc).Replace("-", "").ToLowerInvariant(); -
Return immediately.
-
-
If no note matches, return null, and you can later fall back to FileHashSha256.
3.4 Section headers & SONAME
You need DT_SONAME from the dynamic section. Steps:
-
Read section headers from
Shoff(ELF header).Minimal section header model:
internal sealed class SectionHeader { public uint Name; // index into shstrtab public uint Type; // SHT_* public ulong Offset; public ulong Size; public uint Link; // for some types }For each section:
- Read
Name,Type,Flags(ignored),Addr(ignored),Offset,Size,Link, etc. - Keep these in an array.
- Read
-
Find the section header string table (
shstrtab):-
Use
header.Shstrndxto locate its section header. -
Read that section’s bytes into
shStrTab. -
Define helper to get section name:
static string ReadNullTerminatedString(byte[] table, uint offset) { int i = (int)offset; int start = i; while (i < table.Length && table[i] != 0) i++; return Encoding.ASCII.GetString(table, start, i - start); }
-
-
Use
shStrTabto find:.dynamicsection (Type == 6i.e.SHT_DYNAMIC).- The string table it references (
SectionHeader.Link→ index of the dynamic string table, often.dynstr).
-
Parse the dynamic section:
-
Elf64_Dynis array of entries:int64 d_tag; uint64 d_val;(For 32-bit, both are 4 bytes; you can cast to 64-bit.)
-
For each entry:
- Read
d_tag(signed, but you can treat as 64-bit). - Read
d_val. - If
d_tag == 14(DT_SONAME), thend_valis an offset into the dynstr string table.
- Read
-
-
Read
SONAME:- Use dynstr bytes +
d_valas index, decode null-terminated ASCII →Soname.
- Use dynstr bytes +
If there is no .dynamic section or no DT_SONAME, set Soname = null.
3.5 Mapping e_machine to architecture string
e_machine is a numeric code. Map the most common ones:
static string MapMachine(ushort eMachine) => eMachine switch
{
3 => "x86", // EM_386
62 => "x86_64", // EM_X86_64
40 => "arm", // EM_ARM
183 => "aarch64", // EM_AARCH64
8 => "mips", // EM_MIPS
_ => $"unknown({eMachine})"
};
3.6 Mapping e_type to ElfFileType
static ElfFileType MapFileType(ushort eType) => eType switch
{
1 => ElfFileType.Relocatable, // ET_REL
2 => ElfFileType.Executable, // ET_EXEC
3 => ElfFileType.SharedObject,// ET_DYN
4 => ElfFileType.Core, // ET_CORE
_ => ElfFileType.Unknown
};
3.7 Fallback: SHA-256 hash
If build-id is missing:
static string ComputeFileSha256(string path)
{
using var sha = System.Security.Cryptography.SHA256.Create();
using var fs = File.OpenRead(path);
var hash = sha.ComputeHash(fs);
return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
}
Set:
BuildId = nullBuildIdSource = "FileHash"FileHashSha256 = computedHash
4. Implementation Skeleton (ElfReader)
Here’s a compact skeleton tying it together:
public static class ElfReader
{
public static ElfMetadata ReadMetadata(string path)
{
using var fs = File.OpenRead(path);
using var br = new BinaryReader(fs);
// 1. Read e_ident
byte[] ident = br.ReadBytes(16);
if (ident.Length < 16 ||
ident[0] != 0x7F || ident[1] != (byte)'E' ||
ident[2] != (byte)'L' || ident[3] != (byte)'F')
{
throw new InvalidDataException("Not an ELF file.");
}
bool is64 = ident[4] == 2; // EI_CLASS
bool little = ident[5] == 1; // EI_DATA
// 2. Read header
var header = ReadElfHeader(br, ident, is64, little);
// 3. Read program headers
var phdrs = ReadProgramHeaders(br, header, is64, little);
// 4. Extract build-id from PT_NOTE
string? buildId = TryReadBuildIdFromNotes(br, phdrs, little, is64);
// 5. Read SONAME from .dynamic
string? soname = TryReadSoname(br, header, is64, little);
// 6. Map machine & type
string machine = MapMachine(header.Machine);
ElfFileType fileType = MapFileType(header.Type);
// 7. Hash fallback
string? fileHash = null;
string source;
if (buildId is null)
{
fileHash = ComputeFileSha256(path);
source = "FileHash";
}
else
{
source = "NT_GNU_BUILD_ID";
}
return new ElfMetadata
{
Path = path,
Soname = soname,
BuildId = buildId,
BuildIdSource = source,
FileType = fileType,
Machine = machine,
Is64Bit = is64,
IsLittleEndian = little,
FileHashSha256 = fileHash
};
}
// ... implement ReadElfHeader, ReadProgramHeaders,
// TryReadBuildIdFromNotes, TryReadSoname, MapMachine,
// MapFileType, ComputeFileSha256, + endian helpers ...
}
I didn’t expand every helper to keep this readable, but all helpers follow exactly the rules in section 3.
5. Process Scanner Spec (Linux)
5.1 Reading /proc/<pid>/maps
Each line looks roughly like:
7f2d9c214000-7f2d9c234000 r--p 00000000 08:01 1234567 /usr/lib/x86_64-linux-gnu/libssl.so.3
Last field is the file path, if any.
Algorithm:
public static class ElfProcessScanner
{
public static IReadOnlyList<ElfMetadata> GetProcessModules(int? pid = null)
{
int actualPid = pid ?? Environment.ProcessId;
string mapsPath = $"/proc/{actualPid}/maps";
if (!File.Exists(mapsPath))
throw new PlatformNotSupportedException("Only supported on Linux with /proc.");
var paths = new HashSet<string>(StringComparer.Ordinal);
foreach (var line in File.ReadLines(mapsPath))
{
int idx = line.IndexOf('/');
if (idx < 0)
continue;
string p = line.Substring(idx).Trim();
if (p.StartsWith("["))
continue; // skip [heap], [vdso], etc.
if (!File.Exists(p))
continue;
// De-duplicate
if (!paths.Add(p))
continue;
}
var result = new List<ElfMetadata>();
foreach (var p in paths)
{
if (!IsElfFile(p))
continue;
try
{
var meta = ElfReader.ReadMetadata(p);
result.Add(meta);
}
catch
{
// swallow or log; not all mapped files are valid ELF
}
}
return result;
}
private static bool IsElfFile(string path)
{
try
{
using var fs = File.OpenRead(path);
Span<byte> magic = stackalloc byte[4];
if (fs.Read(magic) != 4) return false;
return magic[0] == 0x7F && magic[1] == (byte)'E' &&
magic[2] == (byte)'L' && magic[3] == (byte)'F';
}
catch { return false; }
}
}
This is simple and robust. If you later want even more accurate results (e.g., also non-file-backed shared objects), you can add a P/Invoke path that uses dl_iterate_phdr, but /proc/<pid>/maps gets you the SBOM-relevant modules.
6. JSON / SBOM Integration (Optional but Recommended)
When you serialize ElfMetadata into your runtime evidence / graph, I’d recommend a nested ELF block, e.g.:
{
"path": "/usr/lib/x86_64-linux-gnu/libssl.so.3",
"soname": "libssl.so.3",
"purl": "pkg:deb/ubuntu/openssl@3.0.2-0ubuntu1.10?arch=amd64",
"elf": {
"build_id": "a1b2c3d4e5f6...",
"build_id_source": "NT_GNU_BUILD_ID",
"file_type": "SharedObject",
"machine": "x86_64",
"is_64bit": true,
"is_little_endian": true,
"file_hash_sha256": null
}
}
You can keep purl on the higher-level module object; build_id becomes the primary key for binary-accurate joins.
7. Testing Checklist
For a C# dev implementing this, I’d suggest these tests:
-
Basic ELF detection
- Provide a non-ELF file →
ReadMetadatathrows. - Provide
/bin/ls(or similar) → parses successfully.
- Provide a non-ELF file →
-
32-bit vs 64-bit
- If you have a 32-bit ELF on the system, verify
Is64Bitand basic fields.
- If you have a 32-bit ELF on the system, verify
-
Build-id extraction
- Compare
ReadMetadata(path).BuildIdagainstreadelf -n path | grep 'Build ID:'. - Ensure they match.
- Compare
-
SONAME extraction
- For a shared library, ensure
Sonamematchesreadelf -d path | grep SONAME.
- For a shared library, ensure
-
Hash fallback
- On an ELF without build-id (you can strip or use a test file), verify
BuildId == nullandFileHashSha256 != null.
- On an ELF without build-id (you can strip or use a test file), verify
-
Process scanner
-
Call
GetProcessModules()in your test app; ensure it returns at least:- your own executable
- core system libs (
libc.so.6,ld-linux*, etc.)
-
If you’d like, next step I can do is expand any of the helper methods (e.g. ReadElfHeader or TryReadSoname) into complete C# code, but this should be enough as a “full spec & directions” for a C# dev to implement the ELF build-id collector.
You’re already in “pretty good” territory with the current spec. To get to “best in class,” I’d tighten it in a few dimensions: API design, correctness, performance, observability, testing, and extensibility.
I’ll walk through concrete ways to upgrade what we already have, without rewriting everything from scratch.
1. Sharpen the public API
1.1 Split responsibilities into clear layers
Right now we have:
ElfReader.ReadMetadata(string path)ElfProcessScanner.GetProcessModules(int? pid = null)
For a best‑in‑class library, I’d explicitly layer things:
public interface IElfParser
{
ElfMetadata Parse(Stream stream, string? pathHint = null);
}
public interface IElfFileInspector
{
ElfMetadata InspectFile(string path);
}
public interface IElfProcessInspector
{
IReadOnlyList<ElfMetadata> GetProcessModules(ElfProcessScanOptions? options = null);
}
With default implementations:
ElfParser– pure, stateless binary parser (no file I/O).ElfFileInspector– wrapsElfParser+ file system.ElfProcessInspector– wraps/proc/<pid>/maps(and optionallydl_iterate_phdr).
This makes testing simpler (you can feed a MemoryStream) and keeps “how we read” decoupled from “how we parse.”
1.2 Options objects & async variants
Give users knobs and modern .NET ergonomics:
public sealed class ElfProcessScanOptions
{
public int? Pid { get; init; }
public bool IncludeNonElfFiles { get; init; } = false;
public bool ParallelFileParsing { get; init; } = true;
public bool ComputeHashWhenBuildIdMissing { get; init; } = true;
public int? MaxFiles { get; init; } // safety valve on huge systems
}
public static class ElfProcessScanner
{
public static IReadOnlyList<ElfMetadata> GetProcessModules(
ElfProcessScanOptions? options = null);
public static IAsyncEnumerable<ElfMetadata> GetProcessModulesAsync(
ElfProcessScanOptions? options = null,
CancellationToken cancellationToken = default);
}
Same for file scans:
public sealed class ElfFileScanOptions
{
public bool ComputeFileHashWhenBuildIdPresent { get; init; } = false;
public bool ThrowOnNonElf { get; init; } = true;
}
public static ElfMetadata ReadMetadata(
string path,
ElfFileScanOptions? options = null);
1.3 Strong types for identity
Instead of string BuildId, add a value type:
public readonly struct ElfBuildId : IEquatable<ElfBuildId>
{
public string HexString { get; } // "a1b2c3..."
public string DebugPathComponent => $"{HexString[..2]}/{HexString[2..]}";
// Parse, TryParse, equality, GetHashCode, etc.
}
Then in ElfMetadata:
public ElfBuildId? BuildId { get; init; } // nullable
public string BuildIdSource { get; init; } // "NT_GNU_BUILD_ID" | "FileHash" | "None"
This prevents subtle bugs from string normalization and gives you the debuginfod‑style path precomputed.
2. Make parsing spec‑accurate & robust
2.1 Handle both PT_NOTE and SHT_NOTE .note.gnu.build-id
Many binaries place build‑id in:
PT_NOTEsegments and/or- a section named
.note.gnu.build-id(SHT_NOTE)
Your spec only mentions PT_NOTE. For best coverage:
- Search all
PT_NOTEsegments forNT_GNU_BUILD_ID. - If none found, search
SHT_NOTEsections with name.note.gnu.build-id. - If both exist and disagree (extremely rare), decide a precedence and log a diagnostic.
2.2 Correct note alignment rules
Spec nuance:
- Note fields (
namesz,descsz,type) are always 4‑byte aligned. - On 64‑bit, the overall note segment may be aligned to 8 bytes, but the internal padding rules still use 4‑byte boundaries.
Your spec uses pad = (4 - (size % 4)) & 3, which is correct, but I’d codify it clearly:
static int NotePadding(int size) => (4 - (size & 3)) & 3;
And call that everywhere you advance across notes so future maintainers don’t “optimize” it incorrectly.
2.3 Be strict on bounds & corruption
Add explicit, defensive checks:
- Do not trust
p_offset+p_fileszblindly. - Before any read, verify
offset + length <= streamLength. - If the file lies about sizes, fail gracefully with a structured error.
E.g.:
public sealed class ElfParseException : Exception
{
public ElfParseErrorKind Kind { get; }
public string? Detail { get; }
// ...
}
public enum ElfParseErrorKind
{
NotElf,
TruncatedHeader,
TruncatedProgramHeader,
TruncatedSectionHeader,
TruncatedNote,
UnsupportedClass,
UnsupportedEndianess,
IoError,
Unknown
}
And then:
if (header.Phoff + (ulong)header.Phnum * header.Phentsize > (ulong)fs.Length)
throw new ElfParseException(ElfParseErrorKind.TruncatedProgramHeader, "...");
Best‑in‑class means you never trust the file, and your errors are debuggable.
2.4 Big‑endian and 32‑bit are first‑class citizens
Even if your primary target is x86_64 Linux, a robust spec:
- Fully supports EI_CLASS = 1 and 2 (32/64).
- Fully supports EI_DATA = 1 and 2 (LSB/MSB).
- Has tests for at least one big‑endian ELF (e.g., sample artifacts in your test assets).
Your current spec mentions big-endian, but I’d explicitly require:
-
A generic
EndianBinaryReaderabstraction that:- Wraps a
Stream - Exposes
ReadUInt16/32/64,ReadInt64,ReadByteswith endianness.
- Wraps a
3. Performance & scale improvements
3.1 Avoid full-file reads by design
Your current design lets devs accidentally hash everything or read all sections even when not needed.
Refine the spec so that default path is minimal I/O:
-
Read ELF header.
-
Read program headers.
-
Read only:
- PT_NOTE ranges
- Section headers (once)
.shstrtab,.dynamic, and its dynstr.
Only compute SHA‑256 when expressly configured (via ElfFileScanOptions.ComputeFileHashWhenBuildIdPresent or ComputeFileHashWhenBuildIdMissing).
3.2 Optional memory‑mapped mode
For very large scans (filesystem crawls, containers), allow a mode that uses MemoryMappedFile:
public sealed class ElfReaderOptions
{
public bool UseMemoryMappedFile { get; init; } = false;
}
Internally, you can spec that the implementation:
- Uses
MemoryMappedFile.CreateFromFile - Creates views over relevant ranges (header, program headers, etc.)
- Avoids multiple OS reads for repeated random access.
3.3 Parallel directory / image scanning
If you foresee scanning whole images or file trees, define a helper:
public static class ElfDirectoryScanner
{
public static IReadOnlyList<ElfMetadata> Scan(
string rootDirectory,
ElfDirectoryScanOptions? options = null);
public static IAsyncEnumerable<ElfMetadata> ScanAsync(
string rootDirectory,
ElfDirectoryScanOptions? options = null,
CancellationToken cancellationToken = default);
}
public sealed class ElfDirectoryScanOptions
{
public SearchOption SearchOption { get; init; } = SearchOption.AllDirectories;
public int MaxDegreeOfParallelism { get; init; } = Environment.ProcessorCount;
public Func<string, bool>? PathFilter { get; init; } // e.g., skip /proc, /sys
}
And explicitly say that the implementation:
- Uses
Parallel.ForEach(orParallel.ForEachAsyncin .NET 8/9+) with bounded parallelism. - Shares a single
ElfParseracross threads (it’s stateless). - De‑dups by
(device, inode)when possible (see below).
4. Process scanner: correctness & completeness
4.1 De‑duplication by inode, not just path
The current spec de‑dups only by path. On Linux:
- Same inode may have multiple paths (hard links, bind mounts, chroot/container overlays).
For best‑in‑class accuracy of “unique binaries,” spec:
- De‑duplicate entries by
(st_dev, st_ino)fromstat(2), not just string path. - Provide both views: unique by file identity and by path.
API example:
public sealed class ElfProcessModules
{
public IReadOnlyList<ElfMetadata> UniqueFiles { get; init; } // dedup by inode
public IReadOnlyList<ElfModuleInstance> Instances { get; init; } // per mapping
}
public sealed class ElfModuleInstance
{
public ElfMetadata Metadata { get; init; }
public string Path { get; init; }
public string? MappingRange { get; init; } // "7f2d9c214000-7f2d9c234000"
}
And ElfProcessScanner.GetProcessModules returns an ElfProcessModules, not just a flat list.
4.2 Optional dl_iterate_phdr P/Invoke path
For a “maximum correctness” mode, you can specify:
- A secondary implementation that uses
dl_iterate_phdrvia P/Invoke. - This gives you module base addresses and sometimes more consistent views across distros.
- You can hybridize: use
/proc/<pid>/mapsfor path enumeration anddl_iterate_phdrto confirm loaded segments (future feature).
You don’t have to implement it day one, but the spec can carve out an extension point:
public enum ElfProcessModuleSource
{
ProcMaps,
DlIteratePhdr
}
public sealed class ElfProcessScanOptions
{
public ElfProcessModuleSource Source { get; init; } = ElfProcessModuleSource.ProcMaps;
}
And define behavior if the requested source isn’t available.
5. Observability & diagnostics
Best‑in‑class libraries are easy to debug.
5.1 Structured diagnostics on parse failures
Instead of “swallow or log” in the scanner, define:
public sealed class ElfScanResult
{
public IReadOnlyList<ElfMetadata> Successes { get; init; }
public IReadOnlyList<ElfScanError> Errors { get; init; }
}
public sealed class ElfScanError
{
public string Path { get; init; }
public ElfParseErrorKind Kind { get; init; }
public string Message { get; init; }
}
And make ElfProcessScanner.GetProcessModules optionally return ElfScanResult (or have an overload).
This way you can:
- Report how many files failed.
- See common misconfigurations (e.g., insufficient permissions, truncated files).
5.2 Logging hooks instead of hard-coded logging
Don’t bake in a logging framework, but add a hook:
public interface IElfLogger
{
void Debug(string message);
void Info(string message);
void Warn(string message);
void Error(string message, Exception? ex = null);
}
public sealed class ElfReaderOptions
{
public IElfLogger? Logger { get; init; }
}
Then use it for “soft failures” (skipping non‑ELF files, ignoring suspect sections, etc.).
6. Security & safety considerations
6.1 Treat inputs as untrusted
Spec explicitly that:
- No ELF is ever loaded or executed.
- No ld.so / dynamic loading is used: all reading is via
FileStream/MemoryMappedFile. - No writes occur to inspected paths.
6.2 Control resource usage
For environments scanning untrusted file trees (e.g., user uploads):
-
Have configurable caps on:
MaxFileSizeBytesto parse.MaxNotesPerSegment/MaxSectionsto avoid pathological “zip bomb” style ELFs.
-
Fail with
ElfParseErrorKind.TruncatedHeaderorUnsupportedrather than exhausting RAM.
7. Testing & validation: make it part of the spec
Instead of just “add tests,” bake them in as requirements.
7.1 Golden tests vs readelf or llvm-readobj
Define that CI must include:
-
For a set of ELFs (32‑bit, 64‑bit, big‑endian, stripped, PIE, static):
- Compare
ElfMetadata.BuildIdwithreadelf -noutput. - Compare
ElfMetadata.Sonamewithreadelf -d/objdump -p.
- Compare
You don’t need to name the exact tools in the API, but the spec can say:
The library’s test suite must cross‑validate build‑id and SONAME values against a trusted system tool (such as
readelforllvm-readobj) for a curated set of binaries.
7.2 Fuzzing & corruption tests
Add:
-
A small fuzz harness that:
- Mutates bytes in real ELF samples.
- Feeds them to
ElfParser. - Asserts: no crashes, only
ElfParseExceptions.
This directly supports the “never trust input” goal.
7.3 Regression fixtures
Check in a testdata/ folder with:
- Minimal 32‑bit/64‑bit ELF with build‑id.
- Minimal ELF without build‑id.
- Shared library with SONAME.
- Big‑endian sample.
8. Extensibility hooks (future-friendly)
Even if you only care about Linux/ELF today, you can design with “other formats later” in mind.
8.1 Generalized module metadata interface
public interface IModuleMetadata
{
string Path { get; }
string? Soname { get; }
string? BuildId { get; }
string Format { get; } // "ELF", "PE", "MachO"
}
ElfMetadata implements IModuleMetadata. That way, a future PeMetadata or MachOMetadata can slot into the same pipelines.
8.2 Integration with SBOM & VEX
Add a tiny, optional interface that lines up with your SBOM graph:
public interface IHasPackageCoordinates
{
string? Purl { get; }
}
public sealed partial class ElfMetadata : IHasPackageCoordinates
{
public string? Purl { get; init; } // populated by your higher-layer resolver
}
The ELF layer doesn’t know how to compute Purl, but it gives a spot for higher layers to attach it without wrapping everything in another type.
9. Documentation & usage examples
Finally, “best in class” is as much about developer experience as code.
Your spec should require:
-
XML docs on all public types/members (shown in IntelliSense).
-
Samples:
- “Read build‑id from a single file”
- “Enumerate current process modules and print build‑ids”
- “Scan a container filesystem for unique ELFs and dump JSON”
For example:
// Example: dump all modules for the current process
var modules = ElfProcessScanner.GetProcessModules();
foreach (var m in modules)
{
Console.WriteLine($"{m.Path} | SONAME={m.Soname} | BUILD-ID={m.BuildId?.HexString ?? "<none>"}");
}
TL;DR: What to actually change in your current spec
If you just want a concrete checklist:
-
Refine API
- Introduce
ElfBuildIdstruct, options objects, async variants. - Split parser vs file/process scanners.
- Introduce
-
Parsing correctness
- Support build‑id in both PT_NOTE and
.note.gnu.build-id. - Add strict bounds checks and
ElfParseExceptionwithElfParseErrorKind. - Treat big‑endian & 32‑bit as first‑class.
- Support build‑id in both PT_NOTE and
-
Performance
- Make full file hashing opt‑in.
- Avoid unnecessary section reads.
- Add optional memory‑mapped mode.
-
Process scanner
- De‑dup by inode, not just path.
- Return both unique files and per‑mapping instances.
- Add structured error reporting (successes + failures).
-
Testing & security
- Mandate cross‑validation vs
readelf. - Add fuzz/corruption tests.
- Add resource caps (max file size, max sections/notes).
- Mandate cross‑validation vs
If you’d like, next step I can do is rewrite the public C# surface (interfaces, classes, XML docs) in one place with all of these improvements baked in, so your team can just drop it into a project and fill in the internals.