Here’s a quick, practical win for your SBOM/runtime join story: **record the ELF build‑id alongside soname and path when mapping modules to purls.** Why it matters: * **build‑id** (from `.note.gnu.build-id`) is a **content hash** that uniquely identifies an ELF image—even if filenames/paths change. * Distros and **debuginfod** index debug symbols **by build‑id**, so you can reliably join runtime traces → binaries → SBOM entries → debug artifacts. * It hardens reachability and VEX joins (no “same soname, different bits” ambiguity). ### What to capture per ELF * `soname` (if shared object) * `full path` at runtime * `purl` (package URL from your resolver) * **`build_id`** (hex, no colons) * `arch`, `file type` (ET_DYN/ET_EXEC), and `build-id source` (NT_GNU_BUILD_ID) ### How to read it (portable snippets) **CLI** ```bash # show build-id quickly readelf -n /path/to/bin | awk '/Build ID:/ {print $3}' # or: objdump -s --section .note.gnu.build-id /path/to/bin ``` **C (runtime collector)** ```c #include #include static int note_cb(struct dl_phdr_info *info, size_t size, void *data) { for (int i=0; iphnum; i++) { const ElfW(Phdr) *ph = &info->phdr[i]; if (ph->p_type == PT_NOTE) { // scan notes for NT_GNU_BUILD_ID (type=3, name="GNU") // extract desc bytes → hex string build_id } } return 0; } // call dl_iterate_phdr(note_cb, NULL); ``` **Go (scanner)** ```go f, _ := elf.Open(path) for _, n := range f.Notes { if n.Type == elf.NT_GNU_BUILD_ID && n.Name == "GNU" { buildID := fmt.Sprintf("%x", n.Desc) // record buildID } } ``` ### Suggested Stella Ops schema (add field, no versioning break) ```json { "module": { "path": "/usr/lib/x86_64-linux-gnu/libssl.so.3", "soname": "libssl.so.3", "purl": "pkg:deb/ubuntu/openssl@3.0.2-0ubuntu1.10?arch=amd64", "elf": { "build_id": "a1b2c3d4e5f6...", "type": "ET_DYN", "arch": "x86_64", "notes": { "source": "NT_GNU_BUILD_ID" } } } } ``` ### Join strategy 1. **Runtime → build‑id:** collect from process maps (or dl_iterate_phdr) and file scan fallback. 2. **SBOM → candidate binaries:** map by purl/filename, then **confirm by build‑id** where available. 3. **Debug/Source:** query debuginfod or distro debug repos by build‑id to fetch symbols for precise call‑graph and reachability. 4. **VEX/Policies:** treat build‑id as the primary key for binary‑level assertions; purl stays as the package‑level key. ### Edge handling * **Stripped binaries:** build‑id still present in the note; if missing, fall back to **full‑file hash** and flag `build_id_absent=true`. * **Containers:** compute build‑id inside image layers and cache in your “Proof‑of‑Integrity Graph.” * **Kernel/Modules:** same idea—`/sys/module/*/notes/.note.gnu.build-id`. ### Quick acceptance tests * Scan a container image (Debian/Ubuntu/RHEL) and verify >90% of ELF objects yield a build‑id. * Cross‑check one binary: path changes across containers, **build‑id stays identical**. * Fetch symbols via debuginfod using that build‑id and run a tiny call‑graph demo to prove determinism. If you want, I can draft the exact .NET 10 collector for Linux (P/Invoke `dl_iterate_phdr`) and a CycloneDX extension block to store `build_id`. Here’s a concrete “implementation spec” for a C# dev to build an **ELF metadata / build-id collector** (“elf builder”). I’ll treat this as a small reusable .NET library plus some process-level helpers. --- ## 1. Goal & Scope **Goal:** From C# on Linux, be able to: 1. Given an ELF file path, extract: * `build-id` (from `.note.gnu.build-id`, i.e. NT_GNU_BUILD_ID) * `soname` (for shared objects) * ELF type (ET_EXEC / ET_DYN / etc.) * machine architecture * file path * optional fallback: full-file hash if build-id is missing 2. Given a running process (usually self), enumerate loaded ELF modules and attach the above metadata per module. The output will power your SBOM/runtime join (path + soname + build-id → purl). --- ## 2. Public API Spec ### 2.1 Core model ```csharp public enum ElfFileType { Unknown = 0, Relocatable = 1, // ET_REL Executable = 2, // ET_EXEC SharedObject = 3, // ET_DYN Core = 4 // ET_CORE } public sealed class ElfMetadata { public required string Path { get; init; } public string? Soname { get; init; } public string? BuildId { get; init; } // Hex, lowercase, no colons public string BuildIdSource { get; init; } = ""; // "NT_GNU_BUILD_ID" | "FileHash" | "" public ElfFileType FileType { get; init; } public string Machine { get; init; } = ""; // e.g. "x86_64", "aarch64" public bool Is64Bit { get; init; } public bool IsLittleEndian { get; init; } public string? FileHashSha256 { get; init; } // only if BuildId == null } ``` ### 2.2 File-level API ```csharp public static class ElfReader { /// /// Parse the ELF file at the given path and extract metadata. /// Throws if file is not ELF or cannot be read. /// public static ElfMetadata ReadMetadata(string path); } ``` **Behavior:** * Validates ELF magic. * Supports both 32-bit and 64-bit ELF. * Supports little and big endian (but you can initially only test little-endian). * Uses program headers (PT_NOTE) and note parsing to extract build-id. * Uses section headers + .dynamic to extract `DT_SONAME`. * Sets `BuildIdSource = "NT_GNU_BUILD_ID"` if build-id present. * If no build-id, computes `FileHashSha256` and sets `BuildIdSource = "FileHash"`. ### 2.3 Process-level API (Linux) ```csharp public static class ElfProcessScanner { /// /// Enumerate ELF modules for the current process (default) or a given pid. /// Only returns unique paths that are actual ELF files. /// public static IReadOnlyList GetProcessModules(int? pid = null); } ``` **Default implementation:** * Only supports Linux. * Reads `/proc//maps`. * Filters entries that map regular files (path not `[vdso]`, `[heap]`, etc.). * De-duplicates by canonical path (e.g. `realpath` behavior). * For each unique path: * Check first 4 bytes for ELF magic. * Call `ElfReader.ReadMetadata(path)`. --- ## 3. ELF Parsing: Binary Layout & Rules You do **not** need unsafe code; a `BinaryReader` is enough. ### 3.1 ELF header First 16 bytes: `e_ident[]`. Key fields: * `e_ident[0..3]` = `0x7F, 'E', 'L', 'F'` (magic) * `e_ident[4]` = `EI_CLASS`: * 1 = 32-bit (`ELFCLASS32`) * 2 = 64-bit (`ELFCLASS64`) * `e_ident[5]` = `EI_DATA`: * 1 = little-endian (`ELFDATA2LSB`) * 2 = big-endian (`ELFDATA2MSB`) Then the “native” header fields, which differ slightly between 32 & 64 bit. Define two internal structs (don’t use `[StructLayout]`; just read fields manually): ```csharp internal sealed class ElfHeaderCommon { public byte[] Ident = new byte[16]; public ushort Type; // e_type public ushort Machine; // e_machine public uint Version; // e_version public ulong Entry; // e_entry (32/64 sized) public ulong Phoff; // e_phoff public ulong Shoff; // e_shoff public uint Flags; // e_flags public ushort Ehsize; // e_ehsize public ushort Phentsize; // e_phentsize public ushort Phnum; // e_phnum public ushort Shentsize; // e_shentsize public ushort Shnum; // e_shnum public ushort Shstrndx; // e_shstrndx } ``` **Algorithm to read header:** 1. `ReadBytes(16)` → `Ident`. Validate magic & EI_CLASS/EI_DATA. 2. Decide `is64` (from EI_CLASS) and `littleEndian` (from EI_DATA). 3. Use helper methods: ```csharp static ushort ReadUInt16(BinaryReader br, bool little) { ... } static uint ReadUInt32(BinaryReader br, bool little) { ... } static ulong ReadUInt64(BinaryReader br, bool little) { ... } ``` Where these helpers swap bytes if file is big-endian and host is little-endian. 4. For 32-bit ELF: fields `Entry`, `Phoff`, `Shoff` are 4-byte values that you zero-extend to 64-bit. 5. For 64-bit ELF: fields are 8-byte values. ### 3.2 Program headers (for build-id) Each program header: * 32-bit: ```text uint32 p_type; uint32 p_offset; uint32 p_vaddr; uint32 p_paddr; uint32 p_filesz; uint32 p_memsz; uint32 p_flags; uint32 p_align; ``` * 64-bit: ```text uint32 p_type; uint32 p_flags; uint64 p_offset; uint64 p_vaddr; uint64 p_paddr; uint64 p_filesz; uint64 p_memsz; uint64 p_align; ``` You only really need: * `p_type` (look for `PT_NOTE` = 4) * `p_offset` * `p_filesz` **Reading algorithm:** ```csharp internal sealed class ProgramHeader { public uint Type; public ulong Offset; public ulong FileSize; } ``` * Seek to `header.Phoff`. * For `i = 0..Phnum-1`: * For 32-bit: * `Type = ReadUInt32()` * Skip `p_offset` into `Offset = ReadUInt32()` * Skip the rest. * For 64-bit: * `Type = ReadUInt32()` * `flags = ReadUInt32()` (ignored) * `Offset = ReadUInt64()` * `FileSize = ReadUInt64()` * Skip rest. * Store those with `Type == 4` (PT_NOTE). ### 3.3 Note segments & NT_GNU_BUILD_ID Each **note** has: ```text uint32 namesz; uint32 descsz; uint32 type; char name[namesz]; // padded to 4-byte boundary byte desc[descsz]; // padded to 4-byte boundary ``` We care about: * `type == 3` (NT_GNU_BUILD_ID) * `name == "GNU"` (null-terminated; usually `"GNU\0"`) **Algorithm:** For each `PT_NOTE` program header: 1. Seek to `ph.Offset`, set `remaining = ph.FileSize`. 2. While `remaining >= 12`: * `namesz = ReadUInt32()` * `descsz = ReadUInt32()` * `type = ReadUInt32()` * `remaining -= 12`. * Read `nameBytes = ReadBytes(namesz)`; `remaining -= namesz`. * Skip padding: `pad = (4 - (namesz % 4)) & 3`; `Seek(pad)`, `remaining -= pad`. * Read `desc = ReadBytes(descsz)`; `remaining -= descsz`. * Skip padding: `pad = (4 - (descsz % 4)) & 3`; `Seek(pad)`, `remaining -= pad`. * If `type == 3` and `Encoding.ASCII.GetString(nameBytes).TrimEnd('\0') == "GNU"`: * Convert `desc` to hex: ```csharp string buildId = BitConverter.ToString(desc).Replace("-", "").ToLowerInvariant(); ``` * Return immediately. If no note matches, return null, and you can later fall back to `FileHashSha256`. ### 3.4 Section headers & SONAME You need `DT_SONAME` from the dynamic section. Steps: 1. Read **section headers** from `Shoff` (ELF header). Minimal section header model: ```csharp internal sealed class SectionHeader { public uint Name; // index into shstrtab public uint Type; // SHT_* public ulong Offset; public ulong Size; public uint Link; // for some types } ``` For each section: * Read `Name`, `Type`, `Flags` (ignored), `Addr` (ignored), `Offset`, `Size`, `Link`, etc. * Keep these in an array. 2. Find the **section header string table** (`shstrtab`): * Use `header.Shstrndx` to locate its section header. * Read that section’s bytes into `shStrTab`. * Define helper to get section name: ```csharp static string ReadNullTerminatedString(byte[] table, uint offset) { int i = (int)offset; int start = i; while (i < table.Length && table[i] != 0) i++; return Encoding.ASCII.GetString(table, start, i - start); } ``` 3. Use `shStrTab` to find: * `.dynamic` section (`Type == 6` i.e. `SHT_DYNAMIC`). * The string table it references (`SectionHeader.Link` → index of the dynamic string table, often `.dynstr`). 4. Parse the **dynamic section**: * `Elf64_Dyn` is array of entries: ```text int64 d_tag; uint64 d_val; ``` (For 32-bit, both are 4 bytes; you can cast to 64-bit.) * For each entry: * Read `d_tag` (signed, but you can treat as 64-bit). * Read `d_val`. * If `d_tag == 14` (`DT_SONAME`), then `d_val` is an offset into the dynstr string table. 5. Read `SONAME`: * Use dynstr bytes + `d_val` as index, decode null-terminated ASCII → `Soname`. If there is no `.dynamic` section or no `DT_SONAME`, set `Soname = null`. ### 3.5 Mapping `e_machine` to architecture string `e_machine` is a numeric code. Map the most common ones: ```csharp static string MapMachine(ushort eMachine) => eMachine switch { 3 => "x86", // EM_386 62 => "x86_64", // EM_X86_64 40 => "arm", // EM_ARM 183 => "aarch64", // EM_AARCH64 8 => "mips", // EM_MIPS _ => $"unknown({eMachine})" }; ``` ### 3.6 Mapping `e_type` to `ElfFileType` ```csharp static ElfFileType MapFileType(ushort eType) => eType switch { 1 => ElfFileType.Relocatable, // ET_REL 2 => ElfFileType.Executable, // ET_EXEC 3 => ElfFileType.SharedObject,// ET_DYN 4 => ElfFileType.Core, // ET_CORE _ => ElfFileType.Unknown }; ``` ### 3.7 Fallback: SHA-256 hash If build-id is missing: ```csharp static string ComputeFileSha256(string path) { using var sha = System.Security.Cryptography.SHA256.Create(); using var fs = File.OpenRead(path); var hash = sha.ComputeHash(fs); return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant(); } ``` Set: * `BuildId = null` * `BuildIdSource = "FileHash"` * `FileHashSha256 = computedHash` --- ## 4. Implementation Skeleton (ElfReader) Here’s a compact skeleton tying it together: ```csharp public static class ElfReader { public static ElfMetadata ReadMetadata(string path) { using var fs = File.OpenRead(path); using var br = new BinaryReader(fs); // 1. Read e_ident byte[] ident = br.ReadBytes(16); if (ident.Length < 16 || ident[0] != 0x7F || ident[1] != (byte)'E' || ident[2] != (byte)'L' || ident[3] != (byte)'F') { throw new InvalidDataException("Not an ELF file."); } bool is64 = ident[4] == 2; // EI_CLASS bool little = ident[5] == 1; // EI_DATA // 2. Read header var header = ReadElfHeader(br, ident, is64, little); // 3. Read program headers var phdrs = ReadProgramHeaders(br, header, is64, little); // 4. Extract build-id from PT_NOTE string? buildId = TryReadBuildIdFromNotes(br, phdrs, little, is64); // 5. Read SONAME from .dynamic string? soname = TryReadSoname(br, header, is64, little); // 6. Map machine & type string machine = MapMachine(header.Machine); ElfFileType fileType = MapFileType(header.Type); // 7. Hash fallback string? fileHash = null; string source; if (buildId is null) { fileHash = ComputeFileSha256(path); source = "FileHash"; } else { source = "NT_GNU_BUILD_ID"; } return new ElfMetadata { Path = path, Soname = soname, BuildId = buildId, BuildIdSource = source, FileType = fileType, Machine = machine, Is64Bit = is64, IsLittleEndian = little, FileHashSha256 = fileHash }; } // ... implement ReadElfHeader, ReadProgramHeaders, // TryReadBuildIdFromNotes, TryReadSoname, MapMachine, // MapFileType, ComputeFileSha256, + endian helpers ... } ``` I didn’t expand *every* helper to keep this readable, but all helpers follow exactly the rules in section 3. --- ## 5. Process Scanner Spec (Linux) ### 5.1 Reading `/proc//maps` Each line looks roughly like: ```text 7f2d9c214000-7f2d9c234000 r--p 00000000 08:01 1234567 /usr/lib/x86_64-linux-gnu/libssl.so.3 ``` Last field is the file path, if any. **Algorithm:** ```csharp public static class ElfProcessScanner { public static IReadOnlyList GetProcessModules(int? pid = null) { int actualPid = pid ?? Environment.ProcessId; string mapsPath = $"/proc/{actualPid}/maps"; if (!File.Exists(mapsPath)) throw new PlatformNotSupportedException("Only supported on Linux with /proc."); var paths = new HashSet(StringComparer.Ordinal); foreach (var line in File.ReadLines(mapsPath)) { int idx = line.IndexOf('/'); if (idx < 0) continue; string p = line.Substring(idx).Trim(); if (p.StartsWith("[")) continue; // skip [heap], [vdso], etc. if (!File.Exists(p)) continue; // De-duplicate if (!paths.Add(p)) continue; } var result = new List(); foreach (var p in paths) { if (!IsElfFile(p)) continue; try { var meta = ElfReader.ReadMetadata(p); result.Add(meta); } catch { // swallow or log; not all mapped files are valid ELF } } return result; } private static bool IsElfFile(string path) { try { using var fs = File.OpenRead(path); Span magic = stackalloc byte[4]; if (fs.Read(magic) != 4) return false; return magic[0] == 0x7F && magic[1] == (byte)'E' && magic[2] == (byte)'L' && magic[3] == (byte)'F'; } catch { return false; } } } ``` This is simple and robust. If you later want **even more accurate** results (e.g., also non-file-backed shared objects), you can add a P/Invoke path that uses `dl_iterate_phdr`, but `/proc//maps` gets you the SBOM-relevant modules. --- ## 6. JSON / SBOM Integration (Optional but Recommended) When you serialize `ElfMetadata` into your runtime evidence / graph, I’d recommend a nested ELF block, e.g.: ```json { "path": "/usr/lib/x86_64-linux-gnu/libssl.so.3", "soname": "libssl.so.3", "purl": "pkg:deb/ubuntu/openssl@3.0.2-0ubuntu1.10?arch=amd64", "elf": { "build_id": "a1b2c3d4e5f6...", "build_id_source": "NT_GNU_BUILD_ID", "file_type": "SharedObject", "machine": "x86_64", "is_64bit": true, "is_little_endian": true, "file_hash_sha256": null } } ``` You can keep `purl` on the higher-level module object; `build_id` becomes the primary key for binary-accurate joins. --- ## 7. Testing Checklist For a C# dev implementing this, I’d suggest these tests: 1. **Basic ELF detection** * Provide a non-ELF file → `ReadMetadata` throws. * Provide `/bin/ls` (or similar) → parses successfully. 2. **32-bit vs 64-bit** * If you have a 32-bit ELF on the system, verify `Is64Bit` and basic fields. 3. **Build-id extraction** * Compare `ReadMetadata(path).BuildId` against `readelf -n path | grep 'Build ID:'`. * Ensure they match. 4. **SONAME extraction** * For a shared library, ensure `Soname` matches `readelf -d path | grep SONAME`. 5. **Hash fallback** * On an ELF without build-id (you can strip or use a test file), verify `BuildId == null` and `FileHashSha256 != null`. 6. **Process scanner** * Call `GetProcessModules()` in your test app; ensure it returns at least: * your own executable * core system libs (`libc.so.6`, `ld-linux*`, etc.) --- If you’d like, next step I can do is expand any of the helper methods (e.g. `ReadElfHeader` or `TryReadSoname`) into complete C# code, but this should be enough as a “full spec & directions” for a C# dev to implement the ELF build-id collector. You’re already in “pretty good” territory with the current spec. To get to “best in class,” I’d tighten it in a few dimensions: **API design, correctness, performance, observability, testing, and extensibility**. I’ll walk through concrete ways to upgrade what we already have, without rewriting everything from scratch. --- ## 1. Sharpen the public API ### 1.1 Split responsibilities into clear layers Right now we have: * `ElfReader.ReadMetadata(string path)` * `ElfProcessScanner.GetProcessModules(int? pid = null)` For a best‑in‑class library, I’d explicitly layer things: ```csharp public interface IElfParser { ElfMetadata Parse(Stream stream, string? pathHint = null); } public interface IElfFileInspector { ElfMetadata InspectFile(string path); } public interface IElfProcessInspector { IReadOnlyList GetProcessModules(ElfProcessScanOptions? options = null); } ``` With default implementations: * `ElfParser` – pure, stateless binary parser (no file I/O). * `ElfFileInspector` – wraps `ElfParser` + file system. * `ElfProcessInspector` – wraps `/proc//maps` (and optionally `dl_iterate_phdr`). This makes testing simpler (you can feed a `MemoryStream`) and keeps “how we read” decoupled from “how we parse.” ### 1.2 Options objects & async variants Give users knobs and modern .NET ergonomics: ```csharp public sealed class ElfProcessScanOptions { public int? Pid { get; init; } public bool IncludeNonElfFiles { get; init; } = false; public bool ParallelFileParsing { get; init; } = true; public bool ComputeHashWhenBuildIdMissing { get; init; } = true; public int? MaxFiles { get; init; } // safety valve on huge systems } public static class ElfProcessScanner { public static IReadOnlyList GetProcessModules( ElfProcessScanOptions? options = null); public static IAsyncEnumerable GetProcessModulesAsync( ElfProcessScanOptions? options = null, CancellationToken cancellationToken = default); } ``` Same for file scans: ```csharp public sealed class ElfFileScanOptions { public bool ComputeFileHashWhenBuildIdPresent { get; init; } = false; public bool ThrowOnNonElf { get; init; } = true; } public static ElfMetadata ReadMetadata( string path, ElfFileScanOptions? options = null); ``` ### 1.3 Strong types for identity Instead of `string BuildId`, add a value type: ```csharp public readonly struct ElfBuildId : IEquatable { public string HexString { get; } // "a1b2c3..." public string DebugPathComponent => $"{HexString[..2]}/{HexString[2..]}"; // Parse, TryParse, equality, GetHashCode, etc. } ``` Then in `ElfMetadata`: ```csharp public ElfBuildId? BuildId { get; init; } // nullable public string BuildIdSource { get; init; } // "NT_GNU_BUILD_ID" | "FileHash" | "None" ``` This prevents subtle bugs from string normalization and gives you the debuginfod‑style path precomputed. --- ## 2. Make parsing spec‑accurate & robust ### 2.1 Handle both PT_NOTE and SHT_NOTE `.note.gnu.build-id` Many binaries place build‑id in: * `PT_NOTE` segments **and/or** * a section named `.note.gnu.build-id` (`SHT_NOTE`) Your spec only mentions `PT_NOTE`. For best coverage: 1. Search all `PT_NOTE` segments for `NT_GNU_BUILD_ID`. 2. If none found, search `SHT_NOTE` sections with name `.note.gnu.build-id`. 3. If both exist and disagree (extremely rare), decide a precedence and log a diagnostic. ### 2.2 Correct note alignment rules Spec nuance: * Note *fields* (`namesz`, `descsz`, `type`) are always 4‑byte aligned. * On 64‑bit, the **overall note segment** may be aligned to 8 bytes, but the internal padding rules still use 4‑byte boundaries. Your spec uses `pad = (4 - (size % 4)) & 3`, which is correct, but I’d codify it clearly: ```csharp static int NotePadding(int size) => (4 - (size & 3)) & 3; ``` And call that everywhere you advance across notes so future maintainers don’t “optimize” it incorrectly. ### 2.3 Be strict on bounds & corruption Add explicit, defensive checks: * Do not trust `p_offset` + `p_filesz` blindly. * Before any read, verify `offset + length <= streamLength`. * If the file lies about sizes, **fail gracefully** with a structured error. E.g.: ```csharp public sealed class ElfParseException : Exception { public ElfParseErrorKind Kind { get; } public string? Detail { get; } // ... } public enum ElfParseErrorKind { NotElf, TruncatedHeader, TruncatedProgramHeader, TruncatedSectionHeader, TruncatedNote, UnsupportedClass, UnsupportedEndianess, IoError, Unknown } ``` And then: ```csharp if (header.Phoff + (ulong)header.Phnum * header.Phentsize > (ulong)fs.Length) throw new ElfParseException(ElfParseErrorKind.TruncatedProgramHeader, "..."); ``` Best‑in‑class means you *never* trust the file, and your errors are debuggable. ### 2.4 Big‑endian and 32‑bit are first‑class citizens Even if your primary target is x86_64 Linux, a robust spec: * Fully supports EI_CLASS = 1 and 2 (32/64). * Fully supports EI_DATA = 1 and 2 (LSB/MSB). * Has tests for at least one big‑endian ELF (e.g., sample artifacts in your test assets). Your current spec *mentions* big-endian, but I’d explicitly require: * A generic `EndianBinaryReader` abstraction that: * Wraps a `Stream` * Exposes `ReadUInt16/32/64`, `ReadInt64`, `ReadBytes` with endianness. --- ## 3. Performance & scale improvements ### 3.1 Avoid full-file reads by design Your current design lets devs accidentally hash everything or read all sections even when not needed. Refine the spec so that **default path** is minimal I/O: * Read ELF header. * Read program headers. * Read only: * PT_NOTE ranges * Section headers (once) * `.shstrtab`, `.dynamic`, and its dynstr. Only compute SHA‑256 when expressly configured (via `ElfFileScanOptions.ComputeFileHashWhenBuildIdPresent` or `ComputeFileHashWhenBuildIdMissing`). ### 3.2 Optional memory‑mapped mode For very large scans (filesystem crawls, containers), allow a mode that uses `MemoryMappedFile`: ```csharp public sealed class ElfReaderOptions { public bool UseMemoryMappedFile { get; init; } = false; } ``` Internally, you can spec that the implementation: * Uses `MemoryMappedFile.CreateFromFile` * Creates views over relevant ranges (header, program headers, etc.) * Avoids multiple OS reads for repeated random access. ### 3.3 Parallel directory / image scanning If you foresee scanning whole images or file trees, define a helper: ```csharp public static class ElfDirectoryScanner { public static IReadOnlyList Scan( string rootDirectory, ElfDirectoryScanOptions? options = null); public static IAsyncEnumerable ScanAsync( string rootDirectory, ElfDirectoryScanOptions? options = null, CancellationToken cancellationToken = default); } public sealed class ElfDirectoryScanOptions { public SearchOption SearchOption { get; init; } = SearchOption.AllDirectories; public int MaxDegreeOfParallelism { get; init; } = Environment.ProcessorCount; public Func? PathFilter { get; init; } // e.g., skip /proc, /sys } ``` And explicitly say that the implementation: * Uses `Parallel.ForEach` (or `Parallel.ForEachAsync` in .NET 8/9+) with bounded parallelism. * Shares a single `ElfParser` across threads (it’s stateless). * De‑dups by `(device, inode)` when possible (see below). --- ## 4. Process scanner: correctness & completeness ### 4.1 De‑duplication by inode, not just path The current spec de‑dups only by path. On Linux: * Same inode may have multiple paths (hard links, bind mounts, chroot/container overlays). For best‑in‑class accuracy of “unique binaries,” spec: * De‑duplicate entries by `(st_dev, st_ino)` from `stat(2)`, not just string path. * Provide both views: unique by file identity and by path. API example: ```csharp public sealed class ElfProcessModules { public IReadOnlyList UniqueFiles { get; init; } // dedup by inode public IReadOnlyList Instances { get; init; } // per mapping } public sealed class ElfModuleInstance { public ElfMetadata Metadata { get; init; } public string Path { get; init; } public string? MappingRange { get; init; } // "7f2d9c214000-7f2d9c234000" } ``` And `ElfProcessScanner.GetProcessModules` returns an `ElfProcessModules`, not just a flat list. ### 4.2 Optional `dl_iterate_phdr` P/Invoke path For a “maximum correctness” mode, you can specify: * A secondary implementation that uses `dl_iterate_phdr` via P/Invoke. * This gives you module base addresses and sometimes more consistent views across distros. * You can hybridize: use `/proc//maps` for path enumeration and `dl_iterate_phdr` to confirm loaded segments (future feature). You don’t **have** to implement it day one, but the spec can carve out an extension point: ```csharp public enum ElfProcessModuleSource { ProcMaps, DlIteratePhdr } public sealed class ElfProcessScanOptions { public ElfProcessModuleSource Source { get; init; } = ElfProcessModuleSource.ProcMaps; } ``` And define behavior if the requested source isn’t available. --- ## 5. Observability & diagnostics Best‑in‑class libraries are easy to debug. ### 5.1 Structured diagnostics on parse failures Instead of “swallow or log” in the scanner, define: ```csharp public sealed class ElfScanResult { public IReadOnlyList Successes { get; init; } public IReadOnlyList Errors { get; init; } } public sealed class ElfScanError { public string Path { get; init; } public ElfParseErrorKind Kind { get; init; } public string Message { get; init; } } ``` And make `ElfProcessScanner.GetProcessModules` optionally return `ElfScanResult` (or have an overload). This way you can: * Report how many files failed. * See common misconfigurations (e.g., insufficient permissions, truncated files). ### 5.2 Logging hooks instead of hard-coded logging Don’t bake in a logging framework, but add a hook: ```csharp public interface IElfLogger { void Debug(string message); void Info(string message); void Warn(string message); void Error(string message, Exception? ex = null); } public sealed class ElfReaderOptions { public IElfLogger? Logger { get; init; } } ``` Then use it for “soft failures” (skipping non‑ELF files, ignoring suspect sections, etc.). --- ## 6. Security & safety considerations ### 6.1 Treat inputs as untrusted Spec explicitly that: * No ELF is ever loaded or executed. * No ld.so / dynamic loading is used: all reading is via `FileStream` / `MemoryMappedFile`. * No writes occur to inspected paths. ### 6.2 Control resource usage For environments scanning untrusted file trees (e.g., user uploads): * Have configurable caps on: * `MaxFileSizeBytes` to parse. * `MaxNotesPerSegment` / `MaxSections` to avoid pathological “zip bomb” style ELFs. * Fail with `ElfParseErrorKind.TruncatedHeader` or `Unsupported` rather than exhausting RAM. --- ## 7. Testing & validation: make it part of the spec Instead of just “add tests,” bake them in as requirements. ### 7.1 Golden tests vs `readelf` or `llvm-readobj` Define that CI must include: * For a set of ELFs (32‑bit, 64‑bit, big‑endian, stripped, PIE, static): * Compare `ElfMetadata.BuildId` with `readelf -n` output. * Compare `ElfMetadata.Soname` with `readelf -d` / `objdump -p`. You don’t need to name the exact tools in the API, but the spec can say: > The library’s test suite **must** cross‑validate build‑id and SONAME values against a trusted system tool (such as `readelf` or `llvm-readobj`) for a curated set of binaries. ### 7.2 Fuzzing & corruption tests Add: * A small fuzz harness that: * Mutates bytes in real ELF samples. * Feeds them to `ElfParser`. * Asserts: no crashes, only `ElfParseException`s. This directly supports the “never trust input” goal. ### 7.3 Regression fixtures Check in a `testdata/` folder with: * Minimal 32‑bit/64‑bit ELF with build‑id. * Minimal ELF without build‑id. * Shared library with SONAME. * Big‑endian sample. --- ## 8. Extensibility hooks (future-friendly) Even if you only care about Linux/ELF today, you can design with “other formats later” in mind. ### 8.1 Generalized module metadata interface ```csharp public interface IModuleMetadata { string Path { get; } string? Soname { get; } string? BuildId { get; } string Format { get; } // "ELF", "PE", "MachO" } ``` `ElfMetadata` implements `IModuleMetadata`. That way, a future `PeMetadata` or `MachOMetadata` can slot into the same pipelines. ### 8.2 Integration with SBOM & VEX Add a tiny, optional interface that lines up with your SBOM graph: ```csharp public interface IHasPackageCoordinates { string? Purl { get; } } public sealed partial class ElfMetadata : IHasPackageCoordinates { public string? Purl { get; init; } // populated by your higher-layer resolver } ``` The ELF layer doesn’t know how to compute `Purl`, but it gives a spot for higher layers to attach it without wrapping everything in another type. --- ## 9. Documentation & usage examples Finally, “best in class” is as much about *developer experience* as code. Your spec should require: * XML docs on all public types/members (shown in IntelliSense). * Samples: * “Read build‑id from a single file” * “Enumerate current process modules and print build‑ids” * “Scan a container filesystem for unique ELFs and dump JSON” For example: ```csharp // Example: dump all modules for the current process var modules = ElfProcessScanner.GetProcessModules(); foreach (var m in modules) { Console.WriteLine($"{m.Path} | SONAME={m.Soname} | BUILD-ID={m.BuildId?.HexString ?? ""}"); } ``` --- ## TL;DR: What to actually change in your current spec If you just want a concrete checklist: 1. **Refine API** * Introduce `ElfBuildId` struct, options objects, async variants. * Split parser vs file/process scanners. 2. **Parsing correctness** * Support build‑id in both PT_NOTE and `.note.gnu.build-id`. * Add strict bounds checks and `ElfParseException` with `ElfParseErrorKind`. * Treat big‑endian & 32‑bit as first‑class. 3. **Performance** * Make full file hashing opt‑in. * Avoid unnecessary section reads. * Add optional memory‑mapped mode. 4. **Process scanner** * De‑dup by inode, not just path. * Return both unique files and per‑mapping instances. * Add structured error reporting (successes + failures). 5. **Testing & security** * Mandate cross‑validation vs `readelf`. * Add fuzz/corruption tests. * Add resource caps (max file size, max sections/notes). If you’d like, next step I can do is **rewrite the public C# surface** (interfaces, classes, XML docs) in one place with all of these improvements baked in, so your team can just drop it into a project and fill in the internals.