git.stella-ops.org/20-Nov-2026 - Branch · Attach ELF Build‑IDs for Stable PURL Mapping.md at 522fff73cd1dea85bfa61dba58a8f1805abcd0b4 - git.stella-ops.org

Files

Docs CI / lint-and-preview (push) Has been cancelled

Details

feat: Add comprehensive documentation for binary reachability with PURL-resolved edges

- Introduced a detailed specification for encoding binary reachability that integrates call graphs with SBOMs.
- Defined a minimal data model including nodes, edges, and SBOM components.
- Outlined a step-by-step guide for building the reachability graph in a C#-centric manner.
- Established core domain models, including enumerations for binary formats and symbol kinds.
- Created a public API for the binary reachability service, including methods for graph building and serialization.
- Specified SBOM component resolution and binary parsing abstractions for PE, ELF, and Mach-O formats.
- Enhanced symbol normalization and digesting processes to ensure deterministic signatures.
- Included error handling, logging, and a high-level test plan to ensure robustness and correctness.
- Added non-functional requirements to guide performance, memory usage, and thread safety.

2025-11-20 23:16:02 +02:00

34 KiB

Raw Blame History

Here’s a quick, practical win for your SBOM/runtime join story: record the ELF build‑id alongside soname and path when mapping modules to purls.

Why it matters:

build‑id (from .note.gnu.build-id) is a content hash that uniquely identifies an ELF image—even if filenames/paths change.
Distros and debuginfod index debug symbols by build‑id, so you can reliably join runtime traces → binaries → SBOM entries → debug artifacts.
It hardens reachability and VEX joins (no “same soname, different bits” ambiguity).

What to capture per ELF

soname (if shared object)
full path at runtime
purl (package URL from your resolver)
build_id (hex, no colons)
arch, file type (ET_DYN/ET_EXEC), and build-id source (NT_GNU_BUILD_ID)

How to read it (portable snippets)

CLI

# show build-id quickly
readelf -n /path/to/bin | awk '/Build ID:/ {print $3}'
# or:
objdump -s --section .note.gnu.build-id /path/to/bin

C (runtime collector)

#include <link.h>
#include <string.h>
static int note_cb(struct dl_phdr_info *info, size_t size, void *data) {
  for (int i=0; i<info->phnum; i++) {
    const ElfW(Phdr) *ph = &info->phdr[i];
    if (ph->p_type == PT_NOTE) {
      // scan notes for NT_GNU_BUILD_ID (type=3, name="GNU")
      // extract desc bytes → hex string build_id
    }
  }
  return 0;
}
// call dl_iterate_phdr(note_cb, NULL);

Go (scanner)

f, _ := elf.Open(path)
for _, n := range f.Notes {
    if n.Type == elf.NT_GNU_BUILD_ID && n.Name == "GNU" {
        buildID := fmt.Sprintf("%x", n.Desc)
        // record buildID
    }
}

Suggested Stella Ops schema (add field, no versioning break)

{
  "module": {
    "path": "/usr/lib/x86_64-linux-gnu/libssl.so.3",
    "soname": "libssl.so.3",
    "purl": "pkg:deb/ubuntu/openssl@3.0.2-0ubuntu1.10?arch=amd64",
    "elf": {
      "build_id": "a1b2c3d4e5f6...",
      "type": "ET_DYN",
      "arch": "x86_64",
      "notes": { "source": "NT_GNU_BUILD_ID" }
    }
  }
}

Join strategy

Runtime → build‑id: collect from process maps (or dl_iterate_phdr) and file scan fallback.
SBOM → candidate binaries: map by purl/filename, then confirm by build‑id where available.
Debug/Source: query debuginfod or distro debug repos by build‑id to fetch symbols for precise call‑graph and reachability.
VEX/Policies: treat build‑id as the primary key for binary‑level assertions; purl stays as the package‑level key.

Edge handling

Stripped binaries: build‑id still present in the note; if missing, fall back to full‑file hash and flag build_id_absent=true.
Containers: compute build‑id inside image layers and cache in your “Proof‑of‑Integrity Graph.”
Kernel/Modules: same idea—/sys/module/*/notes/.note.gnu.build-id.

Quick acceptance tests

Scan a container image (Debian/Ubuntu/RHEL) and verify >90% of ELF objects yield a build‑id.
Cross‑check one binary: path changes across containers, build‑id stays identical.
Fetch symbols via debuginfod using that build‑id and run a tiny call‑graph demo to prove determinism.

If you want, I can draft the exact .NET 10 collector for Linux (P/Invoke dl_iterate_phdr) and a CycloneDX extension block to store build_id. Here’s a concrete “implementation spec” for a C# dev to build an ELF metadata / build-id collector (“elf builder”). I’ll treat this as a small reusable .NET library plus some process-level helpers.

1. Goal & Scope

Goal: From C# on Linux, be able to:

Given an ELF file path, extract:
- build-id (from .note.gnu.build-id, i.e. NT_GNU_BUILD_ID)
- soname (for shared objects)
- ELF type (ET_EXEC / ET_DYN / etc.)
- machine architecture
- file path
- optional fallback: full-file hash if build-id is missing
Given a running process (usually self), enumerate loaded ELF modules and attach the above metadata per module.

The output will power your SBOM/runtime join (path + soname + build-id → purl).

2. Public API Spec

2.1 Core model

public enum ElfFileType
{
    Unknown = 0,
    Relocatable = 1,   // ET_REL
    Executable = 2,    // ET_EXEC
    SharedObject = 3,  // ET_DYN
    Core = 4           // ET_CORE
}

public sealed class ElfMetadata
{
    public required string Path { get; init; }
    public string? Soname { get; init; }
    public string? BuildId { get; init; }           // Hex, lowercase, no colons
    public string BuildIdSource { get; init; } = ""; // "NT_GNU_BUILD_ID" | "FileHash" | ""
    public ElfFileType FileType { get; init; }

    public string Machine { get; init; } = "";      // e.g. "x86_64", "aarch64"
    public bool Is64Bit { get; init; }
    public bool IsLittleEndian { get; init; }

    public string? FileHashSha256 { get; init; }    // only if BuildId == null
}

2.2 File-level API

public static class ElfReader
{
    /// <summary>
    /// Parse the ELF file at the given path and extract metadata.
    /// Throws if file is not ELF or cannot be read.
    /// </summary>
    public static ElfMetadata ReadMetadata(string path);
}

Behavior:

Validates ELF magic.
Supports both 32-bit and 64-bit ELF.
Supports little and big endian (but you can initially only test little-endian).
Uses program headers (PT_NOTE) and note parsing to extract build-id.
Uses section headers + .dynamic to extract DT_SONAME.
Sets BuildIdSource = "NT_GNU_BUILD_ID" if build-id present.
If no build-id, computes FileHashSha256 and sets BuildIdSource = "FileHash".

2.3 Process-level API (Linux)

public static class ElfProcessScanner
{
    /// <summary>
    /// Enumerate ELF modules for the current process (default) or a given pid.
    /// Only returns unique paths that are actual ELF files.
    /// </summary>
    public static IReadOnlyList<ElfMetadata> GetProcessModules(int? pid = null);
}

Default implementation:

Only supports Linux.
Reads /proc/<pid>/maps.
Filters entries that map regular files (path not [vdso], [heap], etc.).
De-duplicates by canonical path (e.g. realpath behavior).
For each unique path:
- Check first 4 bytes for ELF magic.
- Call ElfReader.ReadMetadata(path).

3. ELF Parsing: Binary Layout & Rules

You do not need unsafe code; a BinaryReader is enough.

3.1 ELF header

First 16 bytes: e_ident[].

Key fields:

e_ident[0..3] = 0x7F, 'E', 'L', 'F' (magic)
e_ident[4] = EI_CLASS:
- 1 = 32-bit (ELFCLASS32)
- 2 = 64-bit (ELFCLASS64)
e_ident[5] = EI_DATA:
- 1 = little-endian (ELFDATA2LSB)
- 2 = big-endian (ELFDATA2MSB)

Then the “native” header fields, which differ slightly between 32 & 64 bit.

Define two internal structs (don’t use [StructLayout]; just read fields manually):

internal sealed class ElfHeaderCommon
{
    public byte[] Ident = new byte[16];
    public ushort Type;       // e_type
    public ushort Machine;    // e_machine
    public uint Version;      // e_version
    public ulong Entry;       // e_entry (32/64 sized)
    public ulong Phoff;       // e_phoff
    public ulong Shoff;       // e_shoff
    public uint Flags;        // e_flags
    public ushort Ehsize;     // e_ehsize
    public ushort Phentsize;  // e_phentsize
    public ushort Phnum;      // e_phnum
    public ushort Shentsize;  // e_shentsize
    public ushort Shnum;      // e_shnum
    public ushort Shstrndx;   // e_shstrndx
}

Algorithm to read header:

ReadBytes(16) → Ident. Validate magic & EI_CLASS/EI_DATA.
Decide is64 (from EI_CLASS) and littleEndian (from EI_DATA).

Use helper methods:

static ushort ReadUInt16(BinaryReader br, bool little) { ... }
static uint   ReadUInt32(BinaryReader br, bool little) { ... }
static ulong  ReadUInt64(BinaryReader br, bool little) { ... }

Where these helpers swap bytes if file is big-endian and host is little-endian.

For 32-bit ELF: fields Entry, Phoff, Shoff are 4-byte values that you zero-extend to 64-bit.
For 64-bit ELF: fields are 8-byte values.

3.2 Program headers (for build-id)

Each program header:

32-bit:

uint32 p_type;
uint32 p_offset;
uint32 p_vaddr;
uint32 p_paddr;
uint32 p_filesz;
uint32 p_memsz;
uint32 p_flags;
uint32 p_align;

64-bit:

uint32 p_type;
uint32 p_flags;
uint64 p_offset;
uint64 p_vaddr;
uint64 p_paddr;
uint64 p_filesz;
uint64 p_memsz;
uint64 p_align;

You only really need:

p_type (look for PT_NOTE = 4)
p_offset
p_filesz

Reading algorithm:

internal sealed class ProgramHeader
{
    public uint Type;
    public ulong Offset;
    public ulong FileSize;
}

Seek to header.Phoff.
For i = 0..Phnum-1:
- For 32-bit:
  - Type = ReadUInt32()
  - Skip p_offset into Offset = ReadUInt32()
  - Skip the rest.
- For 64-bit:
  - Type = ReadUInt32()
  - flags = ReadUInt32() (ignored)
  - Offset = ReadUInt64()
  - FileSize = ReadUInt64()
  - Skip rest.
Store those with Type == 4 (PT_NOTE).

3.3 Note segments & NT_GNU_BUILD_ID

Each note has:

uint32 namesz;
uint32 descsz;
uint32 type;
char   name[namesz]; // padded to 4-byte boundary
byte   desc[descsz]; // padded to 4-byte boundary

We care about:

type == 3 (NT_GNU_BUILD_ID)
name == "GNU" (null-terminated; usually "GNU\0")

Algorithm:

For each PT_NOTE program header:

Seek to ph.Offset, set remaining = ph.FileSize.
While remaining >= 12:
- namesz = ReadUInt32()
- descsz = ReadUInt32()
- type = ReadUInt32()
- remaining -= 12.
- Read nameBytes = ReadBytes(namesz); remaining -= namesz.
  - Skip padding: pad = (4 - (namesz % 4)) & 3; Seek(pad), remaining -= pad.
- Read desc = ReadBytes(descsz); remaining -= descsz.
  - Skip padding: pad = (4 - (descsz % 4)) & 3; Seek(pad), remaining -= pad.
- If type == 3 and Encoding.ASCII.GetString(nameBytes).TrimEnd('\0') == "GNU":
  - Convert desc to hex:
```
string buildId = BitConverter.ToString(desc).Replace("-", "").ToLowerInvariant();
```
  - Return immediately.

If no note matches, return null, and you can later fall back to FileHashSha256.

3.4 Section headers & SONAME

You need DT_SONAME from the dynamic section. Steps:

Read section headers from Shoff (ELF header).

Minimal section header model:

internal sealed class SectionHeader
{
    public uint Name;        // index into shstrtab
    public uint Type;        // SHT_*
    public ulong Offset;
    public ulong Size;
    public uint Link;        // for some types
}

For each section:

Read Name, Type, Flags (ignored), Addr (ignored), Offset, Size, Link, etc.
Keep these in an array.

Find the section header string table (shstrtab):

Use header.Shstrndx to locate its section header.
Read that section’s bytes into shStrTab.

Define helper to get section name:

static string ReadNullTerminatedString(byte[] table, uint offset)
{
    int i = (int)offset;
    int start = i;
    while (i < table.Length && table[i] != 0) i++;
    return Encoding.ASCII.GetString(table, start, i - start);
}

Use shStrTab to find:
- .dynamic section (Type == 6 i.e. SHT_DYNAMIC).
- The string table it references (SectionHeader.Link → index of the dynamic string table, often .dynstr).
Parse the dynamic section:
- Elf64_Dyn is array of entries:
```
int64  d_tag;
uint64 d_val;
```
  (For 32-bit, both are 4 bytes; you can cast to 64-bit.)
- For each entry:
  - Read d_tag (signed, but you can treat as 64-bit).
  - Read d_val.
  - If d_tag == 14 (DT_SONAME), then d_val is an offset into the dynstr string table.
Read SONAME:
- Use dynstr bytes + d_val as index, decode null-terminated ASCII → Soname.

If there is no .dynamic section or no DT_SONAME, set Soname = null.

3.5 Mapping `e_machine` to architecture string

e_machine is a numeric code. Map the most common ones:

static string MapMachine(ushort eMachine) => eMachine switch
{
    3   => "x86",        // EM_386
    62  => "x86_64",     // EM_X86_64
    40  => "arm",        // EM_ARM
    183 => "aarch64",    // EM_AARCH64
    8   => "mips",       // EM_MIPS
    _   => $"unknown({eMachine})"
};

3.6 Mapping `e_type` to `ElfFileType`

static ElfFileType MapFileType(ushort eType) => eType switch
{
    1 => ElfFileType.Relocatable, // ET_REL
    2 => ElfFileType.Executable,  // ET_EXEC
    3 => ElfFileType.SharedObject,// ET_DYN
    4 => ElfFileType.Core,        // ET_CORE
    _ => ElfFileType.Unknown
};

3.7 Fallback: SHA-256 hash

If build-id is missing:

static string ComputeFileSha256(string path)
{
    using var sha = System.Security.Cryptography.SHA256.Create();
    using var fs = File.OpenRead(path);
    var hash = sha.ComputeHash(fs);
    return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
}

Set:

BuildId = null
BuildIdSource = "FileHash"
FileHashSha256 = computedHash

4. Implementation Skeleton (ElfReader)

Here’s a compact skeleton tying it together:

public static class ElfReader
{
    public static ElfMetadata ReadMetadata(string path)
    {
        using var fs = File.OpenRead(path);
        using var br = new BinaryReader(fs);

        // 1. Read e_ident
        byte[] ident = br.ReadBytes(16);
        if (ident.Length < 16 ||
            ident[0] != 0x7F || ident[1] != (byte)'E' ||
            ident[2] != (byte)'L' || ident[3] != (byte)'F')
        {
            throw new InvalidDataException("Not an ELF file.");
        }

        bool is64 = ident[4] == 2;    // EI_CLASS
        bool little = ident[5] == 1;  // EI_DATA

        // 2. Read header
        var header = ReadElfHeader(br, ident, is64, little);

        // 3. Read program headers
        var phdrs = ReadProgramHeaders(br, header, is64, little);

        // 4. Extract build-id from PT_NOTE
        string? buildId = TryReadBuildIdFromNotes(br, phdrs, little, is64);

        // 5. Read SONAME from .dynamic
        string? soname = TryReadSoname(br, header, is64, little);

        // 6. Map machine & type
        string machine = MapMachine(header.Machine);
        ElfFileType fileType = MapFileType(header.Type);

        // 7. Hash fallback
        string? fileHash = null;
        string source;
        if (buildId is null)
        {
            fileHash = ComputeFileSha256(path);
            source = "FileHash";
        }
        else
        {
            source = "NT_GNU_BUILD_ID";
        }

        return new ElfMetadata
        {
            Path = path,
            Soname = soname,
            BuildId = buildId,
            BuildIdSource = source,
            FileType = fileType,
            Machine = machine,
            Is64Bit = is64,
            IsLittleEndian = little,
            FileHashSha256 = fileHash
        };
    }

    // ... implement ReadElfHeader, ReadProgramHeaders,
    // TryReadBuildIdFromNotes, TryReadSoname, MapMachine,
    // MapFileType, ComputeFileSha256, + endian helpers ...
}

I didn’t expand every helper to keep this readable, but all helpers follow exactly the rules in section 3.

5. Process Scanner Spec (Linux)

5.1 Reading `/proc/<pid>/maps`

Each line looks roughly like:

7f2d9c214000-7f2d9c234000 r--p 00000000 08:01 1234567 /usr/lib/x86_64-linux-gnu/libssl.so.3

Last field is the file path, if any.

Algorithm:

public static class ElfProcessScanner
{
    public static IReadOnlyList<ElfMetadata> GetProcessModules(int? pid = null)
    {
        int actualPid = pid ?? Environment.ProcessId;
        string mapsPath = $"/proc/{actualPid}/maps";

        if (!File.Exists(mapsPath))
            throw new PlatformNotSupportedException("Only supported on Linux with /proc.");

        var paths = new HashSet<string>(StringComparer.Ordinal);
        foreach (var line in File.ReadLines(mapsPath))
        {
            int idx = line.IndexOf('/');
            if (idx < 0)
                continue;

            string p = line.Substring(idx).Trim();
            if (p.StartsWith("["))
                continue; // skip [heap], [vdso], etc.

            if (!File.Exists(p))
                continue;

            // De-duplicate
            if (!paths.Add(p))
                continue;
        }

        var result = new List<ElfMetadata>();
        foreach (var p in paths)
        {
            if (!IsElfFile(p))
                continue;

            try
            {
                var meta = ElfReader.ReadMetadata(p);
                result.Add(meta);
            }
            catch
            {
                // swallow or log; not all mapped files are valid ELF
            }
        }

        return result;
    }

    private static bool IsElfFile(string path)
    {
        try
        {
            using var fs = File.OpenRead(path);
            Span<byte> magic = stackalloc byte[4];
            if (fs.Read(magic) != 4) return false;
            return magic[0] == 0x7F && magic[1] == (byte)'E' &&
                   magic[2] == (byte)'L' && magic[3] == (byte)'F';
        }
        catch { return false; }
    }
}

This is simple and robust. If you later want even more accurate results (e.g., also non-file-backed shared objects), you can add a P/Invoke path that uses dl_iterate_phdr, but /proc/<pid>/maps gets you the SBOM-relevant modules.

6. JSON / SBOM Integration (Optional but Recommended)

When you serialize ElfMetadata into your runtime evidence / graph, I’d recommend a nested ELF block, e.g.:

{
  "path": "/usr/lib/x86_64-linux-gnu/libssl.so.3",
  "soname": "libssl.so.3",
  "purl": "pkg:deb/ubuntu/openssl@3.0.2-0ubuntu1.10?arch=amd64",
  "elf": {
    "build_id": "a1b2c3d4e5f6...",
    "build_id_source": "NT_GNU_BUILD_ID",
    "file_type": "SharedObject",
    "machine": "x86_64",
    "is_64bit": true,
    "is_little_endian": true,
    "file_hash_sha256": null
  }
}

You can keep purl on the higher-level module object; build_id becomes the primary key for binary-accurate joins.

7. Testing Checklist

For a C# dev implementing this, I’d suggest these tests:

Basic ELF detection
- Provide a non-ELF file → ReadMetadata throws.
- Provide /bin/ls (or similar) → parses successfully.
32-bit vs 64-bit
- If you have a 32-bit ELF on the system, verify Is64Bit and basic fields.
Build-id extraction
- Compare ReadMetadata(path).BuildId against readelf -n path | grep 'Build ID:'.
- Ensure they match.
SONAME extraction
- For a shared library, ensure Soname matches readelf -d path | grep SONAME.
Hash fallback
- On an ELF without build-id (you can strip or use a test file), verify BuildId == null and FileHashSha256 != null.
Process scanner
- Call GetProcessModules() in your test app; ensure it returns at least:
  - your own executable
  - core system libs (libc.so.6, ld-linux*, etc.)

If you’d like, next step I can do is expand any of the helper methods (e.g. ReadElfHeader or TryReadSoname) into complete C# code, but this should be enough as a “full spec & directions” for a C# dev to implement the ELF build-id collector. You’re already in “pretty good” territory with the current spec. To get to “best in class,” I’d tighten it in a few dimensions: API design, correctness, performance, observability, testing, and extensibility.

I’ll walk through concrete ways to upgrade what we already have, without rewriting everything from scratch.

1. Sharpen the public API

1.1 Split responsibilities into clear layers

Right now we have:

ElfReader.ReadMetadata(string path)
ElfProcessScanner.GetProcessModules(int? pid = null)

For a best‑in‑class library, I’d explicitly layer things:

public interface IElfParser
{
    ElfMetadata Parse(Stream stream, string? pathHint = null);
}

public interface IElfFileInspector
{
    ElfMetadata InspectFile(string path);
}

public interface IElfProcessInspector
{
    IReadOnlyList<ElfMetadata> GetProcessModules(ElfProcessScanOptions? options = null);
}

With default implementations:

ElfParser – pure, stateless binary parser (no file I/O).
ElfFileInspector – wraps ElfParser + file system.
ElfProcessInspector – wraps /proc/<pid>/maps (and optionally dl_iterate_phdr).

This makes testing simpler (you can feed a MemoryStream) and keeps “how we read” decoupled from “how we parse.”

1.2 Options objects & async variants

Give users knobs and modern .NET ergonomics:

public sealed class ElfProcessScanOptions
{
    public int? Pid { get; init; }
    public bool IncludeNonElfFiles { get; init; } = false;
    public bool ParallelFileParsing { get; init; } = true;
    public bool ComputeHashWhenBuildIdMissing { get; init; } = true;
    public int? MaxFiles { get; init; } // safety valve on huge systems
}

public static class ElfProcessScanner
{
    public static IReadOnlyList<ElfMetadata> GetProcessModules(
        ElfProcessScanOptions? options = null);

    public static IAsyncEnumerable<ElfMetadata> GetProcessModulesAsync(
        ElfProcessScanOptions? options = null,
        CancellationToken cancellationToken = default);
}

Same for file scans:

public sealed class ElfFileScanOptions
{
    public bool ComputeFileHashWhenBuildIdPresent { get; init; } = false;
    public bool ThrowOnNonElf { get; init; } = true;
}

public static ElfMetadata ReadMetadata(
    string path, 
    ElfFileScanOptions? options = null);

1.3 Strong types for identity

Instead of string BuildId, add a value type:

public readonly struct ElfBuildId : IEquatable<ElfBuildId>
{
    public string HexString { get; }     // "a1b2c3..."
    public string DebugPathComponent => $"{HexString[..2]}/{HexString[2..]}";

    // Parse, TryParse, equality, GetHashCode, etc.
}

Then in ElfMetadata:

public ElfBuildId? BuildId { get; init; }   // nullable
public string BuildIdSource { get; init; }  // "NT_GNU_BUILD_ID" | "FileHash" | "None"

This prevents subtle bugs from string normalization and gives you the debuginfod‑style path precomputed.

2. Make parsing spec‑accurate & robust

2.1 Handle both PT_NOTE and SHT_NOTE `.note.gnu.build-id`

Many binaries place build‑id in:

PT_NOTE segments and/or
a section named .note.gnu.build-id (SHT_NOTE)

Your spec only mentions PT_NOTE. For best coverage:

Search all PT_NOTE segments for NT_GNU_BUILD_ID.
If none found, search SHT_NOTE sections with name .note.gnu.build-id.
If both exist and disagree (extremely rare), decide a precedence and log a diagnostic.

2.2 Correct note alignment rules

Spec nuance:

Note fields (namesz, descsz, type) are always 4‑byte aligned.
On 64‑bit, the overall note segment may be aligned to 8 bytes, but the internal padding rules still use 4‑byte boundaries.

Your spec uses pad = (4 - (size % 4)) & 3, which is correct, but I’d codify it clearly:

static int NotePadding(int size) => (4 - (size & 3)) & 3;

And call that everywhere you advance across notes so future maintainers don’t “optimize” it incorrectly.

2.3 Be strict on bounds & corruption

Add explicit, defensive checks:

Do not trust p_offset + p_filesz blindly.
Before any read, verify offset + length <= streamLength.
If the file lies about sizes, fail gracefully with a structured error.

E.g.:

public sealed class ElfParseException : Exception
{
    public ElfParseErrorKind Kind { get; }
    public string? Detail { get; }

    // ...
}

public enum ElfParseErrorKind
{
    NotElf,
    TruncatedHeader,
    TruncatedProgramHeader,
    TruncatedSectionHeader,
    TruncatedNote,
    UnsupportedClass,
    UnsupportedEndianess,
    IoError,
    Unknown
}

And then:

if (header.Phoff + (ulong)header.Phnum * header.Phentsize > (ulong)fs.Length)
    throw new ElfParseException(ElfParseErrorKind.TruncatedProgramHeader, "...");

Best‑in‑class means you never trust the file, and your errors are debuggable.

2.4 Big‑endian and 32‑bit are first‑class citizens

Even if your primary target is x86_64 Linux, a robust spec:

Fully supports EI_CLASS = 1 and 2 (32/64).
Fully supports EI_DATA = 1 and 2 (LSB/MSB).
Has tests for at least one big‑endian ELF (e.g., sample artifacts in your test assets).

Your current spec mentions big-endian, but I’d explicitly require:

A generic EndianBinaryReader abstraction that:
- Wraps a Stream
- Exposes ReadUInt16/32/64, ReadInt64, ReadBytes with endianness.

3. Performance & scale improvements

3.1 Avoid full-file reads by design

Your current design lets devs accidentally hash everything or read all sections even when not needed.

Refine the spec so that default path is minimal I/O:

Read ELF header.
Read program headers.
Read only:
- PT_NOTE ranges
- Section headers (once)
- .shstrtab, .dynamic, and its dynstr.

Only compute SHA‑256 when expressly configured (via ElfFileScanOptions.ComputeFileHashWhenBuildIdPresent or ComputeFileHashWhenBuildIdMissing).

3.2 Optional memory‑mapped mode

For very large scans (filesystem crawls, containers), allow a mode that uses MemoryMappedFile:

public sealed class ElfReaderOptions
{
    public bool UseMemoryMappedFile { get; init; } = false;
}

Internally, you can spec that the implementation:

Uses MemoryMappedFile.CreateFromFile
Creates views over relevant ranges (header, program headers, etc.)
Avoids multiple OS reads for repeated random access.

3.3 Parallel directory / image scanning

If you foresee scanning whole images or file trees, define a helper:

public static class ElfDirectoryScanner
{
    public static IReadOnlyList<ElfMetadata> Scan(
        string rootDirectory,
        ElfDirectoryScanOptions? options = null);

    public static IAsyncEnumerable<ElfMetadata> ScanAsync(
        string rootDirectory,
        ElfDirectoryScanOptions? options = null,
        CancellationToken cancellationToken = default);
}

public sealed class ElfDirectoryScanOptions
{
    public SearchOption SearchOption { get; init; } = SearchOption.AllDirectories;
    public int MaxDegreeOfParallelism { get; init; } = Environment.ProcessorCount;
    public Func<string, bool>? PathFilter { get; init; }  // e.g., skip /proc, /sys
}

And explicitly say that the implementation:

Uses Parallel.ForEach (or Parallel.ForEachAsync in .NET 8/9+) with bounded parallelism.
Shares a single ElfParser across threads (it’s stateless).
De‑dups by (device, inode) when possible (see below).

4. Process scanner: correctness & completeness

4.1 De‑duplication by inode, not just path

The current spec de‑dups only by path. On Linux:

Same inode may have multiple paths (hard links, bind mounts, chroot/container overlays).

For best‑in‑class accuracy of “unique binaries,” spec:

De‑duplicate entries by (st_dev, st_ino) from stat(2), not just string path.
Provide both views: unique by file identity and by path.

API example:

public sealed class ElfProcessModules
{
    public IReadOnlyList<ElfMetadata> UniqueFiles { get; init; }      // dedup by inode
    public IReadOnlyList<ElfModuleInstance> Instances { get; init; }  // per mapping
}

public sealed class ElfModuleInstance
{
    public ElfMetadata Metadata { get; init; }
    public string Path { get; init; }
    public string? MappingRange { get; init; } // "7f2d9c214000-7f2d9c234000"
}

And ElfProcessScanner.GetProcessModules returns an ElfProcessModules, not just a flat list.

4.2 Optional `dl_iterate_phdr` P/Invoke path

For a “maximum correctness” mode, you can specify:

A secondary implementation that uses dl_iterate_phdr via P/Invoke.
This gives you module base addresses and sometimes more consistent views across distros.
You can hybridize: use /proc/<pid>/maps for path enumeration and dl_iterate_phdr to confirm loaded segments (future feature).

You don’t have to implement it day one, but the spec can carve out an extension point:

public enum ElfProcessModuleSource
{
    ProcMaps,
    DlIteratePhdr
}

public sealed class ElfProcessScanOptions
{
    public ElfProcessModuleSource Source { get; init; } = ElfProcessModuleSource.ProcMaps;
}

And define behavior if the requested source isn’t available.

5. Observability & diagnostics

Best‑in‑class libraries are easy to debug.

5.1 Structured diagnostics on parse failures

Instead of “swallow or log” in the scanner, define:

public sealed class ElfScanResult
{
    public IReadOnlyList<ElfMetadata> Successes { get; init; }
    public IReadOnlyList<ElfScanError> Errors { get; init; }
}

public sealed class ElfScanError
{
    public string Path { get; init; }
    public ElfParseErrorKind Kind { get; init; }
    public string Message { get; init; }
}

And make ElfProcessScanner.GetProcessModules optionally return ElfScanResult (or have an overload).

This way you can:

Report how many files failed.
See common misconfigurations (e.g., insufficient permissions, truncated files).

5.2 Logging hooks instead of hard-coded logging

Don’t bake in a logging framework, but add a hook:

public interface IElfLogger
{
    void Debug(string message);
    void Info(string message);
    void Warn(string message);
    void Error(string message, Exception? ex = null);
}

public sealed class ElfReaderOptions
{
    public IElfLogger? Logger { get; init; }
}

Then use it for “soft failures” (skipping non‑ELF files, ignoring suspect sections, etc.).

6. Security & safety considerations

6.1 Treat inputs as untrusted

Spec explicitly that:

No ELF is ever loaded or executed.
No ld.so / dynamic loading is used: all reading is via FileStream / MemoryMappedFile.
No writes occur to inspected paths.

6.2 Control resource usage

For environments scanning untrusted file trees (e.g., user uploads):

Have configurable caps on:
- MaxFileSizeBytes to parse.
- MaxNotesPerSegment / MaxSections to avoid pathological “zip bomb” style ELFs.
Fail with ElfParseErrorKind.TruncatedHeader or Unsupported rather than exhausting RAM.

7. Testing & validation: make it part of the spec

Instead of just “add tests,” bake them in as requirements.

7.1 Golden tests vs `readelf` or `llvm-readobj`

Define that CI must include:

For a set of ELFs (32‑bit, 64‑bit, big‑endian, stripped, PIE, static):
- Compare ElfMetadata.BuildId with readelf -n output.
- Compare ElfMetadata.Soname with readelf -d / objdump -p.

You don’t need to name the exact tools in the API, but the spec can say:

The library’s test suite must cross‑validate build‑id and SONAME values against a trusted system tool (such as readelf or llvm-readobj) for a curated set of binaries.

7.2 Fuzzing & corruption tests

Add:

A small fuzz harness that:
- Mutates bytes in real ELF samples.
- Feeds them to ElfParser.
- Asserts: no crashes, only ElfParseExceptions.

This directly supports the “never trust input” goal.

7.3 Regression fixtures

Check in a testdata/ folder with:

Minimal 32‑bit/64‑bit ELF with build‑id.
Minimal ELF without build‑id.
Shared library with SONAME.
Big‑endian sample.

8. Extensibility hooks (future-friendly)

Even if you only care about Linux/ELF today, you can design with “other formats later” in mind.

8.1 Generalized module metadata interface

public interface IModuleMetadata
{
    string Path { get; }
    string? Soname { get; }
    string? BuildId { get; }
    string Format { get; }  // "ELF", "PE", "MachO"
}

ElfMetadata implements IModuleMetadata. That way, a future PeMetadata or MachOMetadata can slot into the same pipelines.

8.2 Integration with SBOM & VEX

Add a tiny, optional interface that lines up with your SBOM graph:

public interface IHasPackageCoordinates
{
    string? Purl { get; }
}

public sealed partial class ElfMetadata : IHasPackageCoordinates
{
    public string? Purl { get; init; }  // populated by your higher-layer resolver
}

The ELF layer doesn’t know how to compute Purl, but it gives a spot for higher layers to attach it without wrapping everything in another type.

9. Documentation & usage examples

Finally, “best in class” is as much about developer experience as code.

Your spec should require:

XML docs on all public types/members (shown in IntelliSense).
Samples:
- “Read build‑id from a single file”
- “Enumerate current process modules and print build‑ids”
- “Scan a container filesystem for unique ELFs and dump JSON”

For example:

// Example: dump all modules for the current process
var modules = ElfProcessScanner.GetProcessModules();
foreach (var m in modules)
{
    Console.WriteLine($"{m.Path} | SONAME={m.Soname} | BUILD-ID={m.BuildId?.HexString ?? "<none>"}");
}

TL;DR: What to actually change in your current spec

If you just want a concrete checklist:

Refine API
- Introduce ElfBuildId struct, options objects, async variants.
- Split parser vs file/process scanners.
Parsing correctness
- Support build‑id in both PT_NOTE and .note.gnu.build-id.
- Add strict bounds checks and ElfParseException with ElfParseErrorKind.
- Treat big‑endian & 32‑bit as first‑class.
Performance
- Make full file hashing opt‑in.
- Avoid unnecessary section reads.
- Add optional memory‑mapped mode.
Process scanner
- De‑dup by inode, not just path.
- Return both unique files and per‑mapping instances.
- Add structured error reporting (successes + failures).
Testing & security
- Mandate cross‑validation vs readelf.
- Add fuzz/corruption tests.
- Add resource caps (max file size, max sections/notes).

If you’d like, next step I can do is rewrite the public C# surface (interfaces, classes, XML docs) in one place with all of these improvements baked in, so your team can just drop it into a project and fill in the internals.

34 KiB Raw Blame History Unescape Escape