Files
git.stella-ops.org/docs/modules/binary-index/corpus-management.md
StellaOps Bot 37e11918e0 save progress
2026-01-06 09:42:20 +02:00

13 KiB

Function Behavior Corpus Guide

This document describes StellaOps' Function Behavior Corpus system - a BSim-like capability for identifying functions by their semantic behavior rather than relying on symbols or prior CVE signatures.

Overview

The Function Behavior Corpus is a database of known library functions with pre-computed fingerprints that enable identification of functions in stripped binaries. When a binary is analyzed, functions can be matched against the corpus to determine:

  • Library origin - Which library (glibc, OpenSSL, zlib, etc.) the function comes from
  • Version information - Which version(s) of the library contain this function
  • CVE associations - Whether the function is linked to known vulnerabilities
  • Patch status - Whether a function matches a vulnerable or patched variant

Architecture

┌───────────────────────────────────────────────────────────────────────┐
│                    Function Behavior Corpus                           │
│                                                                       │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                 Corpus Ingestion Layer                          │  │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐                 │  │
│  │  │GlibcCorpus │  │OpenSSL     │  │ZlibCorpus  │  ...            │  │
│  │  │Connector   │  │Connector   │  │Connector   │                 │  │
│  │  └────────────┘  └────────────┘  └────────────┘                 │  │
│  └─────────────────────────────────────────────────────────────────┘  │
│                              │                                        │
│                              v                                        │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                 Fingerprint Generation                          │  │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐                 │  │
│  │  │Instruction │  │Semantic    │  │API Call    │                 │  │
│  │  │Hash        │  │KSG Hash    │  │Graph       │                 │  │
│  │  └────────────┘  └────────────┘  └────────────┘                 │  │
│  └─────────────────────────────────────────────────────────────────┘  │
│                              │                                        │
│                              v                                        │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                 Corpus Storage (PostgreSQL)                     │  │
│  │                                                                 │  │
│  │  corpus.libraries       - Known libraries                       │  │
│  │  corpus.library_versions- Version snapshots                     │  │
│  │  corpus.build_variants  - Architecture/compiler variants        │  │
│  │  corpus.functions       - Function metadata                     │  │
│  │  corpus.fingerprints    - Fingerprint index                     │  │
│  │  corpus.function_clusters- Similar function groups              │  │
│  │  corpus.function_cves   - CVE associations                      │  │
│  └─────────────────────────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────────────────────────┘

Core Services

ICorpusIngestionService

Handles ingestion of library binaries into the corpus.

public interface ICorpusIngestionService
{
    // Ingest a single library binary
    Task<IngestionResult> IngestLibraryAsync(
        LibraryIngestionMetadata metadata,
        Stream binaryStream,
        IngestionOptions? options = null,
        CancellationToken ct = default);

    // Ingest from a library connector (bulk)
    IAsyncEnumerable<IngestionResult> IngestFromConnectorAsync(
        string libraryName,
        ILibraryCorpusConnector connector,
        IngestionOptions? options = null,
        CancellationToken ct = default);

    // Update CVE associations for functions
    Task<int> UpdateCveAssociationsAsync(
        string cveId,
        IReadOnlyList<FunctionCveAssociation> associations,
        CancellationToken ct = default);

    // Check job status
    Task<IngestionJob?> GetJobStatusAsync(Guid jobId, CancellationToken ct = default);
}

ICorpusQueryService

Queries the corpus to identify functions by their fingerprints.

public interface ICorpusQueryService
{
    // Identify a single function
    Task<ImmutableArray<FunctionMatch>> IdentifyFunctionAsync(
        FunctionFingerprints fingerprints,
        IdentifyOptions? options = null,
        CancellationToken ct = default);

    // Batch identify multiple functions
    Task<ImmutableDictionary<int, ImmutableArray<FunctionMatch>>> IdentifyBatchAsync(
        IReadOnlyList<FunctionFingerprints> fingerprintSets,
        IdentifyOptions? options = null,
        CancellationToken ct = default);

    // Get corpus statistics
    Task<CorpusStatistics> GetStatisticsAsync(CancellationToken ct = default);

    // List available libraries
    Task<ImmutableArray<LibrarySummary>> ListLibrariesAsync(CancellationToken ct = default);
}

ILibraryCorpusConnector

Interface for library-specific connectors that fetch binaries for ingestion.

public interface ILibraryCorpusConnector
{
    string LibraryName { get; }
    string[] SupportedArchitectures { get; }

    // Get available versions
    Task<ImmutableArray<string>> GetAvailableVersionsAsync(CancellationToken ct);

    // Fetch binaries for ingestion
    IAsyncEnumerable<LibraryBinary> FetchBinariesAsync(
        IReadOnlyList<string> versions,
        string architecture,
        LibraryFetchOptions? options = null,
        CancellationToken ct = default);
}

Fingerprint Algorithms

The corpus uses multiple fingerprint algorithms to enable matching under different conditions:

Semantic K-Skip-Gram Hash (semantic_ksg)

Based on Ghidra BSim's approach:

  • Analyzes normalized p-code operations
  • Generates k-skip-gram features from instruction sequences
  • Robust against register renaming and basic-block reordering
  • Best for matching functions across optimization levels

Instruction Basic-Block Hash (instruction_bb)

  • Hashes normalized instruction sequences per basic block
  • More sensitive to compiler differences
  • Faster to compute than semantic hash
  • Good for exact or near-exact matches

Control-Flow Graph Hash (cfg_wl)

  • Weisfeiler-Lehman graph hash of the CFG
  • Captures structural similarity
  • Works well even when instruction sequences differ
  • Useful for detecting refactored code

Usage Examples

Ingesting a Library

// Create ingestion metadata
var metadata = new LibraryIngestionMetadata(
    Name: "openssl",
    Version: "3.0.15",
    Architecture: "x86_64",
    Compiler: "gcc",
    CompilerVersion: "12.2",
    OptimizationLevel: "O2",
    IsSecurityRelease: true);

// Ingest from file
await using var stream = File.OpenRead("libssl.so.3");
var result = await ingestionService.IngestLibraryAsync(metadata, stream);

Console.WriteLine($"Indexed {result.FunctionsIndexed} functions");
Console.WriteLine($"Generated {result.FingerprintsGenerated} fingerprints");

Bulk Ingestion via Connector

// Use the OpenSSL connector to fetch and ingest multiple versions
var connector = new OpenSslCorpusConnector(httpClientFactory, logger);

await foreach (var result in ingestionService.IngestFromConnectorAsync(
    "openssl",
    connector,
    new IngestionOptions { GenerateClusters = true }))
{
    Console.WriteLine($"Ingested {result.LibraryName} {result.Version}: {result.FunctionsIndexed} functions");
}

Identifying Functions

// Build fingerprints from analyzed function
var fingerprints = new FunctionFingerprints(
    SemanticHash: semanticHashBytes,
    InstructionHash: instructionHashBytes,
    CfgHash: cfgHashBytes,
    ApiCalls: ["malloc", "memcpy", "free"],
    SizeBytes: 256);

// Query the corpus
var matches = await queryService.IdentifyFunctionAsync(
    fingerprints,
    new IdentifyOptions
    {
        MinSimilarity = 0.85m,
        MaxResults = 5,
        IncludeCveAssociations = true
    });

foreach (var match in matches)
{
    Console.WriteLine($"Match: {match.LibraryName} {match.Version} - {match.FunctionName}");
    Console.WriteLine($"  Similarity: {match.Similarity:P1}");
    Console.WriteLine($"  Match method: {match.MatchMethod}");

    if (match.CveAssociations.Any())
    {
        foreach (var cve in match.CveAssociations)
        {
            Console.WriteLine($"  CVE: {cve.CveId} ({cve.AffectedState})");
        }
    }
}

Checking CVE Associations

// When a function matches, check if it's associated with known CVEs
var match = matches.First();
if (match.CveAssociations.Any(c => c.AffectedState == CveAffectedState.Vulnerable))
{
    Console.WriteLine("WARNING: Function matches a known vulnerable variant!");
}

Database Schema

The corpus uses a dedicated PostgreSQL schema with the following key tables:

Table Purpose
corpus.libraries Master list of tracked libraries
corpus.library_versions Version records with release metadata
corpus.build_variants Architecture/compiler/optimization variants
corpus.functions Function metadata (name, address, size, etc.)
corpus.fingerprints Fingerprint hashes indexed for lookup
corpus.function_clusters Groups of similar functions
corpus.function_cves CVE-to-function associations
corpus.ingestion_jobs Job tracking for bulk ingestion

Supported Libraries

The corpus supports ingestion from these common libraries:

Library Connector Architectures
glibc GlibcCorpusConnector x86_64, aarch64, armv7, i686
OpenSSL OpenSslCorpusConnector x86_64, aarch64, armv7
zlib ZlibCorpusConnector x86_64, aarch64
curl CurlCorpusConnector x86_64, aarch64
SQLite SqliteCorpusConnector x86_64, aarch64

Integration with Scanner

The corpus integrates with the Scanner module through IBinaryVulnerabilityService:

// Scanner can identify functions from fingerprints
var matches = await binaryVulnService.IdentifyFunctionFromCorpusAsync(
    new FunctionFingerprintSet(
        FunctionAddress: 0x4000,
        SemanticHash: hash,
        InstructionHash: null,
        CfgHash: null,
        ApiCalls: null,
        SizeBytes: 128),
    new CorpusLookupOptions
    {
        MinSimilarity = 0.9m,
        MaxResults = 3
    });

Performance Considerations

  • Batch queries: Use IdentifyBatchAsync for multiple functions to reduce round-trips
  • Fingerprint selection: Semantic hash is most robust but slowest; instruction hash is faster for exact matches
  • Similarity threshold: Higher thresholds reduce false positives but may miss legitimate matches
  • Clustering: Pre-computed clusters speed up similarity searches

Security Notes

  • Corpus connectors fetch from external sources; ensure network policies allow required endpoints
  • Ingested binaries are hashed to prevent duplicate processing
  • CVE associations include confidence scores and evidence types for auditability
  • All timestamps use UTC for consistency