# Function Behavior Corpus Guide This document describes StellaOps' Function Behavior Corpus system - a BSim-like capability for identifying functions by their semantic behavior rather than relying on symbols or prior CVE signatures. ## Overview The Function Behavior Corpus is a database of known library functions with pre-computed fingerprints that enable identification of functions in stripped binaries. When a binary is analyzed, functions can be matched against the corpus to determine: - **Library origin** - Which library (glibc, OpenSSL, zlib, etc.) the function comes from - **Version information** - Which version(s) of the library contain this function - **CVE associations** - Whether the function is linked to known vulnerabilities - **Patch status** - Whether a function matches a vulnerable or patched variant ## Architecture ``` ┌───────────────────────────────────────────────────────────────────────┐ │ Function Behavior Corpus │ │ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Corpus Ingestion Layer │ │ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ │ │GlibcCorpus │ │OpenSSL │ │ZlibCorpus │ ... │ │ │ │ │Connector │ │Connector │ │Connector │ │ │ │ │ └────────────┘ └────────────┘ └────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Fingerprint Generation │ │ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ │ │Instruction │ │Semantic │ │API Call │ │ │ │ │ │Hash │ │KSG Hash │ │Graph │ │ │ │ │ └────────────┘ └────────────┘ └────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Corpus Storage (PostgreSQL) │ │ │ │ │ │ │ │ corpus.libraries - Known libraries │ │ │ │ corpus.library_versions- Version snapshots │ │ │ │ corpus.build_variants - Architecture/compiler variants │ │ │ │ corpus.functions - Function metadata │ │ │ │ corpus.fingerprints - Fingerprint index │ │ │ │ corpus.function_clusters- Similar function groups │ │ │ │ corpus.function_cves - CVE associations │ │ │ └─────────────────────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────────────┘ ``` ## Core Services ### ICorpusIngestionService Handles ingestion of library binaries into the corpus. ```csharp public interface ICorpusIngestionService { // Ingest a single library binary Task IngestLibraryAsync( LibraryIngestionMetadata metadata, Stream binaryStream, IngestionOptions? options = null, CancellationToken ct = default); // Ingest from a library connector (bulk) IAsyncEnumerable IngestFromConnectorAsync( string libraryName, ILibraryCorpusConnector connector, IngestionOptions? options = null, CancellationToken ct = default); // Update CVE associations for functions Task UpdateCveAssociationsAsync( string cveId, IReadOnlyList associations, CancellationToken ct = default); // Check job status Task GetJobStatusAsync(Guid jobId, CancellationToken ct = default); } ``` ### ICorpusQueryService Queries the corpus to identify functions by their fingerprints. ```csharp public interface ICorpusQueryService { // Identify a single function Task> IdentifyFunctionAsync( FunctionFingerprints fingerprints, IdentifyOptions? options = null, CancellationToken ct = default); // Batch identify multiple functions Task>> IdentifyBatchAsync( IReadOnlyList fingerprintSets, IdentifyOptions? options = null, CancellationToken ct = default); // Get corpus statistics Task GetStatisticsAsync(CancellationToken ct = default); // List available libraries Task> ListLibrariesAsync(CancellationToken ct = default); } ``` ### ILibraryCorpusConnector Interface for library-specific connectors that fetch binaries for ingestion. ```csharp public interface ILibraryCorpusConnector { string LibraryName { get; } string[] SupportedArchitectures { get; } // Get available versions Task> GetAvailableVersionsAsync(CancellationToken ct); // Fetch binaries for ingestion IAsyncEnumerable FetchBinariesAsync( IReadOnlyList versions, string architecture, LibraryFetchOptions? options = null, CancellationToken ct = default); } ``` ## Fingerprint Algorithms The corpus uses multiple fingerprint algorithms to enable matching under different conditions: ### Semantic K-Skip-Gram Hash (`semantic_ksg`) Based on Ghidra BSim's approach: - Analyzes normalized p-code operations - Generates k-skip-gram features from instruction sequences - Robust against register renaming and basic-block reordering - Best for matching functions across optimization levels ### Instruction Basic-Block Hash (`instruction_bb`) - Hashes normalized instruction sequences per basic block - More sensitive to compiler differences - Faster to compute than semantic hash - Good for exact or near-exact matches ### Control-Flow Graph Hash (`cfg_wl`) - Weisfeiler-Lehman graph hash of the CFG - Captures structural similarity - Works well even when instruction sequences differ - Useful for detecting refactored code ## Usage Examples ### Ingesting a Library ```csharp // Create ingestion metadata var metadata = new LibraryIngestionMetadata( Name: "openssl", Version: "3.0.15", Architecture: "x86_64", Compiler: "gcc", CompilerVersion: "12.2", OptimizationLevel: "O2", IsSecurityRelease: true); // Ingest from file await using var stream = File.OpenRead("libssl.so.3"); var result = await ingestionService.IngestLibraryAsync(metadata, stream); Console.WriteLine($"Indexed {result.FunctionsIndexed} functions"); Console.WriteLine($"Generated {result.FingerprintsGenerated} fingerprints"); ``` ### Bulk Ingestion via Connector ```csharp // Use the OpenSSL connector to fetch and ingest multiple versions var connector = new OpenSslCorpusConnector(httpClientFactory, logger); await foreach (var result in ingestionService.IngestFromConnectorAsync( "openssl", connector, new IngestionOptions { GenerateClusters = true })) { Console.WriteLine($"Ingested {result.LibraryName} {result.Version}: {result.FunctionsIndexed} functions"); } ``` ### Identifying Functions ```csharp // Build fingerprints from analyzed function var fingerprints = new FunctionFingerprints( SemanticHash: semanticHashBytes, InstructionHash: instructionHashBytes, CfgHash: cfgHashBytes, ApiCalls: ["malloc", "memcpy", "free"], SizeBytes: 256); // Query the corpus var matches = await queryService.IdentifyFunctionAsync( fingerprints, new IdentifyOptions { MinSimilarity = 0.85m, MaxResults = 5, IncludeCveAssociations = true }); foreach (var match in matches) { Console.WriteLine($"Match: {match.LibraryName} {match.Version} - {match.FunctionName}"); Console.WriteLine($" Similarity: {match.Similarity:P1}"); Console.WriteLine($" Match method: {match.MatchMethod}"); if (match.CveAssociations.Any()) { foreach (var cve in match.CveAssociations) { Console.WriteLine($" CVE: {cve.CveId} ({cve.AffectedState})"); } } } ``` ### Checking CVE Associations ```csharp // When a function matches, check if it's associated with known CVEs var match = matches.First(); if (match.CveAssociations.Any(c => c.AffectedState == CveAffectedState.Vulnerable)) { Console.WriteLine("WARNING: Function matches a known vulnerable variant!"); } ``` ## Database Schema The corpus uses a dedicated PostgreSQL schema with the following key tables: | Table | Purpose | |-------|---------| | `corpus.libraries` | Master list of tracked libraries | | `corpus.library_versions` | Version records with release metadata | | `corpus.build_variants` | Architecture/compiler/optimization variants | | `corpus.functions` | Function metadata (name, address, size, etc.) | | `corpus.fingerprints` | Fingerprint hashes indexed for lookup | | `corpus.function_clusters` | Groups of similar functions | | `corpus.function_cves` | CVE-to-function associations | | `corpus.ingestion_jobs` | Job tracking for bulk ingestion | ## Supported Libraries The corpus supports ingestion from these common libraries: | Library | Connector | Architectures | |---------|-----------|---------------| | glibc | `GlibcCorpusConnector` | x86_64, aarch64, armv7, i686 | | OpenSSL | `OpenSslCorpusConnector` | x86_64, aarch64, armv7 | | zlib | `ZlibCorpusConnector` | x86_64, aarch64 | | curl | `CurlCorpusConnector` | x86_64, aarch64 | | SQLite | `SqliteCorpusConnector` | x86_64, aarch64 | ## Integration with Scanner The corpus integrates with the Scanner module through `IBinaryVulnerabilityService`: ```csharp // Scanner can identify functions from fingerprints var matches = await binaryVulnService.IdentifyFunctionFromCorpusAsync( new FunctionFingerprintSet( FunctionAddress: 0x4000, SemanticHash: hash, InstructionHash: null, CfgHash: null, ApiCalls: null, SizeBytes: 128), new CorpusLookupOptions { MinSimilarity = 0.9m, MaxResults = 3 }); ``` ## Performance Considerations - **Batch queries**: Use `IdentifyBatchAsync` for multiple functions to reduce round-trips - **Fingerprint selection**: Semantic hash is most robust but slowest; instruction hash is faster for exact matches - **Similarity threshold**: Higher thresholds reduce false positives but may miss legitimate matches - **Clustering**: Pre-computed clusters speed up similarity searches ## Security Notes - Corpus connectors fetch from external sources; ensure network policies allow required endpoints - Ingested binaries are hashed to prevent duplicate processing - CVE associations include confidence scores and evidence types for auditability - All timestamps use UTC for consistency ## Related Documentation - [Binary Index Architecture](architecture.md) - [Semantic Diffing](semantic-diffing.md) - [Scanner Module](../scanner/architecture.md)