13 KiB
Function Behavior Corpus Guide
This document describes StellaOps' Function Behavior Corpus system - a BSim-like capability for identifying functions by their semantic behavior rather than relying on symbols or prior CVE signatures.
Overview
The Function Behavior Corpus is a database of known library functions with pre-computed fingerprints that enable identification of functions in stripped binaries. When a binary is analyzed, functions can be matched against the corpus to determine:
- Library origin - Which library (glibc, OpenSSL, zlib, etc.) the function comes from
- Version information - Which version(s) of the library contain this function
- CVE associations - Whether the function is linked to known vulnerabilities
- Patch status - Whether a function matches a vulnerable or patched variant
Architecture
┌───────────────────────────────────────────────────────────────────────┐
│ Function Behavior Corpus │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Corpus Ingestion Layer │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │GlibcCorpus │ │OpenSSL │ │ZlibCorpus │ ... │ │
│ │ │Connector │ │Connector │ │Connector │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Fingerprint Generation │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │Instruction │ │Semantic │ │API Call │ │ │
│ │ │Hash │ │KSG Hash │ │Graph │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Corpus Storage (PostgreSQL) │ │
│ │ │ │
│ │ corpus.libraries - Known libraries │ │
│ │ corpus.library_versions- Version snapshots │ │
│ │ corpus.build_variants - Architecture/compiler variants │ │
│ │ corpus.functions - Function metadata │ │
│ │ corpus.fingerprints - Fingerprint index │ │
│ │ corpus.function_clusters- Similar function groups │ │
│ │ corpus.function_cves - CVE associations │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
Core Services
ICorpusIngestionService
Handles ingestion of library binaries into the corpus.
public interface ICorpusIngestionService
{
// Ingest a single library binary
Task<IngestionResult> IngestLibraryAsync(
LibraryIngestionMetadata metadata,
Stream binaryStream,
IngestionOptions? options = null,
CancellationToken ct = default);
// Ingest from a library connector (bulk)
IAsyncEnumerable<IngestionResult> IngestFromConnectorAsync(
string libraryName,
ILibraryCorpusConnector connector,
IngestionOptions? options = null,
CancellationToken ct = default);
// Update CVE associations for functions
Task<int> UpdateCveAssociationsAsync(
string cveId,
IReadOnlyList<FunctionCveAssociation> associations,
CancellationToken ct = default);
// Check job status
Task<IngestionJob?> GetJobStatusAsync(Guid jobId, CancellationToken ct = default);
}
ICorpusQueryService
Queries the corpus to identify functions by their fingerprints.
public interface ICorpusQueryService
{
// Identify a single function
Task<ImmutableArray<FunctionMatch>> IdentifyFunctionAsync(
FunctionFingerprints fingerprints,
IdentifyOptions? options = null,
CancellationToken ct = default);
// Batch identify multiple functions
Task<ImmutableDictionary<int, ImmutableArray<FunctionMatch>>> IdentifyBatchAsync(
IReadOnlyList<FunctionFingerprints> fingerprintSets,
IdentifyOptions? options = null,
CancellationToken ct = default);
// Get corpus statistics
Task<CorpusStatistics> GetStatisticsAsync(CancellationToken ct = default);
// List available libraries
Task<ImmutableArray<LibrarySummary>> ListLibrariesAsync(CancellationToken ct = default);
}
ILibraryCorpusConnector
Interface for library-specific connectors that fetch binaries for ingestion.
public interface ILibraryCorpusConnector
{
string LibraryName { get; }
string[] SupportedArchitectures { get; }
// Get available versions
Task<ImmutableArray<string>> GetAvailableVersionsAsync(CancellationToken ct);
// Fetch binaries for ingestion
IAsyncEnumerable<LibraryBinary> FetchBinariesAsync(
IReadOnlyList<string> versions,
string architecture,
LibraryFetchOptions? options = null,
CancellationToken ct = default);
}
Fingerprint Algorithms
The corpus uses multiple fingerprint algorithms to enable matching under different conditions:
Semantic K-Skip-Gram Hash (semantic_ksg)
Based on Ghidra BSim's approach:
- Analyzes normalized p-code operations
- Generates k-skip-gram features from instruction sequences
- Robust against register renaming and basic-block reordering
- Best for matching functions across optimization levels
Instruction Basic-Block Hash (instruction_bb)
- Hashes normalized instruction sequences per basic block
- More sensitive to compiler differences
- Faster to compute than semantic hash
- Good for exact or near-exact matches
Control-Flow Graph Hash (cfg_wl)
- Weisfeiler-Lehman graph hash of the CFG
- Captures structural similarity
- Works well even when instruction sequences differ
- Useful for detecting refactored code
Usage Examples
Ingesting a Library
// Create ingestion metadata
var metadata = new LibraryIngestionMetadata(
Name: "openssl",
Version: "3.0.15",
Architecture: "x86_64",
Compiler: "gcc",
CompilerVersion: "12.2",
OptimizationLevel: "O2",
IsSecurityRelease: true);
// Ingest from file
await using var stream = File.OpenRead("libssl.so.3");
var result = await ingestionService.IngestLibraryAsync(metadata, stream);
Console.WriteLine($"Indexed {result.FunctionsIndexed} functions");
Console.WriteLine($"Generated {result.FingerprintsGenerated} fingerprints");
Bulk Ingestion via Connector
// Use the OpenSSL connector to fetch and ingest multiple versions
var connector = new OpenSslCorpusConnector(httpClientFactory, logger);
await foreach (var result in ingestionService.IngestFromConnectorAsync(
"openssl",
connector,
new IngestionOptions { GenerateClusters = true }))
{
Console.WriteLine($"Ingested {result.LibraryName} {result.Version}: {result.FunctionsIndexed} functions");
}
Identifying Functions
// Build fingerprints from analyzed function
var fingerprints = new FunctionFingerprints(
SemanticHash: semanticHashBytes,
InstructionHash: instructionHashBytes,
CfgHash: cfgHashBytes,
ApiCalls: ["malloc", "memcpy", "free"],
SizeBytes: 256);
// Query the corpus
var matches = await queryService.IdentifyFunctionAsync(
fingerprints,
new IdentifyOptions
{
MinSimilarity = 0.85m,
MaxResults = 5,
IncludeCveAssociations = true
});
foreach (var match in matches)
{
Console.WriteLine($"Match: {match.LibraryName} {match.Version} - {match.FunctionName}");
Console.WriteLine($" Similarity: {match.Similarity:P1}");
Console.WriteLine($" Match method: {match.MatchMethod}");
if (match.CveAssociations.Any())
{
foreach (var cve in match.CveAssociations)
{
Console.WriteLine($" CVE: {cve.CveId} ({cve.AffectedState})");
}
}
}
Checking CVE Associations
// When a function matches, check if it's associated with known CVEs
var match = matches.First();
if (match.CveAssociations.Any(c => c.AffectedState == CveAffectedState.Vulnerable))
{
Console.WriteLine("WARNING: Function matches a known vulnerable variant!");
}
Database Schema
The corpus uses a dedicated PostgreSQL schema with the following key tables:
| Table | Purpose |
|---|---|
corpus.libraries |
Master list of tracked libraries |
corpus.library_versions |
Version records with release metadata |
corpus.build_variants |
Architecture/compiler/optimization variants |
corpus.functions |
Function metadata (name, address, size, etc.) |
corpus.fingerprints |
Fingerprint hashes indexed for lookup |
corpus.function_clusters |
Groups of similar functions |
corpus.function_cves |
CVE-to-function associations |
corpus.ingestion_jobs |
Job tracking for bulk ingestion |
Supported Libraries
The corpus supports ingestion from these common libraries:
| Library | Connector | Architectures |
|---|---|---|
| glibc | GlibcCorpusConnector |
x86_64, aarch64, armv7, i686 |
| OpenSSL | OpenSslCorpusConnector |
x86_64, aarch64, armv7 |
| zlib | ZlibCorpusConnector |
x86_64, aarch64 |
| curl | CurlCorpusConnector |
x86_64, aarch64 |
| SQLite | SqliteCorpusConnector |
x86_64, aarch64 |
Integration with Scanner
The corpus integrates with the Scanner module through IBinaryVulnerabilityService:
// Scanner can identify functions from fingerprints
var matches = await binaryVulnService.IdentifyFunctionFromCorpusAsync(
new FunctionFingerprintSet(
FunctionAddress: 0x4000,
SemanticHash: hash,
InstructionHash: null,
CfgHash: null,
ApiCalls: null,
SizeBytes: 128),
new CorpusLookupOptions
{
MinSimilarity = 0.9m,
MaxResults = 3
});
Performance Considerations
- Batch queries: Use
IdentifyBatchAsyncfor multiple functions to reduce round-trips - Fingerprint selection: Semantic hash is most robust but slowest; instruction hash is faster for exact matches
- Similarity threshold: Higher thresholds reduce false positives but may miss legitimate matches
- Clustering: Pre-computed clusters speed up similarity searches
Security Notes
- Corpus connectors fetch from external sources; ensure network policies allow required endpoints
- Ingested binaries are hashed to prevent duplicate processing
- CVE associations include confidence scores and evidence types for auditability
- All timestamps use UTC for consistency