save progress

This commit is contained in:
StellaOps Bot
2026-01-06 09:42:02 +02:00
parent 94d68bee8b
commit 37e11918e0
443 changed files with 85863 additions and 897 deletions

View File

@@ -0,0 +1,313 @@
# Function Behavior Corpus Guide
This document describes StellaOps' Function Behavior Corpus system - a BSim-like capability for identifying functions by their semantic behavior rather than relying on symbols or prior CVE signatures.
## Overview
The Function Behavior Corpus is a database of known library functions with pre-computed fingerprints that enable identification of functions in stripped binaries. When a binary is analyzed, functions can be matched against the corpus to determine:
- **Library origin** - Which library (glibc, OpenSSL, zlib, etc.) the function comes from
- **Version information** - Which version(s) of the library contain this function
- **CVE associations** - Whether the function is linked to known vulnerabilities
- **Patch status** - Whether a function matches a vulnerable or patched variant
## Architecture
```
┌───────────────────────────────────────────────────────────────────────┐
│ Function Behavior Corpus │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Corpus Ingestion Layer │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │GlibcCorpus │ │OpenSSL │ │ZlibCorpus │ ... │ │
│ │ │Connector │ │Connector │ │Connector │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Fingerprint Generation │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │Instruction │ │Semantic │ │API Call │ │ │
│ │ │Hash │ │KSG Hash │ │Graph │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Corpus Storage (PostgreSQL) │ │
│ │ │ │
│ │ corpus.libraries - Known libraries │ │
│ │ corpus.library_versions- Version snapshots │ │
│ │ corpus.build_variants - Architecture/compiler variants │ │
│ │ corpus.functions - Function metadata │ │
│ │ corpus.fingerprints - Fingerprint index │ │
│ │ corpus.function_clusters- Similar function groups │ │
│ │ corpus.function_cves - CVE associations │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
```
## Core Services
### ICorpusIngestionService
Handles ingestion of library binaries into the corpus.
```csharp
public interface ICorpusIngestionService
{
// Ingest a single library binary
Task<IngestionResult> IngestLibraryAsync(
LibraryIngestionMetadata metadata,
Stream binaryStream,
IngestionOptions? options = null,
CancellationToken ct = default);
// Ingest from a library connector (bulk)
IAsyncEnumerable<IngestionResult> IngestFromConnectorAsync(
string libraryName,
ILibraryCorpusConnector connector,
IngestionOptions? options = null,
CancellationToken ct = default);
// Update CVE associations for functions
Task<int> UpdateCveAssociationsAsync(
string cveId,
IReadOnlyList<FunctionCveAssociation> associations,
CancellationToken ct = default);
// Check job status
Task<IngestionJob?> GetJobStatusAsync(Guid jobId, CancellationToken ct = default);
}
```
### ICorpusQueryService
Queries the corpus to identify functions by their fingerprints.
```csharp
public interface ICorpusQueryService
{
// Identify a single function
Task<ImmutableArray<FunctionMatch>> IdentifyFunctionAsync(
FunctionFingerprints fingerprints,
IdentifyOptions? options = null,
CancellationToken ct = default);
// Batch identify multiple functions
Task<ImmutableDictionary<int, ImmutableArray<FunctionMatch>>> IdentifyBatchAsync(
IReadOnlyList<FunctionFingerprints> fingerprintSets,
IdentifyOptions? options = null,
CancellationToken ct = default);
// Get corpus statistics
Task<CorpusStatistics> GetStatisticsAsync(CancellationToken ct = default);
// List available libraries
Task<ImmutableArray<LibrarySummary>> ListLibrariesAsync(CancellationToken ct = default);
}
```
### ILibraryCorpusConnector
Interface for library-specific connectors that fetch binaries for ingestion.
```csharp
public interface ILibraryCorpusConnector
{
string LibraryName { get; }
string[] SupportedArchitectures { get; }
// Get available versions
Task<ImmutableArray<string>> GetAvailableVersionsAsync(CancellationToken ct);
// Fetch binaries for ingestion
IAsyncEnumerable<LibraryBinary> FetchBinariesAsync(
IReadOnlyList<string> versions,
string architecture,
LibraryFetchOptions? options = null,
CancellationToken ct = default);
}
```
## Fingerprint Algorithms
The corpus uses multiple fingerprint algorithms to enable matching under different conditions:
### Semantic K-Skip-Gram Hash (`semantic_ksg`)
Based on Ghidra BSim's approach:
- Analyzes normalized p-code operations
- Generates k-skip-gram features from instruction sequences
- Robust against register renaming and basic-block reordering
- Best for matching functions across optimization levels
### Instruction Basic-Block Hash (`instruction_bb`)
- Hashes normalized instruction sequences per basic block
- More sensitive to compiler differences
- Faster to compute than semantic hash
- Good for exact or near-exact matches
### Control-Flow Graph Hash (`cfg_wl`)
- Weisfeiler-Lehman graph hash of the CFG
- Captures structural similarity
- Works well even when instruction sequences differ
- Useful for detecting refactored code
## Usage Examples
### Ingesting a Library
```csharp
// Create ingestion metadata
var metadata = new LibraryIngestionMetadata(
Name: "openssl",
Version: "3.0.15",
Architecture: "x86_64",
Compiler: "gcc",
CompilerVersion: "12.2",
OptimizationLevel: "O2",
IsSecurityRelease: true);
// Ingest from file
await using var stream = File.OpenRead("libssl.so.3");
var result = await ingestionService.IngestLibraryAsync(metadata, stream);
Console.WriteLine($"Indexed {result.FunctionsIndexed} functions");
Console.WriteLine($"Generated {result.FingerprintsGenerated} fingerprints");
```
### Bulk Ingestion via Connector
```csharp
// Use the OpenSSL connector to fetch and ingest multiple versions
var connector = new OpenSslCorpusConnector(httpClientFactory, logger);
await foreach (var result in ingestionService.IngestFromConnectorAsync(
"openssl",
connector,
new IngestionOptions { GenerateClusters = true }))
{
Console.WriteLine($"Ingested {result.LibraryName} {result.Version}: {result.FunctionsIndexed} functions");
}
```
### Identifying Functions
```csharp
// Build fingerprints from analyzed function
var fingerprints = new FunctionFingerprints(
SemanticHash: semanticHashBytes,
InstructionHash: instructionHashBytes,
CfgHash: cfgHashBytes,
ApiCalls: ["malloc", "memcpy", "free"],
SizeBytes: 256);
// Query the corpus
var matches = await queryService.IdentifyFunctionAsync(
fingerprints,
new IdentifyOptions
{
MinSimilarity = 0.85m,
MaxResults = 5,
IncludeCveAssociations = true
});
foreach (var match in matches)
{
Console.WriteLine($"Match: {match.LibraryName} {match.Version} - {match.FunctionName}");
Console.WriteLine($" Similarity: {match.Similarity:P1}");
Console.WriteLine($" Match method: {match.MatchMethod}");
if (match.CveAssociations.Any())
{
foreach (var cve in match.CveAssociations)
{
Console.WriteLine($" CVE: {cve.CveId} ({cve.AffectedState})");
}
}
}
```
### Checking CVE Associations
```csharp
// When a function matches, check if it's associated with known CVEs
var match = matches.First();
if (match.CveAssociations.Any(c => c.AffectedState == CveAffectedState.Vulnerable))
{
Console.WriteLine("WARNING: Function matches a known vulnerable variant!");
}
```
## Database Schema
The corpus uses a dedicated PostgreSQL schema with the following key tables:
| Table | Purpose |
|-------|---------|
| `corpus.libraries` | Master list of tracked libraries |
| `corpus.library_versions` | Version records with release metadata |
| `corpus.build_variants` | Architecture/compiler/optimization variants |
| `corpus.functions` | Function metadata (name, address, size, etc.) |
| `corpus.fingerprints` | Fingerprint hashes indexed for lookup |
| `corpus.function_clusters` | Groups of similar functions |
| `corpus.function_cves` | CVE-to-function associations |
| `corpus.ingestion_jobs` | Job tracking for bulk ingestion |
## Supported Libraries
The corpus supports ingestion from these common libraries:
| Library | Connector | Architectures |
|---------|-----------|---------------|
| glibc | `GlibcCorpusConnector` | x86_64, aarch64, armv7, i686 |
| OpenSSL | `OpenSslCorpusConnector` | x86_64, aarch64, armv7 |
| zlib | `ZlibCorpusConnector` | x86_64, aarch64 |
| curl | `CurlCorpusConnector` | x86_64, aarch64 |
| SQLite | `SqliteCorpusConnector` | x86_64, aarch64 |
## Integration with Scanner
The corpus integrates with the Scanner module through `IBinaryVulnerabilityService`:
```csharp
// Scanner can identify functions from fingerprints
var matches = await binaryVulnService.IdentifyFunctionFromCorpusAsync(
new FunctionFingerprintSet(
FunctionAddress: 0x4000,
SemanticHash: hash,
InstructionHash: null,
CfgHash: null,
ApiCalls: null,
SizeBytes: 128),
new CorpusLookupOptions
{
MinSimilarity = 0.9m,
MaxResults = 3
});
```
## Performance Considerations
- **Batch queries**: Use `IdentifyBatchAsync` for multiple functions to reduce round-trips
- **Fingerprint selection**: Semantic hash is most robust but slowest; instruction hash is faster for exact matches
- **Similarity threshold**: Higher thresholds reduce false positives but may miss legitimate matches
- **Clustering**: Pre-computed clusters speed up similarity searches
## Security Notes
- Corpus connectors fetch from external sources; ensure network policies allow required endpoints
- Ingested binaries are hashed to prevent duplicate processing
- CVE associations include confidence scores and evidence types for auditability
- All timestamps use UTC for consistency
## Related Documentation
- [Binary Index Architecture](architecture.md)
- [Semantic Diffing](semantic-diffing.md)
- [Scanner Module](../scanner/architecture.md)