save progress
This commit is contained in:
313
docs/modules/binary-index/corpus-management.md
Normal file
313
docs/modules/binary-index/corpus-management.md
Normal file
@@ -0,0 +1,313 @@
|
||||
# Function Behavior Corpus Guide
|
||||
|
||||
This document describes StellaOps' Function Behavior Corpus system - a BSim-like capability for identifying functions by their semantic behavior rather than relying on symbols or prior CVE signatures.
|
||||
|
||||
## Overview
|
||||
|
||||
The Function Behavior Corpus is a database of known library functions with pre-computed fingerprints that enable identification of functions in stripped binaries. When a binary is analyzed, functions can be matched against the corpus to determine:
|
||||
|
||||
- **Library origin** - Which library (glibc, OpenSSL, zlib, etc.) the function comes from
|
||||
- **Version information** - Which version(s) of the library contain this function
|
||||
- **CVE associations** - Whether the function is linked to known vulnerabilities
|
||||
- **Patch status** - Whether a function matches a vulnerable or patched variant
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌───────────────────────────────────────────────────────────────────────┐
|
||||
│ Function Behavior Corpus │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Corpus Ingestion Layer │ │
|
||||
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
|
||||
│ │ │GlibcCorpus │ │OpenSSL │ │ZlibCorpus │ ... │ │
|
||||
│ │ │Connector │ │Connector │ │Connector │ │ │
|
||||
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
|
||||
│ └─────────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ v │
|
||||
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Fingerprint Generation │ │
|
||||
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
|
||||
│ │ │Instruction │ │Semantic │ │API Call │ │ │
|
||||
│ │ │Hash │ │KSG Hash │ │Graph │ │ │
|
||||
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
|
||||
│ └─────────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ v │
|
||||
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Corpus Storage (PostgreSQL) │ │
|
||||
│ │ │ │
|
||||
│ │ corpus.libraries - Known libraries │ │
|
||||
│ │ corpus.library_versions- Version snapshots │ │
|
||||
│ │ corpus.build_variants - Architecture/compiler variants │ │
|
||||
│ │ corpus.functions - Function metadata │ │
|
||||
│ │ corpus.fingerprints - Fingerprint index │ │
|
||||
│ │ corpus.function_clusters- Similar function groups │ │
|
||||
│ │ corpus.function_cves - CVE associations │ │
|
||||
│ └─────────────────────────────────────────────────────────────────┘ │
|
||||
└───────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Core Services
|
||||
|
||||
### ICorpusIngestionService
|
||||
|
||||
Handles ingestion of library binaries into the corpus.
|
||||
|
||||
```csharp
|
||||
public interface ICorpusIngestionService
|
||||
{
|
||||
// Ingest a single library binary
|
||||
Task<IngestionResult> IngestLibraryAsync(
|
||||
LibraryIngestionMetadata metadata,
|
||||
Stream binaryStream,
|
||||
IngestionOptions? options = null,
|
||||
CancellationToken ct = default);
|
||||
|
||||
// Ingest from a library connector (bulk)
|
||||
IAsyncEnumerable<IngestionResult> IngestFromConnectorAsync(
|
||||
string libraryName,
|
||||
ILibraryCorpusConnector connector,
|
||||
IngestionOptions? options = null,
|
||||
CancellationToken ct = default);
|
||||
|
||||
// Update CVE associations for functions
|
||||
Task<int> UpdateCveAssociationsAsync(
|
||||
string cveId,
|
||||
IReadOnlyList<FunctionCveAssociation> associations,
|
||||
CancellationToken ct = default);
|
||||
|
||||
// Check job status
|
||||
Task<IngestionJob?> GetJobStatusAsync(Guid jobId, CancellationToken ct = default);
|
||||
}
|
||||
```
|
||||
|
||||
### ICorpusQueryService
|
||||
|
||||
Queries the corpus to identify functions by their fingerprints.
|
||||
|
||||
```csharp
|
||||
public interface ICorpusQueryService
|
||||
{
|
||||
// Identify a single function
|
||||
Task<ImmutableArray<FunctionMatch>> IdentifyFunctionAsync(
|
||||
FunctionFingerprints fingerprints,
|
||||
IdentifyOptions? options = null,
|
||||
CancellationToken ct = default);
|
||||
|
||||
// Batch identify multiple functions
|
||||
Task<ImmutableDictionary<int, ImmutableArray<FunctionMatch>>> IdentifyBatchAsync(
|
||||
IReadOnlyList<FunctionFingerprints> fingerprintSets,
|
||||
IdentifyOptions? options = null,
|
||||
CancellationToken ct = default);
|
||||
|
||||
// Get corpus statistics
|
||||
Task<CorpusStatistics> GetStatisticsAsync(CancellationToken ct = default);
|
||||
|
||||
// List available libraries
|
||||
Task<ImmutableArray<LibrarySummary>> ListLibrariesAsync(CancellationToken ct = default);
|
||||
}
|
||||
```
|
||||
|
||||
### ILibraryCorpusConnector
|
||||
|
||||
Interface for library-specific connectors that fetch binaries for ingestion.
|
||||
|
||||
```csharp
|
||||
public interface ILibraryCorpusConnector
|
||||
{
|
||||
string LibraryName { get; }
|
||||
string[] SupportedArchitectures { get; }
|
||||
|
||||
// Get available versions
|
||||
Task<ImmutableArray<string>> GetAvailableVersionsAsync(CancellationToken ct);
|
||||
|
||||
// Fetch binaries for ingestion
|
||||
IAsyncEnumerable<LibraryBinary> FetchBinariesAsync(
|
||||
IReadOnlyList<string> versions,
|
||||
string architecture,
|
||||
LibraryFetchOptions? options = null,
|
||||
CancellationToken ct = default);
|
||||
}
|
||||
```
|
||||
|
||||
## Fingerprint Algorithms
|
||||
|
||||
The corpus uses multiple fingerprint algorithms to enable matching under different conditions:
|
||||
|
||||
### Semantic K-Skip-Gram Hash (`semantic_ksg`)
|
||||
|
||||
Based on Ghidra BSim's approach:
|
||||
- Analyzes normalized p-code operations
|
||||
- Generates k-skip-gram features from instruction sequences
|
||||
- Robust against register renaming and basic-block reordering
|
||||
- Best for matching functions across optimization levels
|
||||
|
||||
### Instruction Basic-Block Hash (`instruction_bb`)
|
||||
|
||||
- Hashes normalized instruction sequences per basic block
|
||||
- More sensitive to compiler differences
|
||||
- Faster to compute than semantic hash
|
||||
- Good for exact or near-exact matches
|
||||
|
||||
### Control-Flow Graph Hash (`cfg_wl`)
|
||||
|
||||
- Weisfeiler-Lehman graph hash of the CFG
|
||||
- Captures structural similarity
|
||||
- Works well even when instruction sequences differ
|
||||
- Useful for detecting refactored code
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Ingesting a Library
|
||||
|
||||
```csharp
|
||||
// Create ingestion metadata
|
||||
var metadata = new LibraryIngestionMetadata(
|
||||
Name: "openssl",
|
||||
Version: "3.0.15",
|
||||
Architecture: "x86_64",
|
||||
Compiler: "gcc",
|
||||
CompilerVersion: "12.2",
|
||||
OptimizationLevel: "O2",
|
||||
IsSecurityRelease: true);
|
||||
|
||||
// Ingest from file
|
||||
await using var stream = File.OpenRead("libssl.so.3");
|
||||
var result = await ingestionService.IngestLibraryAsync(metadata, stream);
|
||||
|
||||
Console.WriteLine($"Indexed {result.FunctionsIndexed} functions");
|
||||
Console.WriteLine($"Generated {result.FingerprintsGenerated} fingerprints");
|
||||
```
|
||||
|
||||
### Bulk Ingestion via Connector
|
||||
|
||||
```csharp
|
||||
// Use the OpenSSL connector to fetch and ingest multiple versions
|
||||
var connector = new OpenSslCorpusConnector(httpClientFactory, logger);
|
||||
|
||||
await foreach (var result in ingestionService.IngestFromConnectorAsync(
|
||||
"openssl",
|
||||
connector,
|
||||
new IngestionOptions { GenerateClusters = true }))
|
||||
{
|
||||
Console.WriteLine($"Ingested {result.LibraryName} {result.Version}: {result.FunctionsIndexed} functions");
|
||||
}
|
||||
```
|
||||
|
||||
### Identifying Functions
|
||||
|
||||
```csharp
|
||||
// Build fingerprints from analyzed function
|
||||
var fingerprints = new FunctionFingerprints(
|
||||
SemanticHash: semanticHashBytes,
|
||||
InstructionHash: instructionHashBytes,
|
||||
CfgHash: cfgHashBytes,
|
||||
ApiCalls: ["malloc", "memcpy", "free"],
|
||||
SizeBytes: 256);
|
||||
|
||||
// Query the corpus
|
||||
var matches = await queryService.IdentifyFunctionAsync(
|
||||
fingerprints,
|
||||
new IdentifyOptions
|
||||
{
|
||||
MinSimilarity = 0.85m,
|
||||
MaxResults = 5,
|
||||
IncludeCveAssociations = true
|
||||
});
|
||||
|
||||
foreach (var match in matches)
|
||||
{
|
||||
Console.WriteLine($"Match: {match.LibraryName} {match.Version} - {match.FunctionName}");
|
||||
Console.WriteLine($" Similarity: {match.Similarity:P1}");
|
||||
Console.WriteLine($" Match method: {match.MatchMethod}");
|
||||
|
||||
if (match.CveAssociations.Any())
|
||||
{
|
||||
foreach (var cve in match.CveAssociations)
|
||||
{
|
||||
Console.WriteLine($" CVE: {cve.CveId} ({cve.AffectedState})");
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Checking CVE Associations
|
||||
|
||||
```csharp
|
||||
// When a function matches, check if it's associated with known CVEs
|
||||
var match = matches.First();
|
||||
if (match.CveAssociations.Any(c => c.AffectedState == CveAffectedState.Vulnerable))
|
||||
{
|
||||
Console.WriteLine("WARNING: Function matches a known vulnerable variant!");
|
||||
}
|
||||
```
|
||||
|
||||
## Database Schema
|
||||
|
||||
The corpus uses a dedicated PostgreSQL schema with the following key tables:
|
||||
|
||||
| Table | Purpose |
|
||||
|-------|---------|
|
||||
| `corpus.libraries` | Master list of tracked libraries |
|
||||
| `corpus.library_versions` | Version records with release metadata |
|
||||
| `corpus.build_variants` | Architecture/compiler/optimization variants |
|
||||
| `corpus.functions` | Function metadata (name, address, size, etc.) |
|
||||
| `corpus.fingerprints` | Fingerprint hashes indexed for lookup |
|
||||
| `corpus.function_clusters` | Groups of similar functions |
|
||||
| `corpus.function_cves` | CVE-to-function associations |
|
||||
| `corpus.ingestion_jobs` | Job tracking for bulk ingestion |
|
||||
|
||||
## Supported Libraries
|
||||
|
||||
The corpus supports ingestion from these common libraries:
|
||||
|
||||
| Library | Connector | Architectures |
|
||||
|---------|-----------|---------------|
|
||||
| glibc | `GlibcCorpusConnector` | x86_64, aarch64, armv7, i686 |
|
||||
| OpenSSL | `OpenSslCorpusConnector` | x86_64, aarch64, armv7 |
|
||||
| zlib | `ZlibCorpusConnector` | x86_64, aarch64 |
|
||||
| curl | `CurlCorpusConnector` | x86_64, aarch64 |
|
||||
| SQLite | `SqliteCorpusConnector` | x86_64, aarch64 |
|
||||
|
||||
## Integration with Scanner
|
||||
|
||||
The corpus integrates with the Scanner module through `IBinaryVulnerabilityService`:
|
||||
|
||||
```csharp
|
||||
// Scanner can identify functions from fingerprints
|
||||
var matches = await binaryVulnService.IdentifyFunctionFromCorpusAsync(
|
||||
new FunctionFingerprintSet(
|
||||
FunctionAddress: 0x4000,
|
||||
SemanticHash: hash,
|
||||
InstructionHash: null,
|
||||
CfgHash: null,
|
||||
ApiCalls: null,
|
||||
SizeBytes: 128),
|
||||
new CorpusLookupOptions
|
||||
{
|
||||
MinSimilarity = 0.9m,
|
||||
MaxResults = 3
|
||||
});
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Batch queries**: Use `IdentifyBatchAsync` for multiple functions to reduce round-trips
|
||||
- **Fingerprint selection**: Semantic hash is most robust but slowest; instruction hash is faster for exact matches
|
||||
- **Similarity threshold**: Higher thresholds reduce false positives but may miss legitimate matches
|
||||
- **Clustering**: Pre-computed clusters speed up similarity searches
|
||||
|
||||
## Security Notes
|
||||
|
||||
- Corpus connectors fetch from external sources; ensure network policies allow required endpoints
|
||||
- Ingested binaries are hashed to prevent duplicate processing
|
||||
- CVE associations include confidence scores and evidence types for auditability
|
||||
- All timestamps use UTC for consistency
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Binary Index Architecture](architecture.md)
|
||||
- [Semantic Diffing](semantic-diffing.md)
|
||||
- [Scanner Module](../scanner/architecture.md)
|
||||
Reference in New Issue
Block a user