save progress

This commit is contained in:
StellaOps Bot
2026-01-06 09:42:02 +02:00
parent 94d68bee8b
commit 37e11918e0
443 changed files with 85863 additions and 897 deletions

View File

@@ -218,7 +218,198 @@ public sealed record VulnFingerprint(
public enum FingerprintType { BasicBlock, ControlFlowGraph, StringReferences, Combined }
```
#### 2.2.5 Binary Vulnerability Service
#### 2.2.5 Semantic Analysis Library
> **Library:** `StellaOps.BinaryIndex.Semantic`
> **Sprint:** 20260105_001_001_BINDEX - Semantic Diffing Phase 1
The Semantic Analysis Library extends fingerprint generation with IR-level semantic matching, enabling detection of semantically equivalent code despite compiler optimizations, instruction reordering, and register allocation differences.
**Key Insight:** Traditional instruction-level fingerprinting loses accuracy on optimized binaries by ~15-20%. Semantic analysis lifts to B2R2's Intermediate Representation (LowUIR), extracts key-semantics graphs, and uses graph hashing for similarity computation.
##### 2.2.5.1 Architecture
```
Binary Input
v
B2R2 Disassembly → Raw Instructions
v
IR Lifting Service → LowUIR Statements
v
Semantic Graph Extractor → Key-Semantics Graph (KSG)
v
Graph Fingerprinting → Semantic Fingerprint
v
Semantic Matcher → Similarity Score + Deltas
```
##### 2.2.5.2 Core Components
**IR Lifting Service** (`IIrLiftingService`)
Lifts disassembled instructions to B2R2 LowUIR:
```csharp
public interface IIrLiftingService
{
Task<LiftedFunction> LiftToIrAsync(
IReadOnlyList<DisassembledInstruction> instructions,
string functionName,
LiftOptions? options = null,
CancellationToken ct = default);
}
public sealed record LiftedFunction(
string Name,
ImmutableArray<IrStatement> Statements,
ImmutableArray<IrBasicBlock> BasicBlocks);
```
**Semantic Graph Extractor** (`ISemanticGraphExtractor`)
Extracts key-semantics graphs capturing data dependencies, control flow, and memory operations:
```csharp
public interface ISemanticGraphExtractor
{
Task<KeySemanticsGraph> ExtractGraphAsync(
LiftedFunction function,
GraphExtractionOptions? options = null,
CancellationToken ct = default);
}
public sealed record KeySemanticsGraph(
string FunctionName,
ImmutableArray<SemanticNode> Nodes,
ImmutableArray<SemanticEdge> Edges,
GraphProperties Properties);
public enum SemanticNodeType { Compute, Load, Store, Branch, Call, Return, Phi }
public enum SemanticEdgeType { DataDependency, ControlDependency, MemoryDependency }
```
**Semantic Fingerprint Generator** (`ISemanticFingerprintGenerator`)
Generates semantic fingerprints using Weisfeiler-Lehman graph hashing:
```csharp
public interface ISemanticFingerprintGenerator
{
Task<SemanticFingerprint> GenerateAsync(
KeySemanticsGraph graph,
SemanticFingerprintOptions? options = null,
CancellationToken ct = default);
}
public sealed record SemanticFingerprint(
string FunctionName,
string GraphHashHex, // WL graph hash (SHA-256)
string OperationHashHex, // Normalized operation sequence hash
string DataFlowHashHex, // Data dependency pattern hash
int NodeCount,
int EdgeCount,
int CyclomaticComplexity,
ImmutableArray<string> ApiCalls,
SemanticFingerprintAlgorithm Algorithm);
```
**Semantic Matcher** (`ISemanticMatcher`)
Computes semantic similarity with weighted components:
```csharp
public interface ISemanticMatcher
{
Task<SemanticMatchResult> MatchAsync(
SemanticFingerprint a,
SemanticFingerprint b,
MatchOptions? options = null,
CancellationToken ct = default);
Task<SemanticMatchResult> MatchWithDeltasAsync(
SemanticFingerprint a,
SemanticFingerprint b,
MatchOptions? options = null,
CancellationToken ct = default);
}
public sealed record SemanticMatchResult(
decimal Similarity, // 0.00-1.00
decimal GraphSimilarity,
decimal OperationSimilarity,
decimal DataFlowSimilarity,
decimal ApiCallSimilarity,
MatchConfidence Confidence);
```
##### 2.2.5.3 Algorithm Details
**Weisfeiler-Lehman Graph Hashing:**
- 3 iterations of label propagation
- SHA-256 for final hash computation
- Deterministic node ordering via canonical sort
**Similarity Weights (Default):**
| Component | Weight |
|-----------|--------|
| Graph Hash | 0.35 |
| Operation Hash | 0.25 |
| Data Flow Hash | 0.25 |
| API Calls | 0.15 |
##### 2.2.5.4 Integration Points
The semantic library integrates with existing BinaryIndex components:
**DeltaSignatureGenerator Extension:**
```csharp
// Optional semantic services via constructor injection
services.AddDeltaSignaturesWithSemantic();
// Extended SymbolSignature with semantic properties
public sealed record SymbolSignature
{
// ... existing properties ...
public string? SemanticHashHex { get; init; }
public ImmutableArray<string> SemanticApiCalls { get; init; }
}
```
**PatchDiffEngine Extension:**
```csharp
// SemanticWeight in HashWeights
public decimal SemanticWeight { get; init; } = 0.2m;
// FunctionFingerprint extended with semantic fingerprint
public SemanticFingerprint? SemanticFingerprint { get; init; }
```
##### 2.2.5.5 Test Coverage
| Category | Tests | Coverage |
|----------|-------|----------|
| Unit Tests (IR lifting, graph extraction, hashing) | 53 | Core algorithms |
| Integration Tests (full pipeline) | 9 | End-to-end flow |
| Golden Corpus (compiler variations) | 11 | Register allocation, optimization, compiler variants |
| Benchmarks (accuracy, performance) | 7 | Baseline metrics |
##### 2.2.5.6 Current Baselines
> **Note:** Baselines reflect foundational implementation; accuracy improves as semantic features mature.
| Metric | Baseline | Target |
|--------|----------|--------|
| Similarity (register allocation variants) | ≥0.55 | ≥0.85 |
| Overall accuracy | ≥40% | ≥70% |
| False positive rate | <10% | <5% |
| P95 fingerprint latency | <100ms | <50ms |
#### 2.2.6 Binary Vulnerability Service
Main query interface for consumers.
@@ -688,8 +879,11 @@ binaryindex:
- Scanner Native Analysis: `src/Scanner/StellaOps.Scanner.Analyzers.Native/`
- Existing Fingerprinting: `src/Scanner/__Libraries/StellaOps.Scanner.EntryTrace/Binary/`
- Build-ID Index: `src/Scanner/StellaOps.Scanner.Analyzers.Native/Index/`
- **Semantic Diffing Sprint:** `docs/implplan/SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md`
- **Semantic Library:** `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/`
- **Semantic Tests:** `src/BinaryIndex/__Tests/StellaOps.BinaryIndex.Semantic.Tests/`
---
*Document Version: 1.0.0*
*Last Updated: 2025-12-21*
*Document Version: 1.1.0*
*Last Updated: 2025-01-15*

View File

@@ -0,0 +1,439 @@
# BSim PostgreSQL Database Setup Guide
**Version:** 1.0
**Sprint:** SPRINT_20260105_001_003_BINDEX
**Task:** GHID-011
## Overview
Ghidra's BSim (Binary Similarity) feature requires a separate PostgreSQL database for storing and querying function signatures. This guide covers setup and configuration.
## Architecture
```
┌──────────────────────────────────────────────────────┐
│ StellaOps BinaryIndex │
├──────────────────────────────────────────────────────┤
│ Main Corpus DB │ BSim DB (Ghidra) │
│ (corpus.* schema) │ (separate instance) │
│ │ │
│ - Function metadata │ - BSim signatures │
│ - Fingerprints │ - Feature vectors │
│ - Clusters │ - Similarity index │
│ - CVE associations │ │
└──────────────────────────────────────────────────────┘
```
**Why Separate?**
- BSim uses Ghidra-specific schema and stored procedures
- Different access patterns (corpus: OLTP, BSim: analytical)
- BSim database can be shared across multiple Ghidra instances
- Isolation prevents schema conflicts
## Prerequisites
- PostgreSQL 14+ (BSim requires specific PostgreSQL features)
- Ghidra 11.x with BSim extension
- Network connectivity between BinaryIndex services and BSim database
- At least 10GB storage for initial database (scales with corpus size)
## Database Setup
### 1. Create BSim Database
```bash
# Create database
createdb bsim_corpus
# Create user
psql -c "CREATE USER bsim_user WITH PASSWORD 'secure_password_here';"
psql -c "GRANT ALL PRIVILEGES ON DATABASE bsim_corpus TO bsim_user;"
```
### 2. Initialize BSim Schema
Ghidra provides scripts to initialize the BSim database schema:
```bash
# Set Ghidra home
export GHIDRA_HOME=/opt/ghidra
# Run BSim database initialization
$GHIDRA_HOME/Ghidra/Features/BSim/data/postgresql_init.sh \
--host localhost \
--port 5432 \
--database bsim_corpus \
--user bsim_user \
--password secure_password_here
```
Alternatively, use Ghidra's BSim server setup:
```bash
# Create BSim server configuration
$GHIDRA_HOME/support/bsimServerSetup \
postgresql://localhost:5432/bsim_corpus \
--user bsim_user \
--password secure_password_here
```
### 3. Verify Installation
```bash
# Connect to database
psql -h localhost -U bsim_user -d bsim_corpus
# Check BSim tables exist
\dt
# Expected tables:
# - bsim_functions
# - bsim_executables
# - bsim_vectors
# - bsim_clusters
# etc.
# Exit
\q
```
## Docker Deployment
### Docker Compose Configuration
```yaml
# docker-compose.bsim.yml
version: '3.8'
services:
bsim-postgres:
image: postgres:16
container_name: stellaops-bsim-db
environment:
POSTGRES_DB: bsim_corpus
POSTGRES_USER: bsim_user
POSTGRES_PASSWORD: ${BSIM_DB_PASSWORD}
POSTGRES_INITDB_ARGS: "-E UTF8 --locale=C"
volumes:
- bsim-data:/var/lib/postgresql/data
- ./scripts/init-bsim.sh:/docker-entrypoint-initdb.d/10-init-bsim.sh:ro
ports:
- "5433:5432" # Different port to avoid conflict with main DB
networks:
- stellaops
healthcheck:
test: ["CMD-SHELL", "pg_isready -U bsim_user -d bsim_corpus"]
interval: 10s
timeout: 5s
retries: 5
ghidra-headless:
image: stellaops/ghidra-headless:11.2
container_name: stellaops-ghidra
depends_on:
bsim-postgres:
condition: service_healthy
environment:
BSIM_DB_URL: "postgresql://bsim-postgres:5432/bsim_corpus"
BSIM_DB_USER: bsim_user
BSIM_DB_PASSWORD: ${BSIM_DB_PASSWORD}
JAVA_HOME: /opt/java/openjdk
MAXMEM: 4G
volumes:
- ghidra-projects:/projects
- ghidra-scripts:/scripts
networks:
- stellaops
deploy:
resources:
limits:
cpus: '4'
memory: 8G
volumes:
bsim-data:
driver: local
ghidra-projects:
ghidra-scripts:
networks:
stellaops:
driver: bridge
```
### Initialization Script
Create `scripts/init-bsim.sh`:
```bash
#!/bin/bash
set -e
# Wait for PostgreSQL to be ready
until pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB"; do
echo "Waiting for PostgreSQL..."
sleep 2
done
echo "PostgreSQL is ready. Installing BSim schema..."
# Note: Actual BSim schema SQL would be sourced from Ghidra distribution
# This is a placeholder - replace with actual Ghidra BSim schema
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
-- BSim schema will be initialized by Ghidra tools
-- This script just ensures the database is ready
COMMENT ON DATABASE bsim_corpus IS 'Ghidra BSim function signature database';
EOSQL
echo "BSim database initialized successfully"
```
### Start Services
```bash
# Set password
export BSIM_DB_PASSWORD="your_secure_password"
# Start services
docker-compose -f docker-compose.bsim.yml up -d
# Check logs
docker-compose -f docker-compose.bsim.yml logs -f ghidra-headless
```
## Configuration
### BinaryIndex Configuration
Configure BSim connection in `appsettings.json`:
```json
{
"BinaryIndex": {
"Ghidra": {
"Enabled": true,
"GhidraHome": "/opt/ghidra",
"BSim": {
"Enabled": true,
"ConnectionString": "Host=localhost;Port=5433;Database=bsim_corpus;Username=bsim_user;Password=...",
"MinSimilarity": 0.7,
"MaxResults": 10
}
}
}
}
```
### Environment Variables
```bash
# BSim database connection
export STELLAOPS_BSIM_CONNECTION="Host=localhost;Port=5433;Database=bsim_corpus;Username=bsim_user;Password=..."
# BSim feature
export STELLAOPS_BSIM_ENABLED=true
# Query tuning
export STELLAOPS_BSIM_MIN_SIMILARITY=0.7
export STELLAOPS_BSIM_QUERY_TIMEOUT=30
```
## Usage
### Ingesting Functions into BSim
```csharp
using StellaOps.BinaryIndex.Ghidra;
var bsimService = serviceProvider.GetRequiredService<IBSimService>();
// Analyze binary with Ghidra
var ghidraService = serviceProvider.GetRequiredService<IGhidraService>();
var analysis = await ghidraService.AnalyzeAsync(binaryStream, ct: ct);
// Generate BSim signatures
var signatures = await bsimService.GenerateSignaturesAsync(analysis, ct: ct);
// Ingest into BSim database
await bsimService.IngestAsync("glibc", "2.31", signatures, ct);
```
### Querying BSim
```csharp
// Query for similar functions
var queryOptions = new BSimQueryOptions
{
MinSimilarity = 0.7,
MinSignificance = 0.5,
MaxResults = 10
};
var matches = await bsimService.QueryAsync(signature, queryOptions, ct);
foreach (var match in matches)
{
Console.WriteLine($"Match: {match.MatchedLibrary} {match.MatchedVersion} - {match.MatchedFunction}");
Console.WriteLine($"Similarity: {match.Similarity:P2}, Confidence: {match.Confidence:P2}");
}
```
## Maintenance
### Database Vacuum
```bash
# Regular vacuum (run weekly)
psql -h localhost -U bsim_user -d bsim_corpus -c "VACUUM ANALYZE;"
# Full vacuum (run monthly)
psql -h localhost -U bsim_user -d bsim_corpus -c "VACUUM FULL;"
```
### Backup and Restore
```bash
# Backup
pg_dump -h localhost -U bsim_user -d bsim_corpus -F c -f bsim_backup_$(date +%Y%m%d).dump
# Restore
pg_restore -h localhost -U bsim_user -d bsim_corpus -c bsim_backup_20260105.dump
```
### Monitoring
```sql
-- Check database size
SELECT pg_size_pretty(pg_database_size('bsim_corpus'));
-- Check signature count
SELECT COUNT(*) FROM bsim_functions;
-- Check recent ingest activity
SELECT * FROM bsim_ingest_log ORDER BY ingested_at DESC LIMIT 10;
```
## Performance Tuning
### PostgreSQL Configuration
Add to `postgresql.conf`:
```ini
# Memory settings for BSim workload
shared_buffers = 4GB
effective_cache_size = 12GB
work_mem = 256MB
maintenance_work_mem = 1GB
# Query parallelism
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
# Indexes
random_page_cost = 1.1 # For SSD storage
```
### Indexing Strategy
BSim automatically creates required indexes. Monitor slow queries:
```sql
-- Enable query logging
ALTER SYSTEM SET log_min_duration_statement = 1000; -- Log queries > 1s
SELECT pg_reload_conf();
-- Check slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE query LIKE '%bsim%'
ORDER BY mean_exec_time DESC
LIMIT 10;
```
## Troubleshooting
### Connection Refused
```
Error: could not connect to server: Connection refused
```
**Solution:**
1. Verify PostgreSQL is running: `systemctl status postgresql`
2. Check port: `netstat -an | grep 5433`
3. Verify firewall rules
4. Check `pg_hba.conf` for access rules
### Schema Not Found
```
Error: relation "bsim_functions" does not exist
```
**Solution:**
1. Re-run BSim schema initialization
2. Verify Ghidra version compatibility
3. Check BSim extension is installed in Ghidra
### Poor Query Performance
```
Warning: BSim queries taking > 5s
```
**Solution:**
1. Run `VACUUM ANALYZE` on BSim tables
2. Increase `work_mem` for complex queries
3. Check index usage: `EXPLAIN ANALYZE` on slow queries
4. Consider partitioning large tables
## Security Considerations
1. **Network Access:** BSim database should only be accessible from BinaryIndex services and Ghidra instances
2. **Authentication:** Use strong passwords, consider certificate-based authentication
3. **Encryption:** Enable SSL/TLS for database connections in production
4. **Access Control:** Grant minimum necessary privileges
```sql
-- Create read-only user for query services
CREATE USER bsim_readonly WITH PASSWORD '...';
GRANT CONNECT ON DATABASE bsim_corpus TO bsim_readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO bsim_readonly;
```
## Integration with Corpus
The BSim database complements the main corpus database:
- **Corpus DB:** Stores function metadata, fingerprints, CVE associations
- **BSim DB:** Stores Ghidra-specific behavioral signatures and feature vectors
Functions are cross-referenced by:
- Library name + version
- Function name
- Binary hash
## Status: GHID-011 Resolution
**Implementation Status:** Service code complete (`BSimService.cs` implemented)
**Database Status:** Schema initialization documented, awaiting infrastructure provisioning
**Blocker Resolution:** This guide provides complete setup instructions. Database can be provisioned by:
1. Operations team following Docker Compose setup above
2. Developers using local PostgreSQL with manual schema init
3. CI/CD using containerized BSim database for integration tests
**Next Steps:**
1. Provision BSim PostgreSQL instance (dev/staging/prod)
2. Run BSim schema initialization
3. Test BSimService connectivity
4. Ingest initial corpus into BSim
## References
- Ghidra BSim Documentation: https://ghidra.re/ghidra_docs/api/ghidra/features/bsim/
- Sprint: `docs/implplan/SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md`
- BSimService Implementation: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ghidra/Services/BSimService.cs`

View File

@@ -0,0 +1,232 @@
# Corpus Ingestion Operations Guide
**Version:** 1.0
**Sprint:** SPRINT_20260105_001_002_BINDEX
**Status:** Implementation Complete - Operational Execution Pending
## Overview
This guide describes how to execute corpus ingestion operations to populate the function behavior corpus with fingerprints from known library functions.
## Prerequisites
- StellaOps.BinaryIndex.Corpus library built and deployed
- PostgreSQL database with corpus schema (see `docs/db/schemas/corpus.sql`)
- Network access to package mirrors (or local package cache)
- Sufficient disk space (~100GB for full corpus)
- Required tools:
- .NET 10 runtime
- HTTP client access to package repositories
## Implementation Status
**CORP-015, CORP-016, CORP-017: Implementation COMPLETE**
All corpus connector implementations are complete and build successfully:
- ✓ GlibcCorpusConnector (GNU C Library)
- ✓ OpenSslCorpusConnector (OpenSSL)
- ✓ ZlibCorpusConnector (zlib)
- ✓ CurlCorpusConnector (libcurl)
**Status:** Code implementation is done. These tasks require **operational execution** to download and ingest real package data.
## Running Corpus Ingestion
### 1. Configure Package Sources
Set up access to package mirrors in your configuration:
```yaml
# config/corpus-ingestion.yaml
packageSources:
debian:
mirrorUrl: "http://deb.debian.org/debian"
distributions: ["bullseye", "bookworm"]
components: ["main"]
ubuntu:
mirrorUrl: "http://archive.ubuntu.com/ubuntu"
distributions: ["focal", "jammy"]
alpine:
mirrorUrl: "https://dl-cdn.alpinelinux.org/alpine"
versions: ["v3.18", "v3.19"]
```
### 2. Environment Variables
```bash
# Database connection
export STELLAOPS_CORPUS_DB="Host=localhost;Database=stellaops;Username=corpus_user;Password=..."
# Package cache directory (optional)
export STELLAOPS_PACKAGE_CACHE="/var/cache/stellaops/packages"
# Concurrent workers
export STELLAOPS_INGESTION_WORKERS=4
```
### 3. Execute Ingestion (CLI)
```bash
# Ingest specific library version
stellaops corpus ingest --library glibc --version 2.31 --architectures x86_64,aarch64
# Ingest version range
stellaops corpus ingest --library openssl --version-range "1.1.0..1.1.1" --architectures x86_64
# Ingest from local binary
stellaops corpus ingest-binary --library glibc --version 2.31 --arch x86_64 --path /usr/lib/x86_64-linux-gnu/libc.so.6
# Full ingestion job (all configured libraries)
stellaops corpus ingest-full --config config/corpus-ingestion.yaml
```
### 4. Execute Ingestion (Programmatic)
```csharp
using StellaOps.BinaryIndex.Corpus;
using StellaOps.BinaryIndex.Corpus.Connectors;
// Setup
var serviceProvider = ...; // Configure DI
var ingestionService = serviceProvider.GetRequiredService<ICorpusIngestionService>();
var glibcConnector = serviceProvider.GetRequiredService<GlibcCorpusConnector>();
// Fetch available versions
var versions = await glibcConnector.GetAvailableVersionsAsync(ct);
// Ingest specific version
foreach (var version in versions.Take(5))
{
foreach (var arch in new[] { "x86_64", "aarch64" })
{
try
{
var binary = await glibcConnector.FetchBinaryAsync(version, arch, abi: "gnu", ct);
var metadata = new LibraryMetadata(
Name: "glibc",
Version: version,
Architecture: arch,
Abi: "gnu",
Compiler: "gcc",
OptimizationLevel: "O2"
);
using var stream = File.OpenRead(binary.Path);
var result = await ingestionService.IngestLibraryAsync(metadata, stream, ct: ct);
Console.WriteLine($"Ingested {result.FunctionsIndexed} functions from glibc {version} {arch}");
}
catch (Exception ex)
{
Console.WriteLine($"Failed to ingest glibc {version} {arch}: {ex.Message}");
}
}
}
```
## Ingestion Workflow
```
1. Package Discovery
└─> Query package mirror for available versions
2. Package Download
└─> Fetch .deb/.apk/.rpm package
└─> Extract binary files
3. Binary Analysis
└─> Disassemble with B2R2
└─> Lift to IR (semantic fingerprints)
└─> Extract functions, imports, exports
4. Fingerprint Generation
└─> Instruction-level fingerprints
└─> Semantic graph fingerprints
└─> API call sequence fingerprints
└─> Combined fingerprints
5. Database Storage
└─> Insert library/version records
└─> Insert build variant records
└─> Insert function records
└─> Insert fingerprint records
6. Clustering (post-ingestion)
└─> Group similar functions across versions
└─> Compute centroids
```
## Expected Corpus Coverage
### Phase 2a (Priority Libraries)
| Library | Versions | Architectures | Est. Functions | Status |
|---------|----------|---------------|----------------|--------|
| glibc | 2.17, 2.28, 2.31, 2.35, 2.38 | x64, arm64, armv7 | ~15,000 | Ready to ingest |
| OpenSSL | 1.0.2, 1.1.0, 1.1.1, 3.0, 3.1 | x64, arm64 | ~8,000 | Ready to ingest |
| zlib | 1.2.8, 1.2.11, 1.2.13, 1.3 | x64, arm64 | ~200 | Ready to ingest |
| libcurl | 7.50-7.88 (select) | x64, arm64 | ~2,000 | Ready to ingest |
| SQLite | 3.30-3.44 (select) | x64, arm64 | ~1,500 | Ready to ingest |
**Total Phase 2a:** ~26,700 unique functions, ~80,000 fingerprints (with variants)
## Monitoring Ingestion
```bash
# Check ingestion job status
stellaops corpus jobs list
# View statistics
stellaops corpus stats
# Query specific library coverage
stellaops corpus query --library glibc --show-versions
```
## Performance Considerations
- **Parallel ingestion:** Use multiple workers for concurrent processing
- **Disk I/O:** Local package cache significantly speeds up repeated ingestion
- **Database:** Ensure PostgreSQL has adequate memory for bulk inserts
- **Network:** Mirror selection impacts download speed
## Troubleshooting
### Package Download Failures
```
Error: Failed to download package from mirror
Solution: Check mirror availability, try alternative mirror
```
### Fingerprint Generation Failures
```
Error: Failed to generate semantic fingerprint for function X
Solution: Check B2R2 support for architecture, verify binary format
```
### Database Connection Issues
```
Error: Could not connect to corpus database
Solution: Verify STELLAOPS_CORPUS_DB connection string, check PostgreSQL is running
```
## Next Steps
After successful ingestion:
1. Run clustering: `stellaops corpus cluster --library glibc`
2. Update CVE associations: `stellaops corpus update-cves`
3. Validate query performance: `stellaops corpus benchmark-query`
4. Export statistics: `stellaops corpus export-stats --output corpus-stats.json`
## Related Documentation
- Database Schema: `docs/db/schemas/corpus.sql`
- Architecture: `docs/modules/binary-index/corpus-management.md`
- Sprint: `docs/implplan/SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md`

View File

@@ -0,0 +1,313 @@
# Function Behavior Corpus Guide
This document describes StellaOps' Function Behavior Corpus system - a BSim-like capability for identifying functions by their semantic behavior rather than relying on symbols or prior CVE signatures.
## Overview
The Function Behavior Corpus is a database of known library functions with pre-computed fingerprints that enable identification of functions in stripped binaries. When a binary is analyzed, functions can be matched against the corpus to determine:
- **Library origin** - Which library (glibc, OpenSSL, zlib, etc.) the function comes from
- **Version information** - Which version(s) of the library contain this function
- **CVE associations** - Whether the function is linked to known vulnerabilities
- **Patch status** - Whether a function matches a vulnerable or patched variant
## Architecture
```
┌───────────────────────────────────────────────────────────────────────┐
│ Function Behavior Corpus │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Corpus Ingestion Layer │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │GlibcCorpus │ │OpenSSL │ │ZlibCorpus │ ... │ │
│ │ │Connector │ │Connector │ │Connector │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Fingerprint Generation │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │Instruction │ │Semantic │ │API Call │ │ │
│ │ │Hash │ │KSG Hash │ │Graph │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Corpus Storage (PostgreSQL) │ │
│ │ │ │
│ │ corpus.libraries - Known libraries │ │
│ │ corpus.library_versions- Version snapshots │ │
│ │ corpus.build_variants - Architecture/compiler variants │ │
│ │ corpus.functions - Function metadata │ │
│ │ corpus.fingerprints - Fingerprint index │ │
│ │ corpus.function_clusters- Similar function groups │ │
│ │ corpus.function_cves - CVE associations │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
```
## Core Services
### ICorpusIngestionService
Handles ingestion of library binaries into the corpus.
```csharp
public interface ICorpusIngestionService
{
// Ingest a single library binary
Task<IngestionResult> IngestLibraryAsync(
LibraryIngestionMetadata metadata,
Stream binaryStream,
IngestionOptions? options = null,
CancellationToken ct = default);
// Ingest from a library connector (bulk)
IAsyncEnumerable<IngestionResult> IngestFromConnectorAsync(
string libraryName,
ILibraryCorpusConnector connector,
IngestionOptions? options = null,
CancellationToken ct = default);
// Update CVE associations for functions
Task<int> UpdateCveAssociationsAsync(
string cveId,
IReadOnlyList<FunctionCveAssociation> associations,
CancellationToken ct = default);
// Check job status
Task<IngestionJob?> GetJobStatusAsync(Guid jobId, CancellationToken ct = default);
}
```
### ICorpusQueryService
Queries the corpus to identify functions by their fingerprints.
```csharp
public interface ICorpusQueryService
{
// Identify a single function
Task<ImmutableArray<FunctionMatch>> IdentifyFunctionAsync(
FunctionFingerprints fingerprints,
IdentifyOptions? options = null,
CancellationToken ct = default);
// Batch identify multiple functions
Task<ImmutableDictionary<int, ImmutableArray<FunctionMatch>>> IdentifyBatchAsync(
IReadOnlyList<FunctionFingerprints> fingerprintSets,
IdentifyOptions? options = null,
CancellationToken ct = default);
// Get corpus statistics
Task<CorpusStatistics> GetStatisticsAsync(CancellationToken ct = default);
// List available libraries
Task<ImmutableArray<LibrarySummary>> ListLibrariesAsync(CancellationToken ct = default);
}
```
### ILibraryCorpusConnector
Interface for library-specific connectors that fetch binaries for ingestion.
```csharp
public interface ILibraryCorpusConnector
{
string LibraryName { get; }
string[] SupportedArchitectures { get; }
// Get available versions
Task<ImmutableArray<string>> GetAvailableVersionsAsync(CancellationToken ct);
// Fetch binaries for ingestion
IAsyncEnumerable<LibraryBinary> FetchBinariesAsync(
IReadOnlyList<string> versions,
string architecture,
LibraryFetchOptions? options = null,
CancellationToken ct = default);
}
```
## Fingerprint Algorithms
The corpus uses multiple fingerprint algorithms to enable matching under different conditions:
### Semantic K-Skip-Gram Hash (`semantic_ksg`)
Based on Ghidra BSim's approach:
- Analyzes normalized p-code operations
- Generates k-skip-gram features from instruction sequences
- Robust against register renaming and basic-block reordering
- Best for matching functions across optimization levels
### Instruction Basic-Block Hash (`instruction_bb`)
- Hashes normalized instruction sequences per basic block
- More sensitive to compiler differences
- Faster to compute than semantic hash
- Good for exact or near-exact matches
### Control-Flow Graph Hash (`cfg_wl`)
- Weisfeiler-Lehman graph hash of the CFG
- Captures structural similarity
- Works well even when instruction sequences differ
- Useful for detecting refactored code
## Usage Examples
### Ingesting a Library
```csharp
// Create ingestion metadata
var metadata = new LibraryIngestionMetadata(
Name: "openssl",
Version: "3.0.15",
Architecture: "x86_64",
Compiler: "gcc",
CompilerVersion: "12.2",
OptimizationLevel: "O2",
IsSecurityRelease: true);
// Ingest from file
await using var stream = File.OpenRead("libssl.so.3");
var result = await ingestionService.IngestLibraryAsync(metadata, stream);
Console.WriteLine($"Indexed {result.FunctionsIndexed} functions");
Console.WriteLine($"Generated {result.FingerprintsGenerated} fingerprints");
```
### Bulk Ingestion via Connector
```csharp
// Use the OpenSSL connector to fetch and ingest multiple versions
var connector = new OpenSslCorpusConnector(httpClientFactory, logger);
await foreach (var result in ingestionService.IngestFromConnectorAsync(
"openssl",
connector,
new IngestionOptions { GenerateClusters = true }))
{
Console.WriteLine($"Ingested {result.LibraryName} {result.Version}: {result.FunctionsIndexed} functions");
}
```
### Identifying Functions
```csharp
// Build fingerprints from analyzed function
var fingerprints = new FunctionFingerprints(
SemanticHash: semanticHashBytes,
InstructionHash: instructionHashBytes,
CfgHash: cfgHashBytes,
ApiCalls: ["malloc", "memcpy", "free"],
SizeBytes: 256);
// Query the corpus
var matches = await queryService.IdentifyFunctionAsync(
fingerprints,
new IdentifyOptions
{
MinSimilarity = 0.85m,
MaxResults = 5,
IncludeCveAssociations = true
});
foreach (var match in matches)
{
Console.WriteLine($"Match: {match.LibraryName} {match.Version} - {match.FunctionName}");
Console.WriteLine($" Similarity: {match.Similarity:P1}");
Console.WriteLine($" Match method: {match.MatchMethod}");
if (match.CveAssociations.Any())
{
foreach (var cve in match.CveAssociations)
{
Console.WriteLine($" CVE: {cve.CveId} ({cve.AffectedState})");
}
}
}
```
### Checking CVE Associations
```csharp
// When a function matches, check if it's associated with known CVEs
var match = matches.First();
if (match.CveAssociations.Any(c => c.AffectedState == CveAffectedState.Vulnerable))
{
Console.WriteLine("WARNING: Function matches a known vulnerable variant!");
}
```
## Database Schema
The corpus uses a dedicated PostgreSQL schema with the following key tables:
| Table | Purpose |
|-------|---------|
| `corpus.libraries` | Master list of tracked libraries |
| `corpus.library_versions` | Version records with release metadata |
| `corpus.build_variants` | Architecture/compiler/optimization variants |
| `corpus.functions` | Function metadata (name, address, size, etc.) |
| `corpus.fingerprints` | Fingerprint hashes indexed for lookup |
| `corpus.function_clusters` | Groups of similar functions |
| `corpus.function_cves` | CVE-to-function associations |
| `corpus.ingestion_jobs` | Job tracking for bulk ingestion |
## Supported Libraries
The corpus supports ingestion from these common libraries:
| Library | Connector | Architectures |
|---------|-----------|---------------|
| glibc | `GlibcCorpusConnector` | x86_64, aarch64, armv7, i686 |
| OpenSSL | `OpenSslCorpusConnector` | x86_64, aarch64, armv7 |
| zlib | `ZlibCorpusConnector` | x86_64, aarch64 |
| curl | `CurlCorpusConnector` | x86_64, aarch64 |
| SQLite | `SqliteCorpusConnector` | x86_64, aarch64 |
## Integration with Scanner
The corpus integrates with the Scanner module through `IBinaryVulnerabilityService`:
```csharp
// Scanner can identify functions from fingerprints
var matches = await binaryVulnService.IdentifyFunctionFromCorpusAsync(
new FunctionFingerprintSet(
FunctionAddress: 0x4000,
SemanticHash: hash,
InstructionHash: null,
CfgHash: null,
ApiCalls: null,
SizeBytes: 128),
new CorpusLookupOptions
{
MinSimilarity = 0.9m,
MaxResults = 3
});
```
## Performance Considerations
- **Batch queries**: Use `IdentifyBatchAsync` for multiple functions to reduce round-trips
- **Fingerprint selection**: Semantic hash is most robust but slowest; instruction hash is faster for exact matches
- **Similarity threshold**: Higher thresholds reduce false positives but may miss legitimate matches
- **Clustering**: Pre-computed clusters speed up similarity searches
## Security Notes
- Corpus connectors fetch from external sources; ensure network policies allow required endpoints
- Ingested binaries are hashed to prevent duplicate processing
- CVE associations include confidence scores and evidence types for auditability
- All timestamps use UTC for consistency
## Related Documentation
- [Binary Index Architecture](architecture.md)
- [Semantic Diffing](semantic-diffing.md)
- [Scanner Module](../scanner/architecture.md)

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,304 @@
# BinaryIndex ML Model Training Guide
This document describes how to train, export, and deploy ML models for the BinaryIndex binary similarity detection system.
## Overview
The BinaryIndex ML pipeline uses transformer-based models to generate function embeddings that capture semantic similarity. The primary model is **CodeBERT-Binary**, a fine-tuned variant of CodeBERT optimized for decompiled binary code comparison.
## Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ Model Training Pipeline │
│ │
│ ┌───────────────┐ ┌────────────────┐ ┌──────────────────┐ │
│ │ Training Data │ -> │ Fine-tuning │ -> │ Model Export │ │
│ │ (Function │ │ (Contrastive │ │ (ONNX format) │ │
│ │ Pairs) │ │ Learning) │ │ │ │
│ └───────────────┘ └────────────────┘ └──────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Inference Pipeline │ │
│ │ │ │
│ │ Code -> Tokenizer -> ONNX Runtime -> Embedding (768-dim) │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
```
## Training Data Requirements
### Positive Pairs (Similar Functions)
| Source | Description | Estimated Count |
|--------|-------------|-----------------|
| Same function, different optimization | O0 vs O2 vs O3 compilations | ~50,000 |
| Same function, different compiler | GCC vs Clang vs MSVC | ~30,000 |
| Same function, different version | From corpus snapshots | ~100,000 |
| Vulnerability patches | Vulnerable vs fixed versions | ~20,000 |
### Negative Pairs (Dissimilar Functions)
| Source | Description | Estimated Count |
|--------|-------------|-----------------|
| Random function pairs | Random sampling from corpus | ~100,000 |
| Similar-named different functions | Hard negatives for robustness | ~50,000 |
| Same library, different functions | Medium-difficulty negatives | ~50,000 |
**Total training data:** ~400,000 labeled pairs
### Data Format
Training data is stored as JSON Lines (JSONL) format:
```json
{"function_a": "int sum(int* a, int n) { int s = 0; for (int i = 0; i < n; i++) s += a[i]; return s; }", "function_b": "int total(int* arr, int len) { int t = 0; for (int j = 0; j < len; j++) t += arr[j]; return t; }", "is_similar": true, "similarity_score": 0.95}
{"function_a": "int sum(int* a, int n) { ... }", "function_b": "void print(char* s) { ... }", "is_similar": false, "similarity_score": 0.1}
```
## Training Process
### Prerequisites
- Python 3.10+
- PyTorch 2.0+
- Transformers 4.30+
- CUDA 11.8+ (for GPU training)
- 64GB RAM, 32GB VRAM (V100 or A100 recommended)
### Installation
```bash
cd tools/ml
pip install -r requirements.txt
```
### Configuration
Create a training configuration file `config/training.yaml`:
```yaml
model:
base_model: microsoft/codebert-base
embedding_dim: 768
max_sequence_length: 512
training:
batch_size: 32
epochs: 10
learning_rate: 1e-5
warmup_steps: 1000
weight_decay: 0.01
contrastive:
margin: 0.5
temperature: 0.07
data:
train_path: data/train.jsonl
val_path: data/val.jsonl
test_path: data/test.jsonl
output:
model_dir: models/codebert-binary
checkpoint_interval: 1000
```
### Running Training
```bash
python train_codebert_binary.py --config config/training.yaml
```
Training logs are written to `logs/` and checkpoints to `models/`.
### Training Script Overview
```python
# tools/ml/train_codebert_binary.py
class CodeBertBinaryModel(torch.nn.Module):
"""CodeBERT fine-tuned for binary code similarity."""
def __init__(self, pretrained_model="microsoft/codebert-base"):
super().__init__()
self.encoder = RobertaModel.from_pretrained(pretrained_model)
self.projection = torch.nn.Linear(768, 768)
def forward(self, input_ids, attention_mask):
outputs = self.encoder(input_ids, attention_mask=attention_mask)
pooled = outputs.last_hidden_state[:, 0, :] # [CLS] token
projected = self.projection(pooled)
return torch.nn.functional.normalize(projected, p=2, dim=1)
class ContrastiveLoss(torch.nn.Module):
"""Contrastive loss for learning similarity embeddings."""
def __init__(self, margin=0.5):
super().__init__()
self.margin = margin
def forward(self, embedding_a, embedding_b, label):
distance = torch.nn.functional.pairwise_distance(embedding_a, embedding_b)
# label=1: similar, label=0: dissimilar
loss = label * distance.pow(2) + \
(1 - label) * torch.clamp(self.margin - distance, min=0).pow(2)
return loss.mean()
```
## Model Export
After training, export the model to ONNX format for inference:
```bash
python export_onnx.py \
--model models/codebert-binary/best.pt \
--output models/codebert-binary.onnx \
--opset 17
```
### Export Script
```python
# tools/ml/export_onnx.py
def export_to_onnx(model, output_path):
model.eval()
dummy_input = torch.randint(0, 50000, (1, 512))
dummy_mask = torch.ones(1, 512)
torch.onnx.export(
model,
(dummy_input, dummy_mask),
output_path,
input_names=['input_ids', 'attention_mask'],
output_names=['embedding'],
dynamic_axes={
'input_ids': {0: 'batch', 1: 'seq'},
'attention_mask': {0: 'batch', 1: 'seq'},
'embedding': {0: 'batch'}
},
opset_version=17
)
```
## Deployment
### Configuration
Configure the ML service in your application:
```yaml
# etc/binaryindex.yaml
ml:
enabled: true
model_path: /opt/stellaops/models/codebert-binary.onnx
vocabulary_path: /opt/stellaops/models/vocab.txt
num_threads: 4
batch_size: 16
```
### Code Integration
```csharp
// Register ML services
services.AddMlServices(options =>
{
options.ModelPath = config["ml:model_path"];
options.VocabularyPath = config["ml:vocabulary_path"];
options.NumThreads = config.GetValue<int>("ml:num_threads");
});
// Use embedding service
var embedding = await embeddingService.GenerateEmbeddingAsync(
new EmbeddingInput(decompiledCode, null, null, EmbeddingInputType.DecompiledCode));
// Compare embeddings
var similarity = embeddingService.ComputeSimilarity(embA, embB, SimilarityMetric.Cosine);
```
### Fallback Mode
When no ONNX model is available, the system generates hash-based pseudo-embeddings:
```csharp
// In OnnxInferenceEngine.cs
if (_session is null)
{
// Fallback: generate hash-based pseudo-embedding for testing
vector = GenerateFallbackEmbedding(text, 768);
}
```
This allows the system to operate without a trained model (useful for testing) but with reduced accuracy.
## Evaluation
### Metrics
| Metric | Definition | Target |
|--------|------------|--------|
| Accuracy | (TP + TN) / Total | > 90% |
| Precision | TP / (TP + FP) | > 95% |
| Recall | TP / (TP + FN) | > 85% |
| F1 Score | 2 * P * R / (P + R) | > 90% |
| Latency | Per-function embedding time | < 100ms |
### Running Evaluation
```bash
python evaluate.py \
--model models/codebert-binary.onnx \
--test data/test.jsonl \
--output results/evaluation.json
```
### Benchmark Results
From `EnsembleAccuracyBenchmarks`:
| Approach | Accuracy | Precision | Recall | F1 Score | Latency |
|----------|----------|-----------|--------|----------|---------|
| Phase 1 (Hash only) | 70% | 100% | 0% | 0% | 1ms |
| AST only | 75% | 80% | 70% | 74% | 5ms |
| Embedding only | 80% | 85% | 75% | 80% | 50ms |
| Ensemble (Phase 4) | 92% | 95% | 88% | 91% | 80ms |
## Troubleshooting
### Common Issues
**Model not loading:**
- Verify ONNX file path is correct
- Check ONNX Runtime is installed: `dotnet add package Microsoft.ML.OnnxRuntime`
- Ensure model was exported with compatible opset version
**Low accuracy:**
- Verify training data quality and balance
- Check for data leakage between train/test splits
- Adjust contrastive loss margin
**High latency:**
- Reduce max sequence length (default 512)
- Enable batching for bulk operations
- Consider GPU acceleration for high-volume deployments
### Logging
Enable detailed ML logging:
```csharp
services.AddLogging(builder =>
{
builder.AddFilter("StellaOps.BinaryIndex.ML", LogLevel.Debug);
});
```
## References
- [CodeBERT Paper](https://arxiv.org/abs/2002.08155)
- [Binary Code Similarity Detection](https://arxiv.org/abs/2308.01463)
- [ONNX Runtime Documentation](https://onnxruntime.ai/docs/)
- [Contrastive Learning for Code](https://arxiv.org/abs/2103.03143)