docs consolidation, big sln build fixes, new advisories and sprints/tasks

This commit is contained in:
master
2026-01-05 18:37:04 +02:00
parent d0a7b88398
commit d7bdca6d97
175 changed files with 10322 additions and 307 deletions

View File

@@ -0,0 +1,564 @@
# Semantic Diffing Architecture
> **Status:** PLANNED
> **Version:** 1.0.0
> **Related Sprints:**
> - `SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md`
> - `SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md`
> - `SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md`
> - `SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md`
---
## 1. Executive Summary
Semantic diffing is an advanced binary analysis capability that detects function equivalence based on **behavior** rather than **syntax**. This enables accurate vulnerability detection in scenarios where traditional byte-level or symbol-based matching fails:
- **Compiler optimizations** - Same source, different instructions
- **Obfuscation** - Intentionally altered code structure
- **Stripped binaries** - No symbols or debug information
- **Cross-compiler** - GCC vs Clang produce different output
- **Backported patches** - Different version, same fix
### Expected Impact
| Capability | Current Accuracy | With Semantic Diffing |
|------------|-----------------|----------------------|
| Patch detection (optimized) | ~70% | 92%+ |
| Function identification (stripped) | ~50% | 85%+ |
| Obfuscation resilience | ~40% | 75%+ |
| False positive rate | ~5% | <2% |
---
## 2. Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────────────────┐
│ Semantic Diffing Architecture │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────────┐│
│ │ Analysis Layer ││
│ │ ││
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││
│ │ │ B2R2 │ │ Ghidra │ │ Decompiler │ │ ML │ ││
│ │ │ (Primary) │ │ (Fallback) │ │ (Optional) │ │ (Optional) │ ││
│ │ │ │ │ │ │ │ │ │ ││
│ │ │ - Disasm │ │ - P-Code │ │ - C output │ │ - CodeBERT │ ││
│ │ │ - LowUIR │ │ - BSim │ │ - AST parse │ │ - GraphSage │ ││
│ │ │ - CFG │ │ - Ver.Track │ │ - Normalize │ │ - Embedding │ ││
│ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ││
│ │ │ │ │ │ ││
│ └─────────┴────────────────┴────────────────┴────────────────┴───────────────┘│
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────────────┐│
│ │ Fingerprint Layer ││
│ │ ││
│ │ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ ││
│ │ │ Instruction │ │ Semantic │ │ Decompiled │ ││
│ │ │ Fingerprint │ │ Fingerprint │ │ Fingerprint │ ││
│ │ │ │ │ │ │ │ ││
│ │ │ - BasicBlock hash │ │ - KSG graph hash │ │ - AST hash │ ││
│ │ │ - CFG edge hash │ │ - WL hash │ │ - Normalized code │ ││
│ │ │ - String refs │ │ - DataFlow hash │ │ - API sequence │ ││
│ │ │ - Rolling chunks │ │ - API calls │ │ - Pattern hash │ ││
│ │ └───────────────────┘ └───────────────────┘ └───────────────────┘ ││
│ │ ││
│ │ ┌───────────────────┐ ┌───────────────────┐ ││
│ │ │ BSim │ │ ML Embedding │ ││
│ │ │ Signature │ │ Vector │ ││
│ │ │ │ │ │ ││
│ │ │ - Feature vector │ │ - 768-dim float[] │ ││
│ │ │ - Significance │ │ - Cosine sim │ ││
│ │ └───────────────────┘ └───────────────────┘ ││
│ │ ││
│ └─────────────────────────────────────────────────────────────────────────────┘│
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────────────┐│
│ │ Matching Layer ││
│ │ ││
│ │ ┌───────────────────────────────────────────────────────────────────────┐ ││
│ │ │ Ensemble Decision Engine │ ││
│ │ │ │ ││
│ │ │ Signal Weights: │ ││
│ │ │ - Instruction fingerprint: 15% │ ││
│ │ │ - Semantic graph: 25% │ ││
│ │ │ - Decompiled AST: 35% │ ││
│ │ │ - ML embedding: 25% │ ││
│ │ │ │ ││
│ │ │ Output: Confidence-weighted similarity score │ ││
│ │ │ │ ││
│ │ └───────────────────────────────────────────────────────────────────────┘ ││
│ │ ││
│ └─────────────────────────────────────────────────────────────────────────────┘│
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────────────┐│
│ │ Storage Layer ││
│ │ ││
│ │ PostgreSQL RustFS Valkey ││
│ │ - corpus.* tables - Fingerprint blobs - Query cache ││
│ │ - binaries.* tables - Model artifacts - Embedding index ││
│ │ - BSim database - Training data ││
│ │ ││
│ └─────────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────────┘
```
---
## 3. Implementation Phases
### Phase 1: IR-Level Semantic Analysis (Foundation)
**Sprint:** `SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md`
Leverage B2R2's Intermediate Representation (IR) for semantic-level function comparison.
**Key Components:**
- `IrLiftingService` - Lift instructions to LowUIR
- `SemanticGraphExtractor` - Build Key-Semantics Graph (KSG)
- `WeisfeilerLehmanHasher` - Graph fingerprinting
- `SemanticMatcher` - Semantic similarity scoring
**Deliverables:**
- `StellaOps.BinaryIndex.Semantic` library
- 20 tasks, ~3 weeks
### Phase 2: Function Behavior Corpus (Scale)
**Sprint:** `SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md`
Build comprehensive database of known library functions.
**Key Components:**
- Library corpus connectors (glibc, OpenSSL, zlib, curl, SQLite)
- `CorpusIngestionService` - Batch fingerprint generation
- `FunctionClusteringService` - Group similar functions
- `CorpusQueryService` - Function identification
**Deliverables:**
- `StellaOps.BinaryIndex.Corpus` library
- PostgreSQL `corpus.*` schema
- ~30,000 indexed functions
- 22 tasks, ~4 weeks
### Phase 3: Ghidra Integration (Depth)
**Sprint:** `SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md`
Add Ghidra as secondary backend for complex cases.
**Key Components:**
- `GhidraHeadlessManager` - Process lifecycle
- `VersionTrackingService` - Multi-correlator diffing
- `GhidriffBridge` - Python interop
- `BSimService` - Behavioral similarity
**Deliverables:**
- `StellaOps.BinaryIndex.Ghidra` library
- Docker image for Ghidra Headless
- 20 tasks, ~4 weeks
### Phase 4: Decompiler & ML (Excellence)
**Sprint:** `SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md`
Highest-fidelity semantic analysis.
**Key Components:**
- `IDecompilerService` - Ghidra decompilation
- `AstComparisonEngine` - Structural similarity
- `OnnxInferenceEngine` - ML embeddings
- `EnsembleDecisionEngine` - Multi-signal fusion
**Deliverables:**
- `StellaOps.BinaryIndex.Decompiler` library
- `StellaOps.BinaryIndex.ML` library
- Trained CodeBERT-Binary model
- 30 tasks, ~5 weeks
---
## 4. Fingerprint Types
### 4.1 Instruction Fingerprint (Existing)
**Algorithm:** BasicBlock hash + CFG edge hash + String refs hash
**Properties:**
- Fast to compute
- Sensitive to instruction changes
- Good for exact/near-exact matches
**Weight in ensemble:** 15%
### 4.2 Semantic Fingerprint (Phase 1)
**Algorithm:** Key-Semantics Graph + Weisfeiler-Lehman hash
**Properties:**
- Captures data/control dependencies
- Resilient to register renaming
- Resilient to instruction reordering
**Weight in ensemble:** 25%
### 4.3 Decompiled Fingerprint (Phase 4)
**Algorithm:** Normalized AST hash + Pattern detection
**Properties:**
- Highest semantic fidelity
- Captures algorithmic structure
- Resilient to most optimizations
**Weight in ensemble:** 35%
### 4.4 ML Embedding (Phase 4)
**Algorithm:** CodeBERT-Binary transformer, 768-dim vectors
**Properties:**
- Learned similarity metric
- Captures latent patterns
- Resilient to obfuscation
**Weight in ensemble:** 25%
---
## 5. Matching Pipeline
```mermaid
sequenceDiagram
participant Client
participant DiffEngine as PatchDiffEngine
participant B2R2
participant Ghidra
participant Corpus
participant Ensemble
Client->>DiffEngine: Compare(oldBinary, newBinary)
par Parallel Analysis
DiffEngine->>B2R2: Disassemble + IR lift
DiffEngine->>Ghidra: Decompile (if needed)
end
B2R2-->>DiffEngine: SemanticFingerprints[]
Ghidra-->>DiffEngine: DecompiledFunctions[]
DiffEngine->>Corpus: IdentifyFunctions(fingerprints)
Corpus-->>DiffEngine: FunctionMatches[]
DiffEngine->>Ensemble: ComputeSimilarity(old, new)
Ensemble-->>DiffEngine: EnsembleResult
DiffEngine-->>Client: PatchDiffResult
```
---
## 6. Fallback Strategy
The system uses a tiered fallback strategy:
```
Tier 1: B2R2 IR + Semantic Graph (fast, ~90% coverage)
│ If confidence < threshold OR architecture unsupported
v
Tier 2: Ghidra Version Tracking (slower, ~95% coverage)
│ If function is high-value (CVE-relevant)
v
Tier 3: Decompiled AST + ML Embedding (slowest, ~99% coverage)
```
**Selection Criteria:**
| Condition | Backend | Reason |
|-----------|---------|--------|
| Standard x64/ARM64 binary | B2R2 only | Fast, accurate |
| Low B2R2 confidence (<0.7) | B2R2 + Ghidra | Validation |
| Exotic architecture | Ghidra only | Better coverage |
| CVE-affected function | Full pipeline | Maximum accuracy |
| Obfuscated binary | ML embedding | Obfuscation resilience |
---
## 7. Corpus Coverage
### Priority Libraries
| Library | Priority | Functions | CVEs |
|---------|----------|-----------|------|
| glibc | Critical | ~15,000 | 50+ |
| OpenSSL | Critical | ~8,000 | 100+ |
| zlib | High | ~200 | 5+ |
| libcurl | High | ~2,000 | 80+ |
| SQLite | High | ~1,500 | 30+ |
| libxml2 | Medium | ~1,200 | 40+ |
| libpng | Medium | ~300 | 10+ |
| expat | Medium | ~150 | 15+ |
### Architecture Coverage
| Architecture | B2R2 | Ghidra | Status |
|--------------|------|--------|--------|
| x86_64 | Excellent | Excellent | Primary |
| ARM64 | Excellent | Excellent | Primary |
| ARM32 | Good | Excellent | Secondary |
| MIPS32 | Fair | Excellent | Fallback |
| MIPS64 | Fair | Excellent | Fallback |
| RISC-V | Good | Good | Emerging |
| PPC32/64 | Fair | Excellent | Fallback |
---
## 8. Performance Characteristics
### Latency Budget
| Operation | Target | Notes |
|-----------|--------|-------|
| B2R2 disassembly | <100ms | Per function |
| IR lifting | <50ms | Per function |
| Semantic fingerprint | <50ms | Per function |
| Ghidra analysis | <30s | Per binary (startup) |
| Decompilation | <500ms | Per function |
| ML inference | <100ms | Per function |
| Ensemble decision | <10ms | Per comparison |
| **Total (Tier 1)** | **<200ms** | Per function |
| **Total (Full)** | **<1s** | Per function |
### Memory Budget
| Component | Memory | Notes |
|-----------|--------|-------|
| B2R2 per binary | ~100MB | Scales with binary size |
| Ghidra per project | ~2GB | Persistent cache |
| ML model | ~500MB | ONNX loaded |
| Corpus query cache | ~100MB | LRU eviction |
---
## 9. Integration Points
### 9.1 Scanner Integration
```csharp
// Scanner.Worker uses semantic diffing for binary vulnerability detection
var result = await _binaryVulnerabilityService.LookupByFingerprintAsync(
fingerprint,
minSimilarity: 0.85m,
useSemanticMatching: true, // Enable semantic diffing
ct);
```
### 9.2 PatchDiffEngine Enhancement
```csharp
// PatchDiffEngine now includes semantic comparison
var diff = await _patchDiffEngine.DiffAsync(
vulnerableBinary,
patchedBinary,
new PatchDiffOptions
{
UseSemanticAnalysis = true,
SemanticThreshold = 0.7m,
IncludeDecompilation = true,
IncludeMlEmbedding = true
},
ct);
```
### 9.3 DeltaSignature Enhancement
```csharp
// Delta signatures now include semantic fingerprints
var signature = await _deltaSignatureGenerator.GenerateSignaturesAsync(
binaryStream,
new DeltaSignatureRequest
{
Cve = "CVE-2024-1234",
TargetSymbols = ["vulnerable_func"],
IncludeSemanticFingerprint = true,
IncludeDecompiledHash = true
},
ct);
```
---
## 10. Security Considerations
### 10.1 Sandbox Requirements
All binary analysis runs in sandboxed environments:
- Seccomp profile restricting syscalls
- Read-only root filesystem
- No network access during analysis
- Memory/CPU limits
### 10.2 Model Security
ML models are:
- Signed with DSSE attestations
- Verified before loading
- Not user-uploadable (pre-trained only)
### 10.3 Corpus Integrity
Corpus data is:
- Ingested from trusted sources only
- Signed at snapshot level
- Version-controlled with audit trail
---
## 11. Configuration
```yaml
# binaryindex.yaml - Semantic diffing configuration
binaryindex:
semantic_diffing:
enabled: true
# Analysis backends
backends:
b2r2:
enabled: true
ir_lifting: true
semantic_graph: true
ghidra:
enabled: true
fallback_only: true
min_b2r2_confidence: 0.7
headless_timeout_ms: 30000
decompiler:
enabled: true
high_value_only: true # Only for CVE-affected functions
ml:
enabled: true
model_path: /models/codebert_binary_v1.onnx
embedding_dimension: 768
# Ensemble weights
ensemble:
instruction_weight: 0.15
semantic_weight: 0.25
decompiled_weight: 0.35
ml_weight: 0.25
min_confidence: 0.6
# Corpus
corpus:
auto_update: true
update_interval_hours: 24
libraries:
- glibc
- openssl
- zlib
- curl
- sqlite
# Performance
performance:
max_parallel_analyses: 4
cache_ttl_seconds: 3600
max_function_size_bytes: 1048576 # 1MB
```
---
## 12. Metrics & Observability
### Metrics
| Metric | Type | Labels |
|--------|------|--------|
| `semantic_diffing_analysis_total` | Counter | backend, result |
| `semantic_diffing_latency_ms` | Histogram | backend, tier |
| `semantic_diffing_accuracy` | Gauge | comparison_type |
| `corpus_functions_total` | Gauge | library |
| `ml_inference_latency_ms` | Histogram | model |
| `ensemble_signal_weight` | Gauge | signal_type |
### Traces
- `semantic_diffing.analyze` - Full analysis span
- `semantic_diffing.b2r2.lift` - IR lifting
- `semantic_diffing.ghidra.decompile` - Decompilation
- `semantic_diffing.ml.inference` - ML embedding
- `semantic_diffing.ensemble.decide` - Ensemble decision
---
## 13. Testing Strategy
### Unit Tests
| Test Suite | Coverage |
|------------|----------|
| `IrLiftingServiceTests` | IR lifting correctness |
| `SemanticGraphExtractorTests` | Graph construction |
| `WeisfeilerLehmanHasherTests` | Hash stability |
| `AstComparisonEngineTests` | AST similarity |
| `OnnxInferenceEngineTests` | ML inference |
| `EnsembleDecisionEngineTests` | Weight combination |
### Integration Tests
| Test Suite | Coverage |
|------------|----------|
| `EndToEndSemanticDiffTests` | Full pipeline |
| `OptimizationResilienceTests` | O0 vs O2 vs O3 |
| `CompilerVariantTests` | GCC vs Clang |
| `GhidraFallbackTests` | Fallback scenarios |
### Golden Corpus Tests
Pre-computed test cases with known results:
- 100 CVE patch pairs (vulnerable -> fixed)
- 50 optimization variant sets
- 25 compiler variant sets
- 25 obfuscation variant sets
---
## 14. Roadmap
| Phase | Status | ETA | Impact |
|-------|--------|-----|--------|
| Phase 1: IR Semantics | Planned | 2026-01-24 | +15% accuracy |
| Phase 2: Corpus | Planned | 2026-02-15 | +10% coverage |
| Phase 3: Ghidra | Planned | 2026-02-28 | +5% edge cases |
| Phase 4: Decompiler/ML | Planned | 2026-03-31 | +10% obfuscation |
| **Total** | | | **+35-40%** |
---
## 15. References
### Internal
- `docs/modules/binary-index/architecture.md`
- `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.DeltaSig/`
- `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Fingerprints/`
### External
- [B2R2 Binary Analysis Framework](https://b2r2.org/)
- [Ghidra Patch Diffing Guide](https://cve-north-stars.github.io/docs/Ghidra-Patch-Diffing)
- [ghidriff Tool](https://github.com/clearbluejar/ghidriff)
- [SemDiff Paper (arXiv)](https://arxiv.org/abs/2308.01463)
- [SEI Semantic Equivalence Research](https://www.sei.cmu.edu/annual-reviews/2022-research-review/semantic-equivalence-checking-of-decompiled-binaries/)
---
*Document Version: 1.0.0*
*Last Updated: 2026-01-05*