docs consolidation, big sln build fixes, new advisories and sprints/tasks

2026-01-05 18:37:04 +02:00
parent d0a7b88398
commit d7bdca6d97
175 changed files with 10322 additions and 307 deletions
--- a/docs/modules/binary-index/semantic-diffing.md
+++ b/docs/modules/binary-index/semantic-diffing.md
@@ -0,0 +1,564 @@
+# Semantic Diffing Architecture
+
+> **Status:** PLANNED
+> **Version:** 1.0.0
+> **Related Sprints:**
+> - `SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md`
+> - `SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md`
+> - `SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md`
+> - `SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md`
+
+---
+
+## 1. Executive Summary
+
+Semantic diffing is an advanced binary analysis capability that detects function equivalence based on **behavior** rather than **syntax**. This enables accurate vulnerability detection in scenarios where traditional byte-level or symbol-based matching fails:
+
+- **Compiler optimizations** - Same source, different instructions
+- **Obfuscation** - Intentionally altered code structure
+- **Stripped binaries** - No symbols or debug information
+- **Cross-compiler** - GCC vs Clang produce different output
+- **Backported patches** - Different version, same fix
+
+### Expected Impact
+
+| Capability | Current Accuracy | With Semantic Diffing |
+|------------|-----------------|----------------------|
+| Patch detection (optimized) | ~70% | 92%+ |
+| Function identification (stripped) | ~50% | 85%+ |
+| Obfuscation resilience | ~40% | 75%+ |
+| False positive rate | ~5% | <2% |
+
+---
+
+## 2. Architecture Overview
+
+```
+┌─────────────────────────────────────────────────────────────────────────────────┐
+│                        Semantic Diffing Architecture                             │
+│                                                                                  │
+│  ┌─────────────────────────────────────────────────────────────────────────────┐│
+│  │                         Analysis Layer                                       ││
+│  │                                                                              ││
+│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        ││
+│  │  │   B2R2      │  │   Ghidra    │  │ Decompiler  │  │     ML      │        ││
+│  │  │  (Primary)  │  │ (Fallback)  │  │  (Optional) │  │ (Optional)  │        ││
+│  │  │             │  │             │  │             │  │             │        ││
+│  │  │ - Disasm    │  │ - P-Code    │  │ - C output  │  │ - CodeBERT  │        ││
+│  │  │ - LowUIR    │  │ - BSim      │  │ - AST parse │  │ - GraphSage │        ││
+│  │  │ - CFG       │  │ - Ver.Track │  │ - Normalize │  │ - Embedding │        ││
+│  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        ││
+│  │         │                │                │                │               ││
+│  └─────────┴────────────────┴────────────────┴────────────────┴───────────────┘│
+│                                      │                                          │
+│                                      v                                          │
+│  ┌─────────────────────────────────────────────────────────────────────────────┐│
+│  │                       Fingerprint Layer                                      ││
+│  │                                                                              ││
+│  │  ┌───────────────────┐  ┌───────────────────┐  ┌───────────────────┐       ││
+│  │  │   Instruction     │  │    Semantic       │  │   Decompiled      │       ││
+│  │  │   Fingerprint     │  │    Fingerprint    │  │   Fingerprint     │       ││
+│  │  │                   │  │                   │  │                   │       ││
+│  │  │ - BasicBlock hash │  │ - KSG graph hash  │  │ - AST hash        │       ││
+│  │  │ - CFG edge hash   │  │ - WL hash         │  │ - Normalized code │       ││
+│  │  │ - String refs     │  │ - DataFlow hash   │  │ - API sequence    │       ││
+│  │  │ - Rolling chunks  │  │ - API calls       │  │ - Pattern hash    │       ││
+│  │  └───────────────────┘  └───────────────────┘  └───────────────────┘       ││
+│  │                                                                              ││
+│  │  ┌───────────────────┐  ┌───────────────────┐                               ││
+│  │  │      BSim         │  │   ML Embedding    │                               ││
+│  │  │   Signature       │  │     Vector        │                               ││
+│  │  │                   │  │                   │                               ││
+│  │  │ - Feature vector  │  │ - 768-dim float[] │                               ││
+│  │  │ - Significance    │  │ - Cosine sim      │                               ││
+│  │  └───────────────────┘  └───────────────────┘                               ││
+│  │                                                                              ││
+│  └─────────────────────────────────────────────────────────────────────────────┘│
+│                                      │                                          │
+│                                      v                                          │
+│  ┌─────────────────────────────────────────────────────────────────────────────┐│
+│  │                       Matching Layer                                         ││
+│  │                                                                              ││
+│  │  ┌───────────────────────────────────────────────────────────────────────┐  ││
+│  │  │                    Ensemble Decision Engine                            │  ││
+│  │  │                                                                        │  ││
+│  │  │  Signal Weights:                                                       │  ││
+│  │  │  - Instruction fingerprint:  15%                                       │  ││
+│  │  │  - Semantic graph:           25%                                       │  ││
+│  │  │  - Decompiled AST:           35%                                       │  ││
+│  │  │  - ML embedding:             25%                                       │  ││
+│  │  │                                                                        │  ││
+│  │  │  Output: Confidence-weighted similarity score                          │  ││
+│  │  │                                                                        │  ││
+│  │  └───────────────────────────────────────────────────────────────────────┘  ││
+│  │                                                                              ││
+│  └─────────────────────────────────────────────────────────────────────────────┘│
+│                                      │                                          │
+│                                      v                                          │
+│  ┌─────────────────────────────────────────────────────────────────────────────┐│
+│  │                       Storage Layer                                          ││
+│  │                                                                              ││
+│  │  PostgreSQL                RustFS                 Valkey                    ││
+│  │  - corpus.* tables         - Fingerprint blobs    - Query cache             ││
+│  │  - binaries.* tables       - Model artifacts      - Embedding index         ││
+│  │  - BSim database           - Training data                                  ││
+│  │                                                                              ││
+│  └─────────────────────────────────────────────────────────────────────────────┘│
+└─────────────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 3. Implementation Phases
+
+### Phase 1: IR-Level Semantic Analysis (Foundation)
+
+**Sprint:** `SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md`
+
+Leverage B2R2's Intermediate Representation (IR) for semantic-level function comparison.
+
+**Key Components:**
+- `IrLiftingService` - Lift instructions to LowUIR
+- `SemanticGraphExtractor` - Build Key-Semantics Graph (KSG)
+- `WeisfeilerLehmanHasher` - Graph fingerprinting
+- `SemanticMatcher` - Semantic similarity scoring
+
+**Deliverables:**
+- `StellaOps.BinaryIndex.Semantic` library
+- 20 tasks, ~3 weeks
+
+### Phase 2: Function Behavior Corpus (Scale)
+
+**Sprint:** `SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md`
+
+Build comprehensive database of known library functions.
+
+**Key Components:**
+- Library corpus connectors (glibc, OpenSSL, zlib, curl, SQLite)
+- `CorpusIngestionService` - Batch fingerprint generation
+- `FunctionClusteringService` - Group similar functions
+- `CorpusQueryService` - Function identification
+
+**Deliverables:**
+- `StellaOps.BinaryIndex.Corpus` library
+- PostgreSQL `corpus.*` schema
+- ~30,000 indexed functions
+- 22 tasks, ~4 weeks
+
+### Phase 3: Ghidra Integration (Depth)
+
+**Sprint:** `SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md`
+
+Add Ghidra as secondary backend for complex cases.
+
+**Key Components:**
+- `GhidraHeadlessManager` - Process lifecycle
+- `VersionTrackingService` - Multi-correlator diffing
+- `GhidriffBridge` - Python interop
+- `BSimService` - Behavioral similarity
+
+**Deliverables:**
+- `StellaOps.BinaryIndex.Ghidra` library
+- Docker image for Ghidra Headless
+- 20 tasks, ~4 weeks
+
+### Phase 4: Decompiler & ML (Excellence)
+
+**Sprint:** `SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md`
+
+Highest-fidelity semantic analysis.
+
+**Key Components:**
+- `IDecompilerService` - Ghidra decompilation
+- `AstComparisonEngine` - Structural similarity
+- `OnnxInferenceEngine` - ML embeddings
+- `EnsembleDecisionEngine` - Multi-signal fusion
+
+**Deliverables:**
+- `StellaOps.BinaryIndex.Decompiler` library
+- `StellaOps.BinaryIndex.ML` library
+- Trained CodeBERT-Binary model
+- 30 tasks, ~5 weeks
+
+---
+
+## 4. Fingerprint Types
+
+### 4.1 Instruction Fingerprint (Existing)
+
+**Algorithm:** BasicBlock hash + CFG edge hash + String refs hash
+
+**Properties:**
+- Fast to compute
+- Sensitive to instruction changes
+- Good for exact/near-exact matches
+
+**Weight in ensemble:** 15%
+
+### 4.2 Semantic Fingerprint (Phase 1)
+
+**Algorithm:** Key-Semantics Graph + Weisfeiler-Lehman hash
+
+**Properties:**
+- Captures data/control dependencies
+- Resilient to register renaming
+- Resilient to instruction reordering
+
+**Weight in ensemble:** 25%
+
+### 4.3 Decompiled Fingerprint (Phase 4)
+
+**Algorithm:** Normalized AST hash + Pattern detection
+
+**Properties:**
+- Highest semantic fidelity
+- Captures algorithmic structure
+- Resilient to most optimizations
+
+**Weight in ensemble:** 35%
+
+### 4.4 ML Embedding (Phase 4)
+
+**Algorithm:** CodeBERT-Binary transformer, 768-dim vectors
+
+**Properties:**
+- Learned similarity metric
+- Captures latent patterns
+- Resilient to obfuscation
+
+**Weight in ensemble:** 25%
+
+---
+
+## 5. Matching Pipeline
+
+```mermaid
+sequenceDiagram
+    participant Client
+    participant DiffEngine as PatchDiffEngine
+    participant B2R2
+    participant Ghidra
+    participant Corpus
+    participant Ensemble
+
+    Client->>DiffEngine: Compare(oldBinary, newBinary)
+
+    par Parallel Analysis
+        DiffEngine->>B2R2: Disassemble + IR lift
+        DiffEngine->>Ghidra: Decompile (if needed)
+    end
+
+    B2R2-->>DiffEngine: SemanticFingerprints[]
+    Ghidra-->>DiffEngine: DecompiledFunctions[]
+
+    DiffEngine->>Corpus: IdentifyFunctions(fingerprints)
+    Corpus-->>DiffEngine: FunctionMatches[]
+
+    DiffEngine->>Ensemble: ComputeSimilarity(old, new)
+    Ensemble-->>DiffEngine: EnsembleResult
+
+    DiffEngine-->>Client: PatchDiffResult
+```
+
+---
+
+## 6. Fallback Strategy
+
+The system uses a tiered fallback strategy:
+
+```
+Tier 1: B2R2 IR + Semantic Graph (fast, ~90% coverage)
+   │
+   │ If confidence < threshold OR architecture unsupported
+   v
+Tier 2: Ghidra Version Tracking (slower, ~95% coverage)
+   │
+   │ If function is high-value (CVE-relevant)
+   v
+Tier 3: Decompiled AST + ML Embedding (slowest, ~99% coverage)
+```
+
+**Selection Criteria:**
+
+| Condition | Backend | Reason |
+|-----------|---------|--------|
+| Standard x64/ARM64 binary | B2R2 only | Fast, accurate |
+| Low B2R2 confidence (<0.7) | B2R2 + Ghidra | Validation |
+| Exotic architecture | Ghidra only | Better coverage |
+| CVE-affected function | Full pipeline | Maximum accuracy |
+| Obfuscated binary | ML embedding | Obfuscation resilience |
+
+---
+
+## 7. Corpus Coverage
+
+### Priority Libraries
+
+| Library | Priority | Functions | CVEs |
+|---------|----------|-----------|------|
+| glibc | Critical | ~15,000 | 50+ |
+| OpenSSL | Critical | ~8,000 | 100+ |
+| zlib | High | ~200 | 5+ |
+| libcurl | High | ~2,000 | 80+ |
+| SQLite | High | ~1,500 | 30+ |
+| libxml2 | Medium | ~1,200 | 40+ |
+| libpng | Medium | ~300 | 10+ |
+| expat | Medium | ~150 | 15+ |
+
+### Architecture Coverage
+
+| Architecture | B2R2 | Ghidra | Status |
+|--------------|------|--------|--------|
+| x86_64 | Excellent | Excellent | Primary |
+| ARM64 | Excellent | Excellent | Primary |
+| ARM32 | Good | Excellent | Secondary |
+| MIPS32 | Fair | Excellent | Fallback |
+| MIPS64 | Fair | Excellent | Fallback |
+| RISC-V | Good | Good | Emerging |
+| PPC32/64 | Fair | Excellent | Fallback |
+
+---
+
+## 8. Performance Characteristics
+
+### Latency Budget
+
+| Operation | Target | Notes |
+|-----------|--------|-------|
+| B2R2 disassembly | <100ms | Per function |
+| IR lifting | <50ms | Per function |
+| Semantic fingerprint | <50ms | Per function |
+| Ghidra analysis | <30s | Per binary (startup) |
+| Decompilation | <500ms | Per function |
+| ML inference | <100ms | Per function |
+| Ensemble decision | <10ms | Per comparison |
+| **Total (Tier 1)** | **<200ms** | Per function |
+| **Total (Full)** | **<1s** | Per function |
+
+### Memory Budget
+
+| Component | Memory | Notes |
+|-----------|--------|-------|
+| B2R2 per binary | ~100MB | Scales with binary size |
+| Ghidra per project | ~2GB | Persistent cache |
+| ML model | ~500MB | ONNX loaded |
+| Corpus query cache | ~100MB | LRU eviction |
+
+---
+
+## 9. Integration Points
+
+### 9.1 Scanner Integration
+
+```csharp
+// Scanner.Worker uses semantic diffing for binary vulnerability detection
+var result = await _binaryVulnerabilityService.LookupByFingerprintAsync(
+    fingerprint,
+    minSimilarity: 0.85m,
+    useSemanticMatching: true,  // Enable semantic diffing
+    ct);
+```
+
+### 9.2 PatchDiffEngine Enhancement
+
+```csharp
+// PatchDiffEngine now includes semantic comparison
+var diff = await _patchDiffEngine.DiffAsync(
+    vulnerableBinary,
+    patchedBinary,
+    new PatchDiffOptions
+    {
+        UseSemanticAnalysis = true,
+        SemanticThreshold = 0.7m,
+        IncludeDecompilation = true,
+        IncludeMlEmbedding = true
+    },
+    ct);
+```
+
+### 9.3 DeltaSignature Enhancement
+
+```csharp
+// Delta signatures now include semantic fingerprints
+var signature = await _deltaSignatureGenerator.GenerateSignaturesAsync(
+    binaryStream,
+    new DeltaSignatureRequest
+    {
+        Cve = "CVE-2024-1234",
+        TargetSymbols = ["vulnerable_func"],
+        IncludeSemanticFingerprint = true,
+        IncludeDecompiledHash = true
+    },
+    ct);
+```
+
+---
+
+## 10. Security Considerations
+
+### 10.1 Sandbox Requirements
+
+All binary analysis runs in sandboxed environments:
+- Seccomp profile restricting syscalls
+- Read-only root filesystem
+- No network access during analysis
+- Memory/CPU limits
+
+### 10.2 Model Security
+
+ML models are:
+- Signed with DSSE attestations
+- Verified before loading
+- Not user-uploadable (pre-trained only)
+
+### 10.3 Corpus Integrity
+
+Corpus data is:
+- Ingested from trusted sources only
+- Signed at snapshot level
+- Version-controlled with audit trail
+
+---
+
+## 11. Configuration
+
+```yaml
+# binaryindex.yaml - Semantic diffing configuration
+binaryindex:
+  semantic_diffing:
+    enabled: true
+
+    # Analysis backends
+    backends:
+      b2r2:
+        enabled: true
+        ir_lifting: true
+        semantic_graph: true
+      ghidra:
+        enabled: true
+        fallback_only: true
+        min_b2r2_confidence: 0.7
+        headless_timeout_ms: 30000
+      decompiler:
+        enabled: true
+        high_value_only: true  # Only for CVE-affected functions
+      ml:
+        enabled: true
+        model_path: /models/codebert_binary_v1.onnx
+        embedding_dimension: 768
+
+    # Ensemble weights
+    ensemble:
+      instruction_weight: 0.15
+      semantic_weight: 0.25
+      decompiled_weight: 0.35
+      ml_weight: 0.25
+      min_confidence: 0.6
+
+    # Corpus
+    corpus:
+      auto_update: true
+      update_interval_hours: 24
+      libraries:
+        - glibc
+        - openssl
+        - zlib
+        - curl
+        - sqlite
+
+    # Performance
+    performance:
+      max_parallel_analyses: 4
+      cache_ttl_seconds: 3600
+      max_function_size_bytes: 1048576  # 1MB
+```
+
+---
+
+## 12. Metrics & Observability
+
+### Metrics
+
+| Metric | Type | Labels |
+|--------|------|--------|
+| `semantic_diffing_analysis_total` | Counter | backend, result |
+| `semantic_diffing_latency_ms` | Histogram | backend, tier |
+| `semantic_diffing_accuracy` | Gauge | comparison_type |
+| `corpus_functions_total` | Gauge | library |
+| `ml_inference_latency_ms` | Histogram | model |
+| `ensemble_signal_weight` | Gauge | signal_type |
+
+### Traces
+
+- `semantic_diffing.analyze` - Full analysis span
+- `semantic_diffing.b2r2.lift` - IR lifting
+- `semantic_diffing.ghidra.decompile` - Decompilation
+- `semantic_diffing.ml.inference` - ML embedding
+- `semantic_diffing.ensemble.decide` - Ensemble decision
+
+---
+
+## 13. Testing Strategy
+
+### Unit Tests
+
+| Test Suite | Coverage |
+|------------|----------|
+| `IrLiftingServiceTests` | IR lifting correctness |
+| `SemanticGraphExtractorTests` | Graph construction |
+| `WeisfeilerLehmanHasherTests` | Hash stability |
+| `AstComparisonEngineTests` | AST similarity |
+| `OnnxInferenceEngineTests` | ML inference |
+| `EnsembleDecisionEngineTests` | Weight combination |
+
+### Integration Tests
+
+| Test Suite | Coverage |
+|------------|----------|
+| `EndToEndSemanticDiffTests` | Full pipeline |
+| `OptimizationResilienceTests` | O0 vs O2 vs O3 |
+| `CompilerVariantTests` | GCC vs Clang |
+| `GhidraFallbackTests` | Fallback scenarios |
+
+### Golden Corpus Tests
+
+Pre-computed test cases with known results:
+- 100 CVE patch pairs (vulnerable -> fixed)
+- 50 optimization variant sets
+- 25 compiler variant sets
+- 25 obfuscation variant sets
+
+---
+
+## 14. Roadmap
+
+| Phase | Status | ETA | Impact |
+|-------|--------|-----|--------|
+| Phase 1: IR Semantics | Planned | 2026-01-24 | +15% accuracy |
+| Phase 2: Corpus | Planned | 2026-02-15 | +10% coverage |
+| Phase 3: Ghidra | Planned | 2026-02-28 | +5% edge cases |
+| Phase 4: Decompiler/ML | Planned | 2026-03-31 | +10% obfuscation |
+| **Total** | | | **+35-40%** |
+
+---
+
+## 15. References
+
+### Internal
+
+- `docs/modules/binary-index/architecture.md`
+- `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.DeltaSig/`
+- `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Fingerprints/`
+
+### External
+
+- [B2R2 Binary Analysis Framework](https://b2r2.org/)
+- [Ghidra Patch Diffing Guide](https://cve-north-stars.github.io/docs/Ghidra-Patch-Diffing)
+- [ghidriff Tool](https://github.com/clearbluejar/ghidriff)
+- [SemDiff Paper (arXiv)](https://arxiv.org/abs/2308.01463)
+- [SEI Semantic Equivalence Research](https://www.sei.cmu.edu/annual-reviews/2022-research-review/semantic-equivalence-checking-of-decompiled-binaries/)
+
+---
+
+*Document Version: 1.0.0*
+*Last Updated: 2026-01-05*