Files

master d7bdca6d97 docs consolidation, big sln build fixes, new advisories and sprints/tasks

2026-01-05 18:37:08 +02:00

22 KiB

Raw Blame History

Semantic Diffing Architecture

Status: PLANNED Version: 1.0.0 Related Sprints:

SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md

SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md

SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md

SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md

1. Executive Summary

Semantic diffing is an advanced binary analysis capability that detects function equivalence based on behavior rather than syntax. This enables accurate vulnerability detection in scenarios where traditional byte-level or symbol-based matching fails:

Compiler optimizations - Same source, different instructions
Obfuscation - Intentionally altered code structure
Stripped binaries - No symbols or debug information
Cross-compiler - GCC vs Clang produce different output
Backported patches - Different version, same fix

Expected Impact

Capability	Current Accuracy	With Semantic Diffing
Patch detection (optimized)	~70%	92%+
Function identification (stripped)	~50%	85%+
Obfuscation resilience	~40%	75%+
False positive rate	~5%	<2%

2. Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────────┐
│                        Semantic Diffing Architecture                             │
│                                                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────────┐│
│  │                         Analysis Layer                                       ││
│  │                                                                              ││
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        ││
│  │  │   B2R2      │  │   Ghidra    │  │ Decompiler  │  │     ML      │        ││
│  │  │  (Primary)  │  │ (Fallback)  │  │  (Optional) │  │ (Optional)  │        ││
│  │  │             │  │             │  │             │  │             │        ││
│  │  │ - Disasm    │  │ - P-Code    │  │ - C output  │  │ - CodeBERT  │        ││
│  │  │ - LowUIR    │  │ - BSim      │  │ - AST parse │  │ - GraphSage │        ││
│  │  │ - CFG       │  │ - Ver.Track │  │ - Normalize │  │ - Embedding │        ││
│  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        ││
│  │         │                │                │                │               ││
│  └─────────┴────────────────┴────────────────┴────────────────┴───────────────┘│
│                                      │                                          │
│                                      v                                          │
│  ┌─────────────────────────────────────────────────────────────────────────────┐│
│  │                       Fingerprint Layer                                      ││
│  │                                                                              ││
│  │  ┌───────────────────┐  ┌───────────────────┐  ┌───────────────────┐       ││
│  │  │   Instruction     │  │    Semantic       │  │   Decompiled      │       ││
│  │  │   Fingerprint     │  │    Fingerprint    │  │   Fingerprint     │       ││
│  │  │                   │  │                   │  │                   │       ││
│  │  │ - BasicBlock hash │  │ - KSG graph hash  │  │ - AST hash        │       ││
│  │  │ - CFG edge hash   │  │ - WL hash         │  │ - Normalized code │       ││
│  │  │ - String refs     │  │ - DataFlow hash   │  │ - API sequence    │       ││
│  │  │ - Rolling chunks  │  │ - API calls       │  │ - Pattern hash    │       ││
│  │  └───────────────────┘  └───────────────────┘  └───────────────────┘       ││
│  │                                                                              ││
│  │  ┌───────────────────┐  ┌───────────────────┐                               ││
│  │  │      BSim         │  │   ML Embedding    │                               ││
│  │  │   Signature       │  │     Vector        │                               ││
│  │  │                   │  │                   │                               ││
│  │  │ - Feature vector  │  │ - 768-dim float[] │                               ││
│  │  │ - Significance    │  │ - Cosine sim      │                               ││
│  │  └───────────────────┘  └───────────────────┘                               ││
│  │                                                                              ││
│  └─────────────────────────────────────────────────────────────────────────────┘│
│                                      │                                          │
│                                      v                                          │
│  ┌─────────────────────────────────────────────────────────────────────────────┐│
│  │                       Matching Layer                                         ││
│  │                                                                              ││
│  │  ┌───────────────────────────────────────────────────────────────────────┐  ││
│  │  │                    Ensemble Decision Engine                            │  ││
│  │  │                                                                        │  ││
│  │  │  Signal Weights:                                                       │  ││
│  │  │  - Instruction fingerprint:  15%                                       │  ││
│  │  │  - Semantic graph:           25%                                       │  ││
│  │  │  - Decompiled AST:           35%                                       │  ││
│  │  │  - ML embedding:             25%                                       │  ││
│  │  │                                                                        │  ││
│  │  │  Output: Confidence-weighted similarity score                          │  ││
│  │  │                                                                        │  ││
│  │  └───────────────────────────────────────────────────────────────────────┘  ││
│  │                                                                              ││
│  └─────────────────────────────────────────────────────────────────────────────┘│
│                                      │                                          │
│                                      v                                          │
│  ┌─────────────────────────────────────────────────────────────────────────────┐│
│  │                       Storage Layer                                          ││
│  │                                                                              ││
│  │  PostgreSQL                RustFS                 Valkey                    ││
│  │  - corpus.* tables         - Fingerprint blobs    - Query cache             ││
│  │  - binaries.* tables       - Model artifacts      - Embedding index         ││
│  │  - BSim database           - Training data                                  ││
│  │                                                                              ││
│  └─────────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────────┘

3. Implementation Phases

Phase 1: IR-Level Semantic Analysis (Foundation)

Sprint: SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md

Leverage B2R2's Intermediate Representation (IR) for semantic-level function comparison.

Key Components:

IrLiftingService - Lift instructions to LowUIR
SemanticGraphExtractor - Build Key-Semantics Graph (KSG)
WeisfeilerLehmanHasher - Graph fingerprinting
SemanticMatcher - Semantic similarity scoring

Deliverables:

StellaOps.BinaryIndex.Semantic library
20 tasks, ~3 weeks

Phase 2: Function Behavior Corpus (Scale)

Sprint: SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md

Build comprehensive database of known library functions.

Key Components:

Library corpus connectors (glibc, OpenSSL, zlib, curl, SQLite)
CorpusIngestionService - Batch fingerprint generation
FunctionClusteringService - Group similar functions
CorpusQueryService - Function identification

Deliverables:

StellaOps.BinaryIndex.Corpus library
PostgreSQL corpus.* schema
~30,000 indexed functions
22 tasks, ~4 weeks

Phase 3: Ghidra Integration (Depth)

Sprint: SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md

Add Ghidra as secondary backend for complex cases.

Key Components:

GhidraHeadlessManager - Process lifecycle
VersionTrackingService - Multi-correlator diffing
GhidriffBridge - Python interop
BSimService - Behavioral similarity

Deliverables:

StellaOps.BinaryIndex.Ghidra library
Docker image for Ghidra Headless
20 tasks, ~4 weeks

Phase 4: Decompiler & ML (Excellence)

Sprint: SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md

Highest-fidelity semantic analysis.

Key Components:

IDecompilerService - Ghidra decompilation
AstComparisonEngine - Structural similarity
OnnxInferenceEngine - ML embeddings
EnsembleDecisionEngine - Multi-signal fusion

Deliverables:

StellaOps.BinaryIndex.Decompiler library
StellaOps.BinaryIndex.ML library
Trained CodeBERT-Binary model
30 tasks, ~5 weeks

4. Fingerprint Types

4.1 Instruction Fingerprint (Existing)

Algorithm: BasicBlock hash + CFG edge hash + String refs hash

Properties:

Fast to compute
Sensitive to instruction changes
Good for exact/near-exact matches

Weight in ensemble: 15%

4.2 Semantic Fingerprint (Phase 1)

Algorithm: Key-Semantics Graph + Weisfeiler-Lehman hash

Properties:

Captures data/control dependencies
Resilient to register renaming
Resilient to instruction reordering

Weight in ensemble: 25%

4.3 Decompiled Fingerprint (Phase 4)

Algorithm: Normalized AST hash + Pattern detection

Properties:

Highest semantic fidelity
Captures algorithmic structure
Resilient to most optimizations

Weight in ensemble: 35%

4.4 ML Embedding (Phase 4)

Algorithm: CodeBERT-Binary transformer, 768-dim vectors

Properties:

Learned similarity metric
Captures latent patterns
Resilient to obfuscation

Weight in ensemble: 25%

5. Matching Pipeline

sequenceDiagram
    participant Client
    participant DiffEngine as PatchDiffEngine
    participant B2R2
    participant Ghidra
    participant Corpus
    participant Ensemble

    Client->>DiffEngine: Compare(oldBinary, newBinary)

    par Parallel Analysis
        DiffEngine->>B2R2: Disassemble + IR lift
        DiffEngine->>Ghidra: Decompile (if needed)
    end

    B2R2-->>DiffEngine: SemanticFingerprints[]
    Ghidra-->>DiffEngine: DecompiledFunctions[]

    DiffEngine->>Corpus: IdentifyFunctions(fingerprints)
    Corpus-->>DiffEngine: FunctionMatches[]

    DiffEngine->>Ensemble: ComputeSimilarity(old, new)
    Ensemble-->>DiffEngine: EnsembleResult

    DiffEngine-->>Client: PatchDiffResult

6. Fallback Strategy

The system uses a tiered fallback strategy:

Tier 1: B2R2 IR + Semantic Graph (fast, ~90% coverage)
   │
   │ If confidence < threshold OR architecture unsupported
   v
Tier 2: Ghidra Version Tracking (slower, ~95% coverage)
   │
   │ If function is high-value (CVE-relevant)
   v
Tier 3: Decompiled AST + ML Embedding (slowest, ~99% coverage)

Selection Criteria:

Condition	Backend	Reason
Standard x64/ARM64 binary	B2R2 only	Fast, accurate
Low B2R2 confidence (<0.7)	B2R2 + Ghidra	Validation
Exotic architecture	Ghidra only	Better coverage
CVE-affected function	Full pipeline	Maximum accuracy
Obfuscated binary	ML embedding	Obfuscation resilience

7. Corpus Coverage

Priority Libraries

Library	Priority	Functions	CVEs
glibc	Critical	~15,000	50+
OpenSSL	Critical	~8,000	100+
zlib	High	~200	5+
libcurl	High	~2,000	80+
SQLite	High	~1,500	30+
libxml2	Medium	~1,200	40+
libpng	Medium	~300	10+
expat	Medium	~150	15+

Architecture Coverage

Architecture	B2R2	Ghidra	Status
x86_64	Excellent	Excellent	Primary
ARM64	Excellent	Excellent	Primary
ARM32	Good	Excellent	Secondary
MIPS32	Fair	Excellent	Fallback
MIPS64	Fair	Excellent	Fallback
RISC-V	Good	Good	Emerging
PPC32/64	Fair	Excellent	Fallback

8. Performance Characteristics

Latency Budget

Operation	Target	Notes
B2R2 disassembly	<100ms	Per function
IR lifting	<50ms	Per function
Semantic fingerprint	<50ms	Per function
Ghidra analysis	<30s	Per binary (startup)
Decompilation	<500ms	Per function
ML inference	<100ms	Per function
Ensemble decision	<10ms	Per comparison
Total (Tier 1)	<200ms	Per function
Total (Full)	<1s	Per function

Memory Budget

Component	Memory	Notes
B2R2 per binary	~100MB	Scales with binary size
Ghidra per project	~2GB	Persistent cache
ML model	~500MB	ONNX loaded
Corpus query cache	~100MB	LRU eviction

9. Integration Points

9.1 Scanner Integration

// Scanner.Worker uses semantic diffing for binary vulnerability detection
var result = await _binaryVulnerabilityService.LookupByFingerprintAsync(
    fingerprint,
    minSimilarity: 0.85m,
    useSemanticMatching: true,  // Enable semantic diffing
    ct);

9.2 PatchDiffEngine Enhancement

// PatchDiffEngine now includes semantic comparison
var diff = await _patchDiffEngine.DiffAsync(
    vulnerableBinary,
    patchedBinary,
    new PatchDiffOptions
    {
        UseSemanticAnalysis = true,
        SemanticThreshold = 0.7m,
        IncludeDecompilation = true,
        IncludeMlEmbedding = true
    },
    ct);

9.3 DeltaSignature Enhancement

// Delta signatures now include semantic fingerprints
var signature = await _deltaSignatureGenerator.GenerateSignaturesAsync(
    binaryStream,
    new DeltaSignatureRequest
    {
        Cve = "CVE-2024-1234",
        TargetSymbols = ["vulnerable_func"],
        IncludeSemanticFingerprint = true,
        IncludeDecompiledHash = true
    },
    ct);

10. Security Considerations

10.1 Sandbox Requirements

All binary analysis runs in sandboxed environments:

Seccomp profile restricting syscalls
Read-only root filesystem
No network access during analysis
Memory/CPU limits

10.2 Model Security

ML models are:

Signed with DSSE attestations
Verified before loading
Not user-uploadable (pre-trained only)

10.3 Corpus Integrity

Corpus data is:

Ingested from trusted sources only
Signed at snapshot level
Version-controlled with audit trail

11. Configuration

# binaryindex.yaml - Semantic diffing configuration
binaryindex:
  semantic_diffing:
    enabled: true

    # Analysis backends
    backends:
      b2r2:
        enabled: true
        ir_lifting: true
        semantic_graph: true
      ghidra:
        enabled: true
        fallback_only: true
        min_b2r2_confidence: 0.7
        headless_timeout_ms: 30000
      decompiler:
        enabled: true
        high_value_only: true  # Only for CVE-affected functions
      ml:
        enabled: true
        model_path: /models/codebert_binary_v1.onnx
        embedding_dimension: 768

    # Ensemble weights
    ensemble:
      instruction_weight: 0.15
      semantic_weight: 0.25
      decompiled_weight: 0.35
      ml_weight: 0.25
      min_confidence: 0.6

    # Corpus
    corpus:
      auto_update: true
      update_interval_hours: 24
      libraries:
        - glibc
        - openssl
        - zlib
        - curl
        - sqlite

    # Performance
    performance:
      max_parallel_analyses: 4
      cache_ttl_seconds: 3600
      max_function_size_bytes: 1048576  # 1MB

12. Metrics & Observability

Metrics

Metric	Type	Labels
`semantic_diffing_analysis_total`	Counter	backend, result
`semantic_diffing_latency_ms`	Histogram	backend, tier
`semantic_diffing_accuracy`	Gauge	comparison_type
`corpus_functions_total`	Gauge	library
`ml_inference_latency_ms`	Histogram	model
`ensemble_signal_weight`	Gauge	signal_type

Traces

semantic_diffing.analyze - Full analysis span
semantic_diffing.b2r2.lift - IR lifting
semantic_diffing.ghidra.decompile - Decompilation
semantic_diffing.ml.inference - ML embedding
semantic_diffing.ensemble.decide - Ensemble decision

13. Testing Strategy

Unit Tests

Test Suite	Coverage
`IrLiftingServiceTests`	IR lifting correctness
`SemanticGraphExtractorTests`	Graph construction
`WeisfeilerLehmanHasherTests`	Hash stability
`AstComparisonEngineTests`	AST similarity
`OnnxInferenceEngineTests`	ML inference
`EnsembleDecisionEngineTests`	Weight combination

Integration Tests

Test Suite	Coverage
`EndToEndSemanticDiffTests`	Full pipeline
`OptimizationResilienceTests`	O0 vs O2 vs O3
`CompilerVariantTests`	GCC vs Clang
`GhidraFallbackTests`	Fallback scenarios

Golden Corpus Tests

Pre-computed test cases with known results:

100 CVE patch pairs (vulnerable -> fixed)
50 optimization variant sets
25 compiler variant sets
25 obfuscation variant sets

14. Roadmap

Phase	Status	ETA	Impact
Phase 1: IR Semantics	Planned	2026-01-24	+15% accuracy
Phase 2: Corpus	Planned	2026-02-15	+10% coverage
Phase 3: Ghidra	Planned	2026-02-28	+5% edge cases
Phase 4: Decompiler/ML	Planned	2026-03-31	+10% obfuscation
Total			+35-40%

15. References

Internal

docs/modules/binary-index/architecture.md
src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.DeltaSig/
src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Fingerprints/

External

Document Version: 1.0.0 Last Updated: 2026-01-05

22 KiB Raw Blame History

Semantic Diffing Architecture

1. Executive Summary

Expected Impact

2. Architecture Overview

3. Implementation Phases

Phase 1: IR-Level Semantic Analysis (Foundation)

Phase 2: Function Behavior Corpus (Scale)

Phase 3: Ghidra Integration (Depth)

Phase 4: Decompiler & ML (Excellence)

4. Fingerprint Types

4.1 Instruction Fingerprint (Existing)

4.2 Semantic Fingerprint (Phase 1)

4.3 Decompiled Fingerprint (Phase 4)

4.4 ML Embedding (Phase 4)

5. Matching Pipeline

6. Fallback Strategy

7. Corpus Coverage

Priority Libraries

Architecture Coverage

8. Performance Characteristics

Latency Budget

Memory Budget

9. Integration Points

9.1 Scanner Integration

9.2 PatchDiffEngine Enhancement

9.3 DeltaSignature Enhancement

10. Security Considerations

10.1 Sandbox Requirements

10.2 Model Security

10.3 Corpus Integrity

11. Configuration

12. Metrics & Observability

Metrics

Traces

13. Testing Strategy

Unit Tests

Integration Tests

Golden Corpus Tests

14. Roadmap

15. References

Internal

External

22 KiB

Raw Blame History