Files
git.stella-ops.org/docs/modules/binary-index/semantic-diffing.md

22 KiB

Semantic Diffing Architecture

Status: PLANNED Version: 1.0.0 Related Sprints:

  • SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md
  • SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md
  • SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md
  • SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md

1. Executive Summary

Semantic diffing is an advanced binary analysis capability that detects function equivalence based on behavior rather than syntax. This enables accurate vulnerability detection in scenarios where traditional byte-level or symbol-based matching fails:

  • Compiler optimizations - Same source, different instructions
  • Obfuscation - Intentionally altered code structure
  • Stripped binaries - No symbols or debug information
  • Cross-compiler - GCC vs Clang produce different output
  • Backported patches - Different version, same fix

Expected Impact

Capability Current Accuracy With Semantic Diffing
Patch detection (optimized) ~70% 92%+
Function identification (stripped) ~50% 85%+
Obfuscation resilience ~40% 75%+
False positive rate ~5% <2%

2. Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────────┐
│                        Semantic Diffing Architecture                             │
│                                                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────────┐│
│  │                         Analysis Layer                                       ││
│  │                                                                              ││
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        ││
│  │  │   B2R2      │  │   Ghidra    │  │ Decompiler  │  │     ML      │        ││
│  │  │  (Primary)  │  │ (Fallback)  │  │  (Optional) │  │ (Optional)  │        ││
│  │  │             │  │             │  │             │  │             │        ││
│  │  │ - Disasm    │  │ - P-Code    │  │ - C output  │  │ - CodeBERT  │        ││
│  │  │ - LowUIR    │  │ - BSim      │  │ - AST parse │  │ - GraphSage │        ││
│  │  │ - CFG       │  │ - Ver.Track │  │ - Normalize │  │ - Embedding │        ││
│  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        ││
│  │         │                │                │                │               ││
│  └─────────┴────────────────┴────────────────┴────────────────┴───────────────┘│
│                                      │                                          │
│                                      v                                          │
│  ┌─────────────────────────────────────────────────────────────────────────────┐│
│  │                       Fingerprint Layer                                      ││
│  │                                                                              ││
│  │  ┌───────────────────┐  ┌───────────────────┐  ┌───────────────────┐       ││
│  │  │   Instruction     │  │    Semantic       │  │   Decompiled      │       ││
│  │  │   Fingerprint     │  │    Fingerprint    │  │   Fingerprint     │       ││
│  │  │                   │  │                   │  │                   │       ││
│  │  │ - BasicBlock hash │  │ - KSG graph hash  │  │ - AST hash        │       ││
│  │  │ - CFG edge hash   │  │ - WL hash         │  │ - Normalized code │       ││
│  │  │ - String refs     │  │ - DataFlow hash   │  │ - API sequence    │       ││
│  │  │ - Rolling chunks  │  │ - API calls       │  │ - Pattern hash    │       ││
│  │  └───────────────────┘  └───────────────────┘  └───────────────────┘       ││
│  │                                                                              ││
│  │  ┌───────────────────┐  ┌───────────────────┐                               ││
│  │  │      BSim         │  │   ML Embedding    │                               ││
│  │  │   Signature       │  │     Vector        │                               ││
│  │  │                   │  │                   │                               ││
│  │  │ - Feature vector  │  │ - 768-dim float[] │                               ││
│  │  │ - Significance    │  │ - Cosine sim      │                               ││
│  │  └───────────────────┘  └───────────────────┘                               ││
│  │                                                                              ││
│  └─────────────────────────────────────────────────────────────────────────────┘│
│                                      │                                          │
│                                      v                                          │
│  ┌─────────────────────────────────────────────────────────────────────────────┐│
│  │                       Matching Layer                                         ││
│  │                                                                              ││
│  │  ┌───────────────────────────────────────────────────────────────────────┐  ││
│  │  │                    Ensemble Decision Engine                            │  ││
│  │  │                                                                        │  ││
│  │  │  Signal Weights:                                                       │  ││
│  │  │  - Instruction fingerprint:  15%                                       │  ││
│  │  │  - Semantic graph:           25%                                       │  ││
│  │  │  - Decompiled AST:           35%                                       │  ││
│  │  │  - ML embedding:             25%                                       │  ││
│  │  │                                                                        │  ││
│  │  │  Output: Confidence-weighted similarity score                          │  ││
│  │  │                                                                        │  ││
│  │  └───────────────────────────────────────────────────────────────────────┘  ││
│  │                                                                              ││
│  └─────────────────────────────────────────────────────────────────────────────┘│
│                                      │                                          │
│                                      v                                          │
│  ┌─────────────────────────────────────────────────────────────────────────────┐│
│  │                       Storage Layer                                          ││
│  │                                                                              ││
│  │  PostgreSQL                RustFS                 Valkey                    ││
│  │  - corpus.* tables         - Fingerprint blobs    - Query cache             ││
│  │  - binaries.* tables       - Model artifacts      - Embedding index         ││
│  │  - BSim database           - Training data                                  ││
│  │                                                                              ││
│  └─────────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────────┘

3. Implementation Phases

Phase 1: IR-Level Semantic Analysis (Foundation)

Sprint: SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md

Leverage B2R2's Intermediate Representation (IR) for semantic-level function comparison.

Key Components:

  • IrLiftingService - Lift instructions to LowUIR
  • SemanticGraphExtractor - Build Key-Semantics Graph (KSG)
  • WeisfeilerLehmanHasher - Graph fingerprinting
  • SemanticMatcher - Semantic similarity scoring

Deliverables:

  • StellaOps.BinaryIndex.Semantic library
  • 20 tasks, ~3 weeks

Phase 2: Function Behavior Corpus (Scale)

Sprint: SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md

Build comprehensive database of known library functions.

Key Components:

  • Library corpus connectors (glibc, OpenSSL, zlib, curl, SQLite)
  • CorpusIngestionService - Batch fingerprint generation
  • FunctionClusteringService - Group similar functions
  • CorpusQueryService - Function identification

Deliverables:

  • StellaOps.BinaryIndex.Corpus library
  • PostgreSQL corpus.* schema
  • ~30,000 indexed functions
  • 22 tasks, ~4 weeks

Phase 3: Ghidra Integration (Depth)

Sprint: SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md

Add Ghidra as secondary backend for complex cases.

Key Components:

  • GhidraHeadlessManager - Process lifecycle
  • VersionTrackingService - Multi-correlator diffing
  • GhidriffBridge - Python interop
  • BSimService - Behavioral similarity

Deliverables:

  • StellaOps.BinaryIndex.Ghidra library
  • Docker image for Ghidra Headless
  • 20 tasks, ~4 weeks

Phase 4: Decompiler & ML (Excellence)

Sprint: SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md

Highest-fidelity semantic analysis.

Key Components:

  • IDecompilerService - Ghidra decompilation
  • AstComparisonEngine - Structural similarity
  • OnnxInferenceEngine - ML embeddings
  • EnsembleDecisionEngine - Multi-signal fusion

Deliverables:

  • StellaOps.BinaryIndex.Decompiler library
  • StellaOps.BinaryIndex.ML library
  • Trained CodeBERT-Binary model
  • 30 tasks, ~5 weeks

4. Fingerprint Types

4.1 Instruction Fingerprint (Existing)

Algorithm: BasicBlock hash + CFG edge hash + String refs hash

Properties:

  • Fast to compute
  • Sensitive to instruction changes
  • Good for exact/near-exact matches

Weight in ensemble: 15%

4.2 Semantic Fingerprint (Phase 1)

Algorithm: Key-Semantics Graph + Weisfeiler-Lehman hash

Properties:

  • Captures data/control dependencies
  • Resilient to register renaming
  • Resilient to instruction reordering

Weight in ensemble: 25%

4.3 Decompiled Fingerprint (Phase 4)

Algorithm: Normalized AST hash + Pattern detection

Properties:

  • Highest semantic fidelity
  • Captures algorithmic structure
  • Resilient to most optimizations

Weight in ensemble: 35%

4.4 ML Embedding (Phase 4)

Algorithm: CodeBERT-Binary transformer, 768-dim vectors

Properties:

  • Learned similarity metric
  • Captures latent patterns
  • Resilient to obfuscation

Weight in ensemble: 25%


5. Matching Pipeline

sequenceDiagram
    participant Client
    participant DiffEngine as PatchDiffEngine
    participant B2R2
    participant Ghidra
    participant Corpus
    participant Ensemble

    Client->>DiffEngine: Compare(oldBinary, newBinary)

    par Parallel Analysis
        DiffEngine->>B2R2: Disassemble + IR lift
        DiffEngine->>Ghidra: Decompile (if needed)
    end

    B2R2-->>DiffEngine: SemanticFingerprints[]
    Ghidra-->>DiffEngine: DecompiledFunctions[]

    DiffEngine->>Corpus: IdentifyFunctions(fingerprints)
    Corpus-->>DiffEngine: FunctionMatches[]

    DiffEngine->>Ensemble: ComputeSimilarity(old, new)
    Ensemble-->>DiffEngine: EnsembleResult

    DiffEngine-->>Client: PatchDiffResult

6. Fallback Strategy

The system uses a tiered fallback strategy:

Tier 1: B2R2 IR + Semantic Graph (fast, ~90% coverage)
   │
   │ If confidence < threshold OR architecture unsupported
   v
Tier 2: Ghidra Version Tracking (slower, ~95% coverage)
   │
   │ If function is high-value (CVE-relevant)
   v
Tier 3: Decompiled AST + ML Embedding (slowest, ~99% coverage)

Selection Criteria:

Condition Backend Reason
Standard x64/ARM64 binary B2R2 only Fast, accurate
Low B2R2 confidence (<0.7) B2R2 + Ghidra Validation
Exotic architecture Ghidra only Better coverage
CVE-affected function Full pipeline Maximum accuracy
Obfuscated binary ML embedding Obfuscation resilience

7. Corpus Coverage

Priority Libraries

Library Priority Functions CVEs
glibc Critical ~15,000 50+
OpenSSL Critical ~8,000 100+
zlib High ~200 5+
libcurl High ~2,000 80+
SQLite High ~1,500 30+
libxml2 Medium ~1,200 40+
libpng Medium ~300 10+
expat Medium ~150 15+

Architecture Coverage

Architecture B2R2 Ghidra Status
x86_64 Excellent Excellent Primary
ARM64 Excellent Excellent Primary
ARM32 Good Excellent Secondary
MIPS32 Fair Excellent Fallback
MIPS64 Fair Excellent Fallback
RISC-V Good Good Emerging
PPC32/64 Fair Excellent Fallback

8. Performance Characteristics

Latency Budget

Operation Target Notes
B2R2 disassembly <100ms Per function
IR lifting <50ms Per function
Semantic fingerprint <50ms Per function
Ghidra analysis <30s Per binary (startup)
Decompilation <500ms Per function
ML inference <100ms Per function
Ensemble decision <10ms Per comparison
Total (Tier 1) <200ms Per function
Total (Full) <1s Per function

Memory Budget

Component Memory Notes
B2R2 per binary ~100MB Scales with binary size
Ghidra per project ~2GB Persistent cache
ML model ~500MB ONNX loaded
Corpus query cache ~100MB LRU eviction

9. Integration Points

9.1 Scanner Integration

// Scanner.Worker uses semantic diffing for binary vulnerability detection
var result = await _binaryVulnerabilityService.LookupByFingerprintAsync(
    fingerprint,
    minSimilarity: 0.85m,
    useSemanticMatching: true,  // Enable semantic diffing
    ct);

9.2 PatchDiffEngine Enhancement

// PatchDiffEngine now includes semantic comparison
var diff = await _patchDiffEngine.DiffAsync(
    vulnerableBinary,
    patchedBinary,
    new PatchDiffOptions
    {
        UseSemanticAnalysis = true,
        SemanticThreshold = 0.7m,
        IncludeDecompilation = true,
        IncludeMlEmbedding = true
    },
    ct);

9.3 DeltaSignature Enhancement

// Delta signatures now include semantic fingerprints
var signature = await _deltaSignatureGenerator.GenerateSignaturesAsync(
    binaryStream,
    new DeltaSignatureRequest
    {
        Cve = "CVE-2024-1234",
        TargetSymbols = ["vulnerable_func"],
        IncludeSemanticFingerprint = true,
        IncludeDecompiledHash = true
    },
    ct);

10. Security Considerations

10.1 Sandbox Requirements

All binary analysis runs in sandboxed environments:

  • Seccomp profile restricting syscalls
  • Read-only root filesystem
  • No network access during analysis
  • Memory/CPU limits

10.2 Model Security

ML models are:

  • Signed with DSSE attestations
  • Verified before loading
  • Not user-uploadable (pre-trained only)

10.3 Corpus Integrity

Corpus data is:

  • Ingested from trusted sources only
  • Signed at snapshot level
  • Version-controlled with audit trail

11. Configuration

# binaryindex.yaml - Semantic diffing configuration
binaryindex:
  semantic_diffing:
    enabled: true

    # Analysis backends
    backends:
      b2r2:
        enabled: true
        ir_lifting: true
        semantic_graph: true
      ghidra:
        enabled: true
        fallback_only: true
        min_b2r2_confidence: 0.7
        headless_timeout_ms: 30000
      decompiler:
        enabled: true
        high_value_only: true  # Only for CVE-affected functions
      ml:
        enabled: true
        model_path: /models/codebert_binary_v1.onnx
        embedding_dimension: 768

    # Ensemble weights
    ensemble:
      instruction_weight: 0.15
      semantic_weight: 0.25
      decompiled_weight: 0.35
      ml_weight: 0.25
      min_confidence: 0.6

    # Corpus
    corpus:
      auto_update: true
      update_interval_hours: 24
      libraries:
        - glibc
        - openssl
        - zlib
        - curl
        - sqlite

    # Performance
    performance:
      max_parallel_analyses: 4
      cache_ttl_seconds: 3600
      max_function_size_bytes: 1048576  # 1MB

12. Metrics & Observability

Metrics

Metric Type Labels
semantic_diffing_analysis_total Counter backend, result
semantic_diffing_latency_ms Histogram backend, tier
semantic_diffing_accuracy Gauge comparison_type
corpus_functions_total Gauge library
ml_inference_latency_ms Histogram model
ensemble_signal_weight Gauge signal_type

Traces

  • semantic_diffing.analyze - Full analysis span
  • semantic_diffing.b2r2.lift - IR lifting
  • semantic_diffing.ghidra.decompile - Decompilation
  • semantic_diffing.ml.inference - ML embedding
  • semantic_diffing.ensemble.decide - Ensemble decision

13. Testing Strategy

Unit Tests

Test Suite Coverage
IrLiftingServiceTests IR lifting correctness
SemanticGraphExtractorTests Graph construction
WeisfeilerLehmanHasherTests Hash stability
AstComparisonEngineTests AST similarity
OnnxInferenceEngineTests ML inference
EnsembleDecisionEngineTests Weight combination

Integration Tests

Test Suite Coverage
EndToEndSemanticDiffTests Full pipeline
OptimizationResilienceTests O0 vs O2 vs O3
CompilerVariantTests GCC vs Clang
GhidraFallbackTests Fallback scenarios

Golden Corpus Tests

Pre-computed test cases with known results:

  • 100 CVE patch pairs (vulnerable -> fixed)
  • 50 optimization variant sets
  • 25 compiler variant sets
  • 25 obfuscation variant sets

14. Roadmap

Phase Status ETA Impact
Phase 1: IR Semantics Planned 2026-01-24 +15% accuracy
Phase 2: Corpus Planned 2026-02-15 +10% coverage
Phase 3: Ghidra Planned 2026-02-28 +5% edge cases
Phase 4: Decompiler/ML Planned 2026-03-31 +10% obfuscation
Total +35-40%

15. References

Internal

  • docs/modules/binary-index/architecture.md
  • src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.DeltaSig/
  • src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Fingerprints/

External


Document Version: 1.0.0 Last Updated: 2026-01-05