# Semantic Diffing Architecture > **Status:** PHASE 1 IMPLEMENTED (B2R2 IR Lifting) > **Version:** 1.1.0 > **Related Sprints:** > - `SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md` > - `SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md` > - `SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md` > - `SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md` --- ## 1. Executive Summary Semantic diffing is an advanced binary analysis capability that detects function equivalence based on **behavior** rather than **syntax**. This enables accurate vulnerability detection in scenarios where traditional byte-level or symbol-based matching fails: - **Compiler optimizations** - Same source, different instructions - **Obfuscation** - Intentionally altered code structure - **Stripped binaries** - No symbols or debug information - **Cross-compiler** - GCC vs Clang produce different output - **Backported patches** - Different version, same fix ### Expected Impact | Capability | Current Accuracy | With Semantic Diffing | |------------|-----------------|----------------------| | Patch detection (optimized) | ~70% | 92%+ | | Function identification (stripped) | ~50% | 85%+ | | Obfuscation resilience | ~40% | 75%+ | | False positive rate | ~5% | <2% | --- ## 2. Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────────────────────┐ │ Semantic Diffing Architecture │ │ │ │ ┌─────────────────────────────────────────────────────────────────────────────┐│ │ │ Analysis Layer ││ │ │ ││ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ │ │ B2R2 │ │ Ghidra │ │ Decompiler │ │ ML │ ││ │ │ │ (Primary) │ │ (Fallback) │ │ (Optional) │ │ (Optional) │ ││ │ │ │ │ │ │ │ │ │ │ ││ │ │ │ - Disasm │ │ - P-Code │ │ - C output │ │ - CodeBERT │ ││ │ │ │ - LowUIR │ │ - BSim │ │ - AST parse │ │ - GraphSage │ ││ │ │ │ - CFG │ │ - Ver.Track │ │ - Normalize │ │ - Embedding │ ││ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ││ │ │ │ │ │ │ ││ │ └─────────┴────────────────┴────────────────┴────────────────┴───────────────┘│ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────────────────┐│ │ │ Fingerprint Layer ││ │ │ ││ │ │ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ ││ │ │ │ Instruction │ │ Semantic │ │ Decompiled │ ││ │ │ │ Fingerprint │ │ Fingerprint │ │ Fingerprint │ ││ │ │ │ │ │ │ │ │ ││ │ │ │ - BasicBlock hash │ │ - KSG graph hash │ │ - AST hash │ ││ │ │ │ - CFG edge hash │ │ - WL hash │ │ - Normalized code │ ││ │ │ │ - String refs │ │ - DataFlow hash │ │ - API sequence │ ││ │ │ │ - Rolling chunks │ │ - API calls │ │ - Pattern hash │ ││ │ │ └───────────────────┘ └───────────────────┘ └───────────────────┘ ││ │ │ ││ │ │ ┌───────────────────┐ ┌───────────────────┐ ││ │ │ │ BSim │ │ ML Embedding │ ││ │ │ │ Signature │ │ Vector │ ││ │ │ │ │ │ │ ││ │ │ │ - Feature vector │ │ - 768-dim float[] │ ││ │ │ │ - Significance │ │ - Cosine sim │ ││ │ │ └───────────────────┘ └───────────────────┘ ││ │ │ ││ │ └─────────────────────────────────────────────────────────────────────────────┘│ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────────────────┐│ │ │ Matching Layer ││ │ │ ││ │ │ ┌───────────────────────────────────────────────────────────────────────┐ ││ │ │ │ Ensemble Decision Engine │ ││ │ │ │ │ ││ │ │ │ Signal Weights: │ ││ │ │ │ - Instruction fingerprint: 15% │ ││ │ │ │ - Semantic graph: 25% │ ││ │ │ │ - Decompiled AST: 35% │ ││ │ │ │ - ML embedding: 25% │ ││ │ │ │ │ ││ │ │ │ Output: Confidence-weighted similarity score │ ││ │ │ │ │ ││ │ │ └───────────────────────────────────────────────────────────────────────┘ ││ │ │ ││ │ └─────────────────────────────────────────────────────────────────────────────┘│ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────────────────┐│ │ │ Storage Layer ││ │ │ ││ │ │ PostgreSQL RustFS Valkey ││ │ │ - corpus.* tables - Fingerprint blobs - Query cache ││ │ │ - binaries.* tables - Model artifacts - Embedding index ││ │ │ - BSim database - Training data ││ │ │ ││ │ └─────────────────────────────────────────────────────────────────────────────┘│ └─────────────────────────────────────────────────────────────────────────────────┘ ``` --- ## 3. Implementation Phases ### Phase 1: IR-Level Semantic Analysis (Foundation) **Sprints:** - `SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md` - `SPRINT_20260112_004_BINIDX_b2r2_lowuir_perf_cache.md` (Performance & Ops) Leverage B2R2's Intermediate Representation (IR) for semantic-level function comparison. **Key Components:** - `B2R2LowUirLiftingService` - Lifts instructions to B2R2 LowUIR, maps to Stella IR model - `B2R2LifterPool` - Bounded pool with warm preload for lifter reuse - `FunctionIrCacheService` - Valkey-backed cache for semantic fingerprints - `SemanticGraphExtractor` - Build Key-Semantics Graph (KSG) - `WeisfeilerLehmanHasher` - Graph fingerprinting - `SemanticMatcher` - Semantic similarity scoring **B2R2LowUirLiftingService Implementation:** - Supports Intel, ARM, MIPS, RISC-V, PowerPC, SPARC, SH4, AVR, EVM - Maps B2R2 LowUIR statements to `IrStatement` model - Applies SSA numbering to temporary registers - Deterministic block ordering (by entry address) - InvariantCulture formatting throughout **B2R2LifterPool Implementation:** - Bounded per-ISA pooling (default 4 lifters/ISA) - Warm preload at startup for common ISAs - Per-ISA stats (pooled, active, max) - Automatic return on dispose **FunctionIrCacheService Implementation:** - Cache key: `(isa, b2r2_version, normalization_recipe, canonical_ir_hash)` - Valkey as hot cache (default 4h TTL) - PostgreSQL persistence for fingerprint records - Hit/miss/eviction statistics **Ops Endpoints:** - `GET /api/v1/ops/binaryindex/health` - Lifter warmness, cache status - `POST /api/v1/ops/binaryindex/bench/run` - Benchmark latency - `GET /api/v1/ops/binaryindex/cache` - Cache statistics - `GET /api/v1/ops/binaryindex/config` - Effective configuration **Deliverables:** - `StellaOps.BinaryIndex.Semantic` library - `StellaOps.BinaryIndex.Disassembly.B2R2` (LowUIR adapter, lifter pool) - `StellaOps.BinaryIndex.Cache` (function IR cache) - BinaryIndexOpsController - 20+ tasks, ~3 weeks ### Phase 2: Function Behavior Corpus (Scale) **Sprint:** `SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md` Build comprehensive database of known library functions. **Key Components:** - Library corpus connectors (glibc, OpenSSL, zlib, curl, SQLite) - `CorpusIngestionService` - Batch fingerprint generation - `FunctionClusteringService` - Group similar functions - `CorpusQueryService` - Function identification **Deliverables:** - `StellaOps.BinaryIndex.Corpus` library - PostgreSQL `corpus.*` schema - ~30,000 indexed functions - 22 tasks, ~4 weeks ### Phase 3: Ghidra Integration (Depth) **Sprint:** `SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md` Add Ghidra as secondary backend for complex cases. **Key Components:** - `GhidraHeadlessManager` - Process lifecycle - `VersionTrackingService` - Multi-correlator diffing - `GhidriffBridge` - Python interop - `BSimService` - Behavioral similarity **Deliverables:** - `StellaOps.BinaryIndex.Ghidra` library - Docker image for Ghidra Headless - 20 tasks, ~4 weeks ### Phase 4: Decompiler & ML (Excellence) **Sprint:** `SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md` Highest-fidelity semantic analysis. **Key Components:** - `IDecompilerService` - Ghidra decompilation - `AstComparisonEngine` - Structural similarity - `OnnxInferenceEngine` - ML embeddings - `EnsembleDecisionEngine` - Multi-signal fusion **Deliverables:** - `StellaOps.BinaryIndex.Decompiler` library - `StellaOps.BinaryIndex.ML` library - Trained CodeBERT-Binary model - 30 tasks, ~5 weeks --- ## 4. Fingerprint Types ### 4.1 Instruction Fingerprint (Existing) **Algorithm:** BasicBlock hash + CFG edge hash + String refs hash **Properties:** - Fast to compute - Sensitive to instruction changes - Good for exact/near-exact matches **Weight in ensemble:** 15% ### 4.2 Semantic Fingerprint (Phase 1) **Algorithm:** Key-Semantics Graph + Weisfeiler-Lehman hash **Properties:** - Captures data/control dependencies - Resilient to register renaming - Resilient to instruction reordering **Weight in ensemble:** 25% ### 4.3 Decompiled Fingerprint (Phase 4) **Algorithm:** Normalized AST hash + Pattern detection **Properties:** - Highest semantic fidelity - Captures algorithmic structure - Resilient to most optimizations **Weight in ensemble:** 35% ### 4.4 ML Embedding (Phase 4) **Algorithm:** CodeBERT-Binary transformer, 768-dim vectors **Properties:** - Learned similarity metric - Captures latent patterns - Resilient to obfuscation **Weight in ensemble:** 25% --- ## 5. Matching Pipeline ```mermaid sequenceDiagram participant Client participant DiffEngine as PatchDiffEngine participant B2R2 participant Ghidra participant Corpus participant Ensemble Client->>DiffEngine: Compare(oldBinary, newBinary) par Parallel Analysis DiffEngine->>B2R2: Disassemble + IR lift DiffEngine->>Ghidra: Decompile (if needed) end B2R2-->>DiffEngine: SemanticFingerprints[] Ghidra-->>DiffEngine: DecompiledFunctions[] DiffEngine->>Corpus: IdentifyFunctions(fingerprints) Corpus-->>DiffEngine: FunctionMatches[] DiffEngine->>Ensemble: ComputeSimilarity(old, new) Ensemble-->>DiffEngine: EnsembleResult DiffEngine-->>Client: PatchDiffResult ``` --- ## 6. Fallback Strategy The system uses a tiered fallback strategy: ``` Tier 1: B2R2 IR + Semantic Graph (fast, ~90% coverage) │ │ If confidence < threshold OR architecture unsupported v Tier 2: Ghidra Version Tracking (slower, ~95% coverage) │ │ If function is high-value (CVE-relevant) v Tier 3: Decompiled AST + ML Embedding (slowest, ~99% coverage) ``` **Selection Criteria:** | Condition | Backend | Reason | |-----------|---------|--------| | Standard x64/ARM64 binary | B2R2 only | Fast, accurate | | Low B2R2 confidence (<0.7) | B2R2 + Ghidra | Validation | | Exotic architecture | Ghidra only | Better coverage | | CVE-affected function | Full pipeline | Maximum accuracy | | Obfuscated binary | ML embedding | Obfuscation resilience | --- ## 7. Corpus Coverage ### Priority Libraries | Library | Priority | Functions | CVEs | |---------|----------|-----------|------| | glibc | Critical | ~15,000 | 50+ | | OpenSSL | Critical | ~8,000 | 100+ | | zlib | High | ~200 | 5+ | | libcurl | High | ~2,000 | 80+ | | SQLite | High | ~1,500 | 30+ | | libxml2 | Medium | ~1,200 | 40+ | | libpng | Medium | ~300 | 10+ | | expat | Medium | ~150 | 15+ | ### Architecture Coverage | Architecture | B2R2 | Ghidra | Status | |--------------|------|--------|--------| | x86_64 | Excellent | Excellent | Primary | | ARM64 | Excellent | Excellent | Primary | | ARM32 | Good | Excellent | Secondary | | MIPS32 | Fair | Excellent | Fallback | | MIPS64 | Fair | Excellent | Fallback | | RISC-V | Good | Good | Emerging | | PPC32/64 | Fair | Excellent | Fallback | --- ## 8. Performance Characteristics ### Latency Budget | Operation | Target | Notes | |-----------|--------|-------| | B2R2 disassembly | <100ms | Per function | | IR lifting | <50ms | Per function | | Semantic fingerprint | <50ms | Per function | | Ghidra analysis | <30s | Per binary (startup) | | Decompilation | <500ms | Per function | | ML inference | <100ms | Per function | | Ensemble decision | <10ms | Per comparison | | **Total (Tier 1)** | **<200ms** | Per function | | **Total (Full)** | **<1s** | Per function | ### Memory Budget | Component | Memory | Notes | |-----------|--------|-------| | B2R2 per binary | ~100MB | Scales with binary size | | Ghidra per project | ~2GB | Persistent cache | | ML model | ~500MB | ONNX loaded | | Corpus query cache | ~100MB | LRU eviction | --- ## 9. Integration Points ### 9.1 Scanner Integration ```csharp // Scanner.Worker uses semantic diffing for binary vulnerability detection var result = await _binaryVulnerabilityService.LookupByFingerprintAsync( fingerprint, minSimilarity: 0.85m, useSemanticMatching: true, // Enable semantic diffing ct); ``` ### 9.2 PatchDiffEngine Enhancement ```csharp // PatchDiffEngine now includes semantic comparison var diff = await _patchDiffEngine.DiffAsync( vulnerableBinary, patchedBinary, new PatchDiffOptions { UseSemanticAnalysis = true, SemanticThreshold = 0.7m, IncludeDecompilation = true, IncludeMlEmbedding = true }, ct); ``` ### 9.3 DeltaSignature Enhancement ```csharp // Delta signatures now include semantic fingerprints var signature = await _deltaSignatureGenerator.GenerateSignaturesAsync( binaryStream, new DeltaSignatureRequest { Cve = "CVE-2024-1234", TargetSymbols = ["vulnerable_func"], IncludeSemanticFingerprint = true, IncludeDecompiledHash = true }, ct); ``` --- ## 10. Security Considerations ### 10.1 Sandbox Requirements All binary analysis runs in sandboxed environments: - Seccomp profile restricting syscalls - Read-only root filesystem - No network access during analysis - Memory/CPU limits ### 10.2 Model Security ML models are: - Signed with DSSE attestations - Verified before loading - Not user-uploadable (pre-trained only) ### 10.3 Corpus Integrity Corpus data is: - Ingested from trusted sources only - Signed at snapshot level - Version-controlled with audit trail --- ## 11. Configuration ```yaml # binaryindex.yaml - Semantic diffing configuration binaryindex: semantic_diffing: enabled: true # Analysis backends backends: b2r2: enabled: true ir_lifting: true semantic_graph: true ghidra: enabled: true fallback_only: true min_b2r2_confidence: 0.7 headless_timeout_ms: 30000 decompiler: enabled: true high_value_only: true # Only for CVE-affected functions ml: enabled: true model_path: /models/codebert_binary_v1.onnx embedding_dimension: 768 # Ensemble weights ensemble: instruction_weight: 0.15 semantic_weight: 0.25 decompiled_weight: 0.35 ml_weight: 0.25 min_confidence: 0.6 # Corpus corpus: auto_update: true update_interval_hours: 24 libraries: - glibc - openssl - zlib - curl - sqlite # Performance performance: max_parallel_analyses: 4 cache_ttl_seconds: 3600 max_function_size_bytes: 1048576 # 1MB ``` Additional appsettings sections (case-insensitive): - `BinaryIndex:B2R2Pool` - lifter pool sizing and warm ISA list. - `BinaryIndex:SemanticLifting` - LowUIR enablement and deterministic controls. - `BinaryIndex:FunctionCache` - Valkey function cache configuration. - `Postgres:BinaryIndex` - persistence for canonical IR fingerprints. --- ## 12. Metrics & Observability ### Ops Endpoints BinaryIndex exposes read-only ops endpoints for health, bench, cache, and effective configuration: - GET `/api/v1/ops/binaryindex/health` -> BinaryIndexOpsHealthResponse - POST `/api/v1/ops/binaryindex/bench/run` -> BinaryIndexBenchResponse - GET `/api/v1/ops/binaryindex/cache` -> BinaryIndexFunctionCacheStats - GET `/api/v1/ops/binaryindex/config` -> BinaryIndexEffectiveConfig ### Metrics | Metric | Type | Labels | |--------|------|--------| | `semantic_diffing_analysis_total` | Counter | backend, result | | `semantic_diffing_latency_ms` | Histogram | backend, tier | | `semantic_diffing_accuracy` | Gauge | comparison_type | | `corpus_functions_total` | Gauge | library | | `ml_inference_latency_ms` | Histogram | model | | `ensemble_signal_weight` | Gauge | signal_type | ### Traces - `semantic_diffing.analyze` - Full analysis span - `semantic_diffing.b2r2.lift` - IR lifting - `semantic_diffing.ghidra.decompile` - Decompilation - `semantic_diffing.ml.inference` - ML embedding - `semantic_diffing.ensemble.decide` - Ensemble decision --- ## 13. Testing Strategy ### Unit Tests | Test Suite | Coverage | |------------|----------| | `IrLiftingServiceTests` | IR lifting correctness | | `SemanticGraphExtractorTests` | Graph construction | | `WeisfeilerLehmanHasherTests` | Hash stability | | `AstComparisonEngineTests` | AST similarity | | `OnnxInferenceEngineTests` | ML inference | | `EnsembleDecisionEngineTests` | Weight combination | ### Integration Tests | Test Suite | Coverage | |------------|----------| | `EndToEndSemanticDiffTests` | Full pipeline | | `OptimizationResilienceTests` | O0 vs O2 vs O3 | | `CompilerVariantTests` | GCC vs Clang | | `GhidraFallbackTests` | Fallback scenarios | ### Golden Corpus Tests Pre-computed test cases with known results: - 100 CVE patch pairs (vulnerable -> fixed) - 50 optimization variant sets - 25 compiler variant sets - 25 obfuscation variant sets --- ## 14. Roadmap | Phase | Status | ETA | Impact | |-------|--------|-----|--------| | Phase 1: IR Semantics | Planned | 2026-01-24 | +15% accuracy | | Phase 2: Corpus | Planned | 2026-02-15 | +10% coverage | | Phase 3: Ghidra | Planned | 2026-02-28 | +5% edge cases | | Phase 4: Decompiler/ML | Planned | 2026-03-31 | +10% obfuscation | | **Total** | | | **+35-40%** | --- ## 15. Delta-Sig Predicate Attestation **Sprint Reference**: `SPRINT_20260117_003_BINDEX_delta_sig_predicate` Delta-sig predicates provide a supply chain attestation format for binary patches, enabling policy-gated releases based on function-level change scope. ### 15.1 Predicate Structure ```jsonc { "_type": "https://in-toto.io/Statement/v1", "predicateType": "https://stellaops.io/delta-sig/v1", "subject": [ { "name": "libexample-1.1.so", "digest": { "sha256": "abc123..." } } ], "predicate": { "before": { "name": "libexample-1.0.so", "digest": { "sha256": "def456..." } }, "after": { "name": "libexample-1.1.so", "digest": { "sha256": "abc123..." } }, "diff": [ { "function": "process_input", "changeType": "modified", "beforeHash": "sha256:old...", "afterHash": "sha256:new...", "bytesDelta": 48, "semanticSimilarity": 0.87 }, { "function": "new_handler", "changeType": "added", "afterHash": "sha256:new...", "bytesDelta": 256 } ], "summary": { "functionsAdded": 1, "functionsRemoved": 0, "functionsModified": 1, "totalBytesChanged": 304 }, "timestamp": "2026-01-16T12:00:00Z" } } ``` ### 15.2 Policy Gate Integration The `DeltaScopePolicyGate` enforces limits on patch scope: ```yaml policy: deltaSig: maxAddedFunctions: 10 maxRemovedFunctions: 5 maxModifiedFunctions: 20 maxBytesChanged: 50000 minSemanticSimilarity: 0.5 requireSemanticAnalysis: false ``` ### 15.3 Attestor Integration Delta-sig predicates integrate with the Attestor module: 1. **Generate** - Create predicate from before/after binary analysis 2. **Sign** - Create DSSE envelope with cosign/fulcio signature 3. **Submit** - Log to Rekor transparency log 4. **Verify** - Validate signature and inclusion proof ### 15.4 CLI Commands ```bash # Generate delta-sig predicate stella binary diff --before old.so --after new.so --output delta.json # Generate and attest in one step stella binary attest --before old.so --after new.so --sign --rekor # Verify attestation stella binary verify --predicate delta.json --signature sig.dsse # Check against policy gate stella binary gate --predicate delta.json --policy policy.yaml ``` ### 15.5 Semantic Similarity Scoring When `requireSemanticAnalysis` is enabled, the gate also checks: | Threshold | Meaning | |-----------|---------| | > 0.9 | Near-identical (cosmetic changes) | | 0.7 - 0.9 | Similar (refactoring, optimization) | | 0.5 - 0.7 | Moderate changes (significant logic) | | < 0.5 | Major rewrite (requires review) | ### 15.6 Evidence Storage Delta-sig predicates are stored in the Evidence Locker and can be included in portable bundles for air-gapped verification. --- ## 16. References ### Internal - `docs/modules/binary-index/architecture.md` - `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.DeltaSig/` - `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Fingerprints/` ### External - [B2R2 Binary Analysis Framework](https://b2r2.org/) - [Ghidra Patch Diffing Guide](https://cve-north-stars.github.io/docs/Ghidra-Patch-Diffing) - [ghidriff Tool](https://github.com/clearbluejar/ghidriff) - [SemDiff Paper (arXiv)](https://arxiv.org/abs/2308.01463) - [SEI Semantic Equivalence Research](https://www.sei.cmu.edu/annual-reviews/2022-research-review/semantic-equivalence-checking-of-decompiled-binaries/) - [in-toto Attestation Framework](https://in-toto.io/) - [SLSA Provenance Spec](https://slsa.dev/provenance/v1) --- --- ## 17. B2R2 Troubleshooting Guide This section covers common issues and resolutions when using B2R2 for IR lifting. ### 17.1 Lifting Failures **Symptom:** `B2R2LiftingException: Failed to lift function at address 0x...` **Common Causes:** 1. **Unsupported instruction** - B2R2 may not recognize certain instructions 2. **Invalid entry point** - Function address is not a valid entry point 3. **Obfuscated code** - Heavy obfuscation defeats parsing **Resolution:** ```csharp // Check if architecture is supported before lifting if (!liftingService.SupportsArchitecture(binary.Architecture)) { // Fall back to disassembly-only mode return await _disassemblyService.DisassembleAsync(binary, ct); } // Use try-lift with fallback var result = await _liftingService.TryLiftWithFallbackAsync( binary, new LiftingOptions { FallbackToDisassembly = true }, ct); ``` ### 17.2 Memory Issues **Symptom:** `OutOfMemoryException` during lifting of large binaries **Common Causes:** 1. **Pool exhaustion** - Too many concurrent lifter instances 2. **Large function** - Single function exceeds memory budget 3. **Memory leak** - Lifter instances not properly disposed **Resolution:** ```yaml # Adjust pool configuration in appsettings.yaml BinaryIndex: B2R2Pool: MaxInstancesPerIsa: 4 # Reduce if OOM RecycleAfterOperations: 1000 # Force recycle more often MaxFunctionSizeBytes: 1048576 # Skip very large functions ``` ### 17.3 Performance Issues **Symptom:** Lifting takes longer than expected (>30s for small binaries) **Common Causes:** 1. **Cold pool** - No warm lifter instances available 2. **Complex CFG** - Function has extremely complex control flow 3. **Cache misses** - IR cache not configured or full **Resolution:** ```csharp // Ensure pool is warmed at startup await _lifterPool.WarmAsync(new[] { ISA.AMD64, ISA.ARM64 }, ct); // Check cache health var stats = await _cacheService.GetStatisticsAsync(ct); if (stats.HitRate < 0.5) { _logger.LogWarning("Low cache hit rate: {HitRate:P}", stats.HitRate); } ``` ### 17.4 Determinism Issues **Symptom:** Same binary produces different IR hashes on repeated lifts **Common Causes:** 1. **Non-deterministic block ordering** - Blocks not sorted by address 2. **Timestamp inclusion** - IR includes lift timestamp 3. **B2R2 version mismatch** - Different versions produce different IR **Resolution:** - Ensure `InvariantCulture` is used for all string formatting - Sort basic blocks by entry address before hashing - Include B2R2 version in cache keys - Use `DeterministicHash` utility for consistent hashing ### 17.5 Architecture Detection Issues **Symptom:** Wrong architecture selected for multi-arch binary (fat binary) **Common Causes:** 1. **Universal binary** - macOS fat binaries contain multiple architectures 2. **ELF with multiple ABIs** - Rare but possible **Resolution:** ```csharp // Explicitly specify target architecture var liftOptions = new LiftingOptions { TargetArchitecture = ISA.AMD64, // Force x86-64 IgnoreOtherArchitectures = true }; ``` ### 17.6 LowUIR Mapping Issues **Symptom:** Specific B2R2 LowUIR statements not mapped correctly **Reference: LowUIR Statement Type Mapping** | B2R2 LowUIR | Stella IR Model | Notes | |-------------|-----------------|-------| | `LMark` | `IrLabel` | Block label markers | | `Put` | `IrAssignment` | Register write | | `Store` | `IrStore` | Memory write | | `InterJmp` | `IrJump` | Cross-function jump | | `IntraJmp` | `IrJump` | Intra-function jump | | `InterCJmp` | `IrConditionalJump` | Cross-function conditional | | `IntraCJmp` | `IrConditionalJump` | Intra-function conditional | | `SideEffect` | `IrCall`/`IrReturn` | Function calls, returns | | `Def`/`Use`/`Phi` | `IrPhi` | SSA form constructs | ### 17.7 Diagnostic Commands ```bash # Check B2R2 health stella ops binaryindex health --verbose # Run benchmark suite stella ops binaryindex bench --iterations 100 --binary sample.so # View cache statistics stella ops binaryindex cache --stats # Dump effective configuration stella ops binaryindex config ``` --- *Document Version: 1.1.0* *Last Updated: 2026-01-19*