# Call-Ngram Fingerprinting for Binary Similarity Analysis ## Module BinaryIndex ## Status PARTIALLY_IMPLEMENTED ## Description Call-sequence n-gram extraction from lifted IR for improved cross-compiler binary similarity matching. Generates n-grams (n=2,3,4) from function call sequences and integrates into the semantic fingerprint pipeline with configurable dimension weights (instruction 0.4, CFG 0.3, call-ngram 0.2, semantic 0.1). ## Implementation Details - **Modules**: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/` - **Key Classes**: - `CallNgramGenerator` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/CallNgramGenerator.cs`) - generates `CallNgramFingerprint` from `LiftedFunction` call sequences; computes Jaccard similarity between fingerprints - `CallNgramFingerprint` (record in same file) - contains n-gram hash sets and metadata; has `Empty` sentinel for functions without calls - **Interfaces**: `ICallNgramGenerator` (defined in `CallNgramGenerator.cs`) - `Generate(LiftedFunction)` and `ComputeSimilarity(CallNgramFingerprint, CallNgramFingerprint)` - **Integration**: Used by `EnsembleDecisionEngine` and `FunctionAnalysisBuilder` as one of the matching dimensions with 0.2 default weight - **Source**: SPRINT_20260118_026_BinaryIndex_deltasig_enhancements.md ## E2E Test Plan - [ ] Generate call-ngram fingerprint from a function with known call sequences and verify correct n-gram extraction (n=2,3,4) - [ ] Compute similarity between identical call sequences and verify similarity = 1.0 - [ ] Compute similarity between disjoint call sequences and verify similarity = 0.0 - [ ] Verify `CallNgramFingerprint.Empty` is returned for functions without call instructions - [ ] Verify call-ngram dimension integrates into ensemble scoring with configurable weight (default 0.2) - [ ] Verify cross-compiler similarity: same source compiled with GCC vs Clang should produce similar call n-grams ## Implementation Gaps (QA 2026-02-11) - `CallNgramGenerator` is present, but no dedicated call-ngram behavioral tests exist in `src/BinaryIndex/__Tests/StellaOps.BinaryIndex.Semantic.Tests/`. - `EnsembleDecisionEngine` currently combines syntactic/semantic/embedding signals only and does not expose a call-ngram signal or the claimed default 0.2 call-ngram weight path. - `FunctionAnalysisBuilder` does not compute or attach call-ngram fingerprints into the ensemble analysis pipeline. - `CallNgramOptions.MinCallCount` is not enforced in generator output flow. ## Verification Outcome - Tier 0/1/2 artifacts: `docs/qa/feature-checks/runs/binaryindex/call-ngram-fingerprinting-for-binary-similarity-analysis/run-001/`. - Result: not implemented at claim parity. - Missing behavior: - `CallNgramGenerator` exists, but no first-class integration path wires call-ngram fingerprints into `FunctionAnalysisBuilder` outputs. - `EnsembleDecisionEngine` and `EnsembleOptions` expose only syntactic/semantic/embedding dimensions; no call-ngram dimension with claimed default weight. - No dedicated call-ngram generator behavioral tests verify n=2/3/4 extraction and similarity semantics as described by the dossier.