# Call-Ngram Fingerprinting for Binary Similarity Analysis ## Module BinaryIndex ## Status IMPLEMENTED ## Description Call-sequence n-gram extraction from lifted IR for improved cross-compiler binary similarity matching. Generates n-grams (n=2,3,4) from function call sequences and integrates into the semantic fingerprint pipeline with configurable dimension weights (instruction 0.4, CFG 0.3, call-ngram 0.2, semantic 0.1). ## Implementation Details - **Modules**: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/` - **Key Classes**: - `CallNgramGenerator` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/CallNgramGenerator.cs`) - generates `CallNgramFingerprint` from `LiftedFunction` call sequences; computes Jaccard similarity between fingerprints - `CallNgramFingerprint` (record in same file) - contains n-gram hash sets and metadata; has `Empty` sentinel for functions without calls - **Interfaces**: `ICallNgramGenerator` (defined in `CallNgramGenerator.cs`) - `Generate(LiftedFunction)` and `ComputeSimilarity(CallNgramFingerprint, CallNgramFingerprint)` - **Integration**: Used by `EnsembleDecisionEngine` and `FunctionAnalysisBuilder` as one of the matching dimensions with 0.2 default weight - **Source**: SPRINT_20260118_026_BinaryIndex_deltasig_enhancements.md ## E2E Test Plan - [ ] Generate call-ngram fingerprint from a function with known call sequences and verify correct n-gram extraction (n=2,3,4) - [ ] Compute similarity between identical call sequences and verify similarity = 1.0 - [ ] Compute similarity between disjoint call sequences and verify similarity = 0.0 - [ ] Verify `CallNgramFingerprint.Empty` is returned for functions without call instructions - [ ] Verify call-ngram dimension integrates into ensemble scoring with configurable weight (default 0.2) - [ ] Verify cross-compiler similarity: same source compiled with GCC vs Clang should produce similar call n-grams