1.9 KiB
1.9 KiB
Call-Ngram Fingerprinting for Binary Similarity Analysis
Module
BinaryIndex
Status
IMPLEMENTED
Description
Call-sequence n-gram extraction from lifted IR for improved cross-compiler binary similarity matching. Generates n-grams (n=2,3,4) from function call sequences and integrates into the semantic fingerprint pipeline with configurable dimension weights (instruction 0.4, CFG 0.3, call-ngram 0.2, semantic 0.1).
Implementation Details
- Modules:
src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/ - Key Classes:
CallNgramGenerator(src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/CallNgramGenerator.cs) - generatesCallNgramFingerprintfromLiftedFunctioncall sequences; computes Jaccard similarity between fingerprintsCallNgramFingerprint(record in same file) - contains n-gram hash sets and metadata; hasEmptysentinel for functions without calls
- Interfaces:
ICallNgramGenerator(defined inCallNgramGenerator.cs) -Generate(LiftedFunction)andComputeSimilarity(CallNgramFingerprint, CallNgramFingerprint) - Integration: Used by
EnsembleDecisionEngineandFunctionAnalysisBuilderas one of the matching dimensions with 0.2 default weight - Source: SPRINT_20260118_026_BinaryIndex_deltasig_enhancements.md
E2E Test Plan
- Generate call-ngram fingerprint from a function with known call sequences and verify correct n-gram extraction (n=2,3,4)
- Compute similarity between identical call sequences and verify similarity = 1.0
- Compute similarity between disjoint call sequences and verify similarity = 0.0
- Verify
CallNgramFingerprint.Emptyis returned for functions without call instructions - Verify call-ngram dimension integrates into ensemble scoring with configurable weight (default 0.2)
- Verify cross-compiler similarity: same source compiled with GCC vs Clang should produce similar call n-grams