3.1 KiB
3.1 KiB
Call-Ngram Fingerprinting for Binary Similarity Analysis
Module
BinaryIndex
Status
PARTIALLY_IMPLEMENTED
Description
Call-sequence n-gram extraction from lifted IR for improved cross-compiler binary similarity matching. Generates n-grams (n=2,3,4) from function call sequences and integrates into the semantic fingerprint pipeline with configurable dimension weights (instruction 0.4, CFG 0.3, call-ngram 0.2, semantic 0.1).
Implementation Details
- Modules:
src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/ - Key Classes:
CallNgramGenerator(src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/CallNgramGenerator.cs) - generatesCallNgramFingerprintfromLiftedFunctioncall sequences; computes Jaccard similarity between fingerprintsCallNgramFingerprint(record in same file) - contains n-gram hash sets and metadata; hasEmptysentinel for functions without calls
- Interfaces:
ICallNgramGenerator(defined inCallNgramGenerator.cs) -Generate(LiftedFunction)andComputeSimilarity(CallNgramFingerprint, CallNgramFingerprint) - Integration: Used by
EnsembleDecisionEngineandFunctionAnalysisBuilderas one of the matching dimensions with 0.2 default weight - Source: SPRINT_20260118_026_BinaryIndex_deltasig_enhancements.md
E2E Test Plan
- Generate call-ngram fingerprint from a function with known call sequences and verify correct n-gram extraction (n=2,3,4)
- Compute similarity between identical call sequences and verify similarity = 1.0
- Compute similarity between disjoint call sequences and verify similarity = 0.0
- Verify
CallNgramFingerprint.Emptyis returned for functions without call instructions - Verify call-ngram dimension integrates into ensemble scoring with configurable weight (default 0.2)
- Verify cross-compiler similarity: same source compiled with GCC vs Clang should produce similar call n-grams
Implementation Gaps (QA 2026-02-11)
CallNgramGeneratoris present, but no dedicated call-ngram behavioral tests exist insrc/BinaryIndex/__Tests/StellaOps.BinaryIndex.Semantic.Tests/.EnsembleDecisionEnginecurrently combines syntactic/semantic/embedding signals only and does not expose a call-ngram signal or the claimed default 0.2 call-ngram weight path.FunctionAnalysisBuilderdoes not compute or attach call-ngram fingerprints into the ensemble analysis pipeline.CallNgramOptions.MinCallCountis not enforced in generator output flow.
Verification Outcome
- Tier 0/1/2 artifacts:
docs/qa/feature-checks/runs/binaryindex/call-ngram-fingerprinting-for-binary-similarity-analysis/run-001/. - Result: not implemented at claim parity.
- Missing behavior:
CallNgramGeneratorexists, but no first-class integration path wires call-ngram fingerprints intoFunctionAnalysisBuilderoutputs.EnsembleDecisionEngineandEnsembleOptionsexpose only syntactic/semantic/embedding dimensions; no call-ngram dimension with claimed default weight.- No dedicated call-ngram generator behavioral tests verify n=2/3/4 extraction and similarity semantics as described by the dossier.