Files
git.stella-ops.org/docs/features/unchecked/binaryindex/call-ngram-fingerprinting-for-binary-similarity-analysis.md

1.9 KiB

Call-Ngram Fingerprinting for Binary Similarity Analysis

Module

BinaryIndex

Status

IMPLEMENTED

Description

Call-sequence n-gram extraction from lifted IR for improved cross-compiler binary similarity matching. Generates n-grams (n=2,3,4) from function call sequences and integrates into the semantic fingerprint pipeline with configurable dimension weights (instruction 0.4, CFG 0.3, call-ngram 0.2, semantic 0.1).

Implementation Details

  • Modules: src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/
  • Key Classes:
    • CallNgramGenerator (src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/CallNgramGenerator.cs) - generates CallNgramFingerprint from LiftedFunction call sequences; computes Jaccard similarity between fingerprints
    • CallNgramFingerprint (record in same file) - contains n-gram hash sets and metadata; has Empty sentinel for functions without calls
  • Interfaces: ICallNgramGenerator (defined in CallNgramGenerator.cs) - Generate(LiftedFunction) and ComputeSimilarity(CallNgramFingerprint, CallNgramFingerprint)
  • Integration: Used by EnsembleDecisionEngine and FunctionAnalysisBuilder as one of the matching dimensions with 0.2 default weight
  • Source: SPRINT_20260118_026_BinaryIndex_deltasig_enhancements.md

E2E Test Plan

  • Generate call-ngram fingerprint from a function with known call sequences and verify correct n-gram extraction (n=2,3,4)
  • Compute similarity between identical call sequences and verify similarity = 1.0
  • Compute similarity between disjoint call sequences and verify similarity = 0.0
  • Verify CallNgramFingerprint.Empty is returned for functions without call instructions
  • Verify call-ngram dimension integrates into ensemble scoring with configurable weight (default 0.2)
  • Verify cross-compiler similarity: same source compiled with GCC vs Clang should produce similar call n-grams