Files
git.stella-ops.org/docs/features/unimplemented/binaryindex/call-ngram-fingerprinting-for-binary-similarity-analysis.md
2026-02-12 10:27:23 +02:00

3.1 KiB

Call-Ngram Fingerprinting for Binary Similarity Analysis

Module

BinaryIndex

Status

PARTIALLY_IMPLEMENTED

Description

Call-sequence n-gram extraction from lifted IR for improved cross-compiler binary similarity matching. Generates n-grams (n=2,3,4) from function call sequences and integrates into the semantic fingerprint pipeline with configurable dimension weights (instruction 0.4, CFG 0.3, call-ngram 0.2, semantic 0.1).

Implementation Details

  • Modules: src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/
  • Key Classes:
    • CallNgramGenerator (src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/CallNgramGenerator.cs) - generates CallNgramFingerprint from LiftedFunction call sequences; computes Jaccard similarity between fingerprints
    • CallNgramFingerprint (record in same file) - contains n-gram hash sets and metadata; has Empty sentinel for functions without calls
  • Interfaces: ICallNgramGenerator (defined in CallNgramGenerator.cs) - Generate(LiftedFunction) and ComputeSimilarity(CallNgramFingerprint, CallNgramFingerprint)
  • Integration: Used by EnsembleDecisionEngine and FunctionAnalysisBuilder as one of the matching dimensions with 0.2 default weight
  • Source: SPRINT_20260118_026_BinaryIndex_deltasig_enhancements.md

E2E Test Plan

  • Generate call-ngram fingerprint from a function with known call sequences and verify correct n-gram extraction (n=2,3,4)
  • Compute similarity between identical call sequences and verify similarity = 1.0
  • Compute similarity between disjoint call sequences and verify similarity = 0.0
  • Verify CallNgramFingerprint.Empty is returned for functions without call instructions
  • Verify call-ngram dimension integrates into ensemble scoring with configurable weight (default 0.2)
  • Verify cross-compiler similarity: same source compiled with GCC vs Clang should produce similar call n-grams

Implementation Gaps (QA 2026-02-11)

  • CallNgramGenerator is present, but no dedicated call-ngram behavioral tests exist in src/BinaryIndex/__Tests/StellaOps.BinaryIndex.Semantic.Tests/.
  • EnsembleDecisionEngine currently combines syntactic/semantic/embedding signals only and does not expose a call-ngram signal or the claimed default 0.2 call-ngram weight path.
  • FunctionAnalysisBuilder does not compute or attach call-ngram fingerprints into the ensemble analysis pipeline.
  • CallNgramOptions.MinCallCount is not enforced in generator output flow.

Verification Outcome

  • Tier 0/1/2 artifacts: docs/qa/feature-checks/runs/binaryindex/call-ngram-fingerprinting-for-binary-similarity-analysis/run-001/.
  • Result: not implemented at claim parity.
  • Missing behavior:
    • CallNgramGenerator exists, but no first-class integration path wires call-ngram fingerprints into FunctionAnalysisBuilder outputs.
    • EnsembleDecisionEngine and EnsembleOptions expose only syntactic/semantic/embedding dimensions; no call-ngram dimension with claimed default weight.
    • No dedicated call-ngram generator behavioral tests verify n=2/3/4 extraction and similarity semantics as described by the dossier.