semi implemented and features implemented save checkpoint

This commit is contained in:
master
2026-02-08 18:00:49 +02:00
parent 04360dff63
commit 1bf6bbf395
20895 changed files with 716795 additions and 64 deletions

View File

@@ -0,0 +1,27 @@
# Call-Ngram Fingerprinting for Binary Similarity Analysis
## Module
BinaryIndex
## Status
IMPLEMENTED
## Description
Call-sequence n-gram extraction from lifted IR for improved cross-compiler binary similarity matching. Generates n-grams (n=2,3,4) from function call sequences and integrates into the semantic fingerprint pipeline with configurable dimension weights (instruction 0.4, CFG 0.3, call-ngram 0.2, semantic 0.1).
## Implementation Details
- **Modules**: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/`
- **Key Classes**:
- `CallNgramGenerator` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/CallNgramGenerator.cs`) - generates `CallNgramFingerprint` from `LiftedFunction` call sequences; computes Jaccard similarity between fingerprints
- `CallNgramFingerprint` (record in same file) - contains n-gram hash sets and metadata; has `Empty` sentinel for functions without calls
- **Interfaces**: `ICallNgramGenerator` (defined in `CallNgramGenerator.cs`) - `Generate(LiftedFunction)` and `ComputeSimilarity(CallNgramFingerprint, CallNgramFingerprint)`
- **Integration**: Used by `EnsembleDecisionEngine` and `FunctionAnalysisBuilder` as one of the matching dimensions with 0.2 default weight
- **Source**: SPRINT_20260118_026_BinaryIndex_deltasig_enhancements.md
## E2E Test Plan
- [ ] Generate call-ngram fingerprint from a function with known call sequences and verify correct n-gram extraction (n=2,3,4)
- [ ] Compute similarity between identical call sequences and verify similarity = 1.0
- [ ] Compute similarity between disjoint call sequences and verify similarity = 0.0
- [ ] Verify `CallNgramFingerprint.Empty` is returned for functions without call instructions
- [ ] Verify call-ngram dimension integrates into ensemble scoring with configurable weight (default 0.2)
- [ ] Verify cross-compiler similarity: same source compiled with GCC vs Clang should produce similar call n-grams