2.4 KiB
2.4 KiB
ML Function Embedding Service (CodeBERT/ONNX Inference)
Module
BinaryIndex
Status
IMPLEMENTED
Description
ONNX-based function embedding inference service for binary function matching using CodeBERT-derived models. Includes training corpus schema, embedding generation pipeline, and ensemble integration with existing matchers. No direct match in known features list.
Implementation Details
- Modules:
src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/,src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ensemble/ - Key Classes:
IEmbeddingService(src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/IEmbeddingService.cs) - generatesFunctionEmbeddingfrom binary functions; supports batch generation, similarity computation, and nearest-neighbor searchInMemoryEmbeddingIndex(src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/InMemoryEmbeddingIndex.cs) - in-memory vector index for fast embedding similarity search with cosine similarityMlEmbeddingMatcherAdapter(src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/MlEmbeddingMatcherAdapter.cs) - adapts ML embeddings for ensemble decision engineGroundTruthCorpusBuilder(src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/GroundTruthCorpusBuilder.cs) - builds training corpus from ground truth data with JsonLines/Json exportICorpusBuilder(src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/ICorpusBuilder.cs) - training corpus building interface withCorpusExportFormatenumFunctionEmbedding- vector embedding record for binary functions
- Integration:
FunctionAnalysisBuilder(src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ensemble/FunctionAnalysisBuilder.cs) passes ML embeddings into ensemble scoring - Registration:
TrainingServiceCollectionExtensionsfor DI setup
E2E Test Plan
- Generate a function embedding from a known binary function and verify vector dimensions are correct
- Compute similarity between embeddings of identical functions (compiled with different flags) and verify high similarity
- Add embeddings to
InMemoryEmbeddingIndexand verify nearest-neighbor search returns correct matches - Build a training corpus from ground truth pairs via
GroundTruthCorpusBuilder - Verify
MlEmbeddingMatcherAdapterintegrates with ensemble decision engine - Verify batch embedding generation processes multiple functions efficiently