# ML Function Embedding Service (CodeBERT/ONNX Inference) ## Module BinaryIndex ## Status IMPLEMENTED ## Description ONNX-based function embedding inference service for binary function matching using CodeBERT-derived models. Includes training corpus schema, embedding generation pipeline, and ensemble integration with existing matchers. No direct match in known features list. ## Implementation Details - **Modules**: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/`, `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ensemble/` - **Key Classes**: - `IEmbeddingService` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/IEmbeddingService.cs`) - generates `FunctionEmbedding` from binary functions; supports batch generation, similarity computation, and nearest-neighbor search - `InMemoryEmbeddingIndex` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/InMemoryEmbeddingIndex.cs`) - in-memory vector index for fast embedding similarity search with cosine similarity - `MlEmbeddingMatcherAdapter` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/MlEmbeddingMatcherAdapter.cs`) - adapts ML embeddings for ensemble decision engine - `GroundTruthCorpusBuilder` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/GroundTruthCorpusBuilder.cs`) - builds training corpus from ground truth data with JsonLines/Json export - `ICorpusBuilder` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/ICorpusBuilder.cs`) - training corpus building interface with `CorpusExportFormat` enum - `FunctionEmbedding` - vector embedding record for binary functions - **Integration**: `FunctionAnalysisBuilder` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ensemble/FunctionAnalysisBuilder.cs`) passes ML embeddings into ensemble scoring - **Registration**: `TrainingServiceCollectionExtensions` for DI setup ## E2E Test Plan - [ ] Generate a function embedding from a known binary function and verify vector dimensions are correct - [ ] Compute similarity between embeddings of identical functions (compiled with different flags) and verify high similarity - [ ] Add embeddings to `InMemoryEmbeddingIndex` and verify nearest-neighbor search returns correct matches - [ ] Build a training corpus from ground truth pairs via `GroundTruthCorpusBuilder` - [ ] Verify `MlEmbeddingMatcherAdapter` integrates with ensemble decision engine - [ ] Verify batch embedding generation processes multiple functions efficiently