31 lines
2.4 KiB
Markdown
31 lines
2.4 KiB
Markdown
# ML Function Embedding Service (CodeBERT/ONNX Inference)
|
|
|
|
## Module
|
|
BinaryIndex
|
|
|
|
## Status
|
|
IMPLEMENTED
|
|
|
|
## Description
|
|
ONNX-based function embedding inference service for binary function matching using CodeBERT-derived models. Includes training corpus schema, embedding generation pipeline, and ensemble integration with existing matchers. No direct match in known features list.
|
|
|
|
## Implementation Details
|
|
- **Modules**: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/`, `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ensemble/`
|
|
- **Key Classes**:
|
|
- `IEmbeddingService` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/IEmbeddingService.cs`) - generates `FunctionEmbedding` from binary functions; supports batch generation, similarity computation, and nearest-neighbor search
|
|
- `InMemoryEmbeddingIndex` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/InMemoryEmbeddingIndex.cs`) - in-memory vector index for fast embedding similarity search with cosine similarity
|
|
- `MlEmbeddingMatcherAdapter` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/MlEmbeddingMatcherAdapter.cs`) - adapts ML embeddings for ensemble decision engine
|
|
- `GroundTruthCorpusBuilder` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/GroundTruthCorpusBuilder.cs`) - builds training corpus from ground truth data with JsonLines/Json export
|
|
- `ICorpusBuilder` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/ICorpusBuilder.cs`) - training corpus building interface with `CorpusExportFormat` enum
|
|
- `FunctionEmbedding` - vector embedding record for binary functions
|
|
- **Integration**: `FunctionAnalysisBuilder` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ensemble/FunctionAnalysisBuilder.cs`) passes ML embeddings into ensemble scoring
|
|
- **Registration**: `TrainingServiceCollectionExtensions` for DI setup
|
|
|
|
## E2E Test Plan
|
|
- [ ] Generate a function embedding from a known binary function and verify vector dimensions are correct
|
|
- [ ] Compute similarity between embeddings of identical functions (compiled with different flags) and verify high similarity
|
|
- [ ] Add embeddings to `InMemoryEmbeddingIndex` and verify nearest-neighbor search returns correct matches
|
|
- [ ] Build a training corpus from ground truth pairs via `GroundTruthCorpusBuilder`
|
|
- [ ] Verify `MlEmbeddingMatcherAdapter` integrates with ensemble decision engine
|
|
- [ ] Verify batch embedding generation processes multiple functions efficiently
|