Files
git.stella-ops.org/docs/features/checked/binaryindex/ml-function-embedding-service.md
2026-02-14 09:11:48 +02:00

31 lines
2.4 KiB
Markdown

# ML Function Embedding Service (CodeBERT/ONNX Inference)
## Module
BinaryIndex
## Status
IMPLEMENTED
## Description
ONNX-based function embedding inference service for binary function matching using CodeBERT-derived models. Includes training corpus schema, embedding generation pipeline, and ensemble integration with existing matchers. No direct match in known features list.
## Implementation Details
- **Modules**: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/`, `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ensemble/`
- **Key Classes**:
- `IEmbeddingService` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/IEmbeddingService.cs`) - generates `FunctionEmbedding` from binary functions; supports batch generation, similarity computation, and nearest-neighbor search
- `InMemoryEmbeddingIndex` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/InMemoryEmbeddingIndex.cs`) - in-memory vector index for fast embedding similarity search with cosine similarity
- `MlEmbeddingMatcherAdapter` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/MlEmbeddingMatcherAdapter.cs`) - adapts ML embeddings for ensemble decision engine
- `GroundTruthCorpusBuilder` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/GroundTruthCorpusBuilder.cs`) - builds training corpus from ground truth data with JsonLines/Json export
- `ICorpusBuilder` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/ICorpusBuilder.cs`) - training corpus building interface with `CorpusExportFormat` enum
- `FunctionEmbedding` - vector embedding record for binary functions
- **Integration**: `FunctionAnalysisBuilder` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ensemble/FunctionAnalysisBuilder.cs`) passes ML embeddings into ensemble scoring
- **Registration**: `TrainingServiceCollectionExtensions` for DI setup
## E2E Test Plan
- [ ] Generate a function embedding from a known binary function and verify vector dimensions are correct
- [ ] Compute similarity between embeddings of identical functions (compiled with different flags) and verify high similarity
- [ ] Add embeddings to `InMemoryEmbeddingIndex` and verify nearest-neighbor search returns correct matches
- [ ] Build a training corpus from ground truth pairs via `GroundTruthCorpusBuilder`
- [ ] Verify `MlEmbeddingMatcherAdapter` integrates with ensemble decision engine
- [ ] Verify batch embedding generation processes multiple functions efficiently