semi implemented and features implemented save checkpoint

This commit is contained in:
master
2026-02-08 18:00:49 +02:00
parent 04360dff63
commit 1bf6bbf395
20895 changed files with 716795 additions and 64 deletions

View File

@@ -0,0 +1,30 @@
# ML Function Embedding Service (CodeBERT/ONNX Inference)
## Module
BinaryIndex
## Status
IMPLEMENTED
## Description
ONNX-based function embedding inference service for binary function matching using CodeBERT-derived models. Includes training corpus schema, embedding generation pipeline, and ensemble integration with existing matchers. No direct match in known features list.
## Implementation Details
- **Modules**: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/`, `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ensemble/`
- **Key Classes**:
- `IEmbeddingService` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/IEmbeddingService.cs`) - generates `FunctionEmbedding` from binary functions; supports batch generation, similarity computation, and nearest-neighbor search
- `InMemoryEmbeddingIndex` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/InMemoryEmbeddingIndex.cs`) - in-memory vector index for fast embedding similarity search with cosine similarity
- `MlEmbeddingMatcherAdapter` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/MlEmbeddingMatcherAdapter.cs`) - adapts ML embeddings for ensemble decision engine
- `GroundTruthCorpusBuilder` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/GroundTruthCorpusBuilder.cs`) - builds training corpus from ground truth data with JsonLines/Json export
- `ICorpusBuilder` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/ICorpusBuilder.cs`) - training corpus building interface with `CorpusExportFormat` enum
- `FunctionEmbedding` - vector embedding record for binary functions
- **Integration**: `FunctionAnalysisBuilder` (`src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ensemble/FunctionAnalysisBuilder.cs`) passes ML embeddings into ensemble scoring
- **Registration**: `TrainingServiceCollectionExtensions` for DI setup
## E2E Test Plan
- [ ] Generate a function embedding from a known binary function and verify vector dimensions are correct
- [ ] Compute similarity between embeddings of identical functions (compiled with different flags) and verify high similarity
- [ ] Add embeddings to `InMemoryEmbeddingIndex` and verify nearest-neighbor search returns correct matches
- [ ] Build a training corpus from ground truth pairs via `GroundTruthCorpusBuilder`
- [ ] Verify `MlEmbeddingMatcherAdapter` integrates with ensemble decision engine
- [ ] Verify batch embedding generation processes multiple functions efficiently