Files
git.stella-ops.org/docs/features/checked/binaryindex/ml-function-embedding-service.md
2026-02-14 09:11:48 +02:00

2.4 KiB

ML Function Embedding Service (CodeBERT/ONNX Inference)

Module

BinaryIndex

Status

IMPLEMENTED

Description

ONNX-based function embedding inference service for binary function matching using CodeBERT-derived models. Includes training corpus schema, embedding generation pipeline, and ensemble integration with existing matchers. No direct match in known features list.

Implementation Details

  • Modules: src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/, src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ensemble/
  • Key Classes:
    • IEmbeddingService (src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/IEmbeddingService.cs) - generates FunctionEmbedding from binary functions; supports batch generation, similarity computation, and nearest-neighbor search
    • InMemoryEmbeddingIndex (src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/InMemoryEmbeddingIndex.cs) - in-memory vector index for fast embedding similarity search with cosine similarity
    • MlEmbeddingMatcherAdapter (src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/MlEmbeddingMatcherAdapter.cs) - adapts ML embeddings for ensemble decision engine
    • GroundTruthCorpusBuilder (src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/GroundTruthCorpusBuilder.cs) - builds training corpus from ground truth data with JsonLines/Json export
    • ICorpusBuilder (src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.ML/Training/ICorpusBuilder.cs) - training corpus building interface with CorpusExportFormat enum
    • FunctionEmbedding - vector embedding record for binary functions
  • Integration: FunctionAnalysisBuilder (src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ensemble/FunctionAnalysisBuilder.cs) passes ML embeddings into ensemble scoring
  • Registration: TrainingServiceCollectionExtensions for DI setup

E2E Test Plan

  • Generate a function embedding from a known binary function and verify vector dimensions are correct
  • Compute similarity between embeddings of identical functions (compiled with different flags) and verify high similarity
  • Add embeddings to InMemoryEmbeddingIndex and verify nearest-neighbor search returns correct matches
  • Build a training corpus from ground truth pairs via GroundTruthCorpusBuilder
  • Verify MlEmbeddingMatcherAdapter integrates with ensemble decision engine
  • Verify batch embedding generation processes multiple functions efficiently