save progress

This commit is contained in:
StellaOps Bot
2026-01-06 09:42:02 +02:00
parent 94d68bee8b
commit 37e11918e0
443 changed files with 85863 additions and 897 deletions

View File

@@ -0,0 +1,304 @@
# BinaryIndex ML Model Training Guide
This document describes how to train, export, and deploy ML models for the BinaryIndex binary similarity detection system.
## Overview
The BinaryIndex ML pipeline uses transformer-based models to generate function embeddings that capture semantic similarity. The primary model is **CodeBERT-Binary**, a fine-tuned variant of CodeBERT optimized for decompiled binary code comparison.
## Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ Model Training Pipeline │
│ │
│ ┌───────────────┐ ┌────────────────┐ ┌──────────────────┐ │
│ │ Training Data │ -> │ Fine-tuning │ -> │ Model Export │ │
│ │ (Function │ │ (Contrastive │ │ (ONNX format) │ │
│ │ Pairs) │ │ Learning) │ │ │ │
│ └───────────────┘ └────────────────┘ └──────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Inference Pipeline │ │
│ │ │ │
│ │ Code -> Tokenizer -> ONNX Runtime -> Embedding (768-dim) │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
```
## Training Data Requirements
### Positive Pairs (Similar Functions)
| Source | Description | Estimated Count |
|--------|-------------|-----------------|
| Same function, different optimization | O0 vs O2 vs O3 compilations | ~50,000 |
| Same function, different compiler | GCC vs Clang vs MSVC | ~30,000 |
| Same function, different version | From corpus snapshots | ~100,000 |
| Vulnerability patches | Vulnerable vs fixed versions | ~20,000 |
### Negative Pairs (Dissimilar Functions)
| Source | Description | Estimated Count |
|--------|-------------|-----------------|
| Random function pairs | Random sampling from corpus | ~100,000 |
| Similar-named different functions | Hard negatives for robustness | ~50,000 |
| Same library, different functions | Medium-difficulty negatives | ~50,000 |
**Total training data:** ~400,000 labeled pairs
### Data Format
Training data is stored as JSON Lines (JSONL) format:
```json
{"function_a": "int sum(int* a, int n) { int s = 0; for (int i = 0; i < n; i++) s += a[i]; return s; }", "function_b": "int total(int* arr, int len) { int t = 0; for (int j = 0; j < len; j++) t += arr[j]; return t; }", "is_similar": true, "similarity_score": 0.95}
{"function_a": "int sum(int* a, int n) { ... }", "function_b": "void print(char* s) { ... }", "is_similar": false, "similarity_score": 0.1}
```
## Training Process
### Prerequisites
- Python 3.10+
- PyTorch 2.0+
- Transformers 4.30+
- CUDA 11.8+ (for GPU training)
- 64GB RAM, 32GB VRAM (V100 or A100 recommended)
### Installation
```bash
cd tools/ml
pip install -r requirements.txt
```
### Configuration
Create a training configuration file `config/training.yaml`:
```yaml
model:
base_model: microsoft/codebert-base
embedding_dim: 768
max_sequence_length: 512
training:
batch_size: 32
epochs: 10
learning_rate: 1e-5
warmup_steps: 1000
weight_decay: 0.01
contrastive:
margin: 0.5
temperature: 0.07
data:
train_path: data/train.jsonl
val_path: data/val.jsonl
test_path: data/test.jsonl
output:
model_dir: models/codebert-binary
checkpoint_interval: 1000
```
### Running Training
```bash
python train_codebert_binary.py --config config/training.yaml
```
Training logs are written to `logs/` and checkpoints to `models/`.
### Training Script Overview
```python
# tools/ml/train_codebert_binary.py
class CodeBertBinaryModel(torch.nn.Module):
"""CodeBERT fine-tuned for binary code similarity."""
def __init__(self, pretrained_model="microsoft/codebert-base"):
super().__init__()
self.encoder = RobertaModel.from_pretrained(pretrained_model)
self.projection = torch.nn.Linear(768, 768)
def forward(self, input_ids, attention_mask):
outputs = self.encoder(input_ids, attention_mask=attention_mask)
pooled = outputs.last_hidden_state[:, 0, :] # [CLS] token
projected = self.projection(pooled)
return torch.nn.functional.normalize(projected, p=2, dim=1)
class ContrastiveLoss(torch.nn.Module):
"""Contrastive loss for learning similarity embeddings."""
def __init__(self, margin=0.5):
super().__init__()
self.margin = margin
def forward(self, embedding_a, embedding_b, label):
distance = torch.nn.functional.pairwise_distance(embedding_a, embedding_b)
# label=1: similar, label=0: dissimilar
loss = label * distance.pow(2) + \
(1 - label) * torch.clamp(self.margin - distance, min=0).pow(2)
return loss.mean()
```
## Model Export
After training, export the model to ONNX format for inference:
```bash
python export_onnx.py \
--model models/codebert-binary/best.pt \
--output models/codebert-binary.onnx \
--opset 17
```
### Export Script
```python
# tools/ml/export_onnx.py
def export_to_onnx(model, output_path):
model.eval()
dummy_input = torch.randint(0, 50000, (1, 512))
dummy_mask = torch.ones(1, 512)
torch.onnx.export(
model,
(dummy_input, dummy_mask),
output_path,
input_names=['input_ids', 'attention_mask'],
output_names=['embedding'],
dynamic_axes={
'input_ids': {0: 'batch', 1: 'seq'},
'attention_mask': {0: 'batch', 1: 'seq'},
'embedding': {0: 'batch'}
},
opset_version=17
)
```
## Deployment
### Configuration
Configure the ML service in your application:
```yaml
# etc/binaryindex.yaml
ml:
enabled: true
model_path: /opt/stellaops/models/codebert-binary.onnx
vocabulary_path: /opt/stellaops/models/vocab.txt
num_threads: 4
batch_size: 16
```
### Code Integration
```csharp
// Register ML services
services.AddMlServices(options =>
{
options.ModelPath = config["ml:model_path"];
options.VocabularyPath = config["ml:vocabulary_path"];
options.NumThreads = config.GetValue<int>("ml:num_threads");
});
// Use embedding service
var embedding = await embeddingService.GenerateEmbeddingAsync(
new EmbeddingInput(decompiledCode, null, null, EmbeddingInputType.DecompiledCode));
// Compare embeddings
var similarity = embeddingService.ComputeSimilarity(embA, embB, SimilarityMetric.Cosine);
```
### Fallback Mode
When no ONNX model is available, the system generates hash-based pseudo-embeddings:
```csharp
// In OnnxInferenceEngine.cs
if (_session is null)
{
// Fallback: generate hash-based pseudo-embedding for testing
vector = GenerateFallbackEmbedding(text, 768);
}
```
This allows the system to operate without a trained model (useful for testing) but with reduced accuracy.
## Evaluation
### Metrics
| Metric | Definition | Target |
|--------|------------|--------|
| Accuracy | (TP + TN) / Total | > 90% |
| Precision | TP / (TP + FP) | > 95% |
| Recall | TP / (TP + FN) | > 85% |
| F1 Score | 2 * P * R / (P + R) | > 90% |
| Latency | Per-function embedding time | < 100ms |
### Running Evaluation
```bash
python evaluate.py \
--model models/codebert-binary.onnx \
--test data/test.jsonl \
--output results/evaluation.json
```
### Benchmark Results
From `EnsembleAccuracyBenchmarks`:
| Approach | Accuracy | Precision | Recall | F1 Score | Latency |
|----------|----------|-----------|--------|----------|---------|
| Phase 1 (Hash only) | 70% | 100% | 0% | 0% | 1ms |
| AST only | 75% | 80% | 70% | 74% | 5ms |
| Embedding only | 80% | 85% | 75% | 80% | 50ms |
| Ensemble (Phase 4) | 92% | 95% | 88% | 91% | 80ms |
## Troubleshooting
### Common Issues
**Model not loading:**
- Verify ONNX file path is correct
- Check ONNX Runtime is installed: `dotnet add package Microsoft.ML.OnnxRuntime`
- Ensure model was exported with compatible opset version
**Low accuracy:**
- Verify training data quality and balance
- Check for data leakage between train/test splits
- Adjust contrastive loss margin
**High latency:**
- Reduce max sequence length (default 512)
- Enable batching for bulk operations
- Consider GPU acceleration for high-volume deployments
### Logging
Enable detailed ML logging:
```csharp
services.AddLogging(builder =>
{
builder.AddFilter("StellaOps.BinaryIndex.ML", LogLevel.Debug);
});
```
## References
- [CodeBERT Paper](https://arxiv.org/abs/2002.08155)
- [Binary Code Similarity Detection](https://arxiv.org/abs/2308.01463)
- [ONNX Runtime Documentation](https://onnxruntime.ai/docs/)
- [Contrastive Learning for Code](https://arxiv.org/abs/2103.03143)