# BinaryIndex ML Model Training Guide

This document describes how to train, export, and deploy ML models for the BinaryIndex binary similarity detection system.

## Overview

The BinaryIndex ML pipeline uses transformer-based models to generate function embeddings that capture semantic similarity. The primary model is **CodeBERT-Binary**, a fine-tuned variant of CodeBERT optimized for decompiled binary code comparison.

## Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                    Model Training Pipeline                          │
│                                                                     │
│  ┌───────────────┐    ┌────────────────┐    ┌──────────────────┐  │
│  │ Training Data │ -> │ Fine-tuning    │ -> │ Model Export     │  │
│  │ (Function     │    │ (Contrastive   │    │ (ONNX format)    │  │
│  │ Pairs)        │    │ Learning)      │    │                  │  │
│  └───────────────┘    └────────────────┘    └──────────────────┘  │
│                                                                     │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │                    Inference Pipeline                         │ │
│  │                                                               │ │
│  │  Code -> Tokenizer -> ONNX Runtime -> Embedding (768-dim)    │ │
│  │                                                               │ │
│  └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
```

## Training Data Requirements

### Positive Pairs (Similar Functions)

| Source | Description | Estimated Count |
|--------|-------------|-----------------|
| Same function, different optimization | O0 vs O2 vs O3 compilations | ~50,000 |
| Same function, different compiler | GCC vs Clang vs MSVC | ~30,000 |
| Same function, different version | From corpus snapshots | ~100,000 |
| Vulnerability patches | Vulnerable vs fixed versions | ~20,000 |

### Negative Pairs (Dissimilar Functions)

| Source | Description | Estimated Count |
|--------|-------------|-----------------|
| Random function pairs | Random sampling from corpus | ~100,000 |
| Similar-named different functions | Hard negatives for robustness | ~50,000 |
| Same library, different functions | Medium-difficulty negatives | ~50,000 |

**Total training data:** ~400,000 labeled pairs

### Data Format

Training data is stored as JSON Lines (JSONL) format:

```json
{"function_a": "int sum(int* a, int n) { int s = 0; for (int i = 0; i < n; i++) s += a[i]; return s; }", "function_b": "int total(int* arr, int len) { int t = 0; for (int j = 0; j < len; j++) t += arr[j]; return t; }", "is_similar": true, "similarity_score": 0.95}
{"function_a": "int sum(int* a, int n) { ... }", "function_b": "void print(char* s) { ... }", "is_similar": false, "similarity_score": 0.1}
```

## Training Process

### Prerequisites

- Python 3.10+
- PyTorch 2.0+
- Transformers 4.30+
- CUDA 11.8+ (for GPU training)
- 64GB RAM, 32GB VRAM (V100 or A100 recommended)

### Installation

```bash
cd tools/ml
pip install -r requirements.txt
```

### Configuration

Create a training configuration file `config/training.yaml`:

```yaml
model:
  base_model: microsoft/codebert-base
  embedding_dim: 768
  max_sequence_length: 512

training:
  batch_size: 32
  epochs: 10
  learning_rate: 1e-5
  warmup_steps: 1000
  weight_decay: 0.01

contrastive:
  margin: 0.5
  temperature: 0.07

data:
  train_path: data/train.jsonl
  val_path: data/val.jsonl
  test_path: data/test.jsonl

output:
  model_dir: models/codebert-binary
  checkpoint_interval: 1000
```

### Running Training

```bash
python train_codebert_binary.py --config config/training.yaml
```

Training logs are written to `logs/` and checkpoints to `models/`.

### Training Script Overview

```python
# tools/ml/train_codebert_binary.py

class CodeBertBinaryModel(torch.nn.Module):
    """CodeBERT fine-tuned for binary code similarity."""

    def __init__(self, pretrained_model="microsoft/codebert-base"):
        super().__init__()
        self.encoder = RobertaModel.from_pretrained(pretrained_model)
        self.projection = torch.nn.Linear(768, 768)

    def forward(self, input_ids, attention_mask):
        outputs = self.encoder(input_ids, attention_mask=attention_mask)
        pooled = outputs.last_hidden_state[:, 0, :]  # [CLS] token
        projected = self.projection(pooled)
        return torch.nn.functional.normalize(projected, p=2, dim=1)


class ContrastiveLoss(torch.nn.Module):
    """Contrastive loss for learning similarity embeddings."""

    def __init__(self, margin=0.5):
        super().__init__()
        self.margin = margin

    def forward(self, embedding_a, embedding_b, label):
        distance = torch.nn.functional.pairwise_distance(embedding_a, embedding_b)
        # label=1: similar, label=0: dissimilar
        loss = label * distance.pow(2) + \
               (1 - label) * torch.clamp(self.margin - distance, min=0).pow(2)
        return loss.mean()
```

## Model Export

After training, export the model to ONNX format for inference:

```bash
python export_onnx.py \
    --model models/codebert-binary/best.pt \
    --output models/codebert-binary.onnx \
    --opset 17
```

### Export Script

```python
# tools/ml/export_onnx.py

def export_to_onnx(model, output_path):
    model.eval()
    dummy_input = torch.randint(0, 50000, (1, 512))
    dummy_mask = torch.ones(1, 512)

    torch.onnx.export(
        model,
        (dummy_input, dummy_mask),
        output_path,
        input_names=['input_ids', 'attention_mask'],
        output_names=['embedding'],
        dynamic_axes={
            'input_ids': {0: 'batch', 1: 'seq'},
            'attention_mask': {0: 'batch', 1: 'seq'},
            'embedding': {0: 'batch'}
        },
        opset_version=17
    )
```

## Deployment

### Configuration

Configure the ML service in your application:

```yaml
# etc/binaryindex.yaml
ml:
  enabled: true
  model_path: /opt/stellaops/models/codebert-binary.onnx
  vocabulary_path: /opt/stellaops/models/vocab.txt
  num_threads: 4
  batch_size: 16
```

### Code Integration

```csharp
// Register ML services
services.AddMlServices(options =>
{
    options.ModelPath = config["ml:model_path"];
    options.VocabularyPath = config["ml:vocabulary_path"];
    options.NumThreads = config.GetValue<int>("ml:num_threads");
});

// Use embedding service
var embedding = await embeddingService.GenerateEmbeddingAsync(
    new EmbeddingInput(decompiledCode, null, null, EmbeddingInputType.DecompiledCode));

// Compare embeddings
var similarity = embeddingService.ComputeSimilarity(embA, embB, SimilarityMetric.Cosine);
```

### Fallback Mode

When no ONNX model is available, the system generates hash-based pseudo-embeddings:

```csharp
// In OnnxInferenceEngine.cs
if (_session is null)
{
    // Fallback: generate hash-based pseudo-embedding for testing
    vector = GenerateFallbackEmbedding(text, 768);
}
```

This allows the system to operate without a trained model (useful for testing) but with reduced accuracy.

## Evaluation

### Metrics

| Metric | Definition | Target |
|--------|------------|--------|
| Accuracy | (TP + TN) / Total | > 90% |
| Precision | TP / (TP + FP) | > 95% |
| Recall | TP / (TP + FN) | > 85% |
| F1 Score | 2 * P * R / (P + R) | > 90% |
| Latency | Per-function embedding time | < 100ms |

### Running Evaluation

```bash
python evaluate.py \
    --model models/codebert-binary.onnx \
    --test data/test.jsonl \
    --output results/evaluation.json
```

### Benchmark Results

From `EnsembleAccuracyBenchmarks`:

| Approach | Accuracy | Precision | Recall | F1 Score | Latency |
|----------|----------|-----------|--------|----------|---------|
| Phase 1 (Hash only) | 70% | 100% | 0% | 0% | 1ms |
| AST only | 75% | 80% | 70% | 74% | 5ms |
| Embedding only | 80% | 85% | 75% | 80% | 50ms |
| Ensemble (Phase 4) | 92% | 95% | 88% | 91% | 80ms |

## Troubleshooting

### Common Issues

**Model not loading:**
- Verify ONNX file path is correct
- Check ONNX Runtime is installed: `dotnet add package Microsoft.ML.OnnxRuntime`
- Ensure model was exported with compatible opset version

**Low accuracy:**
- Verify training data quality and balance
- Check for data leakage between train/test splits
- Adjust contrastive loss margin

**High latency:**
- Reduce max sequence length (default 512)
- Enable batching for bulk operations
- Consider GPU acceleration for high-volume deployments

### Logging

Enable detailed ML logging:

```csharp
services.AddLogging(builder =>
{
    builder.AddFilter("StellaOps.BinaryIndex.ML", LogLevel.Debug);
});
```

## References

- [CodeBERT Paper](https://arxiv.org/abs/2002.08155)
- [Binary Code Similarity Detection](https://arxiv.org/abs/2308.01463)
- [ONNX Runtime Documentation](https://onnxruntime.ai/docs/)
- [Contrastive Learning for Code](https://arxiv.org/abs/2103.03143)