Files

95 lines
3.8 KiB
Markdown

# BinaryIndex
**Status:** Implemented
**Source:** `src/BinaryIndex/`
**Owner:** Scanner Guild + Concelier Guild
## Purpose
BinaryIndex provides vulnerable binary detection independent of package metadata. It addresses the gap where package version strings can lie (backports, custom builds, stripped metadata) through binary-first vulnerability identification using Build-IDs, hash catalogs, and function fingerprints.
## Components
**Libraries:**
- `StellaOps.BinaryIndex.Core` - Core binary identity extraction and matching engine
- `StellaOps.BinaryIndex.Corpus` - Binary-to-advisory mapping database
- `StellaOps.BinaryIndex.Corpus.Debian` - Debian-specific corpus support
- `StellaOps.BinaryIndex.Fingerprints` - Function fingerprint storage and matching (CFG/basic-block hashes)
- `StellaOps.BinaryIndex.FixIndex` - Patch-aware backport handling
- `StellaOps.BinaryIndex.Persistence` - Storage adapters for binary catalogs
## Configuration
Configuration is typically embedded in Scanner and Concelier module settings.
Key features:
- Three-tier binary identification (package/version, Build-ID/hash, function fingerprints)
- Binary identity extraction (Build-ID, PE CodeView GUID, Mach-O UUID)
- Integration with Scanner.Worker for binary lookup
- Offline-first design with deterministic outputs
## Dependencies
- PostgreSQL (integrated with Scanner/Concelier schemas)
- Scanner.Analyzers.Native (for binary disassembly/analysis)
- Concelier (for advisory-to-binary mapping)
## Related Documentation
- Architecture: `./architecture.md`
- High-Level Architecture: `../../ARCHITECTURE_OVERVIEW.md`
- Scanner Architecture: `../scanner/architecture.md`
- Concelier Architecture: `../concelier/architecture.md`
## Current Status
Library implementation complete with support for ELF (Build-ID), PE (CodeView GUID), and Mach-O (UUID) binary formats. Integrated into Scanner's native binary analysis pipeline.
---
## Semantic Diffing Roadmap
A major enhancement to BinaryIndex is planned to enable **semantic-level binary diffing** - detecting function equivalence based on behavior rather than syntax. This addresses limitations in current byte/symbol-based matching when dealing with:
- Compiler optimizations (same source, different instructions)
- Stripped binaries (no symbols)
- Cross-compiler builds (GCC vs Clang)
- Obfuscated code
### Planned Phases
| Phase | Description | Impact | Status |
|-------|-------------|--------|--------|
| **Phase 1** | IR-Level Semantic Analysis | +15% accuracy on optimized binaries | Planned |
| **Phase 2** | Function Behavior Corpus | +10% coverage on stripped binaries | Planned |
| **Phase 3** | Ghidra Integration | +5% edge case handling | Planned |
| **Phase 4** | Decompiler & ML Similarity | +10% obfuscation resilience | Planned |
### New Libraries (Planned)
- `StellaOps.BinaryIndex.Semantic` - IR lifting and semantic graph fingerprints
- `StellaOps.BinaryIndex.Corpus` - 30K+ function behavior database
- `StellaOps.BinaryIndex.Ghidra` - Ghidra Headless integration
- `StellaOps.BinaryIndex.Decompiler` - Decompiled code AST comparison
- `StellaOps.BinaryIndex.ML` - CodeBERT-based function embeddings
- `StellaOps.BinaryIndex.Ensemble` - Multi-signal decision fusion
### Expected Outcomes
| Metric | Current | Target |
|--------|---------|--------|
| Patch detection accuracy | ~70% | 92%+ |
| Function identification (stripped) | ~50% | 85%+ |
| False positive rate | ~5% | <2% |
### Sprint Files
- `docs/implplan/SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md`
- `docs/implplan/SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md`
- `docs/implplan/SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md`
- `docs/implplan/SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md`
### Architecture Documentation
See `./semantic-diffing.md` for comprehensive architecture documentation.