95 lines
3.8 KiB
Markdown
95 lines
3.8 KiB
Markdown
# BinaryIndex
|
|
|
|
**Status:** Implemented
|
|
**Source:** `src/BinaryIndex/`
|
|
**Owner:** Scanner Guild + Concelier Guild
|
|
|
|
## Purpose
|
|
|
|
BinaryIndex provides vulnerable binary detection independent of package metadata. It addresses the gap where package version strings can lie (backports, custom builds, stripped metadata) through binary-first vulnerability identification using Build-IDs, hash catalogs, and function fingerprints.
|
|
|
|
## Components
|
|
|
|
**Libraries:**
|
|
- `StellaOps.BinaryIndex.Core` - Core binary identity extraction and matching engine
|
|
- `StellaOps.BinaryIndex.Corpus` - Binary-to-advisory mapping database
|
|
- `StellaOps.BinaryIndex.Corpus.Debian` - Debian-specific corpus support
|
|
- `StellaOps.BinaryIndex.Fingerprints` - Function fingerprint storage and matching (CFG/basic-block hashes)
|
|
- `StellaOps.BinaryIndex.FixIndex` - Patch-aware backport handling
|
|
- `StellaOps.BinaryIndex.Persistence` - Storage adapters for binary catalogs
|
|
|
|
## Configuration
|
|
|
|
Configuration is typically embedded in Scanner and Concelier module settings.
|
|
|
|
Key features:
|
|
- Three-tier binary identification (package/version, Build-ID/hash, function fingerprints)
|
|
- Binary identity extraction (Build-ID, PE CodeView GUID, Mach-O UUID)
|
|
- Integration with Scanner.Worker for binary lookup
|
|
- Offline-first design with deterministic outputs
|
|
|
|
## Dependencies
|
|
|
|
- PostgreSQL (integrated with Scanner/Concelier schemas)
|
|
- Scanner.Analyzers.Native (for binary disassembly/analysis)
|
|
- Concelier (for advisory-to-binary mapping)
|
|
|
|
## Related Documentation
|
|
|
|
- Architecture: `./architecture.md`
|
|
- High-Level Architecture: `../../ARCHITECTURE_OVERVIEW.md`
|
|
- Scanner Architecture: `../scanner/architecture.md`
|
|
- Concelier Architecture: `../concelier/architecture.md`
|
|
|
|
## Current Status
|
|
|
|
Library implementation complete with support for ELF (Build-ID), PE (CodeView GUID), and Mach-O (UUID) binary formats. Integrated into Scanner's native binary analysis pipeline.
|
|
|
|
---
|
|
|
|
## Semantic Diffing Roadmap
|
|
|
|
A major enhancement to BinaryIndex is planned to enable **semantic-level binary diffing** - detecting function equivalence based on behavior rather than syntax. This addresses limitations in current byte/symbol-based matching when dealing with:
|
|
|
|
- Compiler optimizations (same source, different instructions)
|
|
- Stripped binaries (no symbols)
|
|
- Cross-compiler builds (GCC vs Clang)
|
|
- Obfuscated code
|
|
|
|
### Planned Phases
|
|
|
|
| Phase | Description | Impact | Status |
|
|
|-------|-------------|--------|--------|
|
|
| **Phase 1** | IR-Level Semantic Analysis | +15% accuracy on optimized binaries | Planned |
|
|
| **Phase 2** | Function Behavior Corpus | +10% coverage on stripped binaries | Planned |
|
|
| **Phase 3** | Ghidra Integration | +5% edge case handling | Planned |
|
|
| **Phase 4** | Decompiler & ML Similarity | +10% obfuscation resilience | Planned |
|
|
|
|
### New Libraries (Planned)
|
|
|
|
- `StellaOps.BinaryIndex.Semantic` - IR lifting and semantic graph fingerprints
|
|
- `StellaOps.BinaryIndex.Corpus` - 30K+ function behavior database
|
|
- `StellaOps.BinaryIndex.Ghidra` - Ghidra Headless integration
|
|
- `StellaOps.BinaryIndex.Decompiler` - Decompiled code AST comparison
|
|
- `StellaOps.BinaryIndex.ML` - CodeBERT-based function embeddings
|
|
- `StellaOps.BinaryIndex.Ensemble` - Multi-signal decision fusion
|
|
|
|
### Expected Outcomes
|
|
|
|
| Metric | Current | Target |
|
|
|--------|---------|--------|
|
|
| Patch detection accuracy | ~70% | 92%+ |
|
|
| Function identification (stripped) | ~50% | 85%+ |
|
|
| False positive rate | ~5% | <2% |
|
|
|
|
### Sprint Files
|
|
|
|
- `docs/implplan/SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md`
|
|
- `docs/implplan/SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md`
|
|
- `docs/implplan/SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md`
|
|
- `docs/implplan/SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md`
|
|
|
|
### Architecture Documentation
|
|
|
|
See `./semantic-diffing.md` for comprehensive architecture documentation.
|