docs consolidation, big sln build fixes, new advisories and sprints/tasks
This commit is contained in:
94
docs/modules/binary-index/README.md
Normal file
94
docs/modules/binary-index/README.md
Normal file
@@ -0,0 +1,94 @@
|
||||
# BinaryIndex
|
||||
|
||||
**Status:** Implemented
|
||||
**Source:** `src/BinaryIndex/`
|
||||
**Owner:** Scanner Guild + Concelier Guild
|
||||
|
||||
## Purpose
|
||||
|
||||
BinaryIndex provides vulnerable binary detection independent of package metadata. It addresses the gap where package version strings can lie (backports, custom builds, stripped metadata) through binary-first vulnerability identification using Build-IDs, hash catalogs, and function fingerprints.
|
||||
|
||||
## Components
|
||||
|
||||
**Libraries:**
|
||||
- `StellaOps.BinaryIndex.Core` - Core binary identity extraction and matching engine
|
||||
- `StellaOps.BinaryIndex.Corpus` - Binary-to-advisory mapping database
|
||||
- `StellaOps.BinaryIndex.Corpus.Debian` - Debian-specific corpus support
|
||||
- `StellaOps.BinaryIndex.Fingerprints` - Function fingerprint storage and matching (CFG/basic-block hashes)
|
||||
- `StellaOps.BinaryIndex.FixIndex` - Patch-aware backport handling
|
||||
- `StellaOps.BinaryIndex.Persistence` - Storage adapters for binary catalogs
|
||||
|
||||
## Configuration
|
||||
|
||||
Configuration is typically embedded in Scanner and Concelier module settings.
|
||||
|
||||
Key features:
|
||||
- Three-tier binary identification (package/version, Build-ID/hash, function fingerprints)
|
||||
- Binary identity extraction (Build-ID, PE CodeView GUID, Mach-O UUID)
|
||||
- Integration with Scanner.Worker for binary lookup
|
||||
- Offline-first design with deterministic outputs
|
||||
|
||||
## Dependencies
|
||||
|
||||
- PostgreSQL (integrated with Scanner/Concelier schemas)
|
||||
- Scanner.Analyzers.Native (for binary disassembly/analysis)
|
||||
- Concelier (for advisory-to-binary mapping)
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- Architecture: `./architecture.md`
|
||||
- High-Level Architecture: `../../ARCHITECTURE_OVERVIEW.md`
|
||||
- Scanner Architecture: `../scanner/architecture.md`
|
||||
- Concelier Architecture: `../concelier/architecture.md`
|
||||
|
||||
## Current Status
|
||||
|
||||
Library implementation complete with support for ELF (Build-ID), PE (CodeView GUID), and Mach-O (UUID) binary formats. Integrated into Scanner's native binary analysis pipeline.
|
||||
|
||||
---
|
||||
|
||||
## Semantic Diffing Roadmap
|
||||
|
||||
A major enhancement to BinaryIndex is planned to enable **semantic-level binary diffing** - detecting function equivalence based on behavior rather than syntax. This addresses limitations in current byte/symbol-based matching when dealing with:
|
||||
|
||||
- Compiler optimizations (same source, different instructions)
|
||||
- Stripped binaries (no symbols)
|
||||
- Cross-compiler builds (GCC vs Clang)
|
||||
- Obfuscated code
|
||||
|
||||
### Planned Phases
|
||||
|
||||
| Phase | Description | Impact | Status |
|
||||
|-------|-------------|--------|--------|
|
||||
| **Phase 1** | IR-Level Semantic Analysis | +15% accuracy on optimized binaries | Planned |
|
||||
| **Phase 2** | Function Behavior Corpus | +10% coverage on stripped binaries | Planned |
|
||||
| **Phase 3** | Ghidra Integration | +5% edge case handling | Planned |
|
||||
| **Phase 4** | Decompiler & ML Similarity | +10% obfuscation resilience | Planned |
|
||||
|
||||
### New Libraries (Planned)
|
||||
|
||||
- `StellaOps.BinaryIndex.Semantic` - IR lifting and semantic graph fingerprints
|
||||
- `StellaOps.BinaryIndex.Corpus` - 30K+ function behavior database
|
||||
- `StellaOps.BinaryIndex.Ghidra` - Ghidra Headless integration
|
||||
- `StellaOps.BinaryIndex.Decompiler` - Decompiled code AST comparison
|
||||
- `StellaOps.BinaryIndex.ML` - CodeBERT-based function embeddings
|
||||
- `StellaOps.BinaryIndex.Ensemble` - Multi-signal decision fusion
|
||||
|
||||
### Expected Outcomes
|
||||
|
||||
| Metric | Current | Target |
|
||||
|--------|---------|--------|
|
||||
| Patch detection accuracy | ~70% | 92%+ |
|
||||
| Function identification (stripped) | ~50% | 85%+ |
|
||||
| False positive rate | ~5% | <2% |
|
||||
|
||||
### Sprint Files
|
||||
|
||||
- `docs/implplan/SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md`
|
||||
- `docs/implplan/SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md`
|
||||
- `docs/implplan/SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md`
|
||||
- `docs/implplan/SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md`
|
||||
|
||||
### Architecture Documentation
|
||||
|
||||
See `./semantic-diffing.md` for comprehensive architecture documentation.
|
||||
Reference in New Issue
Block a user