# BinaryIndex **Status:** Implemented **Source:** `src/BinaryIndex/` **Owner:** Scanner Guild + Concelier Guild ## Purpose BinaryIndex provides vulnerable binary detection independent of package metadata. It addresses the gap where package version strings can lie (backports, custom builds, stripped metadata) through binary-first vulnerability identification using Build-IDs, hash catalogs, and function fingerprints. ## Components **Libraries:** - `StellaOps.BinaryIndex.Core` - Core binary identity extraction and matching engine - `StellaOps.BinaryIndex.Corpus` - Binary-to-advisory mapping database - `StellaOps.BinaryIndex.Corpus.Debian` - Debian-specific corpus support - `StellaOps.BinaryIndex.Fingerprints` - Function fingerprint storage and matching (CFG/basic-block hashes) - `StellaOps.BinaryIndex.FixIndex` - Patch-aware backport handling - `StellaOps.BinaryIndex.Persistence` - Storage adapters for binary catalogs ## Configuration Configuration is typically embedded in Scanner and Concelier module settings. Key features: - Three-tier binary identification (package/version, Build-ID/hash, function fingerprints) - Binary identity extraction (Build-ID, PE CodeView GUID, Mach-O UUID) - Integration with Scanner.Worker for binary lookup - Offline-first design with deterministic outputs ## Dependencies - PostgreSQL (integrated with Scanner/Concelier schemas) - Scanner.Analyzers.Native (for binary disassembly/analysis) - Concelier (for advisory-to-binary mapping) ## Related Documentation - Architecture: `./architecture.md` - High-Level Architecture: `../../ARCHITECTURE_OVERVIEW.md` - Scanner Architecture: `../scanner/architecture.md` - Concelier Architecture: `../concelier/architecture.md` ## Current Status Library implementation complete with support for ELF (Build-ID), PE (CodeView GUID), and Mach-O (UUID) binary formats. Integrated into Scanner's native binary analysis pipeline. --- ## Semantic Diffing Roadmap A major enhancement to BinaryIndex is planned to enable **semantic-level binary diffing** - detecting function equivalence based on behavior rather than syntax. This addresses limitations in current byte/symbol-based matching when dealing with: - Compiler optimizations (same source, different instructions) - Stripped binaries (no symbols) - Cross-compiler builds (GCC vs Clang) - Obfuscated code ### Planned Phases | Phase | Description | Impact | Status | |-------|-------------|--------|--------| | **Phase 1** | IR-Level Semantic Analysis | +15% accuracy on optimized binaries | Planned | | **Phase 2** | Function Behavior Corpus | +10% coverage on stripped binaries | Planned | | **Phase 3** | Ghidra Integration | +5% edge case handling | Planned | | **Phase 4** | Decompiler & ML Similarity | +10% obfuscation resilience | Planned | ### New Libraries (Planned) - `StellaOps.BinaryIndex.Semantic` - IR lifting and semantic graph fingerprints - `StellaOps.BinaryIndex.Corpus` - 30K+ function behavior database - `StellaOps.BinaryIndex.Ghidra` - Ghidra Headless integration - `StellaOps.BinaryIndex.Decompiler` - Decompiled code AST comparison - `StellaOps.BinaryIndex.ML` - CodeBERT-based function embeddings - `StellaOps.BinaryIndex.Ensemble` - Multi-signal decision fusion ### Expected Outcomes | Metric | Current | Target | |--------|---------|--------| | Patch detection accuracy | ~70% | 92%+ | | Function identification (stripped) | ~50% | 85%+ | | False positive rate | ~5% | <2% | ### Sprint Files - `docs/implplan/SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md` - `docs/implplan/SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md` - `docs/implplan/SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md` - `docs/implplan/SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md` ### Architecture Documentation See `./semantic-diffing.md` for comprehensive architecture documentation.