save progress
This commit is contained in:
439
docs/modules/binary-index/bsim-setup.md
Normal file
439
docs/modules/binary-index/bsim-setup.md
Normal file
@@ -0,0 +1,439 @@
|
||||
# BSim PostgreSQL Database Setup Guide
|
||||
|
||||
**Version:** 1.0
|
||||
**Sprint:** SPRINT_20260105_001_003_BINDEX
|
||||
**Task:** GHID-011
|
||||
|
||||
## Overview
|
||||
|
||||
Ghidra's BSim (Binary Similarity) feature requires a separate PostgreSQL database for storing and querying function signatures. This guide covers setup and configuration.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────┐
|
||||
│ StellaOps BinaryIndex │
|
||||
├──────────────────────────────────────────────────────┤
|
||||
│ Main Corpus DB │ BSim DB (Ghidra) │
|
||||
│ (corpus.* schema) │ (separate instance) │
|
||||
│ │ │
|
||||
│ - Function metadata │ - BSim signatures │
|
||||
│ - Fingerprints │ - Feature vectors │
|
||||
│ - Clusters │ - Similarity index │
|
||||
│ - CVE associations │ │
|
||||
└──────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Why Separate?**
|
||||
- BSim uses Ghidra-specific schema and stored procedures
|
||||
- Different access patterns (corpus: OLTP, BSim: analytical)
|
||||
- BSim database can be shared across multiple Ghidra instances
|
||||
- Isolation prevents schema conflicts
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- PostgreSQL 14+ (BSim requires specific PostgreSQL features)
|
||||
- Ghidra 11.x with BSim extension
|
||||
- Network connectivity between BinaryIndex services and BSim database
|
||||
- At least 10GB storage for initial database (scales with corpus size)
|
||||
|
||||
## Database Setup
|
||||
|
||||
### 1. Create BSim Database
|
||||
|
||||
```bash
|
||||
# Create database
|
||||
createdb bsim_corpus
|
||||
|
||||
# Create user
|
||||
psql -c "CREATE USER bsim_user WITH PASSWORD 'secure_password_here';"
|
||||
psql -c "GRANT ALL PRIVILEGES ON DATABASE bsim_corpus TO bsim_user;"
|
||||
```
|
||||
|
||||
### 2. Initialize BSim Schema
|
||||
|
||||
Ghidra provides scripts to initialize the BSim database schema:
|
||||
|
||||
```bash
|
||||
# Set Ghidra home
|
||||
export GHIDRA_HOME=/opt/ghidra
|
||||
|
||||
# Run BSim database initialization
|
||||
$GHIDRA_HOME/Ghidra/Features/BSim/data/postgresql_init.sh \
|
||||
--host localhost \
|
||||
--port 5432 \
|
||||
--database bsim_corpus \
|
||||
--user bsim_user \
|
||||
--password secure_password_here
|
||||
```
|
||||
|
||||
Alternatively, use Ghidra's BSim server setup:
|
||||
|
||||
```bash
|
||||
# Create BSim server configuration
|
||||
$GHIDRA_HOME/support/bsimServerSetup \
|
||||
postgresql://localhost:5432/bsim_corpus \
|
||||
--user bsim_user \
|
||||
--password secure_password_here
|
||||
```
|
||||
|
||||
### 3. Verify Installation
|
||||
|
||||
```bash
|
||||
# Connect to database
|
||||
psql -h localhost -U bsim_user -d bsim_corpus
|
||||
|
||||
# Check BSim tables exist
|
||||
\dt
|
||||
|
||||
# Expected tables:
|
||||
# - bsim_functions
|
||||
# - bsim_executables
|
||||
# - bsim_vectors
|
||||
# - bsim_clusters
|
||||
# etc.
|
||||
|
||||
# Exit
|
||||
\q
|
||||
```
|
||||
|
||||
## Docker Deployment
|
||||
|
||||
### Docker Compose Configuration
|
||||
|
||||
```yaml
|
||||
# docker-compose.bsim.yml
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
bsim-postgres:
|
||||
image: postgres:16
|
||||
container_name: stellaops-bsim-db
|
||||
environment:
|
||||
POSTGRES_DB: bsim_corpus
|
||||
POSTGRES_USER: bsim_user
|
||||
POSTGRES_PASSWORD: ${BSIM_DB_PASSWORD}
|
||||
POSTGRES_INITDB_ARGS: "-E UTF8 --locale=C"
|
||||
volumes:
|
||||
- bsim-data:/var/lib/postgresql/data
|
||||
- ./scripts/init-bsim.sh:/docker-entrypoint-initdb.d/10-init-bsim.sh:ro
|
||||
ports:
|
||||
- "5433:5432" # Different port to avoid conflict with main DB
|
||||
networks:
|
||||
- stellaops
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "pg_isready -U bsim_user -d bsim_corpus"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
|
||||
ghidra-headless:
|
||||
image: stellaops/ghidra-headless:11.2
|
||||
container_name: stellaops-ghidra
|
||||
depends_on:
|
||||
bsim-postgres:
|
||||
condition: service_healthy
|
||||
environment:
|
||||
BSIM_DB_URL: "postgresql://bsim-postgres:5432/bsim_corpus"
|
||||
BSIM_DB_USER: bsim_user
|
||||
BSIM_DB_PASSWORD: ${BSIM_DB_PASSWORD}
|
||||
JAVA_HOME: /opt/java/openjdk
|
||||
MAXMEM: 4G
|
||||
volumes:
|
||||
- ghidra-projects:/projects
|
||||
- ghidra-scripts:/scripts
|
||||
networks:
|
||||
- stellaops
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
cpus: '4'
|
||||
memory: 8G
|
||||
|
||||
volumes:
|
||||
bsim-data:
|
||||
driver: local
|
||||
ghidra-projects:
|
||||
ghidra-scripts:
|
||||
|
||||
networks:
|
||||
stellaops:
|
||||
driver: bridge
|
||||
```
|
||||
|
||||
### Initialization Script
|
||||
|
||||
Create `scripts/init-bsim.sh`:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
# Wait for PostgreSQL to be ready
|
||||
until pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB"; do
|
||||
echo "Waiting for PostgreSQL..."
|
||||
sleep 2
|
||||
done
|
||||
|
||||
echo "PostgreSQL is ready. Installing BSim schema..."
|
||||
|
||||
# Note: Actual BSim schema SQL would be sourced from Ghidra distribution
|
||||
# This is a placeholder - replace with actual Ghidra BSim schema
|
||||
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
|
||||
-- BSim schema will be initialized by Ghidra tools
|
||||
-- This script just ensures the database is ready
|
||||
|
||||
COMMENT ON DATABASE bsim_corpus IS 'Ghidra BSim function signature database';
|
||||
EOSQL
|
||||
|
||||
echo "BSim database initialized successfully"
|
||||
```
|
||||
|
||||
### Start Services
|
||||
|
||||
```bash
|
||||
# Set password
|
||||
export BSIM_DB_PASSWORD="your_secure_password"
|
||||
|
||||
# Start services
|
||||
docker-compose -f docker-compose.bsim.yml up -d
|
||||
|
||||
# Check logs
|
||||
docker-compose -f docker-compose.bsim.yml logs -f ghidra-headless
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### BinaryIndex Configuration
|
||||
|
||||
Configure BSim connection in `appsettings.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"BinaryIndex": {
|
||||
"Ghidra": {
|
||||
"Enabled": true,
|
||||
"GhidraHome": "/opt/ghidra",
|
||||
"BSim": {
|
||||
"Enabled": true,
|
||||
"ConnectionString": "Host=localhost;Port=5433;Database=bsim_corpus;Username=bsim_user;Password=...",
|
||||
"MinSimilarity": 0.7,
|
||||
"MaxResults": 10
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# BSim database connection
|
||||
export STELLAOPS_BSIM_CONNECTION="Host=localhost;Port=5433;Database=bsim_corpus;Username=bsim_user;Password=..."
|
||||
|
||||
# BSim feature
|
||||
export STELLAOPS_BSIM_ENABLED=true
|
||||
|
||||
# Query tuning
|
||||
export STELLAOPS_BSIM_MIN_SIMILARITY=0.7
|
||||
export STELLAOPS_BSIM_QUERY_TIMEOUT=30
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Ingesting Functions into BSim
|
||||
|
||||
```csharp
|
||||
using StellaOps.BinaryIndex.Ghidra;
|
||||
|
||||
var bsimService = serviceProvider.GetRequiredService<IBSimService>();
|
||||
|
||||
// Analyze binary with Ghidra
|
||||
var ghidraService = serviceProvider.GetRequiredService<IGhidraService>();
|
||||
var analysis = await ghidraService.AnalyzeAsync(binaryStream, ct: ct);
|
||||
|
||||
// Generate BSim signatures
|
||||
var signatures = await bsimService.GenerateSignaturesAsync(analysis, ct: ct);
|
||||
|
||||
// Ingest into BSim database
|
||||
await bsimService.IngestAsync("glibc", "2.31", signatures, ct);
|
||||
```
|
||||
|
||||
### Querying BSim
|
||||
|
||||
```csharp
|
||||
// Query for similar functions
|
||||
var queryOptions = new BSimQueryOptions
|
||||
{
|
||||
MinSimilarity = 0.7,
|
||||
MinSignificance = 0.5,
|
||||
MaxResults = 10
|
||||
};
|
||||
|
||||
var matches = await bsimService.QueryAsync(signature, queryOptions, ct);
|
||||
|
||||
foreach (var match in matches)
|
||||
{
|
||||
Console.WriteLine($"Match: {match.MatchedLibrary} {match.MatchedVersion} - {match.MatchedFunction}");
|
||||
Console.WriteLine($"Similarity: {match.Similarity:P2}, Confidence: {match.Confidence:P2}");
|
||||
}
|
||||
```
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Database Vacuum
|
||||
|
||||
```bash
|
||||
# Regular vacuum (run weekly)
|
||||
psql -h localhost -U bsim_user -d bsim_corpus -c "VACUUM ANALYZE;"
|
||||
|
||||
# Full vacuum (run monthly)
|
||||
psql -h localhost -U bsim_user -d bsim_corpus -c "VACUUM FULL;"
|
||||
```
|
||||
|
||||
### Backup and Restore
|
||||
|
||||
```bash
|
||||
# Backup
|
||||
pg_dump -h localhost -U bsim_user -d bsim_corpus -F c -f bsim_backup_$(date +%Y%m%d).dump
|
||||
|
||||
# Restore
|
||||
pg_restore -h localhost -U bsim_user -d bsim_corpus -c bsim_backup_20260105.dump
|
||||
```
|
||||
|
||||
### Monitoring
|
||||
|
||||
```sql
|
||||
-- Check database size
|
||||
SELECT pg_size_pretty(pg_database_size('bsim_corpus'));
|
||||
|
||||
-- Check signature count
|
||||
SELECT COUNT(*) FROM bsim_functions;
|
||||
|
||||
-- Check recent ingest activity
|
||||
SELECT * FROM bsim_ingest_log ORDER BY ingested_at DESC LIMIT 10;
|
||||
```
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### PostgreSQL Configuration
|
||||
|
||||
Add to `postgresql.conf`:
|
||||
|
||||
```ini
|
||||
# Memory settings for BSim workload
|
||||
shared_buffers = 4GB
|
||||
effective_cache_size = 12GB
|
||||
work_mem = 256MB
|
||||
maintenance_work_mem = 1GB
|
||||
|
||||
# Query parallelism
|
||||
max_parallel_workers_per_gather = 4
|
||||
max_parallel_workers = 8
|
||||
|
||||
# Indexes
|
||||
random_page_cost = 1.1 # For SSD storage
|
||||
```
|
||||
|
||||
### Indexing Strategy
|
||||
|
||||
BSim automatically creates required indexes. Monitor slow queries:
|
||||
|
||||
```sql
|
||||
-- Enable query logging
|
||||
ALTER SYSTEM SET log_min_duration_statement = 1000; -- Log queries > 1s
|
||||
SELECT pg_reload_conf();
|
||||
|
||||
-- Check slow queries
|
||||
SELECT query, mean_exec_time, calls
|
||||
FROM pg_stat_statements
|
||||
WHERE query LIKE '%bsim%'
|
||||
ORDER BY mean_exec_time DESC
|
||||
LIMIT 10;
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Connection Refused
|
||||
|
||||
```
|
||||
Error: could not connect to server: Connection refused
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Verify PostgreSQL is running: `systemctl status postgresql`
|
||||
2. Check port: `netstat -an | grep 5433`
|
||||
3. Verify firewall rules
|
||||
4. Check `pg_hba.conf` for access rules
|
||||
|
||||
### Schema Not Found
|
||||
|
||||
```
|
||||
Error: relation "bsim_functions" does not exist
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Re-run BSim schema initialization
|
||||
2. Verify Ghidra version compatibility
|
||||
3. Check BSim extension is installed in Ghidra
|
||||
|
||||
### Poor Query Performance
|
||||
|
||||
```
|
||||
Warning: BSim queries taking > 5s
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Run `VACUUM ANALYZE` on BSim tables
|
||||
2. Increase `work_mem` for complex queries
|
||||
3. Check index usage: `EXPLAIN ANALYZE` on slow queries
|
||||
4. Consider partitioning large tables
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **Network Access:** BSim database should only be accessible from BinaryIndex services and Ghidra instances
|
||||
2. **Authentication:** Use strong passwords, consider certificate-based authentication
|
||||
3. **Encryption:** Enable SSL/TLS for database connections in production
|
||||
4. **Access Control:** Grant minimum necessary privileges
|
||||
|
||||
```sql
|
||||
-- Create read-only user for query services
|
||||
CREATE USER bsim_readonly WITH PASSWORD '...';
|
||||
GRANT CONNECT ON DATABASE bsim_corpus TO bsim_readonly;
|
||||
GRANT SELECT ON ALL TABLES IN SCHEMA public TO bsim_readonly;
|
||||
```
|
||||
|
||||
## Integration with Corpus
|
||||
|
||||
The BSim database complements the main corpus database:
|
||||
|
||||
- **Corpus DB:** Stores function metadata, fingerprints, CVE associations
|
||||
- **BSim DB:** Stores Ghidra-specific behavioral signatures and feature vectors
|
||||
|
||||
Functions are cross-referenced by:
|
||||
- Library name + version
|
||||
- Function name
|
||||
- Binary hash
|
||||
|
||||
## Status: GHID-011 Resolution
|
||||
|
||||
**Implementation Status:** Service code complete (`BSimService.cs` implemented)
|
||||
|
||||
**Database Status:** Schema initialization documented, awaiting infrastructure provisioning
|
||||
|
||||
**Blocker Resolution:** This guide provides complete setup instructions. Database can be provisioned by:
|
||||
1. Operations team following Docker Compose setup above
|
||||
2. Developers using local PostgreSQL with manual schema init
|
||||
3. CI/CD using containerized BSim database for integration tests
|
||||
|
||||
**Next Steps:**
|
||||
1. Provision BSim PostgreSQL instance (dev/staging/prod)
|
||||
2. Run BSim schema initialization
|
||||
3. Test BSimService connectivity
|
||||
4. Ingest initial corpus into BSim
|
||||
|
||||
## References
|
||||
|
||||
- Ghidra BSim Documentation: https://ghidra.re/ghidra_docs/api/ghidra/features/bsim/
|
||||
- Sprint: `docs/implplan/SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md`
|
||||
- BSimService Implementation: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ghidra/Services/BSimService.cs`
|
||||
Reference in New Issue
Block a user