440 lines
11 KiB
Markdown
440 lines
11 KiB
Markdown
# BSim PostgreSQL Database Setup Guide
|
|
|
|
**Version:** 1.0
|
|
**Sprint:** SPRINT_20260105_001_003_BINDEX
|
|
**Task:** GHID-011
|
|
|
|
## Overview
|
|
|
|
Ghidra's BSim (Binary Similarity) feature requires a separate PostgreSQL database for storing and querying function signatures. This guide covers setup and configuration.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────┐
|
|
│ StellaOps BinaryIndex │
|
|
├──────────────────────────────────────────────────────┤
|
|
│ Main Corpus DB │ BSim DB (Ghidra) │
|
|
│ (corpus.* schema) │ (separate instance) │
|
|
│ │ │
|
|
│ - Function metadata │ - BSim signatures │
|
|
│ - Fingerprints │ - Feature vectors │
|
|
│ - Clusters │ - Similarity index │
|
|
│ - CVE associations │ │
|
|
└──────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**Why Separate?**
|
|
- BSim uses Ghidra-specific schema and stored procedures
|
|
- Different access patterns (corpus: OLTP, BSim: analytical)
|
|
- BSim database can be shared across multiple Ghidra instances
|
|
- Isolation prevents schema conflicts
|
|
|
|
## Prerequisites
|
|
|
|
- PostgreSQL 14+ (BSim requires specific PostgreSQL features)
|
|
- Ghidra 11.x with BSim extension
|
|
- Network connectivity between BinaryIndex services and BSim database
|
|
- At least 10GB storage for initial database (scales with corpus size)
|
|
|
|
## Database Setup
|
|
|
|
### 1. Create BSim Database
|
|
|
|
```bash
|
|
# Create database
|
|
createdb bsim_corpus
|
|
|
|
# Create user
|
|
psql -c "CREATE USER bsim_user WITH PASSWORD 'secure_password_here';"
|
|
psql -c "GRANT ALL PRIVILEGES ON DATABASE bsim_corpus TO bsim_user;"
|
|
```
|
|
|
|
### 2. Initialize BSim Schema
|
|
|
|
Ghidra provides scripts to initialize the BSim database schema:
|
|
|
|
```bash
|
|
# Set Ghidra home
|
|
export GHIDRA_HOME=/opt/ghidra
|
|
|
|
# Run BSim database initialization
|
|
$GHIDRA_HOME/Ghidra/Features/BSim/data/postgresql_init.sh \
|
|
--host localhost \
|
|
--port 5432 \
|
|
--database bsim_corpus \
|
|
--user bsim_user \
|
|
--password secure_password_here
|
|
```
|
|
|
|
Alternatively, use Ghidra's BSim server setup:
|
|
|
|
```bash
|
|
# Create BSim server configuration
|
|
$GHIDRA_HOME/support/bsimServerSetup \
|
|
postgresql://localhost:5432/bsim_corpus \
|
|
--user bsim_user \
|
|
--password secure_password_here
|
|
```
|
|
|
|
### 3. Verify Installation
|
|
|
|
```bash
|
|
# Connect to database
|
|
psql -h localhost -U bsim_user -d bsim_corpus
|
|
|
|
# Check BSim tables exist
|
|
\dt
|
|
|
|
# Expected tables:
|
|
# - bsim_functions
|
|
# - bsim_executables
|
|
# - bsim_vectors
|
|
# - bsim_clusters
|
|
# etc.
|
|
|
|
# Exit
|
|
\q
|
|
```
|
|
|
|
## Docker Deployment
|
|
|
|
### Docker Compose Configuration
|
|
|
|
```yaml
|
|
# docker-compose.bsim.yml
|
|
version: '3.8'
|
|
|
|
services:
|
|
bsim-postgres:
|
|
image: postgres:16
|
|
container_name: stellaops-bsim-db
|
|
environment:
|
|
POSTGRES_DB: bsim_corpus
|
|
POSTGRES_USER: bsim_user
|
|
POSTGRES_PASSWORD: ${BSIM_DB_PASSWORD}
|
|
POSTGRES_INITDB_ARGS: "-E UTF8 --locale=C"
|
|
volumes:
|
|
- bsim-data:/var/lib/postgresql/data
|
|
- ./scripts/init-bsim.sh:/docker-entrypoint-initdb.d/10-init-bsim.sh:ro
|
|
ports:
|
|
- "5433:5432" # Different port to avoid conflict with main DB
|
|
networks:
|
|
- stellaops
|
|
healthcheck:
|
|
test: ["CMD-SHELL", "pg_isready -U bsim_user -d bsim_corpus"]
|
|
interval: 10s
|
|
timeout: 5s
|
|
retries: 5
|
|
|
|
ghidra-headless:
|
|
image: stellaops/ghidra-headless:11.2
|
|
container_name: stellaops-ghidra
|
|
depends_on:
|
|
bsim-postgres:
|
|
condition: service_healthy
|
|
environment:
|
|
BSIM_DB_URL: "postgresql://bsim-postgres:5432/bsim_corpus"
|
|
BSIM_DB_USER: bsim_user
|
|
BSIM_DB_PASSWORD: ${BSIM_DB_PASSWORD}
|
|
JAVA_HOME: /opt/java/openjdk
|
|
MAXMEM: 4G
|
|
volumes:
|
|
- ghidra-projects:/projects
|
|
- ghidra-scripts:/scripts
|
|
networks:
|
|
- stellaops
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
cpus: '4'
|
|
memory: 8G
|
|
|
|
volumes:
|
|
bsim-data:
|
|
driver: local
|
|
ghidra-projects:
|
|
ghidra-scripts:
|
|
|
|
networks:
|
|
stellaops:
|
|
driver: bridge
|
|
```
|
|
|
|
### Initialization Script
|
|
|
|
Create `scripts/init-bsim.sh`:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
set -e
|
|
|
|
# Wait for PostgreSQL to be ready
|
|
until pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB"; do
|
|
echo "Waiting for PostgreSQL..."
|
|
sleep 2
|
|
done
|
|
|
|
echo "PostgreSQL is ready. Installing BSim schema..."
|
|
|
|
# Note: Actual BSim schema SQL would be sourced from Ghidra distribution
|
|
# This is a placeholder - replace with actual Ghidra BSim schema
|
|
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
|
|
-- BSim schema will be initialized by Ghidra tools
|
|
-- This script just ensures the database is ready
|
|
|
|
COMMENT ON DATABASE bsim_corpus IS 'Ghidra BSim function signature database';
|
|
EOSQL
|
|
|
|
echo "BSim database initialized successfully"
|
|
```
|
|
|
|
### Start Services
|
|
|
|
```bash
|
|
# Set password
|
|
export BSIM_DB_PASSWORD="your_secure_password"
|
|
|
|
# Start services
|
|
docker-compose -f docker-compose.bsim.yml up -d
|
|
|
|
# Check logs
|
|
docker-compose -f docker-compose.bsim.yml logs -f ghidra-headless
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### BinaryIndex Configuration
|
|
|
|
Configure BSim connection in `appsettings.json`:
|
|
|
|
```json
|
|
{
|
|
"BinaryIndex": {
|
|
"Ghidra": {
|
|
"Enabled": true,
|
|
"GhidraHome": "/opt/ghidra",
|
|
"BSim": {
|
|
"Enabled": true,
|
|
"ConnectionString": "Host=localhost;Port=5433;Database=bsim_corpus;Username=bsim_user;Password=...",
|
|
"MinSimilarity": 0.7,
|
|
"MaxResults": 10
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
```bash
|
|
# BSim database connection
|
|
export STELLAOPS_BSIM_CONNECTION="Host=localhost;Port=5433;Database=bsim_corpus;Username=bsim_user;Password=..."
|
|
|
|
# BSim feature
|
|
export STELLAOPS_BSIM_ENABLED=true
|
|
|
|
# Query tuning
|
|
export STELLAOPS_BSIM_MIN_SIMILARITY=0.7
|
|
export STELLAOPS_BSIM_QUERY_TIMEOUT=30
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Ingesting Functions into BSim
|
|
|
|
```csharp
|
|
using StellaOps.BinaryIndex.Ghidra;
|
|
|
|
var bsimService = serviceProvider.GetRequiredService<IBSimService>();
|
|
|
|
// Analyze binary with Ghidra
|
|
var ghidraService = serviceProvider.GetRequiredService<IGhidraService>();
|
|
var analysis = await ghidraService.AnalyzeAsync(binaryStream, ct: ct);
|
|
|
|
// Generate BSim signatures
|
|
var signatures = await bsimService.GenerateSignaturesAsync(analysis, ct: ct);
|
|
|
|
// Ingest into BSim database
|
|
await bsimService.IngestAsync("glibc", "2.31", signatures, ct);
|
|
```
|
|
|
|
### Querying BSim
|
|
|
|
```csharp
|
|
// Query for similar functions
|
|
var queryOptions = new BSimQueryOptions
|
|
{
|
|
MinSimilarity = 0.7,
|
|
MinSignificance = 0.5,
|
|
MaxResults = 10
|
|
};
|
|
|
|
var matches = await bsimService.QueryAsync(signature, queryOptions, ct);
|
|
|
|
foreach (var match in matches)
|
|
{
|
|
Console.WriteLine($"Match: {match.MatchedLibrary} {match.MatchedVersion} - {match.MatchedFunction}");
|
|
Console.WriteLine($"Similarity: {match.Similarity:P2}, Confidence: {match.Confidence:P2}");
|
|
}
|
|
```
|
|
|
|
## Maintenance
|
|
|
|
### Database Vacuum
|
|
|
|
```bash
|
|
# Regular vacuum (run weekly)
|
|
psql -h localhost -U bsim_user -d bsim_corpus -c "VACUUM ANALYZE;"
|
|
|
|
# Full vacuum (run monthly)
|
|
psql -h localhost -U bsim_user -d bsim_corpus -c "VACUUM FULL;"
|
|
```
|
|
|
|
### Backup and Restore
|
|
|
|
```bash
|
|
# Backup
|
|
pg_dump -h localhost -U bsim_user -d bsim_corpus -F c -f bsim_backup_$(date +%Y%m%d).dump
|
|
|
|
# Restore
|
|
pg_restore -h localhost -U bsim_user -d bsim_corpus -c bsim_backup_20260105.dump
|
|
```
|
|
|
|
### Monitoring
|
|
|
|
```sql
|
|
-- Check database size
|
|
SELECT pg_size_pretty(pg_database_size('bsim_corpus'));
|
|
|
|
-- Check signature count
|
|
SELECT COUNT(*) FROM bsim_functions;
|
|
|
|
-- Check recent ingest activity
|
|
SELECT * FROM bsim_ingest_log ORDER BY ingested_at DESC LIMIT 10;
|
|
```
|
|
|
|
## Performance Tuning
|
|
|
|
### PostgreSQL Configuration
|
|
|
|
Add to `postgresql.conf`:
|
|
|
|
```ini
|
|
# Memory settings for BSim workload
|
|
shared_buffers = 4GB
|
|
effective_cache_size = 12GB
|
|
work_mem = 256MB
|
|
maintenance_work_mem = 1GB
|
|
|
|
# Query parallelism
|
|
max_parallel_workers_per_gather = 4
|
|
max_parallel_workers = 8
|
|
|
|
# Indexes
|
|
random_page_cost = 1.1 # For SSD storage
|
|
```
|
|
|
|
### Indexing Strategy
|
|
|
|
BSim automatically creates required indexes. Monitor slow queries:
|
|
|
|
```sql
|
|
-- Enable query logging
|
|
ALTER SYSTEM SET log_min_duration_statement = 1000; -- Log queries > 1s
|
|
SELECT pg_reload_conf();
|
|
|
|
-- Check slow queries
|
|
SELECT query, mean_exec_time, calls
|
|
FROM pg_stat_statements
|
|
WHERE query LIKE '%bsim%'
|
|
ORDER BY mean_exec_time DESC
|
|
LIMIT 10;
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Connection Refused
|
|
|
|
```
|
|
Error: could not connect to server: Connection refused
|
|
```
|
|
|
|
**Solution:**
|
|
1. Verify PostgreSQL is running: `systemctl status postgresql`
|
|
2. Check port: `netstat -an | grep 5433`
|
|
3. Verify firewall rules
|
|
4. Check `pg_hba.conf` for access rules
|
|
|
|
### Schema Not Found
|
|
|
|
```
|
|
Error: relation "bsim_functions" does not exist
|
|
```
|
|
|
|
**Solution:**
|
|
1. Re-run BSim schema initialization
|
|
2. Verify Ghidra version compatibility
|
|
3. Check BSim extension is installed in Ghidra
|
|
|
|
### Poor Query Performance
|
|
|
|
```
|
|
Warning: BSim queries taking > 5s
|
|
```
|
|
|
|
**Solution:**
|
|
1. Run `VACUUM ANALYZE` on BSim tables
|
|
2. Increase `work_mem` for complex queries
|
|
3. Check index usage: `EXPLAIN ANALYZE` on slow queries
|
|
4. Consider partitioning large tables
|
|
|
|
## Security Considerations
|
|
|
|
1. **Network Access:** BSim database should only be accessible from BinaryIndex services and Ghidra instances
|
|
2. **Authentication:** Use strong passwords, consider certificate-based authentication
|
|
3. **Encryption:** Enable SSL/TLS for database connections in production
|
|
4. **Access Control:** Grant minimum necessary privileges
|
|
|
|
```sql
|
|
-- Create read-only user for query services
|
|
CREATE USER bsim_readonly WITH PASSWORD '...';
|
|
GRANT CONNECT ON DATABASE bsim_corpus TO bsim_readonly;
|
|
GRANT SELECT ON ALL TABLES IN SCHEMA public TO bsim_readonly;
|
|
```
|
|
|
|
## Integration with Corpus
|
|
|
|
The BSim database complements the main corpus database:
|
|
|
|
- **Corpus DB:** Stores function metadata, fingerprints, CVE associations
|
|
- **BSim DB:** Stores Ghidra-specific behavioral signatures and feature vectors
|
|
|
|
Functions are cross-referenced by:
|
|
- Library name + version
|
|
- Function name
|
|
- Binary hash
|
|
|
|
## Status: GHID-011 Resolution
|
|
|
|
**Implementation Status:** Service code complete (`BSimService.cs` implemented)
|
|
|
|
**Database Status:** Schema initialization documented, awaiting infrastructure provisioning
|
|
|
|
**Blocker Resolution:** This guide provides complete setup instructions. Database can be provisioned by:
|
|
1. Operations team following Docker Compose setup above
|
|
2. Developers using local PostgreSQL with manual schema init
|
|
3. CI/CD using containerized BSim database for integration tests
|
|
|
|
**Next Steps:**
|
|
1. Provision BSim PostgreSQL instance (dev/staging/prod)
|
|
2. Run BSim schema initialization
|
|
3. Test BSimService connectivity
|
|
4. Ingest initial corpus into BSim
|
|
|
|
## References
|
|
|
|
- Ghidra BSim Documentation: https://ghidra.re/ghidra_docs/api/ghidra/features/bsim/
|
|
- Sprint: `docs/implplan/SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md`
|
|
- BSimService Implementation: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ghidra/Services/BSimService.cs`
|