Files
StellaOps Bot 37e11918e0 save progress
2026-01-06 09:42:20 +02:00

440 lines
11 KiB
Markdown

# BSim PostgreSQL Database Setup Guide
**Version:** 1.0
**Sprint:** SPRINT_20260105_001_003_BINDEX
**Task:** GHID-011
## Overview
Ghidra's BSim (Binary Similarity) feature requires a separate PostgreSQL database for storing and querying function signatures. This guide covers setup and configuration.
## Architecture
```
┌──────────────────────────────────────────────────────┐
│ StellaOps BinaryIndex │
├──────────────────────────────────────────────────────┤
│ Main Corpus DB │ BSim DB (Ghidra) │
│ (corpus.* schema) │ (separate instance) │
│ │ │
│ - Function metadata │ - BSim signatures │
│ - Fingerprints │ - Feature vectors │
│ - Clusters │ - Similarity index │
│ - CVE associations │ │
└──────────────────────────────────────────────────────┘
```
**Why Separate?**
- BSim uses Ghidra-specific schema and stored procedures
- Different access patterns (corpus: OLTP, BSim: analytical)
- BSim database can be shared across multiple Ghidra instances
- Isolation prevents schema conflicts
## Prerequisites
- PostgreSQL 14+ (BSim requires specific PostgreSQL features)
- Ghidra 11.x with BSim extension
- Network connectivity between BinaryIndex services and BSim database
- At least 10GB storage for initial database (scales with corpus size)
## Database Setup
### 1. Create BSim Database
```bash
# Create database
createdb bsim_corpus
# Create user
psql -c "CREATE USER bsim_user WITH PASSWORD 'secure_password_here';"
psql -c "GRANT ALL PRIVILEGES ON DATABASE bsim_corpus TO bsim_user;"
```
### 2. Initialize BSim Schema
Ghidra provides scripts to initialize the BSim database schema:
```bash
# Set Ghidra home
export GHIDRA_HOME=/opt/ghidra
# Run BSim database initialization
$GHIDRA_HOME/Ghidra/Features/BSim/data/postgresql_init.sh \
--host localhost \
--port 5432 \
--database bsim_corpus \
--user bsim_user \
--password secure_password_here
```
Alternatively, use Ghidra's BSim server setup:
```bash
# Create BSim server configuration
$GHIDRA_HOME/support/bsimServerSetup \
postgresql://localhost:5432/bsim_corpus \
--user bsim_user \
--password secure_password_here
```
### 3. Verify Installation
```bash
# Connect to database
psql -h localhost -U bsim_user -d bsim_corpus
# Check BSim tables exist
\dt
# Expected tables:
# - bsim_functions
# - bsim_executables
# - bsim_vectors
# - bsim_clusters
# etc.
# Exit
\q
```
## Docker Deployment
### Docker Compose Configuration
```yaml
# docker-compose.bsim.yml
version: '3.8'
services:
bsim-postgres:
image: postgres:16
container_name: stellaops-bsim-db
environment:
POSTGRES_DB: bsim_corpus
POSTGRES_USER: bsim_user
POSTGRES_PASSWORD: ${BSIM_DB_PASSWORD}
POSTGRES_INITDB_ARGS: "-E UTF8 --locale=C"
volumes:
- bsim-data:/var/lib/postgresql/data
- ./scripts/init-bsim.sh:/docker-entrypoint-initdb.d/10-init-bsim.sh:ro
ports:
- "5433:5432" # Different port to avoid conflict with main DB
networks:
- stellaops
healthcheck:
test: ["CMD-SHELL", "pg_isready -U bsim_user -d bsim_corpus"]
interval: 10s
timeout: 5s
retries: 5
ghidra-headless:
image: stellaops/ghidra-headless:11.2
container_name: stellaops-ghidra
depends_on:
bsim-postgres:
condition: service_healthy
environment:
BSIM_DB_URL: "postgresql://bsim-postgres:5432/bsim_corpus"
BSIM_DB_USER: bsim_user
BSIM_DB_PASSWORD: ${BSIM_DB_PASSWORD}
JAVA_HOME: /opt/java/openjdk
MAXMEM: 4G
volumes:
- ghidra-projects:/projects
- ghidra-scripts:/scripts
networks:
- stellaops
deploy:
resources:
limits:
cpus: '4'
memory: 8G
volumes:
bsim-data:
driver: local
ghidra-projects:
ghidra-scripts:
networks:
stellaops:
driver: bridge
```
### Initialization Script
Create `scripts/init-bsim.sh`:
```bash
#!/bin/bash
set -e
# Wait for PostgreSQL to be ready
until pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB"; do
echo "Waiting for PostgreSQL..."
sleep 2
done
echo "PostgreSQL is ready. Installing BSim schema..."
# Note: Actual BSim schema SQL would be sourced from Ghidra distribution
# This is a placeholder - replace with actual Ghidra BSim schema
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
-- BSim schema will be initialized by Ghidra tools
-- This script just ensures the database is ready
COMMENT ON DATABASE bsim_corpus IS 'Ghidra BSim function signature database';
EOSQL
echo "BSim database initialized successfully"
```
### Start Services
```bash
# Set password
export BSIM_DB_PASSWORD="your_secure_password"
# Start services
docker-compose -f docker-compose.bsim.yml up -d
# Check logs
docker-compose -f docker-compose.bsim.yml logs -f ghidra-headless
```
## Configuration
### BinaryIndex Configuration
Configure BSim connection in `appsettings.json`:
```json
{
"BinaryIndex": {
"Ghidra": {
"Enabled": true,
"GhidraHome": "/opt/ghidra",
"BSim": {
"Enabled": true,
"ConnectionString": "Host=localhost;Port=5433;Database=bsim_corpus;Username=bsim_user;Password=...",
"MinSimilarity": 0.7,
"MaxResults": 10
}
}
}
}
```
### Environment Variables
```bash
# BSim database connection
export STELLAOPS_BSIM_CONNECTION="Host=localhost;Port=5433;Database=bsim_corpus;Username=bsim_user;Password=..."
# BSim feature
export STELLAOPS_BSIM_ENABLED=true
# Query tuning
export STELLAOPS_BSIM_MIN_SIMILARITY=0.7
export STELLAOPS_BSIM_QUERY_TIMEOUT=30
```
## Usage
### Ingesting Functions into BSim
```csharp
using StellaOps.BinaryIndex.Ghidra;
var bsimService = serviceProvider.GetRequiredService<IBSimService>();
// Analyze binary with Ghidra
var ghidraService = serviceProvider.GetRequiredService<IGhidraService>();
var analysis = await ghidraService.AnalyzeAsync(binaryStream, ct: ct);
// Generate BSim signatures
var signatures = await bsimService.GenerateSignaturesAsync(analysis, ct: ct);
// Ingest into BSim database
await bsimService.IngestAsync("glibc", "2.31", signatures, ct);
```
### Querying BSim
```csharp
// Query for similar functions
var queryOptions = new BSimQueryOptions
{
MinSimilarity = 0.7,
MinSignificance = 0.5,
MaxResults = 10
};
var matches = await bsimService.QueryAsync(signature, queryOptions, ct);
foreach (var match in matches)
{
Console.WriteLine($"Match: {match.MatchedLibrary} {match.MatchedVersion} - {match.MatchedFunction}");
Console.WriteLine($"Similarity: {match.Similarity:P2}, Confidence: {match.Confidence:P2}");
}
```
## Maintenance
### Database Vacuum
```bash
# Regular vacuum (run weekly)
psql -h localhost -U bsim_user -d bsim_corpus -c "VACUUM ANALYZE;"
# Full vacuum (run monthly)
psql -h localhost -U bsim_user -d bsim_corpus -c "VACUUM FULL;"
```
### Backup and Restore
```bash
# Backup
pg_dump -h localhost -U bsim_user -d bsim_corpus -F c -f bsim_backup_$(date +%Y%m%d).dump
# Restore
pg_restore -h localhost -U bsim_user -d bsim_corpus -c bsim_backup_20260105.dump
```
### Monitoring
```sql
-- Check database size
SELECT pg_size_pretty(pg_database_size('bsim_corpus'));
-- Check signature count
SELECT COUNT(*) FROM bsim_functions;
-- Check recent ingest activity
SELECT * FROM bsim_ingest_log ORDER BY ingested_at DESC LIMIT 10;
```
## Performance Tuning
### PostgreSQL Configuration
Add to `postgresql.conf`:
```ini
# Memory settings for BSim workload
shared_buffers = 4GB
effective_cache_size = 12GB
work_mem = 256MB
maintenance_work_mem = 1GB
# Query parallelism
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
# Indexes
random_page_cost = 1.1 # For SSD storage
```
### Indexing Strategy
BSim automatically creates required indexes. Monitor slow queries:
```sql
-- Enable query logging
ALTER SYSTEM SET log_min_duration_statement = 1000; -- Log queries > 1s
SELECT pg_reload_conf();
-- Check slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE query LIKE '%bsim%'
ORDER BY mean_exec_time DESC
LIMIT 10;
```
## Troubleshooting
### Connection Refused
```
Error: could not connect to server: Connection refused
```
**Solution:**
1. Verify PostgreSQL is running: `systemctl status postgresql`
2. Check port: `netstat -an | grep 5433`
3. Verify firewall rules
4. Check `pg_hba.conf` for access rules
### Schema Not Found
```
Error: relation "bsim_functions" does not exist
```
**Solution:**
1. Re-run BSim schema initialization
2. Verify Ghidra version compatibility
3. Check BSim extension is installed in Ghidra
### Poor Query Performance
```
Warning: BSim queries taking > 5s
```
**Solution:**
1. Run `VACUUM ANALYZE` on BSim tables
2. Increase `work_mem` for complex queries
3. Check index usage: `EXPLAIN ANALYZE` on slow queries
4. Consider partitioning large tables
## Security Considerations
1. **Network Access:** BSim database should only be accessible from BinaryIndex services and Ghidra instances
2. **Authentication:** Use strong passwords, consider certificate-based authentication
3. **Encryption:** Enable SSL/TLS for database connections in production
4. **Access Control:** Grant minimum necessary privileges
```sql
-- Create read-only user for query services
CREATE USER bsim_readonly WITH PASSWORD '...';
GRANT CONNECT ON DATABASE bsim_corpus TO bsim_readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO bsim_readonly;
```
## Integration with Corpus
The BSim database complements the main corpus database:
- **Corpus DB:** Stores function metadata, fingerprints, CVE associations
- **BSim DB:** Stores Ghidra-specific behavioral signatures and feature vectors
Functions are cross-referenced by:
- Library name + version
- Function name
- Binary hash
## Status: GHID-011 Resolution
**Implementation Status:** Service code complete (`BSimService.cs` implemented)
**Database Status:** Schema initialization documented, awaiting infrastructure provisioning
**Blocker Resolution:** This guide provides complete setup instructions. Database can be provisioned by:
1. Operations team following Docker Compose setup above
2. Developers using local PostgreSQL with manual schema init
3. CI/CD using containerized BSim database for integration tests
**Next Steps:**
1. Provision BSim PostgreSQL instance (dev/staging/prod)
2. Run BSim schema initialization
3. Test BSimService connectivity
4. Ingest initial corpus into BSim
## References
- Ghidra BSim Documentation: https://ghidra.re/ghidra_docs/api/ghidra/features/bsim/
- Sprint: `docs/implplan/SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md`
- BSimService Implementation: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ghidra/Services/BSimService.cs`