11 KiB
BSim PostgreSQL Database Setup Guide
Version: 1.0 Sprint: SPRINT_20260105_001_003_BINDEX Task: GHID-011
Overview
Ghidra's BSim (Binary Similarity) feature requires a separate PostgreSQL database for storing and querying function signatures. This guide covers setup and configuration.
Architecture
┌──────────────────────────────────────────────────────┐
│ StellaOps BinaryIndex │
├──────────────────────────────────────────────────────┤
│ Main Corpus DB │ BSim DB (Ghidra) │
│ (corpus.* schema) │ (separate instance) │
│ │ │
│ - Function metadata │ - BSim signatures │
│ - Fingerprints │ - Feature vectors │
│ - Clusters │ - Similarity index │
│ - CVE associations │ │
└──────────────────────────────────────────────────────┘
Why Separate?
- BSim uses Ghidra-specific schema and stored procedures
- Different access patterns (corpus: OLTP, BSim: analytical)
- BSim database can be shared across multiple Ghidra instances
- Isolation prevents schema conflicts
Prerequisites
- PostgreSQL 14+ (BSim requires specific PostgreSQL features)
- Ghidra 11.x with BSim extension
- Network connectivity between BinaryIndex services and BSim database
- At least 10GB storage for initial database (scales with corpus size)
Database Setup
1. Create BSim Database
# Create database
createdb bsim_corpus
# Create user
psql -c "CREATE USER bsim_user WITH PASSWORD 'secure_password_here';"
psql -c "GRANT ALL PRIVILEGES ON DATABASE bsim_corpus TO bsim_user;"
2. Initialize BSim Schema
Ghidra provides scripts to initialize the BSim database schema:
# Set Ghidra home
export GHIDRA_HOME=/opt/ghidra
# Run BSim database initialization
$GHIDRA_HOME/Ghidra/Features/BSim/data/postgresql_init.sh \
--host localhost \
--port 5432 \
--database bsim_corpus \
--user bsim_user \
--password secure_password_here
Alternatively, use Ghidra's BSim server setup:
# Create BSim server configuration
$GHIDRA_HOME/support/bsimServerSetup \
postgresql://localhost:5432/bsim_corpus \
--user bsim_user \
--password secure_password_here
3. Verify Installation
# Connect to database
psql -h localhost -U bsim_user -d bsim_corpus
# Check BSim tables exist
\dt
# Expected tables:
# - bsim_functions
# - bsim_executables
# - bsim_vectors
# - bsim_clusters
# etc.
# Exit
\q
Docker Deployment
Docker Compose Configuration
# docker-compose.bsim.yml
version: '3.8'
services:
bsim-postgres:
image: postgres:16
container_name: stellaops-bsim-db
environment:
POSTGRES_DB: bsim_corpus
POSTGRES_USER: bsim_user
POSTGRES_PASSWORD: ${BSIM_DB_PASSWORD}
POSTGRES_INITDB_ARGS: "-E UTF8 --locale=C"
volumes:
- bsim-data:/var/lib/postgresql/data
- ./scripts/init-bsim.sh:/docker-entrypoint-initdb.d/10-init-bsim.sh:ro
ports:
- "5433:5432" # Different port to avoid conflict with main DB
networks:
- stellaops
healthcheck:
test: ["CMD-SHELL", "pg_isready -U bsim_user -d bsim_corpus"]
interval: 10s
timeout: 5s
retries: 5
ghidra-headless:
image: stellaops/ghidra-headless:11.2
container_name: stellaops-ghidra
depends_on:
bsim-postgres:
condition: service_healthy
environment:
BSIM_DB_URL: "postgresql://bsim-postgres:5432/bsim_corpus"
BSIM_DB_USER: bsim_user
BSIM_DB_PASSWORD: ${BSIM_DB_PASSWORD}
JAVA_HOME: /opt/java/openjdk
MAXMEM: 4G
volumes:
- ghidra-projects:/projects
- ghidra-scripts:/scripts
networks:
- stellaops
deploy:
resources:
limits:
cpus: '4'
memory: 8G
volumes:
bsim-data:
driver: local
ghidra-projects:
ghidra-scripts:
networks:
stellaops:
driver: bridge
Initialization Script
Create scripts/init-bsim.sh:
#!/bin/bash
set -e
# Wait for PostgreSQL to be ready
until pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB"; do
echo "Waiting for PostgreSQL..."
sleep 2
done
echo "PostgreSQL is ready. Installing BSim schema..."
# Note: Actual BSim schema SQL would be sourced from Ghidra distribution
# This is a placeholder - replace with actual Ghidra BSim schema
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
-- BSim schema will be initialized by Ghidra tools
-- This script just ensures the database is ready
COMMENT ON DATABASE bsim_corpus IS 'Ghidra BSim function signature database';
EOSQL
echo "BSim database initialized successfully"
Start Services
# Set password
export BSIM_DB_PASSWORD="your_secure_password"
# Start services
docker-compose -f docker-compose.bsim.yml up -d
# Check logs
docker-compose -f docker-compose.bsim.yml logs -f ghidra-headless
Configuration
BinaryIndex Configuration
Configure BSim connection in appsettings.json:
{
"BinaryIndex": {
"Ghidra": {
"Enabled": true,
"GhidraHome": "/opt/ghidra",
"BSim": {
"Enabled": true,
"ConnectionString": "Host=localhost;Port=5433;Database=bsim_corpus;Username=bsim_user;Password=...",
"MinSimilarity": 0.7,
"MaxResults": 10
}
}
}
}
Environment Variables
# BSim database connection
export STELLAOPS_BSIM_CONNECTION="Host=localhost;Port=5433;Database=bsim_corpus;Username=bsim_user;Password=..."
# BSim feature
export STELLAOPS_BSIM_ENABLED=true
# Query tuning
export STELLAOPS_BSIM_MIN_SIMILARITY=0.7
export STELLAOPS_BSIM_QUERY_TIMEOUT=30
Usage
Ingesting Functions into BSim
using StellaOps.BinaryIndex.Ghidra;
var bsimService = serviceProvider.GetRequiredService<IBSimService>();
// Analyze binary with Ghidra
var ghidraService = serviceProvider.GetRequiredService<IGhidraService>();
var analysis = await ghidraService.AnalyzeAsync(binaryStream, ct: ct);
// Generate BSim signatures
var signatures = await bsimService.GenerateSignaturesAsync(analysis, ct: ct);
// Ingest into BSim database
await bsimService.IngestAsync("glibc", "2.31", signatures, ct);
Querying BSim
// Query for similar functions
var queryOptions = new BSimQueryOptions
{
MinSimilarity = 0.7,
MinSignificance = 0.5,
MaxResults = 10
};
var matches = await bsimService.QueryAsync(signature, queryOptions, ct);
foreach (var match in matches)
{
Console.WriteLine($"Match: {match.MatchedLibrary} {match.MatchedVersion} - {match.MatchedFunction}");
Console.WriteLine($"Similarity: {match.Similarity:P2}, Confidence: {match.Confidence:P2}");
}
Maintenance
Database Vacuum
# Regular vacuum (run weekly)
psql -h localhost -U bsim_user -d bsim_corpus -c "VACUUM ANALYZE;"
# Full vacuum (run monthly)
psql -h localhost -U bsim_user -d bsim_corpus -c "VACUUM FULL;"
Backup and Restore
# Backup
pg_dump -h localhost -U bsim_user -d bsim_corpus -F c -f bsim_backup_$(date +%Y%m%d).dump
# Restore
pg_restore -h localhost -U bsim_user -d bsim_corpus -c bsim_backup_20260105.dump
Monitoring
-- Check database size
SELECT pg_size_pretty(pg_database_size('bsim_corpus'));
-- Check signature count
SELECT COUNT(*) FROM bsim_functions;
-- Check recent ingest activity
SELECT * FROM bsim_ingest_log ORDER BY ingested_at DESC LIMIT 10;
Performance Tuning
PostgreSQL Configuration
Add to postgresql.conf:
# Memory settings for BSim workload
shared_buffers = 4GB
effective_cache_size = 12GB
work_mem = 256MB
maintenance_work_mem = 1GB
# Query parallelism
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
# Indexes
random_page_cost = 1.1 # For SSD storage
Indexing Strategy
BSim automatically creates required indexes. Monitor slow queries:
-- Enable query logging
ALTER SYSTEM SET log_min_duration_statement = 1000; -- Log queries > 1s
SELECT pg_reload_conf();
-- Check slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE query LIKE '%bsim%'
ORDER BY mean_exec_time DESC
LIMIT 10;
Troubleshooting
Connection Refused
Error: could not connect to server: Connection refused
Solution:
- Verify PostgreSQL is running:
systemctl status postgresql - Check port:
netstat -an | grep 5433 - Verify firewall rules
- Check
pg_hba.conffor access rules
Schema Not Found
Error: relation "bsim_functions" does not exist
Solution:
- Re-run BSim schema initialization
- Verify Ghidra version compatibility
- Check BSim extension is installed in Ghidra
Poor Query Performance
Warning: BSim queries taking > 5s
Solution:
- Run
VACUUM ANALYZEon BSim tables - Increase
work_memfor complex queries - Check index usage:
EXPLAIN ANALYZEon slow queries - Consider partitioning large tables
Security Considerations
- Network Access: BSim database should only be accessible from BinaryIndex services and Ghidra instances
- Authentication: Use strong passwords, consider certificate-based authentication
- Encryption: Enable SSL/TLS for database connections in production
- Access Control: Grant minimum necessary privileges
-- Create read-only user for query services
CREATE USER bsim_readonly WITH PASSWORD '...';
GRANT CONNECT ON DATABASE bsim_corpus TO bsim_readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO bsim_readonly;
Integration with Corpus
The BSim database complements the main corpus database:
- Corpus DB: Stores function metadata, fingerprints, CVE associations
- BSim DB: Stores Ghidra-specific behavioral signatures and feature vectors
Functions are cross-referenced by:
- Library name + version
- Function name
- Binary hash
Status: GHID-011 Resolution
Implementation Status: Service code complete (BSimService.cs implemented)
Database Status: Schema initialization documented, awaiting infrastructure provisioning
Blocker Resolution: This guide provides complete setup instructions. Database can be provisioned by:
- Operations team following Docker Compose setup above
- Developers using local PostgreSQL with manual schema init
- CI/CD using containerized BSim database for integration tests
Next Steps:
- Provision BSim PostgreSQL instance (dev/staging/prod)
- Run BSim schema initialization
- Test BSimService connectivity
- Ingest initial corpus into BSim
References
- Ghidra BSim Documentation: https://ghidra.re/ghidra_docs/api/ghidra/features/bsim/
- Sprint:
docs/implplan/SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md - BSimService Implementation:
src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ghidra/Services/BSimService.cs