# BSim PostgreSQL Database Setup Guide **Version:** 1.0 **Sprint:** SPRINT_20260105_001_003_BINDEX **Task:** GHID-011 ## Overview Ghidra's BSim (Binary Similarity) feature requires a separate PostgreSQL database for storing and querying function signatures. This guide covers setup and configuration. ## Architecture ``` ┌──────────────────────────────────────────────────────┐ │ StellaOps BinaryIndex │ ├──────────────────────────────────────────────────────┤ │ Main Corpus DB │ BSim DB (Ghidra) │ │ (corpus.* schema) │ (separate instance) │ │ │ │ │ - Function metadata │ - BSim signatures │ │ - Fingerprints │ - Feature vectors │ │ - Clusters │ - Similarity index │ │ - CVE associations │ │ └──────────────────────────────────────────────────────┘ ``` **Why Separate?** - BSim uses Ghidra-specific schema and stored procedures - Different access patterns (corpus: OLTP, BSim: analytical) - BSim database can be shared across multiple Ghidra instances - Isolation prevents schema conflicts ## Prerequisites - PostgreSQL 14+ (BSim requires specific PostgreSQL features) - Ghidra 11.x with BSim extension - Network connectivity between BinaryIndex services and BSim database - At least 10GB storage for initial database (scales with corpus size) ## Database Setup ### 1. Create BSim Database ```bash # Create database createdb bsim_corpus # Create user psql -c "CREATE USER bsim_user WITH PASSWORD 'secure_password_here';" psql -c "GRANT ALL PRIVILEGES ON DATABASE bsim_corpus TO bsim_user;" ``` ### 2. Initialize BSim Schema Ghidra provides scripts to initialize the BSim database schema: ```bash # Set Ghidra home export GHIDRA_HOME=/opt/ghidra # Run BSim database initialization $GHIDRA_HOME/Ghidra/Features/BSim/data/postgresql_init.sh \ --host localhost \ --port 5432 \ --database bsim_corpus \ --user bsim_user \ --password secure_password_here ``` Alternatively, use Ghidra's BSim server setup: ```bash # Create BSim server configuration $GHIDRA_HOME/support/bsimServerSetup \ postgresql://localhost:5432/bsim_corpus \ --user bsim_user \ --password secure_password_here ``` ### 3. Verify Installation ```bash # Connect to database psql -h localhost -U bsim_user -d bsim_corpus # Check BSim tables exist \dt # Expected tables: # - bsim_functions # - bsim_executables # - bsim_vectors # - bsim_clusters # etc. # Exit \q ``` ## Docker Deployment ### Docker Compose Configuration ```yaml # docker-compose.bsim.yml version: '3.8' services: bsim-postgres: image: postgres:16 container_name: stellaops-bsim-db environment: POSTGRES_DB: bsim_corpus POSTGRES_USER: bsim_user POSTGRES_PASSWORD: ${BSIM_DB_PASSWORD} POSTGRES_INITDB_ARGS: "-E UTF8 --locale=C" volumes: - bsim-data:/var/lib/postgresql/data - ./scripts/init-bsim.sh:/docker-entrypoint-initdb.d/10-init-bsim.sh:ro ports: - "5433:5432" # Different port to avoid conflict with main DB networks: - stellaops healthcheck: test: ["CMD-SHELL", "pg_isready -U bsim_user -d bsim_corpus"] interval: 10s timeout: 5s retries: 5 ghidra-headless: image: stellaops/ghidra-headless:11.2 container_name: stellaops-ghidra depends_on: bsim-postgres: condition: service_healthy environment: BSIM_DB_URL: "postgresql://bsim-postgres:5432/bsim_corpus" BSIM_DB_USER: bsim_user BSIM_DB_PASSWORD: ${BSIM_DB_PASSWORD} JAVA_HOME: /opt/java/openjdk MAXMEM: 4G volumes: - ghidra-projects:/projects - ghidra-scripts:/scripts networks: - stellaops deploy: resources: limits: cpus: '4' memory: 8G volumes: bsim-data: driver: local ghidra-projects: ghidra-scripts: networks: stellaops: driver: bridge ``` ### Initialization Script Create `scripts/init-bsim.sh`: ```bash #!/bin/bash set -e # Wait for PostgreSQL to be ready until pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB"; do echo "Waiting for PostgreSQL..." sleep 2 done echo "PostgreSQL is ready. Installing BSim schema..." # Note: Actual BSim schema SQL would be sourced from Ghidra distribution # This is a placeholder - replace with actual Ghidra BSim schema psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL -- BSim schema will be initialized by Ghidra tools -- This script just ensures the database is ready COMMENT ON DATABASE bsim_corpus IS 'Ghidra BSim function signature database'; EOSQL echo "BSim database initialized successfully" ``` ### Start Services ```bash # Set password export BSIM_DB_PASSWORD="your_secure_password" # Start services docker-compose -f docker-compose.bsim.yml up -d # Check logs docker-compose -f docker-compose.bsim.yml logs -f ghidra-headless ``` ## Configuration ### BinaryIndex Configuration Configure BSim connection in `appsettings.json`: ```json { "BinaryIndex": { "Ghidra": { "Enabled": true, "GhidraHome": "/opt/ghidra", "BSim": { "Enabled": true, "ConnectionString": "Host=localhost;Port=5433;Database=bsim_corpus;Username=bsim_user;Password=...", "MinSimilarity": 0.7, "MaxResults": 10 } } } } ``` ### Environment Variables ```bash # BSim database connection export STELLAOPS_BSIM_CONNECTION="Host=localhost;Port=5433;Database=bsim_corpus;Username=bsim_user;Password=..." # BSim feature export STELLAOPS_BSIM_ENABLED=true # Query tuning export STELLAOPS_BSIM_MIN_SIMILARITY=0.7 export STELLAOPS_BSIM_QUERY_TIMEOUT=30 ``` ## Usage ### Ingesting Functions into BSim ```csharp using StellaOps.BinaryIndex.Ghidra; var bsimService = serviceProvider.GetRequiredService(); // Analyze binary with Ghidra var ghidraService = serviceProvider.GetRequiredService(); var analysis = await ghidraService.AnalyzeAsync(binaryStream, ct: ct); // Generate BSim signatures var signatures = await bsimService.GenerateSignaturesAsync(analysis, ct: ct); // Ingest into BSim database await bsimService.IngestAsync("glibc", "2.31", signatures, ct); ``` ### Querying BSim ```csharp // Query for similar functions var queryOptions = new BSimQueryOptions { MinSimilarity = 0.7, MinSignificance = 0.5, MaxResults = 10 }; var matches = await bsimService.QueryAsync(signature, queryOptions, ct); foreach (var match in matches) { Console.WriteLine($"Match: {match.MatchedLibrary} {match.MatchedVersion} - {match.MatchedFunction}"); Console.WriteLine($"Similarity: {match.Similarity:P2}, Confidence: {match.Confidence:P2}"); } ``` ## Maintenance ### Database Vacuum ```bash # Regular vacuum (run weekly) psql -h localhost -U bsim_user -d bsim_corpus -c "VACUUM ANALYZE;" # Full vacuum (run monthly) psql -h localhost -U bsim_user -d bsim_corpus -c "VACUUM FULL;" ``` ### Backup and Restore ```bash # Backup pg_dump -h localhost -U bsim_user -d bsim_corpus -F c -f bsim_backup_$(date +%Y%m%d).dump # Restore pg_restore -h localhost -U bsim_user -d bsim_corpus -c bsim_backup_20260105.dump ``` ### Monitoring ```sql -- Check database size SELECT pg_size_pretty(pg_database_size('bsim_corpus')); -- Check signature count SELECT COUNT(*) FROM bsim_functions; -- Check recent ingest activity SELECT * FROM bsim_ingest_log ORDER BY ingested_at DESC LIMIT 10; ``` ## Performance Tuning ### PostgreSQL Configuration Add to `postgresql.conf`: ```ini # Memory settings for BSim workload shared_buffers = 4GB effective_cache_size = 12GB work_mem = 256MB maintenance_work_mem = 1GB # Query parallelism max_parallel_workers_per_gather = 4 max_parallel_workers = 8 # Indexes random_page_cost = 1.1 # For SSD storage ``` ### Indexing Strategy BSim automatically creates required indexes. Monitor slow queries: ```sql -- Enable query logging ALTER SYSTEM SET log_min_duration_statement = 1000; -- Log queries > 1s SELECT pg_reload_conf(); -- Check slow queries SELECT query, mean_exec_time, calls FROM pg_stat_statements WHERE query LIKE '%bsim%' ORDER BY mean_exec_time DESC LIMIT 10; ``` ## Troubleshooting ### Connection Refused ``` Error: could not connect to server: Connection refused ``` **Solution:** 1. Verify PostgreSQL is running: `systemctl status postgresql` 2. Check port: `netstat -an | grep 5433` 3. Verify firewall rules 4. Check `pg_hba.conf` for access rules ### Schema Not Found ``` Error: relation "bsim_functions" does not exist ``` **Solution:** 1. Re-run BSim schema initialization 2. Verify Ghidra version compatibility 3. Check BSim extension is installed in Ghidra ### Poor Query Performance ``` Warning: BSim queries taking > 5s ``` **Solution:** 1. Run `VACUUM ANALYZE` on BSim tables 2. Increase `work_mem` for complex queries 3. Check index usage: `EXPLAIN ANALYZE` on slow queries 4. Consider partitioning large tables ## Security Considerations 1. **Network Access:** BSim database should only be accessible from BinaryIndex services and Ghidra instances 2. **Authentication:** Use strong passwords, consider certificate-based authentication 3. **Encryption:** Enable SSL/TLS for database connections in production 4. **Access Control:** Grant minimum necessary privileges ```sql -- Create read-only user for query services CREATE USER bsim_readonly WITH PASSWORD '...'; GRANT CONNECT ON DATABASE bsim_corpus TO bsim_readonly; GRANT SELECT ON ALL TABLES IN SCHEMA public TO bsim_readonly; ``` ## Integration with Corpus The BSim database complements the main corpus database: - **Corpus DB:** Stores function metadata, fingerprints, CVE associations - **BSim DB:** Stores Ghidra-specific behavioral signatures and feature vectors Functions are cross-referenced by: - Library name + version - Function name - Binary hash ## Status: GHID-011 Resolution **Implementation Status:** Service code complete (`BSimService.cs` implemented) **Database Status:** Schema initialization documented, awaiting infrastructure provisioning **Blocker Resolution:** This guide provides complete setup instructions. Database can be provisioned by: 1. Operations team following Docker Compose setup above 2. Developers using local PostgreSQL with manual schema init 3. CI/CD using containerized BSim database for integration tests **Next Steps:** 1. Provision BSim PostgreSQL instance (dev/staging/prod) 2. Run BSim schema initialization 3. Test BSimService connectivity 4. Ingest initial corpus into BSim ## References - Ghidra BSim Documentation: https://ghidra.re/ghidra_docs/api/ghidra/features/bsim/ - Sprint: `docs/implplan/SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md` - BSimService Implementation: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Ghidra/Services/BSimService.cs`