Files
StellaOps Bot 37e11918e0 save progress
2026-01-06 09:42:20 +02:00

11 KiB

BSim PostgreSQL Database Setup Guide

Version: 1.0 Sprint: SPRINT_20260105_001_003_BINDEX Task: GHID-011

Overview

Ghidra's BSim (Binary Similarity) feature requires a separate PostgreSQL database for storing and querying function signatures. This guide covers setup and configuration.

Architecture

┌──────────────────────────────────────────────────────┐
│              StellaOps BinaryIndex                   │
├──────────────────────────────────────────────────────┤
│  Main Corpus DB          │  BSim DB (Ghidra)        │
│  (corpus.* schema)       │  (separate instance)     │
│                          │                          │
│  - Function metadata     │  - BSim signatures       │
│  - Fingerprints          │  - Feature vectors       │
│  - Clusters              │  - Similarity index      │
│  - CVE associations      │                          │
└──────────────────────────────────────────────────────┘

Why Separate?

  • BSim uses Ghidra-specific schema and stored procedures
  • Different access patterns (corpus: OLTP, BSim: analytical)
  • BSim database can be shared across multiple Ghidra instances
  • Isolation prevents schema conflicts

Prerequisites

  • PostgreSQL 14+ (BSim requires specific PostgreSQL features)
  • Ghidra 11.x with BSim extension
  • Network connectivity between BinaryIndex services and BSim database
  • At least 10GB storage for initial database (scales with corpus size)

Database Setup

1. Create BSim Database

# Create database
createdb bsim_corpus

# Create user
psql -c "CREATE USER bsim_user WITH PASSWORD 'secure_password_here';"
psql -c "GRANT ALL PRIVILEGES ON DATABASE bsim_corpus TO bsim_user;"

2. Initialize BSim Schema

Ghidra provides scripts to initialize the BSim database schema:

# Set Ghidra home
export GHIDRA_HOME=/opt/ghidra

# Run BSim database initialization
$GHIDRA_HOME/Ghidra/Features/BSim/data/postgresql_init.sh \
    --host localhost \
    --port 5432 \
    --database bsim_corpus \
    --user bsim_user \
    --password secure_password_here

Alternatively, use Ghidra's BSim server setup:

# Create BSim server configuration
$GHIDRA_HOME/support/bsimServerSetup \
    postgresql://localhost:5432/bsim_corpus \
    --user bsim_user \
    --password secure_password_here

3. Verify Installation

# Connect to database
psql -h localhost -U bsim_user -d bsim_corpus

# Check BSim tables exist
\dt

# Expected tables:
# - bsim_functions
# - bsim_executables
# - bsim_vectors
# - bsim_clusters
# etc.

# Exit
\q

Docker Deployment

Docker Compose Configuration

# docker-compose.bsim.yml
version: '3.8'

services:
  bsim-postgres:
    image: postgres:16
    container_name: stellaops-bsim-db
    environment:
      POSTGRES_DB: bsim_corpus
      POSTGRES_USER: bsim_user
      POSTGRES_PASSWORD: ${BSIM_DB_PASSWORD}
      POSTGRES_INITDB_ARGS: "-E UTF8 --locale=C"
    volumes:
      - bsim-data:/var/lib/postgresql/data
      - ./scripts/init-bsim.sh:/docker-entrypoint-initdb.d/10-init-bsim.sh:ro
    ports:
      - "5433:5432"  # Different port to avoid conflict with main DB
    networks:
      - stellaops
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U bsim_user -d bsim_corpus"]
      interval: 10s
      timeout: 5s
      retries: 5

  ghidra-headless:
    image: stellaops/ghidra-headless:11.2
    container_name: stellaops-ghidra
    depends_on:
      bsim-postgres:
        condition: service_healthy
    environment:
      BSIM_DB_URL: "postgresql://bsim-postgres:5432/bsim_corpus"
      BSIM_DB_USER: bsim_user
      BSIM_DB_PASSWORD: ${BSIM_DB_PASSWORD}
      JAVA_HOME: /opt/java/openjdk
      MAXMEM: 4G
    volumes:
      - ghidra-projects:/projects
      - ghidra-scripts:/scripts
    networks:
      - stellaops
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 8G

volumes:
  bsim-data:
    driver: local
  ghidra-projects:
  ghidra-scripts:

networks:
  stellaops:
    driver: bridge

Initialization Script

Create scripts/init-bsim.sh:

#!/bin/bash
set -e

# Wait for PostgreSQL to be ready
until pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB"; do
  echo "Waiting for PostgreSQL..."
  sleep 2
done

echo "PostgreSQL is ready. Installing BSim schema..."

# Note: Actual BSim schema SQL would be sourced from Ghidra distribution
# This is a placeholder - replace with actual Ghidra BSim schema
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
    -- BSim schema will be initialized by Ghidra tools
    -- This script just ensures the database is ready

    COMMENT ON DATABASE bsim_corpus IS 'Ghidra BSim function signature database';
EOSQL

echo "BSim database initialized successfully"

Start Services

# Set password
export BSIM_DB_PASSWORD="your_secure_password"

# Start services
docker-compose -f docker-compose.bsim.yml up -d

# Check logs
docker-compose -f docker-compose.bsim.yml logs -f ghidra-headless

Configuration

BinaryIndex Configuration

Configure BSim connection in appsettings.json:

{
  "BinaryIndex": {
    "Ghidra": {
      "Enabled": true,
      "GhidraHome": "/opt/ghidra",
      "BSim": {
        "Enabled": true,
        "ConnectionString": "Host=localhost;Port=5433;Database=bsim_corpus;Username=bsim_user;Password=...",
        "MinSimilarity": 0.7,
        "MaxResults": 10
      }
    }
  }
}

Environment Variables

# BSim database connection
export STELLAOPS_BSIM_CONNECTION="Host=localhost;Port=5433;Database=bsim_corpus;Username=bsim_user;Password=..."

# BSim feature
export STELLAOPS_BSIM_ENABLED=true

# Query tuning
export STELLAOPS_BSIM_MIN_SIMILARITY=0.7
export STELLAOPS_BSIM_QUERY_TIMEOUT=30

Usage

Ingesting Functions into BSim

using StellaOps.BinaryIndex.Ghidra;

var bsimService = serviceProvider.GetRequiredService<IBSimService>();

// Analyze binary with Ghidra
var ghidraService = serviceProvider.GetRequiredService<IGhidraService>();
var analysis = await ghidraService.AnalyzeAsync(binaryStream, ct: ct);

// Generate BSim signatures
var signatures = await bsimService.GenerateSignaturesAsync(analysis, ct: ct);

// Ingest into BSim database
await bsimService.IngestAsync("glibc", "2.31", signatures, ct);

Querying BSim

// Query for similar functions
var queryOptions = new BSimQueryOptions
{
    MinSimilarity = 0.7,
    MinSignificance = 0.5,
    MaxResults = 10
};

var matches = await bsimService.QueryAsync(signature, queryOptions, ct);

foreach (var match in matches)
{
    Console.WriteLine($"Match: {match.MatchedLibrary} {match.MatchedVersion} - {match.MatchedFunction}");
    Console.WriteLine($"Similarity: {match.Similarity:P2}, Confidence: {match.Confidence:P2}");
}

Maintenance

Database Vacuum

# Regular vacuum (run weekly)
psql -h localhost -U bsim_user -d bsim_corpus -c "VACUUM ANALYZE;"

# Full vacuum (run monthly)
psql -h localhost -U bsim_user -d bsim_corpus -c "VACUUM FULL;"

Backup and Restore

# Backup
pg_dump -h localhost -U bsim_user -d bsim_corpus -F c -f bsim_backup_$(date +%Y%m%d).dump

# Restore
pg_restore -h localhost -U bsim_user -d bsim_corpus -c bsim_backup_20260105.dump

Monitoring

-- Check database size
SELECT pg_size_pretty(pg_database_size('bsim_corpus'));

-- Check signature count
SELECT COUNT(*) FROM bsim_functions;

-- Check recent ingest activity
SELECT * FROM bsim_ingest_log ORDER BY ingested_at DESC LIMIT 10;

Performance Tuning

PostgreSQL Configuration

Add to postgresql.conf:

# Memory settings for BSim workload
shared_buffers = 4GB
effective_cache_size = 12GB
work_mem = 256MB
maintenance_work_mem = 1GB

# Query parallelism
max_parallel_workers_per_gather = 4
max_parallel_workers = 8

# Indexes
random_page_cost = 1.1  # For SSD storage

Indexing Strategy

BSim automatically creates required indexes. Monitor slow queries:

-- Enable query logging
ALTER SYSTEM SET log_min_duration_statement = 1000; -- Log queries > 1s
SELECT pg_reload_conf();

-- Check slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE query LIKE '%bsim%'
ORDER BY mean_exec_time DESC
LIMIT 10;

Troubleshooting

Connection Refused

Error: could not connect to server: Connection refused

Solution:

  1. Verify PostgreSQL is running: systemctl status postgresql
  2. Check port: netstat -an | grep 5433
  3. Verify firewall rules
  4. Check pg_hba.conf for access rules

Schema Not Found

Error: relation "bsim_functions" does not exist

Solution:

  1. Re-run BSim schema initialization
  2. Verify Ghidra version compatibility
  3. Check BSim extension is installed in Ghidra

Poor Query Performance

Warning: BSim queries taking > 5s

Solution:

  1. Run VACUUM ANALYZE on BSim tables
  2. Increase work_mem for complex queries
  3. Check index usage: EXPLAIN ANALYZE on slow queries
  4. Consider partitioning large tables

Security Considerations

  1. Network Access: BSim database should only be accessible from BinaryIndex services and Ghidra instances
  2. Authentication: Use strong passwords, consider certificate-based authentication
  3. Encryption: Enable SSL/TLS for database connections in production
  4. Access Control: Grant minimum necessary privileges
-- Create read-only user for query services
CREATE USER bsim_readonly WITH PASSWORD '...';
GRANT CONNECT ON DATABASE bsim_corpus TO bsim_readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO bsim_readonly;

Integration with Corpus

The BSim database complements the main corpus database:

  • Corpus DB: Stores function metadata, fingerprints, CVE associations
  • BSim DB: Stores Ghidra-specific behavioral signatures and feature vectors

Functions are cross-referenced by:

  • Library name + version
  • Function name
  • Binary hash

Status: GHID-011 Resolution

Implementation Status: Service code complete (BSimService.cs implemented)

Database Status: Schema initialization documented, awaiting infrastructure provisioning

Blocker Resolution: This guide provides complete setup instructions. Database can be provisioned by:

  1. Operations team following Docker Compose setup above
  2. Developers using local PostgreSQL with manual schema init
  3. CI/CD using containerized BSim database for integration tests

Next Steps:

  1. Provision BSim PostgreSQL instance (dev/staging/prod)
  2. Run BSim schema initialization
  3. Test BSimService connectivity
  4. Ingest initial corpus into BSim

References