# PostgreSQL Migration Strategy **Version:** 1.0 **Last Updated:** 2025-12-03 **Status:** Active ## Overview This document defines the migration strategy for StellaOps PostgreSQL databases. It covers initial setup, per-release migrations, multi-instance coordination, and air-gapped operation. ## Principles 1. **Forward-Only**: No down migrations. Fixes are applied as new forward migrations. 2. **Idempotent**: All migrations must be safe to re-run (use `IF NOT EXISTS`, `ON CONFLICT DO NOTHING`). 3. **Deterministic**: Same input produces identical schema state across environments. 4. **Air-Gap Compatible**: All migrations embedded in assemblies, no external dependencies. 5. **Zero-Downtime**: Non-breaking migrations run at startup; breaking changes require coordination. ## Migration Categories ### Category A: Startup Migrations (Automatic) Run automatically when application starts. Must complete within 60 seconds. **Allowed Operations:** - `CREATE SCHEMA IF NOT EXISTS` - `CREATE TABLE IF NOT EXISTS` - `CREATE INDEX IF NOT EXISTS` - `CREATE INDEX CONCURRENTLY` (non-blocking) - `ALTER TABLE ADD COLUMN` (nullable or with default) - `CREATE TYPE ... IF NOT EXISTS` (enums) - Adding new enum values (`ALTER TYPE ... ADD VALUE IF NOT EXISTS`) - Insert seed data with `ON CONFLICT DO NOTHING` **Forbidden Operations:** - `DROP TABLE/COLUMN/INDEX` - `ALTER TABLE DROP COLUMN` - `ALTER TABLE ALTER COLUMN TYPE` - `TRUNCATE` - Large data migrations (> 10,000 rows affected) - Any operation requiring `ACCESS EXCLUSIVE` lock for extended periods ### Category B: Release Migrations (Manual/CLI) Require explicit execution via CLI before deployment. Used for breaking changes. **Typical Operations:** - Dropping deprecated columns/tables - Column type changes - Large data backfills - Index rebuilds - Table renames - Constraint modifications ### Category C: Data Migrations (Batched) Long-running data transformations that run as background jobs. **Characteristics:** - Batched processing (1000-10000 rows per batch) - Resumable after interruption - Progress tracking - Can run alongside application ## Migration File Structure ``` src//__Libraries/StellaOps..Storage.Postgres/ ├── Migrations/ │ ├── 001_initial_schema.sql # Category A │ ├── 002_add_audit_columns.sql # Category A │ ├── 003_add_search_index.sql # Category A │ └── 100_drop_legacy_columns.sql # Category B (100+ = manual) ├── Seeds/ │ ├── 001_default_roles.sql # Seed data │ └── 002_builtin_policies.sql # Seed data └── DataMigrations/ └── DM001_BackfillTenantIds.cs # Category C (code-based) ``` ### Naming Convention | Prefix | Category | Description | |--------|----------|-------------| | `001-099` | A (Startup) | Automatic, non-breaking | | `100-199` | B (Release) | Manual, breaking changes | | `200-299` | B (Release) | Major version migrations | | `S001-S999` | Seed | Reference data | | `DM001-DM999` | C (Data) | Batched data migrations | ## Execution Flow ### Application Startup ``` ┌─────────────────────────────────────────────────────────────┐ │ Application Startup │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 1. Acquire Advisory Lock (pg_try_advisory_lock) │ │ Key: hash of schema name │ │ If lock fails: wait up to 120s, then fail startup │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 2. Create schema_migrations table if not exists │ │ Columns: migration_name, applied_at, checksum, category │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 3. Load embedded migrations (001-099 only) │ │ - Sort by name │ │ - Compute checksums │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 4. Compare with applied migrations │ │ - Detect checksum mismatches (FATAL ERROR) │ │ - Identify pending migrations │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 5. Check for pending Category B migrations │ │ - If any 100+ migrations are pending: FAIL STARTUP │ │ - Log: "Run 'stellaops migrate' before deployment" │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 6. Execute pending Category A migrations │ │ - Each in transaction │ │ - Record in schema_migrations │ │ - Log timing │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 7. Execute seed data (if not already applied) │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 8. Release Advisory Lock │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 9. Continue Application Startup │ └─────────────────────────────────────────────────────────────┘ ``` ### Release Migration (CLI) ```bash # Before deployment - run breaking migrations stellaops system migrations-run --module Authority --category release # Verify migration state stellaops system migrations-status --module Authority # Dry run (show what would be executed) stellaops system migrations-run --module Authority --dry-run ``` ## Multi-Instance Coordination ### Advisory Locks Each module uses a unique advisory lock key derived from its schema name: ```sql -- Lock key calculation SELECT pg_try_advisory_lock(hashtext('auth')); -- Authority SELECT pg_try_advisory_lock(hashtext('scheduler')); -- Scheduler SELECT pg_try_advisory_lock(hashtext('vuln')); -- Concelier SELECT pg_try_advisory_lock(hashtext('policy')); -- Policy SELECT pg_try_advisory_lock(hashtext('notify')); -- Notify ``` ### Race Condition Handling ``` Instance A Instance B │ │ ├─ Acquire lock (success) ──► │ │ ├─ Acquire lock (BLOCKED) ├─ Run migrations │ Wait up to 120s │ │ ├─ Release lock ────────────► │ │ ├─ Acquire lock (success) │ ├─ Check migrations (none pending) │ ├─ Release lock │ │ ▼ ▼ Running Running ``` ## Schema Migrations Table Each schema maintains its own migration history: ```sql CREATE TABLE IF NOT EXISTS {schema}.schema_migrations ( migration_name TEXT PRIMARY KEY, category TEXT NOT NULL DEFAULT 'startup', checksum TEXT NOT NULL, applied_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), applied_by TEXT, duration_ms INT, CONSTRAINT valid_category CHECK (category IN ('startup', 'release', 'seed', 'data')) ); CREATE INDEX IF NOT EXISTS idx_schema_migrations_applied_at ON {schema}.schema_migrations(applied_at DESC); ``` ## Module-Specific Schemas | Module | Schema | Lock Key | Tables | |--------|--------|----------|--------| | Authority | `auth` | `hashtext('auth')` | tenants, users, roles, tokens, sessions | | Scheduler | `scheduler` | `hashtext('scheduler')` | jobs, triggers, workers, locks | | Concelier | `vuln` | `hashtext('vuln')` | advisories, affected, aliases, sources | | Policy | `policy` | `hashtext('policy')` | packs, versions, rules, evaluations | | Notify | `notify` | `hashtext('notify')` | templates, channels, deliveries | | Excititor | `vex` | `hashtext('vex')` | statements, documents, products | ## Release Workflow ### Pre-Deployment ```bash # 1. Review pending migrations stellaops system migrations-status --module all # 2. Backup database (if required) pg_dump -Fc stellaops > backup_$(date +%Y%m%d).dump # 3. Run release migrations in maintenance window stellaops system migrations-run --category release --module all # 4. Verify schema state stellaops system migrations-verify --module all ``` ### Deployment 1. Deploy new application version 2. Application startup runs Category A migrations automatically 3. Health checks pass after migrations complete ### Post-Deployment ```bash # Check migration status stellaops system migrations-status --module all # Run any data migrations (background) stellaops system migrations-run --category data --module all ``` ## Rollback Strategy Since we use forward-only migrations, rollback is achieved through: 1. **Fix-Forward**: Deploy a new migration that reverses the problematic change 2. **Blue/Green Deployment**: Switch back to previous version (requires backward-compatible migrations) 3. **Point-in-Time Recovery**: Restore from backup (last resort) ### Backward Compatibility Window For zero-downtime deployments, migrations must be backward compatible for N-1 version: ``` Version N: Adds new nullable column 'status_v2' Version N+1: Application uses 'status_v2', keeps 'status' populated Version N+2: Migration removes 'status' column (Category B) ``` ## Air-Gapped Operation All migrations are embedded as assembly resources: ```xml ``` No network access required during migration execution. ## Monitoring & Observability ### Metrics | Metric | Type | Description | |--------|------|-------------| | `stellaops_migration_duration_seconds` | Histogram | Time to run migration | | `stellaops_migration_pending_count` | Gauge | Number of pending migrations | | `stellaops_migration_applied_total` | Counter | Total migrations applied | | `stellaops_migration_failed_total` | Counter | Total migration failures | ### Logging ``` [INF] Migration: Acquiring lock for schema 'auth' [INF] Migration: Lock acquired, checking pending migrations [INF] Migration: 2 pending migrations found [INF] Migration: Applying 003_add_audit_columns.sql (checksum: a1b2c3...) [INF] Migration: 003_add_audit_columns.sql completed in 245ms [INF] Migration: Applying 004_add_search_index.sql (checksum: d4e5f6...) [INF] Migration: 004_add_search_index.sql completed in 1823ms [INF] Migration: All migrations applied, releasing lock ``` ### Alerts - Migration lock held > 5 minutes - Migration failure - Checksum mismatch detected - Pending Category B migrations blocking startup ## Development Workflow ### Creating a New Migration ```bash # 1. Create migration file touch src/Authority/__Libraries/StellaOps.Authority.Storage.Postgres/Migrations/005_add_mfa_columns.sql # 2. Write idempotent SQL cat > 005_add_mfa_columns.sql << 'EOF' -- Migration: 005_add_mfa_columns -- Category: startup -- Description: Add MFA support columns to users table ALTER TABLE auth.users ADD COLUMN IF NOT EXISTS mfa_enabled BOOLEAN NOT NULL DEFAULT FALSE; ALTER TABLE auth.users ADD COLUMN IF NOT EXISTS mfa_secret TEXT; ALTER TABLE auth.users ADD COLUMN IF NOT EXISTS mfa_backup_codes TEXT[]; CREATE INDEX IF NOT EXISTS idx_users_mfa_enabled ON auth.users(mfa_enabled) WHERE mfa_enabled = TRUE; EOF # 3. Test locally dotnet run --project src/Authority/StellaOps.Authority.WebService # 4. Verify migration applied stellaops system migrations-status --module Authority ``` ### Testing Migrations ```bash # Run integration tests with migrations dotnet test --filter "Category=Migration" # Test idempotency (run twice) stellaops system migrations-run --module Authority stellaops system migrations-run --module Authority # Should be no-op ``` ## Troubleshooting ### Lock Timeout ``` ERROR: Could not acquire migration lock within 120 seconds ``` **Cause**: Another instance is running migrations or crashed while holding lock. **Resolution**: ```sql -- Check active locks SELECT * FROM pg_locks WHERE locktype = 'advisory'; -- Force release (use with caution) SELECT pg_advisory_unlock_all(); ``` ### Checksum Mismatch ``` ERROR: Migration checksum mismatch for '003_add_audit_columns.sql' Expected: a1b2c3d4e5f6... Found: x9y8z7w6v5u4... ``` **Cause**: Migration file was modified after being applied. **Resolution**: 1. Never modify applied migrations 2. If intentional, update checksum manually in `schema_migrations` 3. Create new migration with fix instead ### Pending Release Migrations ``` ERROR: Cannot start application - pending release migrations require manual execution Pending: 100_drop_legacy_columns.sql Run: stellaops system migrations-run --module Authority --category release ``` **Resolution**: Run CLI migration command before deployment. ## Integration Guide ### Adding Startup Migrations to a Module ```csharp // In Program.cs or Startup.cs using StellaOps.Infrastructure.Postgres.Migrations; // Option 1: Using PostgresOptions services.AddStartupMigrations( schemaName: "auth", moduleName: "Authority", migrationsAssembly: typeof(AuthorityDataSource).Assembly, configureOptions: options => { options.LockTimeoutSeconds = 120; options.FailOnPendingReleaseMigrations = true; }); // Option 2: Using custom options type services.AddStartupMigrations( schemaName: "auth", moduleName: "Authority", migrationsAssembly: typeof(AuthorityDataSource).Assembly, connectionStringSelector: opts => opts.Storage.ConnectionString); // Add migration status service for health checks services.AddMigrationStatus( schemaName: "auth", moduleName: "Authority", migrationsAssembly: typeof(AuthorityDataSource).Assembly, connectionStringSelector: opts => opts.ConnectionString); ``` ### Embedding Migrations in Assembly ```xml ``` ### Health Check Integration ```csharp // Add migration status to health checks services.AddHealthChecks() .AddCheck("migrations", async (cancellationToken) => { var status = await migrationStatusService.GetStatusAsync(cancellationToken); if (status.HasBlockingIssues) { return HealthCheckResult.Unhealthy( $"Pending release migrations: {status.PendingReleaseCount}, " + $"Checksum errors: {status.ChecksumErrors.Count}"); } if (status.PendingStartupCount > 0) { return HealthCheckResult.Degraded( $"Pending startup migrations: {status.PendingStartupCount}"); } return HealthCheckResult.Healthy($"Applied: {status.AppliedCount}"); }); ``` ## Implementation Files | File | Description | |------|-------------| | `src/__Libraries/StellaOps.Infrastructure.Postgres/Migrations/MigrationRunner.cs` | Core migration execution logic | | `src/__Libraries/StellaOps.Infrastructure.Postgres/Migrations/MigrationCategory.cs` | Migration category enum and helpers | | `src/__Libraries/StellaOps.Infrastructure.Postgres/Migrations/StartupMigrationHost.cs` | IHostedService for automatic migrations | | `src/__Libraries/StellaOps.Infrastructure.Postgres/Migrations/MigrationServiceExtensions.cs` | DI registration extensions | ## Reference - [PostgreSQL Advisory Locks](https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS) - [Zero-Downtime Migrations](https://docs.stellaops.org/operations/migrations) - [StellaOps CLI Reference](../09_API_CLI_REFERENCE.md)