Files
git.stella-ops.org/docs/db/MIGRATION_STRATEGY.md
master 75f6942769
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Add integration tests for migration categories and execution
- Implemented MigrationCategoryTests to validate migration categorization for startup, release, seed, and data migrations.
- Added tests for edge cases, including null, empty, and whitespace migration names.
- Created StartupMigrationHostTests to verify the behavior of the migration host with real PostgreSQL instances using Testcontainers.
- Included tests for migration execution, schema creation, and handling of pending release migrations.
- Added SQL migration files for testing: creating a test table, adding a column, a release migration, and seeding data.
2025-12-04 19:10:54 +02:00

19 KiB

PostgreSQL Migration Strategy

Version: 1.0 Last Updated: 2025-12-03 Status: Active

Overview

This document defines the migration strategy for StellaOps PostgreSQL databases. It covers initial setup, per-release migrations, multi-instance coordination, and air-gapped operation.

Principles

  1. Forward-Only: No down migrations. Fixes are applied as new forward migrations.
  2. Idempotent: All migrations must be safe to re-run (use IF NOT EXISTS, ON CONFLICT DO NOTHING).
  3. Deterministic: Same input produces identical schema state across environments.
  4. Air-Gap Compatible: All migrations embedded in assemblies, no external dependencies.
  5. Zero-Downtime: Non-breaking migrations run at startup; breaking changes require coordination.

Migration Categories

Category A: Startup Migrations (Automatic)

Run automatically when application starts. Must complete within 60 seconds.

Allowed Operations:

  • CREATE SCHEMA IF NOT EXISTS
  • CREATE TABLE IF NOT EXISTS
  • CREATE INDEX IF NOT EXISTS
  • CREATE INDEX CONCURRENTLY (non-blocking)
  • ALTER TABLE ADD COLUMN (nullable or with default)
  • CREATE TYPE ... IF NOT EXISTS (enums)
  • Adding new enum values (ALTER TYPE ... ADD VALUE IF NOT EXISTS)
  • Insert seed data with ON CONFLICT DO NOTHING

Forbidden Operations:

  • DROP TABLE/COLUMN/INDEX
  • ALTER TABLE DROP COLUMN
  • ALTER TABLE ALTER COLUMN TYPE
  • TRUNCATE
  • Large data migrations (> 10,000 rows affected)
  • Any operation requiring ACCESS EXCLUSIVE lock for extended periods

Category B: Release Migrations (Manual/CLI)

Require explicit execution via CLI before deployment. Used for breaking changes.

Typical Operations:

  • Dropping deprecated columns/tables
  • Column type changes
  • Large data backfills
  • Index rebuilds
  • Table renames
  • Constraint modifications

Category C: Data Migrations (Batched)

Long-running data transformations that run as background jobs.

Characteristics:

  • Batched processing (1000-10000 rows per batch)
  • Resumable after interruption
  • Progress tracking
  • Can run alongside application

Migration File Structure

src/<Module>/__Libraries/StellaOps.<Module>.Storage.Postgres/
├── Migrations/
│   ├── 001_initial_schema.sql          # Category A
│   ├── 002_add_audit_columns.sql       # Category A
│   ├── 003_add_search_index.sql        # Category A
│   └── 100_drop_legacy_columns.sql     # Category B (100+ = manual)
├── Seeds/
│   ├── 001_default_roles.sql           # Seed data
│   └── 002_builtin_policies.sql        # Seed data
└── DataMigrations/
    └── DM001_BackfillTenantIds.cs      # Category C (code-based)

Naming Convention

Prefix Category Description
001-099 A (Startup) Automatic, non-breaking
100-199 B (Release) Manual, breaking changes
200-299 B (Release) Major version migrations
S001-S999 Seed Reference data
DM001-DM999 C (Data) Batched data migrations

Execution Flow

Application Startup

┌─────────────────────────────────────────────────────────────┐
│                    Application Startup                       │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  1. Acquire Advisory Lock (pg_try_advisory_lock)            │
│     Key: hash of schema name                                 │
│     If lock fails: wait up to 120s, then fail startup       │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  2. Create schema_migrations table if not exists             │
│     Columns: migration_name, applied_at, checksum, category │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  3. Load embedded migrations (001-099 only)                  │
│     - Sort by name                                           │
│     - Compute checksums                                      │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  4. Compare with applied migrations                          │
│     - Detect checksum mismatches (FATAL ERROR)              │
│     - Identify pending migrations                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  5. Check for pending Category B migrations                  │
│     - If any 100+ migrations are pending: FAIL STARTUP      │
│     - Log: "Run 'stellaops migrate' before deployment"      │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  6. Execute pending Category A migrations                    │
│     - Each in transaction                                    │
│     - Record in schema_migrations                            │
│     - Log timing                                             │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  7. Execute seed data (if not already applied)               │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  8. Release Advisory Lock                                    │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  9. Continue Application Startup                             │
└─────────────────────────────────────────────────────────────┘

Release Migration (CLI)

# Before deployment - run breaking migrations
stellaops system migrations-run --module Authority --category release

# Verify migration state
stellaops system migrations-status --module Authority

# Dry run (show what would be executed)
stellaops system migrations-run --module Authority --dry-run

Multi-Instance Coordination

Advisory Locks

Each module uses a unique advisory lock key derived from its schema name:

-- Lock key calculation
SELECT pg_try_advisory_lock(hashtext('auth'));        -- Authority
SELECT pg_try_advisory_lock(hashtext('scheduler'));   -- Scheduler
SELECT pg_try_advisory_lock(hashtext('vuln'));        -- Concelier
SELECT pg_try_advisory_lock(hashtext('policy'));      -- Policy
SELECT pg_try_advisory_lock(hashtext('notify'));      -- Notify

Race Condition Handling

Instance A                      Instance B
    │                               │
    ├─ Acquire lock (success) ──►   │
    │                               ├─ Acquire lock (BLOCKED)
    ├─ Run migrations               │     Wait up to 120s
    │                               │
    ├─ Release lock ────────────►   │
    │                               ├─ Acquire lock (success)
    │                               ├─ Check migrations (none pending)
    │                               ├─ Release lock
    │                               │
    ▼                               ▼
  Running                        Running

Schema Migrations Table

Each schema maintains its own migration history:

CREATE TABLE IF NOT EXISTS {schema}.schema_migrations (
    migration_name TEXT PRIMARY KEY,
    category TEXT NOT NULL DEFAULT 'startup',
    checksum TEXT NOT NULL,
    applied_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    applied_by TEXT,
    duration_ms INT,

    CONSTRAINT valid_category CHECK (category IN ('startup', 'release', 'seed', 'data'))
);

CREATE INDEX IF NOT EXISTS idx_schema_migrations_applied_at
    ON {schema}.schema_migrations(applied_at DESC);

Module-Specific Schemas

Module Schema Lock Key Tables
Authority auth hashtext('auth') tenants, users, roles, tokens, sessions
Scheduler scheduler hashtext('scheduler') jobs, triggers, workers, locks
Concelier vuln hashtext('vuln') advisories, affected, aliases, sources
Policy policy hashtext('policy') packs, versions, rules, evaluations
Notify notify hashtext('notify') templates, channels, deliveries
Excititor vex hashtext('vex') statements, documents, products

Release Workflow

Pre-Deployment

# 1. Review pending migrations
stellaops system migrations-status --module all

# 2. Backup database (if required)
pg_dump -Fc stellaops > backup_$(date +%Y%m%d).dump

# 3. Run release migrations in maintenance window
stellaops system migrations-run --category release --module all

# 4. Verify schema state
stellaops system migrations-verify --module all

Deployment

  1. Deploy new application version
  2. Application startup runs Category A migrations automatically
  3. Health checks pass after migrations complete

Post-Deployment

# Check migration status
stellaops system migrations-status --module all

# Run any data migrations (background)
stellaops system migrations-run --category data --module all

Rollback Strategy

Since we use forward-only migrations, rollback is achieved through:

  1. Fix-Forward: Deploy a new migration that reverses the problematic change
  2. Blue/Green Deployment: Switch back to previous version (requires backward-compatible migrations)
  3. Point-in-Time Recovery: Restore from backup (last resort)

Backward Compatibility Window

For zero-downtime deployments, migrations must be backward compatible for N-1 version:

Version N:   Adds new nullable column 'status_v2'
Version N+1: Application uses 'status_v2', keeps 'status' populated
Version N+2: Migration removes 'status' column (Category B)

Air-Gapped Operation

All migrations are embedded as assembly resources:

<!-- In .csproj file -->
<ItemGroup>
  <EmbeddedResource Include="Migrations\*.sql" LogicalName="%(Filename)%(Extension)" />
  <EmbeddedResource Include="Seeds\*.sql" LogicalName="%(Filename)%(Extension)" />
</ItemGroup>

No network access required during migration execution.

Monitoring & Observability

Metrics

Metric Type Description
stellaops_migration_duration_seconds Histogram Time to run migration
stellaops_migration_pending_count Gauge Number of pending migrations
stellaops_migration_applied_total Counter Total migrations applied
stellaops_migration_failed_total Counter Total migration failures

Logging

[INF] Migration: Acquiring lock for schema 'auth'
[INF] Migration: Lock acquired, checking pending migrations
[INF] Migration: 2 pending migrations found
[INF] Migration: Applying 003_add_audit_columns.sql (checksum: a1b2c3...)
[INF] Migration: 003_add_audit_columns.sql completed in 245ms
[INF] Migration: Applying 004_add_search_index.sql (checksum: d4e5f6...)
[INF] Migration: 004_add_search_index.sql completed in 1823ms
[INF] Migration: All migrations applied, releasing lock

Alerts

  • Migration lock held > 5 minutes
  • Migration failure
  • Checksum mismatch detected
  • Pending Category B migrations blocking startup

Development Workflow

Creating a New Migration

# 1. Create migration file
touch src/Authority/__Libraries/StellaOps.Authority.Storage.Postgres/Migrations/005_add_mfa_columns.sql

# 2. Write idempotent SQL
cat > 005_add_mfa_columns.sql << 'EOF'
-- Migration: 005_add_mfa_columns
-- Category: startup
-- Description: Add MFA support columns to users table

ALTER TABLE auth.users ADD COLUMN IF NOT EXISTS mfa_enabled BOOLEAN NOT NULL DEFAULT FALSE;
ALTER TABLE auth.users ADD COLUMN IF NOT EXISTS mfa_secret TEXT;
ALTER TABLE auth.users ADD COLUMN IF NOT EXISTS mfa_backup_codes TEXT[];

CREATE INDEX IF NOT EXISTS idx_users_mfa_enabled ON auth.users(mfa_enabled) WHERE mfa_enabled = TRUE;
EOF

# 3. Test locally
dotnet run --project src/Authority/StellaOps.Authority.WebService

# 4. Verify migration applied
stellaops system migrations-status --module Authority

Testing Migrations

# Run integration tests with migrations
dotnet test --filter "Category=Migration"

# Test idempotency (run twice)
stellaops system migrations-run --module Authority
stellaops system migrations-run --module Authority  # Should be no-op

Troubleshooting

Lock Timeout

ERROR: Could not acquire migration lock within 120 seconds

Cause: Another instance is running migrations or crashed while holding lock.

Resolution:

-- Check active locks
SELECT * FROM pg_locks WHERE locktype = 'advisory';

-- Force release (use with caution)
SELECT pg_advisory_unlock_all();

Checksum Mismatch

ERROR: Migration checksum mismatch for '003_add_audit_columns.sql'
  Expected: a1b2c3d4e5f6...
  Found:    x9y8z7w6v5u4...

Cause: Migration file was modified after being applied.

Resolution:

  1. Never modify applied migrations
  2. If intentional, update checksum manually in schema_migrations
  3. Create new migration with fix instead

Pending Release Migrations

ERROR: Cannot start application - pending release migrations require manual execution
  Pending: 100_drop_legacy_columns.sql
  Run: stellaops system migrations-run --module Authority --category release

Resolution: Run CLI migration command before deployment.

Integration Guide

Adding Startup Migrations to a Module

// In Program.cs or Startup.cs
using StellaOps.Infrastructure.Postgres.Migrations;

// Option 1: Using PostgresOptions
services.AddStartupMigrations(
    schemaName: "auth",
    moduleName: "Authority",
    migrationsAssembly: typeof(AuthorityDataSource).Assembly,
    configureOptions: options =>
    {
        options.LockTimeoutSeconds = 120;
        options.FailOnPendingReleaseMigrations = true;
    });

// Option 2: Using custom options type
services.AddStartupMigrations<AuthorityOptions>(
    schemaName: "auth",
    moduleName: "Authority",
    migrationsAssembly: typeof(AuthorityDataSource).Assembly,
    connectionStringSelector: opts => opts.Storage.ConnectionString);

// Add migration status service for health checks
services.AddMigrationStatus<PostgresOptions>(
    schemaName: "auth",
    moduleName: "Authority",
    migrationsAssembly: typeof(AuthorityDataSource).Assembly,
    connectionStringSelector: opts => opts.ConnectionString);

Embedding Migrations in Assembly

<!-- In .csproj file -->
<ItemGroup>
  <EmbeddedResource Include="Migrations\*.sql" LogicalName="%(Filename)%(Extension)" />
  <EmbeddedResource Include="Seeds\*.sql" LogicalName="%(Filename)%(Extension)" />
</ItemGroup>

Health Check Integration

// Add migration status to health checks
services.AddHealthChecks()
    .AddCheck("migrations", async (cancellationToken) =>
    {
        var status = await migrationStatusService.GetStatusAsync(cancellationToken);

        if (status.HasBlockingIssues)
        {
            return HealthCheckResult.Unhealthy(
                $"Pending release migrations: {status.PendingReleaseCount}, " +
                $"Checksum errors: {status.ChecksumErrors.Count}");
        }

        if (status.PendingStartupCount > 0)
        {
            return HealthCheckResult.Degraded(
                $"Pending startup migrations: {status.PendingStartupCount}");
        }

        return HealthCheckResult.Healthy($"Applied: {status.AppliedCount}");
    });

Implementation Files

File Description
src/__Libraries/StellaOps.Infrastructure.Postgres/Migrations/MigrationRunner.cs Core migration execution logic
src/__Libraries/StellaOps.Infrastructure.Postgres/Migrations/MigrationCategory.cs Migration category enum and helpers
src/__Libraries/StellaOps.Infrastructure.Postgres/Migrations/StartupMigrationHost.cs IHostedService for automatic migrations
src/__Libraries/StellaOps.Infrastructure.Postgres/Migrations/MigrationServiceExtensions.cs DI registration extensions

Reference