Files
git.stella-ops.org/docs/adr/0001-postgresql-for-control-plane.md
master 75f6942769
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Add integration tests for migration categories and execution
- Implemented MigrationCategoryTests to validate migration categorization for startup, release, seed, and data migrations.
- Added tests for edge cases, including null, empty, and whitespace migration names.
- Created StartupMigrationHostTests to verify the behavior of the migration host with real PostgreSQL instances using Testcontainers.
- Included tests for migration execution, schema creation, and handling of pending release migrations.
- Added SQL migration files for testing: creating a test table, adding a column, a release migration, and seeding data.
2025-12-04 19:10:54 +02:00

8.7 KiB

ADR-0001: PostgreSQL for Control-Plane Storage

Status

Accepted

Date

2025-12-04

Authors

  • Platform Team

Deciders

  • Architecture Guild
  • Platform Team

Context

StellaOps control-plane services (Authority, Scheduler, Notify, Concelier/Excititor, Policy) require persistent storage for:

  • Identity and authorization data (users, roles, tokens, sessions)
  • Job scheduling and execution state
  • Notification rules, templates, and delivery tracking
  • Vulnerability advisories and VEX statements
  • Policy packs, rules, and evaluation history

Triggers for this decision:

  1. Licensing trust & ecosystem stability — PostgreSQL is licensed under the permissive PostgreSQL License (similar to MIT/BSD), OSI-approved, with no vendor lock-in concerns. MongoDB's SSPL license (2018) is not OSI-approved and creates uncertainty for self-hosted/sovereign deployments. For a platform emphasizing sovereignty and auditability, database licensing must be beyond reproach.
  2. Schema complexity — Control-plane domains have well-defined, relational schemas with referential integrity requirements (foreign keys, cascading deletes, constraints).
  3. Query patterns — Complex joins, aggregations, and window functions are common (e.g., finding all images affected by a newly published CVE).
  4. ACID requirements — Job scheduling, token issuance, and notification delivery require strong transactional guarantees.
  5. Multi-tenancy — Row-level security (RLS) needed for tenant isolation without schema-per-tenant overhead.
  6. Migration tooling — Need deterministic, forward-only migrations with advisory lock coordination for multi-instance deployments.
  7. Air-gap operation — All schema and data must be embeddable in assemblies without external network dependencies.
  8. Auditability — PostgreSQL's mature ecosystem includes proven audit logging, compliance tooling, and forensic capabilities trusted by regulated industries.

Decision

Adopt PostgreSQL (≥15) as the primary database for all StellaOps control-plane domains.

Key architectural choices:

1. Per-Module Schema Isolation

Each module owns exactly one PostgreSQL schema:

Schema Owner Description
auth Authority Identity, authentication, authorization, licensing
vuln Concelier Vulnerability advisories, sources, affected packages
vex Excititor VEX statements, graphs, observations, consensus
scheduler Scheduler Jobs, triggers, workers, execution history
notify Notify Channels, templates, rules, deliveries
policy Policy Policy packs, rules, risk profiles
audit Shared Cross-cutting audit log (optional)

Rationale:

  • Clear ownership boundaries
  • Independent migration lifecycles
  • Schema-level access control
  • Simplified testing and development

2. Multi-Tenancy via tenant_id Column

Single database, single schema set, tenant_id column on all tenant-scoped tables.

-- Session-level tenant context
SET app.tenant_id = '<tenant-uuid>';

-- Row-level security (defense in depth)
CREATE POLICY tenant_isolation ON <table>
    USING (tenant_id = current_setting('app.tenant_id')::uuid);

Rationale:

  • Simplest operational model
  • Shared connection pooling
  • Easy cross-tenant queries for admin operations
  • Composite indexes on (tenant_id, ...) for query performance

3. Forward-Only Migrations with Advisory Locks

Migrations are embedded in assemblies and executed at startup with PostgreSQL advisory locks:

SELECT pg_try_advisory_lock(hashtext('auth'));  -- Per-schema lock

Migration categories:

  • Startup (001-099): Automatic, non-breaking DDL
  • Release (100-199): Manual CLI, breaking changes
  • Seed (S001-S999): Reference data
  • Data (DM001-DM999): Batched background jobs

Rationale:

  • No down migrations needed (forward-only with fix-forward)
  • Advisory locks prevent concurrent migrations across instances
  • Checksum validation catches unauthorized modifications
  • Air-gap compatible (no external migration service needed)

4. RustFS for Binary Artifacts

PostgreSQL stores metadata and indexes; RustFS stores binary artifacts (SBOMs, attestations, reports):

PostgreSQL: Schema definitions, relationships, indexes, audit trails
RustFS:     sbom.cdx.json.zst, inventory.cdx.pb, bom-index.bin, *.dsse.json

Rationale:

  • Right tool for each job
  • PostgreSQL excellent for structured queries
  • Object storage better for large binary blobs
  • Clear separation of concerns

Consequences

Positive

  1. Licensing trust — PostgreSQL License is permissive, OSI-approved, and universally accepted. No vendor lock-in, no license ambiguity for sovereign deployments. Trusted by governments, regulated industries, and security-conscious organizations.
  2. Ecosystem stability — 30+ years of development, included in all major distributions, no license rug-pulls. Community governance ensures long-term trust.
  3. Relational integrity — Foreign keys, constraints, and transactions ensure data consistency.
  4. Query flexibility — Complex joins, CTEs, window functions, and full-text search available natively.
  5. Operational maturity — Well-understood backup, replication, and monitoring ecosystem.
  6. Row-level security — Built-in multi-tenancy support without application-layer hacks.
  7. Schema evolution — Mature migration tooling with online DDL capabilities.
  8. Performance — Excellent query planning, connection pooling (PgBouncer), and indexing options.
  9. Auditability — Proven audit logging extensions (pgAudit), compliance certifications, forensic tooling.

Negative

  1. Schema rigidity — Changes require migrations; less flexible than document stores for rapidly evolving schemas.
  2. Operational overhead — Requires PostgreSQL expertise for tuning, vacuuming, and monitoring.
  3. Connection limits — Need PgBouncer for high-concurrency workloads.

Follow-up Actions

  • Create docs/db/ documentation directory with specification, rules, and conversion plan
  • Define migration infrastructure in StellaOps.Infrastructure.Postgres
  • Complete phased conversion from MongoDB per docs/db/tasks/PHASE_*.md
  • Update deployment guides for PostgreSQL requirements
  • Add PostgreSQL health checks to all control-plane services

Rollback Criteria

Revert to MongoDB (or hybrid) if:

  • Migration performance unacceptable (> 60s startup time)
  • Query complexity exceeds PostgreSQL capabilities
  • Operational burden exceeds team capacity

Alternatives Considered

Option A: Continue with MongoDB

Pros:

  • Already in use for some components
  • Flexible schema
  • Good for document-centric workloads

Cons:

  • Licensing uncertainty — MongoDB's SSPL (Server Side Public License, 2018) is not OSI-approved. Creates legal ambiguity for sovereign/self-hosted deployments, especially in regulated industries and government contexts where license provenance matters.
  • Ecosystem trust erosion — SSPL switch caused major distributions (Debian, Fedora, RHEL) to drop MongoDB packages. Sovereign customers may have policies against non-OSI licenses.
  • No referential integrity (app-enforced)
  • Limited join capabilities
  • Multi-tenancy requires additional logic
  • No row-level security
  • Less mature migration tooling

Rejected because: Licensing uncertainty is incompatible with StellaOps' sovereign-first positioning. Control-plane domains are also fundamentally relational with strong consistency requirements.

Option B: Hybrid (PostgreSQL + MongoDB)

Pros:

  • Use each database for appropriate workloads
  • Gradual migration possible

Cons:

  • Two databases to operate and monitor
  • Complex deployment
  • Cross-database consistency challenges
  • Higher operational burden

Rejected because: Unified PostgreSQL approach is simpler and sufficient for all control-plane needs.

Option C: CockroachDB / YugabyteDB

Pros:

  • PostgreSQL-compatible
  • Built-in horizontal scaling
  • Multi-region capabilities

Cons:

  • Additional operational complexity
  • Less mature than PostgreSQL
  • Overkill for current scale
  • Air-gap deployment challenges

Rejected because: PostgreSQL provides sufficient scale and simpler operations for current requirements. Can revisit if horizontal scaling becomes necessary.

References