Files
git.stella-ops.org/docs/db/CONVERSION_PLAN.md
StellaOps Bot 35c8f9216f Add tests and implement timeline ingestion options with NATS and Redis subscribers
- Introduced `BinaryReachabilityLifterTests` to validate binary lifting functionality.
- Created `PackRunWorkerOptions` for configuring worker paths and execution persistence.
- Added `TimelineIngestionOptions` for configuring NATS and Redis ingestion transports.
- Implemented `NatsTimelineEventSubscriber` for subscribing to NATS events.
- Developed `RedisTimelineEventSubscriber` for reading from Redis Streams.
- Added `TimelineEnvelopeParser` to normalize incoming event envelopes.
- Created unit tests for `TimelineEnvelopeParser` to ensure correct field mapping.
- Implemented `TimelineAuthorizationAuditSink` for logging authorization outcomes.
2025-12-03 09:46:48 +02:00

492 lines
17 KiB
Markdown

# MongoDB to PostgreSQL Conversion Plan
**Version:** 2.0.0
**Status:** APPROVED
**Created:** 2025-11-28
**Last Updated:** 2025-11-28
---
## Executive Summary
This document outlines the strategic plan to **convert** (not migrate) StellaOps from MongoDB to PostgreSQL for control-plane domains. The conversion follows a "strangler fig" pattern, introducing PostgreSQL repositories alongside existing MongoDB implementations and gradually switching each bounded context.
**Key Finding:** StellaOps already has production-ready PostgreSQL patterns in the Orchestrator and Findings modules that serve as templates for all other modules.
### Related Documents
| Document | Purpose |
|----------|---------|
| [SPECIFICATION.md](./SPECIFICATION.md) | Schema designs, naming conventions, data types |
| [RULES.md](./RULES.md) | Database coding rules and patterns |
| [VERIFICATION.md](./VERIFICATION.md) | Testing and verification requirements |
| [tasks/](./tasks/) | Detailed task definitions per phase |
---
## 1. Principles & Scope
### 1.1 Goals
Convert **control-plane** domains from MongoDB to PostgreSQL:
| Domain | Current DB | Target | Priority |
|--------|-----------|--------|----------|
| Authority | `stellaops_authority` | PostgreSQL | P0 |
| Scheduler | `stellaops_scheduler` | PostgreSQL | P0 |
| Notify | `stellaops_notify` | PostgreSQL | P1 |
| Policy | `stellaops_policy` | PostgreSQL | P1 |
| Vulnerabilities (Concelier) | `concelier` | PostgreSQL | P2 |
| VEX & Graph (Excititor) | `excititor` | PostgreSQL | P2 |
| PacksRegistry | `stellaops_packs` | PostgreSQL | P3 |
| IssuerDirectory | `stellaops_issuer` | PostgreSQL | P3 |
### 1.2 Non-Goals
- Scanner result storage (remains object storage + Mongo for now)
- Real-time event streams (separate infrastructure)
- Legacy data archive (can remain in MongoDB read-only)
### 1.3 Constraints
**MUST Preserve:**
- Deterministic, replayable scans
- "Preserve/prune source" rule for Concelier/Excititor
- Lattice logic in `Scanner.WebService` (not in DB)
- Air-gap friendliness and offline-kit packaging
- Multi-tenant isolation patterns
- Zero downtime during conversion
### 1.4 Conversion vs Migration
This is a **conversion**, not a 1:1 document→row mapping:
| Approach | When to Use |
|----------|-------------|
| **Normalize** | Identities, jobs, schedules, relationships |
| **Keep JSONB** | Advisory payloads, provenance trails, evidence manifests |
| **Drop/Archive** | Ephemeral data (caches, locks), historical logs |
---
## 2. Architecture
### 2.1 Strangler Fig Pattern
```
┌─────────────────────────────────────────────────────────────┐
│ Service Layer │
├─────────────────────────────────────────────────────────────┤
│ Repository Interface │
│ (e.g., IScheduleRepository) │
├──────────────────────┬──────────────────────────────────────┤
│ MongoRepository │ PostgresRepository │
│ (existing) │ (new) │
├──────────────────────┴──────────────────────────────────────┤
│ DI Container (configured switch) │
└─────────────────────────────────────────────────────────────┘
```
### 2.2 Configuration-Driven Backend Selection
```json
{
"Persistence": {
"Authority": "Postgres",
"Scheduler": "Postgres",
"Concelier": "Mongo",
"Excititor": "Mongo",
"Notify": "Postgres",
"Policy": "Mongo"
}
}
```
### 2.3 Existing PostgreSQL Patterns
The codebase already contains production-ready patterns:
| Module | Location | Reusable Components |
|--------|----------|---------------------|
| Orchestrator | `src/Orchestrator/.../Infrastructure/Postgres/` | DataSource, tenant context, repository pattern |
| Findings | `src/Findings/StellaOps.Findings.Ledger/Infrastructure/Postgres/` | Ledger events, Merkle anchors, projections |
**Reference Implementation:** `OrchestratorDataSource.cs`
---
## 3. Data Tiering
### 3.1 Tier Definitions
| Tier | Description | Strategy |
|------|-------------|----------|
| **A** | Critical business data | Full conversion with verification |
| **B** | Important but recoverable | Convert active records only |
| **C** | Ephemeral/cache data | Fresh start, no migration |
### 3.2 Module Tiering
#### Authority
| Collection | Tier | Strategy |
|------------|------|----------|
| `authority_users` | A | Full conversion |
| `authority_clients` | A | Full conversion |
| `authority_scopes` | A | Full conversion |
| `authority_tokens` | B | Active tokens only |
| `authority_service_accounts` | A | Full conversion |
| `authority_login_attempts` | B | Recent 90 days |
| `authority_revocations` | A | Full conversion |
#### Scheduler
| Collection | Tier | Strategy |
|------------|------|----------|
| `schedules` | A | Full conversion |
| `runs` | B | Recent 180 days |
| `graph_jobs` | B | Active/recent only |
| `policy_jobs` | B | Active/recent only |
| `impact_snapshots` | B | Recent 90 days |
| `locks` | C | Fresh start |
#### Concelier (Vulnerabilities)
| Collection | Tier | Strategy |
|------------|------|----------|
| `advisory` | A | Full conversion |
| `advisory_raw` | B | GridFS refs only |
| `alias` | A | Full conversion |
| `affected` | A | Full conversion |
| `source` | A | Full conversion |
| `source_state` | A | Full conversion |
| `jobs`, `locks` | C | Fresh start |
#### Excititor (VEX)
| Collection | Tier | Strategy |
|------------|------|----------|
| `vex.statements` | A | Full conversion |
| `vex.observations` | A | Full conversion |
| `vex.linksets` | A | Full conversion |
| `vex.consensus` | A | Full conversion |
| `vex.raw` | B | Active/recent only |
| `vex.cache` | C | Fresh start |
---
## 4. Execution Phases
### Phase Overview
```
Phase 0: Foundations [1 sprint]
├─→ Phase 1: Authority [1 sprint]
├─→ Phase 2: Scheduler [1 sprint]
├─→ Phase 3: Notify [1 sprint]
├─→ Phase 4: Policy [1 sprint]
└─→ Phase 5: Concelier [2 sprints]
└─→ Phase 6: Excititor [2-3 sprints]
└─→ Phase 7: Cleanup [1 sprint]
```
### Phase Summary
| Phase | Scope | Duration | Dependencies | Deliverable |
|-------|-------|----------|--------------|-------------|
| 0 | Foundations | 1 sprint | None | PostgreSQL infrastructure, shared library |
| 1 | Authority | 1 sprint | Phase 0 | Identity management on PostgreSQL |
| 2 | Scheduler | 1 sprint | Phase 0 | Job scheduling on PostgreSQL |
| 3 | Notify | 1 sprint | Phase 0 | Notifications on PostgreSQL |
| 4 | Policy | 1 sprint | Phase 0 | Policy engine on PostgreSQL |
| 5 | Concelier | 2 sprints | Phase 0 | Vulnerability index on PostgreSQL |
| 6 | Excititor | 2-3 sprints | Phase 5 | VEX & graphs on PostgreSQL |
| 7 | Cleanup | 1 sprint | All | MongoDB retired, docs updated |
**Total: 10-12 sprints**
### Detailed Task Definitions
See:
- [tasks/PHASE_0_FOUNDATIONS.md](./tasks/PHASE_0_FOUNDATIONS.md)
- [tasks/PHASE_1_AUTHORITY.md](./tasks/PHASE_1_AUTHORITY.md)
- [tasks/PHASE_2_SCHEDULER.md](./tasks/PHASE_2_SCHEDULER.md)
- [tasks/PHASE_3_NOTIFY.md](./tasks/PHASE_3_NOTIFY.md)
- [tasks/PHASE_4_POLICY.md](./tasks/PHASE_4_POLICY.md)
- [tasks/PHASE_5_VULNERABILITIES.md](./tasks/PHASE_5_VULNERABILITIES.md)
- [tasks/PHASE_6_VEX_GRAPH.md](./tasks/PHASE_6_VEX_GRAPH.md)
- [tasks/PHASE_7_CLEANUP.md](./tasks/PHASE_7_CLEANUP.md)
---
## 5. Conversion Strategy
### 5.1 Per-Module Approach
```
1. Create PostgreSQL storage project
2. Implement schema migrations
3. Implement repository interfaces
4. Add configuration switch
5. (Retired) Dual-write was used during migration for Tier A; all modules are now Postgres-only.
6. Run verification tests
7. Switch to PostgreSQL-only
8. Archive MongoDB data
```
### 5.2 Dual-Write Pattern
For Tier A data requiring historical continuity:
```
┌──────────────────────────────────────────────────────────────┐
│ DualWriteRepository │
├──────────────────────────────────────────────────────────────┤
│ Write: PostgreSQL (primary) + MongoDB (secondary) │
│ Read: PostgreSQL (primary) → MongoDB (fallback) │
│ Config: WriteToBoth, FallbackToMongo, ConvertOnRead │
└──────────────────────────────────────────────────────────────┘
```
### 5.3 Fresh Start Pattern
For Tier C ephemeral data:
```
┌──────────────────────────────────────────────────────────────┐
│ 1. Deploy PostgreSQL schema │
│ 2. Switch configuration to PostgreSQL │
│ 3. New data goes to PostgreSQL only │
│ 4. Old MongoDB data ages out naturally │
└──────────────────────────────────────────────────────────────┘
```
---
## 6. Risk Assessment
### 6.1 Technical Risks
| Risk | Impact | Likelihood | Mitigation |
|------|--------|------------|------------|
| Data loss during conversion | High | Low | Dual-write mode, extensive verification |
| Performance regression | Medium | Medium | Load testing before switch, index optimization |
| Determinism violation | High | Medium | Automated verification tests, parallel pipeline |
| Schema evolution conflicts | Medium | Low | Migration framework, schema versioning |
| Transaction semantics differences | Medium | Low | Code review, integration tests |
### 6.2 Operational Risks
| Risk | Impact | Likelihood | Mitigation |
|------|--------|------------|------------|
| Extended conversion timeline | Medium | Medium | Phase-based approach, clear milestones |
| Team learning curve | Low | Medium | Reference implementations, documentation |
| Rollback complexity | Medium | Low | Keep Mongo data until verified, feature flags |
### 6.3 Rollback Strategy
Each phase has independent rollback capability:
| Level | Action | Recovery Time |
|-------|--------|---------------|
| Configuration | Change `Persistence:<Module>` to `Mongo` | Minutes |
| Data | MongoDB data retained during dual-write | None needed (historical note; dual-write ended after cutover) |
| Code | Git revert (PostgreSQL code isolated) | Hours |
---
## 7. Success Criteria
### 7.1 Per-Module Criteria
- [ ] All existing integration tests pass with PostgreSQL backend
- [ ] No performance regression >10% on critical paths
- [ ] Deterministic outputs verified against MongoDB baseline
- [ ] Zero data loss during conversion
- [ ] Tenant isolation verified
### 7.2 Overall Criteria
- [ ] All control-plane modules running on PostgreSQL
- [ ] MongoDB retired from production for converted modules
- [ ] Air-gap kit updated with PostgreSQL support
- [ ] Documentation updated for PostgreSQL operations
- [ ] Runbooks updated for PostgreSQL troubleshooting
---
## 8. Project Structure
### 8.1 New Projects
```
src/
├── Shared/
│ └── StellaOps.Infrastructure.Postgres/
│ ├── DataSourceBase.cs
│ ├── Migrations/
│ │ ├── IPostgresMigration.cs
│ │ └── PostgresMigrationRunner.cs
│ ├── Extensions/
│ │ └── NpgsqlExtensions.cs
│ └── ServiceCollectionExtensions.cs
├── Authority/
│ └── __Libraries/
│ └── StellaOps.Authority.Storage.Postgres/
│ ├── AuthorityDataSource.cs
│ ├── Repositories/
│ ├── Migrations/
│ └── ServiceCollectionExtensions.cs
├── Scheduler/
│ └── __Libraries/
│ └── StellaOps.Scheduler.Storage.Postgres/
├── Notify/
│ └── __Libraries/
│ └── StellaOps.Notify.Storage.Postgres/
├── Policy/
│ └── __Libraries/
│ └── StellaOps.Policy.Storage.Postgres/
├── Concelier/
│ └── __Libraries/
│ └── StellaOps.Concelier.Storage.Postgres/
└── Excititor/
└── __Libraries/
└── StellaOps.Excititor.Storage.Postgres/
```
### 8.2 Schema Files
```
docs/db/
├── schemas/
│ ├── authority.sql
│ ├── vuln.sql
│ ├── vex.sql
│ ├── scheduler.sql
│ ├── notify.sql
│ └── policy.sql
```
---
## 9. Timeline
### 9.1 Sprint Schedule
| Sprint | Phase | Focus |
|--------|-------|-------|
| 1 | 0 | PostgreSQL infrastructure, shared library |
| 2 | 1 | Authority module conversion |
| 3 | 2 | Scheduler module conversion |
| 4 | 3 | Notify module conversion |
| 5 | 4 | Policy module conversion |
| 6-7 | 5 | Concelier/Vulnerability conversion |
| 8-10 | 6 | Excititor/VEX conversion |
| 11 | 7 | Cleanup, optimization, documentation |
### 9.2 Milestones
| Milestone | Sprint | Criteria |
|-----------|--------|----------|
| M1: Infrastructure Ready | 1 | PostgreSQL cluster operational, CI tests passing |
| M2: Identity Converted | 2 | Authority on PostgreSQL, auth flows working |
| M3: Scheduling Converted | 3 | Scheduler on PostgreSQL, jobs executing |
| M4: Core Services Converted | 5 | Notify + Policy on PostgreSQL |
| M5: Vulnerability Index Converted | 7 | Concelier on PostgreSQL, scans deterministic |
| M6: VEX Converted | 10 | Excititor on PostgreSQL, graphs stable |
| M7: MongoDB Retired | 11 | All modules converted, Mongo archived |
---
## 10. Governance
### 10.1 Decision Log
| Date | Decision | Rationale | Approver |
|------|----------|-----------|----------|
| 2025-11-28 | Strangler fig pattern | Allows gradual rollout with rollback | Architecture Team |
| 2025-11-28 | JSONB for semi-structured data | Preserves flexibility, simplifies conversion | Architecture Team |
| 2025-11-28 | Phase 0 first | Infrastructure must be stable before modules | Architecture Team |
### 10.2 Change Control
Changes to this plan require:
1. Impact assessment documented
2. Risk analysis updated
3. Approval from Architecture Team
4. Updated task definitions in `docs/db/tasks/`
### 10.3 Status Reporting
Weekly status updates in sprint files tracking:
- Tasks completed
- Blockers encountered
- Verification results
- Next sprint objectives
---
## Appendix A: Reference Implementation
### DataSource Pattern
```csharp
public sealed class ModuleDataSource : IAsyncDisposable
{
private readonly NpgsqlDataSource _dataSource;
public async Task<NpgsqlConnection> OpenConnectionAsync(
string tenantId,
CancellationToken cancellationToken = default)
{
var connection = await _dataSource.OpenConnectionAsync(cancellationToken);
await ConfigureSessionAsync(connection, tenantId, cancellationToken);
return connection;
}
private static async Task ConfigureSessionAsync(
NpgsqlConnection connection,
string tenantId,
CancellationToken cancellationToken)
{
await using var cmd = connection.CreateCommand();
cmd.CommandText = $"""
SET app.tenant_id = '{tenantId}';
SET timezone = 'UTC';
SET statement_timeout = '30s';
""";
await cmd.ExecuteNonQueryAsync(cancellationToken);
}
}
```
### Repository Pattern
See [RULES.md](./RULES.md) Section 1 for complete repository implementation guidelines.
---
## Appendix B: Glossary
| Term | Definition |
|------|------------|
| **Strangler Fig** | Pattern where new system grows alongside old, gradually replacing it |
| **Dual-Write** | Writing to both MongoDB and PostgreSQL during transition |
| **Tier A/B/C** | Data classification by criticality for migration strategy |
| **DataSource** | Npgsql connection factory with tenant context configuration |
| **Determinism** | Property that same inputs always produce same outputs |
---
*Document Version: 2.0.0*
*Last Updated: 2025-11-28*