Files
git.stella-ops.org/docs/db/tasks/PHASE_2_SCHEDULER.md
StellaOps Bot 6a299d231f
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Add unit tests for Router configuration and transport layers
- Implemented tests for RouterConfig, RoutingOptions, StaticInstanceConfig, and RouterConfigOptions to ensure default values are set correctly.
- Added tests for RouterConfigProvider to validate configurations and ensure defaults are returned when no file is specified.
- Created tests for ConfigValidationResult to check success and error scenarios.
- Developed tests for ServiceCollectionExtensions to verify service registration for RouterConfig.
- Introduced UdpTransportTests to validate serialization, connection, request-response, and error handling in UDP transport.
- Added scripts for signing authority gaps and hashing DevPortal SDK snippets.
2025-12-05 08:01:47 +02:00

8.5 KiB
Raw Blame History

Phase 2: Scheduler Module Conversion

Sprint: 3 Duration: 1 sprint Status: DOING (fresh-start approved; Mongo backfill skipped) Dependencies: Phase 0 (Foundations) — DONE


Objectives

  1. Create StellaOps.Scheduler.Storage.Postgres project
  2. Implement Scheduler schema in PostgreSQL
  3. Implement 7+ repository interfaces
  4. Replace MongoDB job tracking with PostgreSQL
  5. Implement PostgreSQL advisory locks for distributed locking
  6. Backfill Mongo data or explicitly decide on fresh-start (PG-T2.9T2.11)

Deliverables

Deliverable Acceptance Criteria
Scheduler schema All tables created with indexes
Repository implementations All 7+ interfaces implemented
Advisory locks Distributed locking working
Integration tests 100% coverage of CRUD operations
Verification report Schedule execution verified

Schema Reference

See SPECIFICATION.md Section 5.4 for complete Scheduler schema.

Tables:

  • scheduler.schedules
  • scheduler.triggers
  • scheduler.runs
  • scheduler.graph_jobs
  • scheduler.policy_jobs
  • scheduler.impact_snapshots
  • scheduler.workers
  • scheduler.execution_logs
  • scheduler.locks
  • scheduler.run_summaries
  • scheduler.audit

Task Breakdown

T2.1: Create Scheduler.Storage.Postgres Project

Status: DONE Assignee: Scheduler Guild Estimate: 0.5 days

Subtasks:

  • T2.1.1: Create project structure
  • T2.1.2: Add NuGet references
  • T2.1.3: Create SchedulerDataSource class
  • T2.1.4: Create ServiceCollectionExtensions.cs

T2.2: Implement Schema Migrations

Status: DONE Assignee: Scheduler Guild Estimate: 1 day

Subtasks:

  • T2.2.1: Create V001_CreateSchedulerSchema migration
  • T2.2.2: Include all tables and indexes
  • T2.2.3: Add partial index for active schedules
  • T2.2.4: Test migration idempotency

T2.3: Implement Schedule Repository

Status: DONE Assignee: Scheduler Guild Estimate: 1 day

Interface:

public interface IScheduleRepository
{
    Task<Schedule?> GetAsync(string tenantId, string scheduleId, CancellationToken ct);
    Task<IReadOnlyList<Schedule>> ListAsync(string tenantId, ScheduleQueryOptions? options, CancellationToken ct);
    Task UpsertAsync(Schedule schedule, CancellationToken ct);
    Task<bool> SoftDeleteAsync(string tenantId, string scheduleId, string deletedBy, DateTimeOffset deletedAt, CancellationToken ct);
    Task<IReadOnlyList<Schedule>> GetDueSchedulesAsync(DateTimeOffset now, CancellationToken ct);
}

Subtasks:

  • T2.3.1: Implement all interface methods
  • T2.3.2: Handle soft delete correctly
  • T2.3.3: Implement GetDueSchedules for trigger calculation
  • T2.3.4: Write integration tests

T2.4: Implement Run Repository

Status: DONE Assignee: Scheduler Guild Estimate: 1 day

Interface:

public interface IRunRepository
{
    Task<Run?> GetAsync(string tenantId, Guid runId, CancellationToken ct);
    Task<IReadOnlyList<Run>> ListAsync(string tenantId, RunQueryOptions? options, CancellationToken ct);
    Task<Run> CreateAsync(Run run, CancellationToken ct);
    Task<Run> UpdateAsync(Run run, CancellationToken ct);
    Task<IReadOnlyList<Run>> GetPendingRunsAsync(string tenantId, CancellationToken ct);
    Task<IReadOnlyList<Run>> GetRunsByScheduleAsync(string tenantId, Guid scheduleId, int limit, CancellationToken ct);
}

Subtasks:

  • T2.4.1: Implement all interface methods
  • T2.4.2: Handle state transitions
  • T2.4.3: Implement efficient pagination
  • T2.4.4: Write integration tests

T2.5: Implement Graph Job Repository

Status: DONE Assignee: Scheduler Guild Estimate: 0.5 days

Subtasks:

  • T2.5.1: Implement CRUD operations
  • T2.5.2: Implement status queries
  • T2.5.3: Write integration tests

T2.6: Implement Policy Job Repository

Status: DONE Assignee: Scheduler Guild Estimate: 0.5 days

Subtasks:

  • T2.6.1: Implement CRUD operations
  • T2.6.2: Implement status queries
  • T2.6.3: Write integration tests

T2.7: Implement Impact Snapshot Repository

Status: DONE Assignee: Scheduler Guild Estimate: 0.5 days

Subtasks:

  • T2.7.1: Implement CRUD operations
  • T2.7.2: Implement queries by run
  • T2.7.3: Write integration tests

T2.8: Implement Distributed Locking

Status: DONE Assignee: Scheduler Guild Estimate: 1 day

Description: Implement distributed locking using PostgreSQL advisory locks.

Options:

  1. PostgreSQL advisory locks (pg_advisory_lock)
  2. Table-based locks with SELECT FOR UPDATE SKIP LOCKED
  3. Combination approach

Subtasks:

  • T2.8.1: Choose locking strategy
  • T2.8.2: Implement IDistributedLock interface
  • T2.8.3: Implement lock acquisition with timeout
  • T2.8.4: Implement lock renewal
  • T2.8.5: Implement lock release
  • T2.8.6: Write concurrency tests

Implementation Example:

public sealed class PostgresDistributedLock : IDistributedLock
{
    private readonly SchedulerDataSource _dataSource;

    public async Task<IAsyncDisposable?> TryAcquireAsync(
        string lockKey,
        TimeSpan timeout,
        CancellationToken ct)
    {
        var lockId = ComputeLockId(lockKey);
        await using var connection = await _dataSource.OpenConnectionAsync("system", ct);

        await using var cmd = connection.CreateCommand();
        cmd.CommandText = "SELECT pg_try_advisory_lock(@lock_id)";
        cmd.Parameters.AddWithValue("lock_id", lockId);

        var acquired = await cmd.ExecuteScalarAsync(ct) is true;
        if (!acquired) return null;

        return new LockHandle(connection, lockId);
    }

    private static long ComputeLockId(string key)
        => unchecked((long)key.GetHashCode());
}

T2.9: Implement Worker Registration

Status: DONE Assignee: TBD Estimate: 0.5 days

Subtasks:

  • T2.9.1: Implement worker registration
  • T2.9.2: Implement heartbeat updates
  • T2.9.3: Implement dead worker detection
  • T2.9.4: Write integration tests

T2.10: Add Configuration Switch

Status: DONE Assignee: Scheduler Guild Estimate: 0.5 days

Subtasks:

  • T2.10.1: Update service registration
  • T2.10.2: Test backend switching
  • T2.10.3: Document configuration

T2.11: Run Verification Tests

Status: DONE (fresh-start; Postgres-only verification) Assignee: Scheduler Guild Estimate: 1 day

Subtasks:

  • T2.11.1: Test schedule CRUD
  • T2.11.2: Test run creation and state transitions
  • T2.11.3: Test trigger calculation
  • T2.11.4: Test distributed locking under concurrency
  • T2.11.5: Test job execution end-to-end
  • T2.11.6: Generate verification report (fresh-start baseline; Mongo parity not applicable)

T2.12: Switch to PostgreSQL-Only

Status: DONE Assignee: Scheduler Guild Estimate: 0.5 days

Subtasks:

  • T2.12.1: Update configuration (Persistence:Scheduler=Postgres)
  • T2.12.2: Deploy to staging
  • T2.12.3: Run integration tests
  • T2.12.4: Deploy to production
  • T2.12.5: Monitor metrics

Exit Criteria

  • All repository interfaces implemented
  • Distributed locking working correctly
  • All integration tests pass (module-level)
  • Fresh-start verification completed (no Mongo parity/backfill)
  • Scheduler running on PostgreSQL in staging/production

Execution Log

Date (UTC) Update Owner
2025-11-28 Project + schema migration created; repos implemented (T2.1T2.8) Scheduler Guild
2025-11-30 Determinism and concurrency tests added; advisory locks in place Scheduler Guild
2025-12-02 Backfill tool added; Mongo endpoint unavailable → parity/backfill blocked Scheduler Guild
2025-12-05 Phase 0 unblocked; fresh-start approved (skip Mongo backfill). Verification done on Postgres-only baseline; cutover pending config switch/deploy. PM
2025-12-05 Config switched to Postgres, staged and produced deployed; integration smoke passed; monitoring active. Scheduler Guild

Risks & Mitigations

Risk Likelihood Impact Mitigation
Lock contention Medium Medium Test under load, tune timeouts
Trigger calculation errors Low High Extensive testing with edge cases
State transition bugs Medium Medium State machine tests

Phase Version: 1.0.0 Last Updated: 2025-11-28