Files
git.stella-ops.org/docs/modules/scheduler/hlc-migration-guide.md
2026-01-08 09:06:03 +02:00

6.2 KiB

HLC Queue Ordering Migration Guide

This guide describes how to enable HLC (Hybrid Logical Clock) ordering for the Scheduler queue, transitioning from legacy (priority, created_at) ordering to HLC-based ordering with cryptographic chain linking.

Overview

HLC ordering provides:

  • Deterministic global ordering: Causal consistency across distributed nodes
  • Cryptographic chain linking: Audit-safe job sequence proofs
  • Reproducible processing: Same input produces same chain

Prerequisites

  1. PostgreSQL 16+ with the scheduler schema
  2. HLC library dependency (StellaOps.HybridLogicalClock)
  3. Schema migration 002_hlc_queue_chain.sql applied

Migration Phases

Phase 1: Deploy with Dual-Write Mode

Enable dual-write to populate the new scheduler_log table without affecting existing operations.

# appsettings.yaml or environment configuration
Scheduler:
  Queue:
    Hlc:
      EnableHlcOrdering: false  # Keep using legacy ordering for reads
      DualWriteMode: true       # Write to both legacy and HLC tables
// Program.cs or Startup.cs
services.AddOptions<SchedulerQueueOptions>()
    .Bind(configuration.GetSection("Scheduler:Queue"))
    .ValidateDataAnnotations()
    .ValidateOnStart();

// Register HLC services
services.AddHlcSchedulerServices();

// Register HLC clock
services.AddSingleton<IHybridLogicalClock>(sp =>
{
    var nodeId = Environment.MachineName; // or use a stable node identifier
    return new HybridLogicalClock(nodeId, TimeProvider.System);
});

Verification:

  • Monitor scheduler_hlc_enqueues_total metric for dual-write activity
  • Verify scheduler_log table is being populated
  • Check chain verification passes: scheduler_chain_verifications_total{result="valid"}

Phase 2: Backfill Historical Data (Optional)

If you need historical jobs in the HLC chain, backfill from the existing scheduler.jobs table:

-- Backfill script (run during maintenance window)
-- Note: This creates a new chain starting from historical data
-- The chain will not have valid prev_link values for historical entries

INSERT INTO scheduler.scheduler_log (
    tenant_id, t_hlc, partition_key, job_id, payload_hash, prev_link, link
)
SELECT
    tenant_id,
    -- Generate synthetic HLC timestamps based on created_at
    -- Format: YYYYMMDDHHMMSS-nodeid-counter
    TO_CHAR(created_at AT TIME ZONE 'UTC', 'YYYYMMDDHH24MISS') || '-backfill-' ||
        LPAD(ROW_NUMBER() OVER (PARTITION BY tenant_id ORDER BY created_at)::TEXT, 6, '0'),
    COALESCE(project_id, ''),
    id,
    DECODE(payload_digest, 'hex'),
    NULL,  -- No chain linking for historical data
    DECODE(payload_digest, 'hex')  -- Use payload_digest as link placeholder
FROM scheduler.jobs
WHERE status IN ('pending', 'scheduled', 'running')
  AND NOT EXISTS (
    SELECT 1 FROM scheduler.scheduler_log sl
    WHERE sl.job_id = jobs.id
  )
ORDER BY tenant_id, created_at;

Phase 3: Enable HLC Ordering for Reads

Once dual-write is stable and backfill (if needed) is complete:

Scheduler:
  Queue:
    Hlc:
      EnableHlcOrdering: true   # Use HLC ordering for reads
      DualWriteMode: true       # Keep dual-write during transition
      VerifyOnDequeue: false    # Optional: enable for extra validation

Verification:

  • Monitor dequeue latency (should be similar to legacy)
  • Verify job processing order matches HLC order
  • Check chain integrity periodically

Phase 4: Disable Dual-Write Mode

Once confident in HLC ordering:

Scheduler:
  Queue:
    Hlc:
      EnableHlcOrdering: true
      DualWriteMode: false      # Stop writing to legacy table
      VerifyOnDequeue: false

Configuration Reference

SchedulerHlcOptions

Property Type Default Description
EnableHlcOrdering bool false Use HLC ordering for queue reads
DualWriteMode bool false Write to both legacy and HLC tables
VerifyOnDequeue bool false Verify chain integrity on each dequeue
MaxClockDriftMs int 60000 Maximum allowed clock drift in milliseconds

Metrics

Metric Type Description
scheduler_hlc_enqueues_total Counter Total HLC enqueue operations
scheduler_hlc_enqueue_deduplicated_total Counter Deduplicated enqueue operations
scheduler_hlc_enqueue_duration_seconds Histogram Enqueue operation duration
scheduler_hlc_dequeues_total Counter Total HLC dequeue operations
scheduler_hlc_dequeued_entries_total Counter Total entries dequeued
scheduler_chain_verifications_total Counter Chain verification operations
scheduler_chain_verification_issues_total Counter Chain verification issues found
scheduler_batch_snapshots_created_total Counter Batch snapshots created

Troubleshooting

Chain Verification Failures

If chain verification reports issues:

  1. Check scheduler_chain_verification_issues_total for issue count

  2. Query the log for specific issues:

    var result = await chainVerifier.VerifyAsync(tenantId);
    foreach (var issue in result.Issues)
    {
        logger.LogError(
            "Chain issue at job {JobId}: {Type} - {Description}",
            issue.JobId, issue.IssueType, issue.Description);
    }
    
  3. Common causes:

    • Database corruption: Restore from backup
    • Concurrent writes without proper locking: Check transaction isolation
    • Clock drift: Verify MaxClockDriftMs setting

Performance Considerations

  • Index usage: Ensure idx_scheduler_log_tenant_hlc is being used
  • Chain head caching: The chain_heads table provides O(1) access to latest link
  • Batch sizes: Adjust dequeue batch size based on workload

Rollback Procedure

To rollback to legacy ordering:

Scheduler:
  Queue:
    Hlc:
      EnableHlcOrdering: false
      DualWriteMode: false

The scheduler_log table can be retained for audit purposes or dropped if no longer needed.