Files
git.stella-ops.org/docs/modules/scheduler/hlc-migration-guide.md
2026-01-08 09:06:03 +02:00

191 lines
6.2 KiB
Markdown

# HLC Queue Ordering Migration Guide
This guide describes how to enable HLC (Hybrid Logical Clock) ordering for the Scheduler queue, transitioning from legacy `(priority, created_at)` ordering to HLC-based ordering with cryptographic chain linking.
## Overview
HLC ordering provides:
- **Deterministic global ordering**: Causal consistency across distributed nodes
- **Cryptographic chain linking**: Audit-safe job sequence proofs
- **Reproducible processing**: Same input produces same chain
## Prerequisites
1. PostgreSQL 16+ with the scheduler schema
2. HLC library dependency (`StellaOps.HybridLogicalClock`)
3. Schema migration `002_hlc_queue_chain.sql` applied
## Migration Phases
### Phase 1: Deploy with Dual-Write Mode
Enable dual-write to populate the new `scheduler_log` table without affecting existing operations.
```yaml
# appsettings.yaml or environment configuration
Scheduler:
Queue:
Hlc:
EnableHlcOrdering: false # Keep using legacy ordering for reads
DualWriteMode: true # Write to both legacy and HLC tables
```
```csharp
// Program.cs or Startup.cs
services.AddOptions<SchedulerQueueOptions>()
.Bind(configuration.GetSection("Scheduler:Queue"))
.ValidateDataAnnotations()
.ValidateOnStart();
// Register HLC services
services.AddHlcSchedulerServices();
// Register HLC clock
services.AddSingleton<IHybridLogicalClock>(sp =>
{
var nodeId = Environment.MachineName; // or use a stable node identifier
return new HybridLogicalClock(nodeId, TimeProvider.System);
});
```
**Verification:**
- Monitor `scheduler_hlc_enqueues_total` metric for dual-write activity
- Verify `scheduler_log` table is being populated
- Check chain verification passes: `scheduler_chain_verifications_total{result="valid"}`
### Phase 2: Backfill Historical Data (Optional)
If you need historical jobs in the HLC chain, backfill from the existing `scheduler.jobs` table:
```sql
-- Backfill script (run during maintenance window)
-- Note: This creates a new chain starting from historical data
-- The chain will not have valid prev_link values for historical entries
INSERT INTO scheduler.scheduler_log (
tenant_id, t_hlc, partition_key, job_id, payload_hash, prev_link, link
)
SELECT
tenant_id,
-- Generate synthetic HLC timestamps based on created_at
-- Format: YYYYMMDDHHMMSS-nodeid-counter
TO_CHAR(created_at AT TIME ZONE 'UTC', 'YYYYMMDDHH24MISS') || '-backfill-' ||
LPAD(ROW_NUMBER() OVER (PARTITION BY tenant_id ORDER BY created_at)::TEXT, 6, '0'),
COALESCE(project_id, ''),
id,
DECODE(payload_digest, 'hex'),
NULL, -- No chain linking for historical data
DECODE(payload_digest, 'hex') -- Use payload_digest as link placeholder
FROM scheduler.jobs
WHERE status IN ('pending', 'scheduled', 'running')
AND NOT EXISTS (
SELECT 1 FROM scheduler.scheduler_log sl
WHERE sl.job_id = jobs.id
)
ORDER BY tenant_id, created_at;
```
### Phase 3: Enable HLC Ordering for Reads
Once dual-write is stable and backfill (if needed) is complete:
```yaml
Scheduler:
Queue:
Hlc:
EnableHlcOrdering: true # Use HLC ordering for reads
DualWriteMode: true # Keep dual-write during transition
VerifyOnDequeue: false # Optional: enable for extra validation
```
**Verification:**
- Monitor dequeue latency (should be similar to legacy)
- Verify job processing order matches HLC order
- Check chain integrity periodically
### Phase 4: Disable Dual-Write Mode
Once confident in HLC ordering:
```yaml
Scheduler:
Queue:
Hlc:
EnableHlcOrdering: true
DualWriteMode: false # Stop writing to legacy table
VerifyOnDequeue: false
```
## Configuration Reference
### SchedulerHlcOptions
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `EnableHlcOrdering` | bool | false | Use HLC ordering for queue reads |
| `DualWriteMode` | bool | false | Write to both legacy and HLC tables |
| `VerifyOnDequeue` | bool | false | Verify chain integrity on each dequeue |
| `MaxClockDriftMs` | int | 60000 | Maximum allowed clock drift in milliseconds |
## Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `scheduler_hlc_enqueues_total` | Counter | Total HLC enqueue operations |
| `scheduler_hlc_enqueue_deduplicated_total` | Counter | Deduplicated enqueue operations |
| `scheduler_hlc_enqueue_duration_seconds` | Histogram | Enqueue operation duration |
| `scheduler_hlc_dequeues_total` | Counter | Total HLC dequeue operations |
| `scheduler_hlc_dequeued_entries_total` | Counter | Total entries dequeued |
| `scheduler_chain_verifications_total` | Counter | Chain verification operations |
| `scheduler_chain_verification_issues_total` | Counter | Chain verification issues found |
| `scheduler_batch_snapshots_created_total` | Counter | Batch snapshots created |
## Troubleshooting
### Chain Verification Failures
If chain verification reports issues:
1. Check `scheduler_chain_verification_issues_total` for issue count
2. Query the log for specific issues:
```csharp
var result = await chainVerifier.VerifyAsync(tenantId);
foreach (var issue in result.Issues)
{
logger.LogError(
"Chain issue at job {JobId}: {Type} - {Description}",
issue.JobId, issue.IssueType, issue.Description);
}
```
3. Common causes:
- Database corruption: Restore from backup
- Concurrent writes without proper locking: Check transaction isolation
- Clock drift: Verify `MaxClockDriftMs` setting
### Performance Considerations
- **Index usage**: Ensure `idx_scheduler_log_tenant_hlc` is being used
- **Chain head caching**: The `chain_heads` table provides O(1) access to latest link
- **Batch sizes**: Adjust dequeue batch size based on workload
## Rollback Procedure
To rollback to legacy ordering:
```yaml
Scheduler:
Queue:
Hlc:
EnableHlcOrdering: false
DualWriteMode: false
```
The `scheduler_log` table can be retained for audit purposes or dropped if no longer needed.
## Related Documentation
- [Scheduler Architecture](architecture.md)
- [HLC Library Documentation](../../__Libraries/StellaOps.HybridLogicalClock/README.md)
- [Product Advisory: Audit-safe Job Queue Ordering](../../product/advisories/audit-safe-job-queue-ordering.md)