191 lines
6.2 KiB
Markdown
191 lines
6.2 KiB
Markdown
# HLC Queue Ordering Migration Guide
|
|
|
|
This guide describes how to enable HLC (Hybrid Logical Clock) ordering for the Scheduler queue, transitioning from legacy `(priority, created_at)` ordering to HLC-based ordering with cryptographic chain linking.
|
|
|
|
## Overview
|
|
|
|
HLC ordering provides:
|
|
- **Deterministic global ordering**: Causal consistency across distributed nodes
|
|
- **Cryptographic chain linking**: Audit-safe job sequence proofs
|
|
- **Reproducible processing**: Same input produces same chain
|
|
|
|
## Prerequisites
|
|
|
|
1. PostgreSQL 16+ with the scheduler schema
|
|
2. HLC library dependency (`StellaOps.HybridLogicalClock`)
|
|
3. Schema migration `002_hlc_queue_chain.sql` applied
|
|
|
|
## Migration Phases
|
|
|
|
### Phase 1: Deploy with Dual-Write Mode
|
|
|
|
Enable dual-write to populate the new `scheduler_log` table without affecting existing operations.
|
|
|
|
```yaml
|
|
# appsettings.yaml or environment configuration
|
|
Scheduler:
|
|
Queue:
|
|
Hlc:
|
|
EnableHlcOrdering: false # Keep using legacy ordering for reads
|
|
DualWriteMode: true # Write to both legacy and HLC tables
|
|
```
|
|
|
|
```csharp
|
|
// Program.cs or Startup.cs
|
|
services.AddOptions<SchedulerQueueOptions>()
|
|
.Bind(configuration.GetSection("Scheduler:Queue"))
|
|
.ValidateDataAnnotations()
|
|
.ValidateOnStart();
|
|
|
|
// Register HLC services
|
|
services.AddHlcSchedulerServices();
|
|
|
|
// Register HLC clock
|
|
services.AddSingleton<IHybridLogicalClock>(sp =>
|
|
{
|
|
var nodeId = Environment.MachineName; // or use a stable node identifier
|
|
return new HybridLogicalClock(nodeId, TimeProvider.System);
|
|
});
|
|
```
|
|
|
|
**Verification:**
|
|
- Monitor `scheduler_hlc_enqueues_total` metric for dual-write activity
|
|
- Verify `scheduler_log` table is being populated
|
|
- Check chain verification passes: `scheduler_chain_verifications_total{result="valid"}`
|
|
|
|
### Phase 2: Backfill Historical Data (Optional)
|
|
|
|
If you need historical jobs in the HLC chain, backfill from the existing `scheduler.jobs` table:
|
|
|
|
```sql
|
|
-- Backfill script (run during maintenance window)
|
|
-- Note: This creates a new chain starting from historical data
|
|
-- The chain will not have valid prev_link values for historical entries
|
|
|
|
INSERT INTO scheduler.scheduler_log (
|
|
tenant_id, t_hlc, partition_key, job_id, payload_hash, prev_link, link
|
|
)
|
|
SELECT
|
|
tenant_id,
|
|
-- Generate synthetic HLC timestamps based on created_at
|
|
-- Format: YYYYMMDDHHMMSS-nodeid-counter
|
|
TO_CHAR(created_at AT TIME ZONE 'UTC', 'YYYYMMDDHH24MISS') || '-backfill-' ||
|
|
LPAD(ROW_NUMBER() OVER (PARTITION BY tenant_id ORDER BY created_at)::TEXT, 6, '0'),
|
|
COALESCE(project_id, ''),
|
|
id,
|
|
DECODE(payload_digest, 'hex'),
|
|
NULL, -- No chain linking for historical data
|
|
DECODE(payload_digest, 'hex') -- Use payload_digest as link placeholder
|
|
FROM scheduler.jobs
|
|
WHERE status IN ('pending', 'scheduled', 'running')
|
|
AND NOT EXISTS (
|
|
SELECT 1 FROM scheduler.scheduler_log sl
|
|
WHERE sl.job_id = jobs.id
|
|
)
|
|
ORDER BY tenant_id, created_at;
|
|
```
|
|
|
|
### Phase 3: Enable HLC Ordering for Reads
|
|
|
|
Once dual-write is stable and backfill (if needed) is complete:
|
|
|
|
```yaml
|
|
Scheduler:
|
|
Queue:
|
|
Hlc:
|
|
EnableHlcOrdering: true # Use HLC ordering for reads
|
|
DualWriteMode: true # Keep dual-write during transition
|
|
VerifyOnDequeue: false # Optional: enable for extra validation
|
|
```
|
|
|
|
**Verification:**
|
|
- Monitor dequeue latency (should be similar to legacy)
|
|
- Verify job processing order matches HLC order
|
|
- Check chain integrity periodically
|
|
|
|
### Phase 4: Disable Dual-Write Mode
|
|
|
|
Once confident in HLC ordering:
|
|
|
|
```yaml
|
|
Scheduler:
|
|
Queue:
|
|
Hlc:
|
|
EnableHlcOrdering: true
|
|
DualWriteMode: false # Stop writing to legacy table
|
|
VerifyOnDequeue: false
|
|
```
|
|
|
|
## Configuration Reference
|
|
|
|
### SchedulerHlcOptions
|
|
|
|
| Property | Type | Default | Description |
|
|
|----------|------|---------|-------------|
|
|
| `EnableHlcOrdering` | bool | false | Use HLC ordering for queue reads |
|
|
| `DualWriteMode` | bool | false | Write to both legacy and HLC tables |
|
|
| `VerifyOnDequeue` | bool | false | Verify chain integrity on each dequeue |
|
|
| `MaxClockDriftMs` | int | 60000 | Maximum allowed clock drift in milliseconds |
|
|
|
|
## Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `scheduler_hlc_enqueues_total` | Counter | Total HLC enqueue operations |
|
|
| `scheduler_hlc_enqueue_deduplicated_total` | Counter | Deduplicated enqueue operations |
|
|
| `scheduler_hlc_enqueue_duration_seconds` | Histogram | Enqueue operation duration |
|
|
| `scheduler_hlc_dequeues_total` | Counter | Total HLC dequeue operations |
|
|
| `scheduler_hlc_dequeued_entries_total` | Counter | Total entries dequeued |
|
|
| `scheduler_chain_verifications_total` | Counter | Chain verification operations |
|
|
| `scheduler_chain_verification_issues_total` | Counter | Chain verification issues found |
|
|
| `scheduler_batch_snapshots_created_total` | Counter | Batch snapshots created |
|
|
|
|
## Troubleshooting
|
|
|
|
### Chain Verification Failures
|
|
|
|
If chain verification reports issues:
|
|
|
|
1. Check `scheduler_chain_verification_issues_total` for issue count
|
|
2. Query the log for specific issues:
|
|
```csharp
|
|
var result = await chainVerifier.VerifyAsync(tenantId);
|
|
foreach (var issue in result.Issues)
|
|
{
|
|
logger.LogError(
|
|
"Chain issue at job {JobId}: {Type} - {Description}",
|
|
issue.JobId, issue.IssueType, issue.Description);
|
|
}
|
|
```
|
|
|
|
3. Common causes:
|
|
- Database corruption: Restore from backup
|
|
- Concurrent writes without proper locking: Check transaction isolation
|
|
- Clock drift: Verify `MaxClockDriftMs` setting
|
|
|
|
### Performance Considerations
|
|
|
|
- **Index usage**: Ensure `idx_scheduler_log_tenant_hlc` is being used
|
|
- **Chain head caching**: The `chain_heads` table provides O(1) access to latest link
|
|
- **Batch sizes**: Adjust dequeue batch size based on workload
|
|
|
|
## Rollback Procedure
|
|
|
|
To rollback to legacy ordering:
|
|
|
|
```yaml
|
|
Scheduler:
|
|
Queue:
|
|
Hlc:
|
|
EnableHlcOrdering: false
|
|
DualWriteMode: false
|
|
```
|
|
|
|
The `scheduler_log` table can be retained for audit purposes or dropped if no longer needed.
|
|
|
|
## Related Documentation
|
|
|
|
- [Scheduler Architecture](architecture.md)
|
|
- [HLC Library Documentation](../../__Libraries/StellaOps.HybridLogicalClock/README.md)
|
|
- [Product Advisory: Audit-safe Job Queue Ordering](../../product/advisories/audit-safe-job-queue-ordering.md)
|