save progress
This commit is contained in:
190
docs/modules/scheduler/hlc-migration-guide.md
Normal file
190
docs/modules/scheduler/hlc-migration-guide.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# HLC Queue Ordering Migration Guide
|
||||
|
||||
This guide describes how to enable HLC (Hybrid Logical Clock) ordering for the Scheduler queue, transitioning from legacy `(priority, created_at)` ordering to HLC-based ordering with cryptographic chain linking.
|
||||
|
||||
## Overview
|
||||
|
||||
HLC ordering provides:
|
||||
- **Deterministic global ordering**: Causal consistency across distributed nodes
|
||||
- **Cryptographic chain linking**: Audit-safe job sequence proofs
|
||||
- **Reproducible processing**: Same input produces same chain
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. PostgreSQL 16+ with the scheduler schema
|
||||
2. HLC library dependency (`StellaOps.HybridLogicalClock`)
|
||||
3. Schema migration `002_hlc_queue_chain.sql` applied
|
||||
|
||||
## Migration Phases
|
||||
|
||||
### Phase 1: Deploy with Dual-Write Mode
|
||||
|
||||
Enable dual-write to populate the new `scheduler_log` table without affecting existing operations.
|
||||
|
||||
```yaml
|
||||
# appsettings.yaml or environment configuration
|
||||
Scheduler:
|
||||
Queue:
|
||||
Hlc:
|
||||
EnableHlcOrdering: false # Keep using legacy ordering for reads
|
||||
DualWriteMode: true # Write to both legacy and HLC tables
|
||||
```
|
||||
|
||||
```csharp
|
||||
// Program.cs or Startup.cs
|
||||
services.AddOptions<SchedulerQueueOptions>()
|
||||
.Bind(configuration.GetSection("Scheduler:Queue"))
|
||||
.ValidateDataAnnotations()
|
||||
.ValidateOnStart();
|
||||
|
||||
// Register HLC services
|
||||
services.AddHlcSchedulerServices();
|
||||
|
||||
// Register HLC clock
|
||||
services.AddSingleton<IHybridLogicalClock>(sp =>
|
||||
{
|
||||
var nodeId = Environment.MachineName; // or use a stable node identifier
|
||||
return new HybridLogicalClock(nodeId, TimeProvider.System);
|
||||
});
|
||||
```
|
||||
|
||||
**Verification:**
|
||||
- Monitor `scheduler_hlc_enqueues_total` metric for dual-write activity
|
||||
- Verify `scheduler_log` table is being populated
|
||||
- Check chain verification passes: `scheduler_chain_verifications_total{result="valid"}`
|
||||
|
||||
### Phase 2: Backfill Historical Data (Optional)
|
||||
|
||||
If you need historical jobs in the HLC chain, backfill from the existing `scheduler.jobs` table:
|
||||
|
||||
```sql
|
||||
-- Backfill script (run during maintenance window)
|
||||
-- Note: This creates a new chain starting from historical data
|
||||
-- The chain will not have valid prev_link values for historical entries
|
||||
|
||||
INSERT INTO scheduler.scheduler_log (
|
||||
tenant_id, t_hlc, partition_key, job_id, payload_hash, prev_link, link
|
||||
)
|
||||
SELECT
|
||||
tenant_id,
|
||||
-- Generate synthetic HLC timestamps based on created_at
|
||||
-- Format: YYYYMMDDHHMMSS-nodeid-counter
|
||||
TO_CHAR(created_at AT TIME ZONE 'UTC', 'YYYYMMDDHH24MISS') || '-backfill-' ||
|
||||
LPAD(ROW_NUMBER() OVER (PARTITION BY tenant_id ORDER BY created_at)::TEXT, 6, '0'),
|
||||
COALESCE(project_id, ''),
|
||||
id,
|
||||
DECODE(payload_digest, 'hex'),
|
||||
NULL, -- No chain linking for historical data
|
||||
DECODE(payload_digest, 'hex') -- Use payload_digest as link placeholder
|
||||
FROM scheduler.jobs
|
||||
WHERE status IN ('pending', 'scheduled', 'running')
|
||||
AND NOT EXISTS (
|
||||
SELECT 1 FROM scheduler.scheduler_log sl
|
||||
WHERE sl.job_id = jobs.id
|
||||
)
|
||||
ORDER BY tenant_id, created_at;
|
||||
```
|
||||
|
||||
### Phase 3: Enable HLC Ordering for Reads
|
||||
|
||||
Once dual-write is stable and backfill (if needed) is complete:
|
||||
|
||||
```yaml
|
||||
Scheduler:
|
||||
Queue:
|
||||
Hlc:
|
||||
EnableHlcOrdering: true # Use HLC ordering for reads
|
||||
DualWriteMode: true # Keep dual-write during transition
|
||||
VerifyOnDequeue: false # Optional: enable for extra validation
|
||||
```
|
||||
|
||||
**Verification:**
|
||||
- Monitor dequeue latency (should be similar to legacy)
|
||||
- Verify job processing order matches HLC order
|
||||
- Check chain integrity periodically
|
||||
|
||||
### Phase 4: Disable Dual-Write Mode
|
||||
|
||||
Once confident in HLC ordering:
|
||||
|
||||
```yaml
|
||||
Scheduler:
|
||||
Queue:
|
||||
Hlc:
|
||||
EnableHlcOrdering: true
|
||||
DualWriteMode: false # Stop writing to legacy table
|
||||
VerifyOnDequeue: false
|
||||
```
|
||||
|
||||
## Configuration Reference
|
||||
|
||||
### SchedulerHlcOptions
|
||||
|
||||
| Property | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `EnableHlcOrdering` | bool | false | Use HLC ordering for queue reads |
|
||||
| `DualWriteMode` | bool | false | Write to both legacy and HLC tables |
|
||||
| `VerifyOnDequeue` | bool | false | Verify chain integrity on each dequeue |
|
||||
| `MaxClockDriftMs` | int | 60000 | Maximum allowed clock drift in milliseconds |
|
||||
|
||||
## Metrics
|
||||
|
||||
| Metric | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `scheduler_hlc_enqueues_total` | Counter | Total HLC enqueue operations |
|
||||
| `scheduler_hlc_enqueue_deduplicated_total` | Counter | Deduplicated enqueue operations |
|
||||
| `scheduler_hlc_enqueue_duration_seconds` | Histogram | Enqueue operation duration |
|
||||
| `scheduler_hlc_dequeues_total` | Counter | Total HLC dequeue operations |
|
||||
| `scheduler_hlc_dequeued_entries_total` | Counter | Total entries dequeued |
|
||||
| `scheduler_chain_verifications_total` | Counter | Chain verification operations |
|
||||
| `scheduler_chain_verification_issues_total` | Counter | Chain verification issues found |
|
||||
| `scheduler_batch_snapshots_created_total` | Counter | Batch snapshots created |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Chain Verification Failures
|
||||
|
||||
If chain verification reports issues:
|
||||
|
||||
1. Check `scheduler_chain_verification_issues_total` for issue count
|
||||
2. Query the log for specific issues:
|
||||
```csharp
|
||||
var result = await chainVerifier.VerifyAsync(tenantId);
|
||||
foreach (var issue in result.Issues)
|
||||
{
|
||||
logger.LogError(
|
||||
"Chain issue at job {JobId}: {Type} - {Description}",
|
||||
issue.JobId, issue.IssueType, issue.Description);
|
||||
}
|
||||
```
|
||||
|
||||
3. Common causes:
|
||||
- Database corruption: Restore from backup
|
||||
- Concurrent writes without proper locking: Check transaction isolation
|
||||
- Clock drift: Verify `MaxClockDriftMs` setting
|
||||
|
||||
### Performance Considerations
|
||||
|
||||
- **Index usage**: Ensure `idx_scheduler_log_tenant_hlc` is being used
|
||||
- **Chain head caching**: The `chain_heads` table provides O(1) access to latest link
|
||||
- **Batch sizes**: Adjust dequeue batch size based on workload
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
To rollback to legacy ordering:
|
||||
|
||||
```yaml
|
||||
Scheduler:
|
||||
Queue:
|
||||
Hlc:
|
||||
EnableHlcOrdering: false
|
||||
DualWriteMode: false
|
||||
```
|
||||
|
||||
The `scheduler_log` table can be retained for audit purposes or dropped if no longer needed.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Scheduler Architecture](architecture.md)
|
||||
- [HLC Library Documentation](../../__Libraries/StellaOps.HybridLogicalClock/README.md)
|
||||
- [Product Advisory: Audit-safe Job Queue Ordering](../../product-advisories/audit-safe-job-queue-ordering.md)
|
||||
Reference in New Issue
Block a user