8.3 KiB
HLC Troubleshooting Runbook
Version: 1.0.0
Sprint: SPRINT_20260105_002_004_BE
Last Updated: 2026-01-07
This runbook covers troubleshooting procedures for Hybrid Logical Clock (HLC) based queue ordering in StellaOps.
Table of Contents
- Chain Verification Failure
- Clock Skew Issues
- Time Offset Drift
- Merge Conflicts
- Slow Air-Gap Sync
- No Enqueues
- Batch Snapshot Failures
- Duplicate Node ID
1. Chain Verification Failure
Symptoms
- Alert:
HlcChainVerificationFailure - Metric:
scheduler_chain_verification_failures_totalincreasing - Log:
Chain verification failed: expected {expected}, got {actual}
Severity
Critical - Indicates potential data tampering or corruption.
Investigation Steps
-
Identify the affected chain segment:
SELECT job_id, t_hlc, encode(prev_link, 'hex') as prev_link, encode(link, 'hex') as link, created_at FROM scheduler.scheduler_log WHERE tenant_id = '<tenant_id>' ORDER BY t_hlc DESC LIMIT 100; -
Find the break point:
WITH chain AS ( SELECT job_id, t_hlc, prev_link, link, LAG(link) OVER (ORDER BY t_hlc) as expected_prev FROM scheduler.scheduler_log WHERE tenant_id = '<tenant_id>' ) SELECT * FROM chain WHERE prev_link IS DISTINCT FROM expected_prev ORDER BY t_hlc; -
Check for unauthorized modifications:
- Review database audit logs
- Check for direct SQL updates bypassing the application
-
Verify chain head consistency:
SELECT * FROM scheduler.chain_heads WHERE tenant_id = '<tenant_id>';
Resolution
If corruption is isolated:
- Mark affected jobs for re-processing
- Rebuild chain from the last valid point
- Update chain head
If tampering is suspected:
- Escalate to Security team immediately
- Preserve all logs and database state
- Initiate incident response procedure
2. Clock Skew Issues
Symptoms
- Alert:
HlcClockSkewExceedsTolerance - Metric:
hlc_clock_skew_rejections_totalincreasing - Log:
Clock skew exceeds tolerance: {skew}ms > {tolerance}ms
Severity
Critical - Can cause job ordering inconsistencies.
Investigation Steps
-
Check NTP synchronization:
# On affected node timedatectl status ntpq -p chronyc tracking # if using chrony -
Verify time sources:
ntpq -pn -
Check for leap second issues:
dmesg | grep -i leap -
Compare with other nodes:
for node in node-1 node-2 node-3; do echo "$node: $(ssh $node date +%s.%N)" done
Resolution
-
Restart NTP client:
sudo systemctl restart chronyd # or ntpd -
Force time sync:
sudo chronyc makestep -
Temporarily increase tolerance (emergency only):
Scheduler: Queue: Hlc: MaxClockSkewMs: 60000 # Increase from default 5000 -
Restart affected service to reset HLC state.
3. Time Offset Drift
Symptoms
- Alert:
HlcPhysicalTimeOffset - Metric:
hlc_physical_time_offset_seconds> 0.5
Severity
Warning - May cause timestamp anomalies in diagnostics.
Investigation Steps
-
Check current offset:
hlc_physical_time_offset_seconds{node_id="<node>"} -
Review HLC state:
curl -s http://localhost:5000/health/hlc | jq -
Check for high logical counter: If logical counter is very high, it indicates frequent same-millisecond events.
Resolution
Usually self-correcting as wall clock advances. If persistent:
- Review job submission rate
- Consider horizontal scaling to distribute load
4. Merge Conflicts
Symptoms
- Alert:
HlcMergeConflictRateHigh - Metric:
airgap_merge_conflicts_totalincreasing - Log:
Merge conflict: job {jobId} has conflicting payloads
Severity
Warning - May indicate duplicate job submissions or clock issues on offline nodes.
Investigation Steps
-
Identify conflict types:
sum by (conflict_type) (airgap_merge_conflicts_total) -
Review merge logs:
grep "merge conflict" /var/log/stellaops/scheduler.log | tail -100 -
Check offline node clocks:
- Were offline nodes synchronized before disconnection?
- How long were nodes offline?
Resolution
- For duplicate jobs: Use idempotency keys to prevent duplicates
- For payload conflicts: Review job submission logic
- For ordering conflicts: Verify NTP on all nodes before disconnection
5. Slow Air-Gap Sync
Symptoms
- Alert:
HlcSyncDurationHigh - Metric:
airgap_sync_duration_secondsp95 > 30s
Severity
Warning - Delays job processing.
Investigation Steps
-
Check bundle sizes:
histogram_quantile(0.95, airgap_bundle_size_bytes_bucket) -
Check database performance:
SELECT * FROM pg_stat_activity WHERE state = 'active' AND query LIKE '%scheduler_log%'; -
Review index usage:
EXPLAIN ANALYZE SELECT * FROM scheduler.scheduler_log WHERE tenant_id = '<tenant>' ORDER BY t_hlc LIMIT 1000;
Resolution
- Chunk large bundles: Split bundles > 10K entries
- Optimize database: Ensure indexes are used
- Increase resources: Scale up database if needed
6. No Enqueues
Symptoms
- Alert:
HlcEnqueueRateZero - No jobs appearing in HLC queue
Severity
Info - May be expected or indicate misconfiguration.
Investigation Steps
-
Check if HLC ordering is enabled:
curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc' -
Verify service is receiving jobs:
rate(scheduler_jobs_received_total[5m]) -
Check for errors:
grep -i "hlc\|enqueue" /var/log/stellaops/scheduler.log | grep -i error
Resolution
-
If HLC should be enabled:
Scheduler: Queue: Hlc: EnableHlcOrdering: true -
If dual-write mode is needed:
Scheduler: Queue: Hlc: DualWriteMode: true
7. Batch Snapshot Failures
Symptoms
- Alert:
HlcBatchSnapshotFailures - Missing DSSE-signed batch proofs
Severity
Warning - Audit proofs may be incomplete.
Investigation Steps
-
Check signing key:
stella signer status -
Verify DSSE configuration:
curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc.batchSigning' -
Check database connectivity:
SELECT 1; -- Simple connectivity test
Resolution
- Refresh signing credentials
- Check certificate expiry
- Verify database permissions for batch_snapshots table
8. Duplicate Node ID
Symptoms
- Alert:
HlcDuplicateNodeId - Multiple instances with same node_id
Severity
Critical - Will cause chain corruption.
Investigation Steps
-
Identify affected instances:
group by (node_id, instance) (hlc_ticks_total) -
Check node ID configuration:
# On each instance grep -r "NodeId" /etc/stellaops/
Resolution
Immediate action required:
- Stop one of the duplicate instances
- Reconfigure with unique node ID
- Restart and verify
- Check chain integrity for affected time period
Escalation Matrix
| Issue | First Responder | Escalation L2 | Escalation L3 |
|---|---|---|---|
| Chain verification failure | On-call SRE | Scheduler team | Security team |
| Clock skew | On-call SRE | Infrastructure | Architecture |
| Merge conflicts | On-call SRE | Scheduler team | - |
| Performance issues | On-call SRE | Database team | - |
| Duplicate node ID | On-call SRE | Scheduler team | - |
Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0.0 | 2026-01-07 | Agent | Initial release |