# HLC Troubleshooting Runbook > **Version**: 1.0.0 > **Sprint**: SPRINT_20260105_002_004_BE > **Last Updated**: 2026-01-07 This runbook covers troubleshooting procedures for Hybrid Logical Clock (HLC) based queue ordering in StellaOps. --- ## Table of Contents 1. [Chain Verification Failure](#1-chain-verification-failure) 2. [Clock Skew Issues](#2-clock-skew-issues) 3. [Time Offset Drift](#3-time-offset-drift) 4. [Merge Conflicts](#4-merge-conflicts) 5. [Slow Air-Gap Sync](#5-slow-air-gap-sync) 6. [No Enqueues](#6-no-enqueues) 7. [Batch Snapshot Failures](#7-batch-snapshot-failures) 8. [Duplicate Node ID](#8-duplicate-node-id) --- ## 1. Chain Verification Failure ### Symptoms - Alert: `HlcChainVerificationFailure` - Metric: `scheduler_chain_verification_failures_total` increasing - Log: `Chain verification failed: expected {expected}, got {actual}` ### Severity **Critical** - Indicates potential data tampering or corruption. ### Investigation Steps 1. **Identify the affected chain segment**: ```sql SELECT job_id, t_hlc, encode(prev_link, 'hex') as prev_link, encode(link, 'hex') as link, created_at FROM scheduler.scheduler_log WHERE tenant_id = '' ORDER BY t_hlc DESC LIMIT 100; ``` 2. **Find the break point**: ```sql WITH chain AS ( SELECT job_id, t_hlc, prev_link, link, LAG(link) OVER (ORDER BY t_hlc) as expected_prev FROM scheduler.scheduler_log WHERE tenant_id = '' ) SELECT * FROM chain WHERE prev_link IS DISTINCT FROM expected_prev ORDER BY t_hlc; ``` 3. **Check for unauthorized modifications**: - Review database audit logs - Check for direct SQL updates bypassing the application 4. **Verify chain head consistency**: ```sql SELECT * FROM scheduler.chain_heads WHERE tenant_id = ''; ``` ### Resolution **If corruption is isolated**: 1. Mark affected jobs for re-processing 2. Rebuild chain from the last valid point 3. Update chain head **If tampering is suspected**: 1. Escalate to Security team immediately 2. Preserve all logs and database state 3. Initiate incident response procedure --- ## 2. Clock Skew Issues ### Symptoms - Alert: `HlcClockSkewExceedsTolerance` - Metric: `hlc_clock_skew_rejections_total` increasing - Log: `Clock skew exceeds tolerance: {skew}ms > {tolerance}ms` ### Severity **Critical** - Can cause job ordering inconsistencies. ### Investigation Steps 1. **Check NTP synchronization**: ```bash # On affected node timedatectl status ntpq -p chronyc tracking # if using chrony ``` 2. **Verify time sources**: ```bash ntpq -pn ``` 3. **Check for leap second issues**: ```bash dmesg | grep -i leap ``` 4. **Compare with other nodes**: ```bash for node in node-1 node-2 node-3; do echo "$node: $(ssh $node date +%s.%N)" done ``` ### Resolution 1. **Restart NTP client**: ```bash sudo systemctl restart chronyd # or ntpd ``` 2. **Force time sync**: ```bash sudo chronyc makestep ``` 3. **Temporarily increase tolerance** (emergency only): ```yaml Scheduler: Queue: Hlc: MaxClockSkewMs: 60000 # Increase from default 5000 ``` 4. **Restart affected service** to reset HLC state. --- ## 3. Time Offset Drift ### Symptoms - Alert: `HlcPhysicalTimeOffset` - Metric: `hlc_physical_time_offset_seconds` > 0.5 ### Severity **Warning** - May cause timestamp anomalies in diagnostics. ### Investigation Steps 1. **Check current offset**: ```promql hlc_physical_time_offset_seconds{node_id=""} ``` 2. **Review HLC state**: ```bash curl -s http://localhost:5000/health/hlc | jq ``` 3. **Check for high logical counter**: If logical counter is very high, it indicates frequent same-millisecond events. ### Resolution Usually self-correcting as wall clock advances. If persistent: 1. Review job submission rate 2. Consider horizontal scaling to distribute load --- ## 4. Merge Conflicts ### Symptoms - Alert: `HlcMergeConflictRateHigh` - Metric: `airgap_merge_conflicts_total` increasing - Log: `Merge conflict: job {jobId} has conflicting payloads` ### Severity **Warning** - May indicate duplicate job submissions or clock issues on offline nodes. ### Investigation Steps 1. **Identify conflict types**: ```promql sum by (conflict_type) (airgap_merge_conflicts_total) ``` 2. **Review merge logs**: ```bash grep "merge conflict" /var/log/stellaops/scheduler.log | tail -100 ``` 3. **Check offline node clocks**: - Were offline nodes synchronized before disconnection? - How long were nodes offline? ### Resolution 1. **For duplicate jobs**: Use idempotency keys to prevent duplicates 2. **For payload conflicts**: Review job submission logic 3. **For ordering conflicts**: Verify NTP on all nodes before disconnection --- ## 5. Slow Air-Gap Sync ### Symptoms - Alert: `HlcSyncDurationHigh` - Metric: `airgap_sync_duration_seconds` p95 > 30s ### Severity **Warning** - Delays job processing. ### Investigation Steps 1. **Check bundle sizes**: ```promql histogram_quantile(0.95, airgap_bundle_size_bytes_bucket) ``` 2. **Check database performance**: ```sql SELECT * FROM pg_stat_activity WHERE state = 'active' AND query LIKE '%scheduler_log%'; ``` 3. **Review index usage**: ```sql EXPLAIN ANALYZE SELECT * FROM scheduler.scheduler_log WHERE tenant_id = '' ORDER BY t_hlc LIMIT 1000; ``` ### Resolution 1. **Chunk large bundles**: Split bundles > 10K entries 2. **Optimize database**: Ensure indexes are used 3. **Increase resources**: Scale up database if needed --- ## 6. No Enqueues ### Symptoms - Alert: `HlcEnqueueRateZero` - No jobs appearing in HLC queue ### Severity **Info** - May be expected or indicate misconfiguration. ### Investigation Steps 1. **Check if HLC ordering is enabled**: ```bash curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc' ``` 2. **Verify service is receiving jobs**: ```promql rate(scheduler_jobs_received_total[5m]) ``` 3. **Check for errors**: ```bash grep -i "hlc\|enqueue" /var/log/stellaops/scheduler.log | grep -i error ``` ### Resolution 1. If HLC should be enabled: ```yaml Scheduler: Queue: Hlc: EnableHlcOrdering: true ``` 2. If dual-write mode is needed: ```yaml Scheduler: Queue: Hlc: DualWriteMode: true ``` --- ## 7. Batch Snapshot Failures ### Symptoms - Alert: `HlcBatchSnapshotFailures` - Missing DSSE-signed batch proofs ### Severity **Warning** - Audit proofs may be incomplete. ### Investigation Steps 1. **Check signing key**: ```bash stella signer status ``` 2. **Verify DSSE configuration**: ```bash curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc.batchSigning' ``` 3. **Check database connectivity**: ```sql SELECT 1; -- Simple connectivity test ``` ### Resolution 1. **Refresh signing credentials** 2. **Check certificate expiry** 3. **Verify database permissions for batch_snapshots table** --- ## 8. Duplicate Node ID ### Symptoms - Alert: `HlcDuplicateNodeId` - Multiple instances with same node_id ### Severity **Critical** - Will cause chain corruption. ### Investigation Steps 1. **Identify affected instances**: ```promql group by (node_id, instance) (hlc_ticks_total) ``` 2. **Check node ID configuration**: ```bash # On each instance grep -r "NodeId" /etc/stellaops/ ``` ### Resolution **Immediate action required**: 1. Stop one of the duplicate instances 2. Reconfigure with unique node ID 3. Restart and verify 4. Check chain integrity for affected time period --- ## Escalation Matrix | Issue | First Responder | Escalation L2 | Escalation L3 | |-------|-----------------|---------------|---------------| | Chain verification failure | On-call SRE | Scheduler team | Security team | | Clock skew | On-call SRE | Infrastructure | Architecture | | Merge conflicts | On-call SRE | Scheduler team | - | | Performance issues | On-call SRE | Database team | - | | Duplicate node ID | On-call SRE | Scheduler team | - | --- ## Revision History | Version | Date | Author | Changes | |---------|------|--------|---------| | 1.0.0 | 2026-01-07 | Agent | Initial release |