Files
git.stella-ops.org/docs/operations/runbooks/hlc-troubleshooting.md

8.3 KiB

HLC Troubleshooting Runbook

Version: 1.0.0
Sprint: SPRINT_20260105_002_004_BE
Last Updated: 2026-01-07

This runbook covers troubleshooting procedures for Hybrid Logical Clock (HLC) based queue ordering in StellaOps.


Table of Contents

  1. Chain Verification Failure
  2. Clock Skew Issues
  3. Time Offset Drift
  4. Merge Conflicts
  5. Slow Air-Gap Sync
  6. No Enqueues
  7. Batch Snapshot Failures
  8. Duplicate Node ID

1. Chain Verification Failure

Symptoms

  • Alert: HlcChainVerificationFailure
  • Metric: scheduler_chain_verification_failures_total increasing
  • Log: Chain verification failed: expected {expected}, got {actual}

Severity

Critical - Indicates potential data tampering or corruption.

Investigation Steps

  1. Identify the affected chain segment:

    SELECT 
        job_id, 
        t_hlc, 
        encode(prev_link, 'hex') as prev_link,
        encode(link, 'hex') as link,
        created_at
    FROM scheduler.scheduler_log
    WHERE tenant_id = '<tenant_id>'
    ORDER BY t_hlc DESC
    LIMIT 100;
    
  2. Find the break point:

    WITH chain AS (
        SELECT 
            job_id,
            t_hlc,
            prev_link,
            link,
            LAG(link) OVER (ORDER BY t_hlc) as expected_prev
        FROM scheduler.scheduler_log
        WHERE tenant_id = '<tenant_id>'
    )
    SELECT * FROM chain
    WHERE prev_link IS DISTINCT FROM expected_prev
    ORDER BY t_hlc;
    
  3. Check for unauthorized modifications:

    • Review database audit logs
    • Check for direct SQL updates bypassing the application
  4. Verify chain head consistency:

    SELECT * FROM scheduler.chain_heads
    WHERE tenant_id = '<tenant_id>';
    

Resolution

If corruption is isolated:

  1. Mark affected jobs for re-processing
  2. Rebuild chain from the last valid point
  3. Update chain head

If tampering is suspected:

  1. Escalate to Security team immediately
  2. Preserve all logs and database state
  3. Initiate incident response procedure

2. Clock Skew Issues

Symptoms

  • Alert: HlcClockSkewExceedsTolerance
  • Metric: hlc_clock_skew_rejections_total increasing
  • Log: Clock skew exceeds tolerance: {skew}ms > {tolerance}ms

Severity

Critical - Can cause job ordering inconsistencies.

Investigation Steps

  1. Check NTP synchronization:

    # On affected node
    timedatectl status
    ntpq -p
    chronyc tracking  # if using chrony
    
  2. Verify time sources:

    ntpq -pn
    
  3. Check for leap second issues:

    dmesg | grep -i leap
    
  4. Compare with other nodes:

    for node in node-1 node-2 node-3; do
      echo "$node: $(ssh $node date +%s.%N)"
    done
    

Resolution

  1. Restart NTP client:

    sudo systemctl restart chronyd  # or ntpd
    
  2. Force time sync:

    sudo chronyc makestep
    
  3. Temporarily increase tolerance (emergency only):

    Scheduler:
      Queue:
        Hlc:
          MaxClockSkewMs: 60000  # Increase from default 5000
    
  4. Restart affected service to reset HLC state.


3. Time Offset Drift

Symptoms

  • Alert: HlcPhysicalTimeOffset
  • Metric: hlc_physical_time_offset_seconds > 0.5

Severity

Warning - May cause timestamp anomalies in diagnostics.

Investigation Steps

  1. Check current offset:

    hlc_physical_time_offset_seconds{node_id="<node>"}
    
  2. Review HLC state:

    curl -s http://localhost:5000/health/hlc | jq
    
  3. Check for high logical counter: If logical counter is very high, it indicates frequent same-millisecond events.

Resolution

Usually self-correcting as wall clock advances. If persistent:

  1. Review job submission rate
  2. Consider horizontal scaling to distribute load

4. Merge Conflicts

Symptoms

  • Alert: HlcMergeConflictRateHigh
  • Metric: airgap_merge_conflicts_total increasing
  • Log: Merge conflict: job {jobId} has conflicting payloads

Severity

Warning - May indicate duplicate job submissions or clock issues on offline nodes.

Investigation Steps

  1. Identify conflict types:

    sum by (conflict_type) (airgap_merge_conflicts_total)
    
  2. Review merge logs:

    grep "merge conflict" /var/log/stellaops/scheduler.log | tail -100
    
  3. Check offline node clocks:

    • Were offline nodes synchronized before disconnection?
    • How long were nodes offline?

Resolution

  1. For duplicate jobs: Use idempotency keys to prevent duplicates
  2. For payload conflicts: Review job submission logic
  3. For ordering conflicts: Verify NTP on all nodes before disconnection

5. Slow Air-Gap Sync

Symptoms

  • Alert: HlcSyncDurationHigh
  • Metric: airgap_sync_duration_seconds p95 > 30s

Severity

Warning - Delays job processing.

Investigation Steps

  1. Check bundle sizes:

    histogram_quantile(0.95, airgap_bundle_size_bytes_bucket)
    
  2. Check database performance:

    SELECT * FROM pg_stat_activity
    WHERE state = 'active' AND query LIKE '%scheduler_log%';
    
  3. Review index usage:

    EXPLAIN ANALYZE
    SELECT * FROM scheduler.scheduler_log
    WHERE tenant_id = '<tenant>'
    ORDER BY t_hlc
    LIMIT 1000;
    

Resolution

  1. Chunk large bundles: Split bundles > 10K entries
  2. Optimize database: Ensure indexes are used
  3. Increase resources: Scale up database if needed

6. No Enqueues

Symptoms

  • Alert: HlcEnqueueRateZero
  • No jobs appearing in HLC queue

Severity

Info - May be expected or indicate misconfiguration.

Investigation Steps

  1. Check if HLC ordering is enabled:

    curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc'
    
  2. Verify service is receiving jobs:

    rate(scheduler_jobs_received_total[5m])
    
  3. Check for errors:

    grep -i "hlc\|enqueue" /var/log/stellaops/scheduler.log | grep -i error
    

Resolution

  1. If HLC should be enabled:

    Scheduler:
      Queue:
        Hlc:
          EnableHlcOrdering: true
    
  2. If dual-write mode is needed:

    Scheduler:
      Queue:
        Hlc:
          DualWriteMode: true
    

7. Batch Snapshot Failures

Symptoms

  • Alert: HlcBatchSnapshotFailures
  • Missing DSSE-signed batch proofs

Severity

Warning - Audit proofs may be incomplete.

Investigation Steps

  1. Check signing key:

    stella signer status
    
  2. Verify DSSE configuration:

    curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc.batchSigning'
    
  3. Check database connectivity:

    SELECT 1;  -- Simple connectivity test
    

Resolution

  1. Refresh signing credentials
  2. Check certificate expiry
  3. Verify database permissions for batch_snapshots table

8. Duplicate Node ID

Symptoms

  • Alert: HlcDuplicateNodeId
  • Multiple instances with same node_id

Severity

Critical - Will cause chain corruption.

Investigation Steps

  1. Identify affected instances:

    group by (node_id, instance) (hlc_ticks_total)
    
  2. Check node ID configuration:

    # On each instance
    grep -r "NodeId" /etc/stellaops/
    

Resolution

Immediate action required:

  1. Stop one of the duplicate instances
  2. Reconfigure with unique node ID
  3. Restart and verify
  4. Check chain integrity for affected time period

Escalation Matrix

Issue First Responder Escalation L2 Escalation L3
Chain verification failure On-call SRE Scheduler team Security team
Clock skew On-call SRE Infrastructure Architecture
Merge conflicts On-call SRE Scheduler team -
Performance issues On-call SRE Database team -
Duplicate node ID On-call SRE Scheduler team -

Revision History

Version Date Author Changes
1.0.0 2026-01-07 Agent Initial release