stella-ops.org/git.stella-ops.org

Fork 0

Files

master 608a7f85c0 audit work, fixed StellaOps.sln warnings/errors, fixed tests, sprints work, new advisories

2026-01-07 18:50:11 +02:00

8.3 KiB

Raw Blame History

HLC Troubleshooting Runbook

Version: 1.0.0
Sprint: SPRINT_20260105_002_004_BE
Last Updated: 2026-01-07

This runbook covers troubleshooting procedures for Hybrid Logical Clock (HLC) based queue ordering in StellaOps.

Chain Verification Failure
Clock Skew Issues
Time Offset Drift
Merge Conflicts
Slow Air-Gap Sync
No Enqueues
Batch Snapshot Failures
Duplicate Node ID

1. Chain Verification Failure

Symptoms

Alert: HlcChainVerificationFailure
Metric: scheduler_chain_verification_failures_total increasing
Log: Chain verification failed: expected {expected}, got {actual}

Severity

Critical - Indicates potential data tampering or corruption.

Investigation Steps

Identify the affected chain segment:

SELECT 
    job_id, 
    t_hlc, 
    encode(prev_link, 'hex') as prev_link,
    encode(link, 'hex') as link,
    created_at
FROM scheduler.scheduler_log
WHERE tenant_id = '<tenant_id>'
ORDER BY t_hlc DESC
LIMIT 100;

Find the break point:

WITH chain AS (
    SELECT 
        job_id,
        t_hlc,
        prev_link,
        link,
        LAG(link) OVER (ORDER BY t_hlc) as expected_prev
    FROM scheduler.scheduler_log
    WHERE tenant_id = '<tenant_id>'
)
SELECT * FROM chain
WHERE prev_link IS DISTINCT FROM expected_prev
ORDER BY t_hlc;

Check for unauthorized modifications:
- Review database audit logs
- Check for direct SQL updates bypassing the application

Verify chain head consistency:

SELECT * FROM scheduler.chain_heads
WHERE tenant_id = '<tenant_id>';

Resolution

If corruption is isolated:

Mark affected jobs for re-processing
Rebuild chain from the last valid point
Update chain head

If tampering is suspected:

Escalate to Security team immediately
Preserve all logs and database state
Initiate incident response procedure

2. Clock Skew Issues

Symptoms

Alert: HlcClockSkewExceedsTolerance
Metric: hlc_clock_skew_rejections_total increasing
Log: Clock skew exceeds tolerance: {skew}ms > {tolerance}ms

Severity

Critical - Can cause job ordering inconsistencies.

Investigation Steps

Check NTP synchronization:

# On affected node
timedatectl status
ntpq -p
chronyc tracking  # if using chrony

Verify time sources:
```
ntpq -pn
```
Check for leap second issues:
```
dmesg | grep -i leap
```

Compare with other nodes:

for node in node-1 node-2 node-3; do
  echo "$node: $(ssh $node date +%s.%N)"
done

Resolution

Restart NTP client:

sudo systemctl restart chronyd  # or ntpd

Force time sync:
```
sudo chronyc makestep
```

Temporarily increase tolerance (emergency only):

Scheduler:
  Queue:
    Hlc:
      MaxClockSkewMs: 60000  # Increase from default 5000

Restart affected service to reset HLC state.

3. Time Offset Drift

Symptoms

Alert: HlcPhysicalTimeOffset
Metric: hlc_physical_time_offset_seconds > 0.5

Severity

Warning - May cause timestamp anomalies in diagnostics.

Investigation Steps

Check current offset:

hlc_physical_time_offset_seconds{node_id="<node>"}

Review HLC state:

curl -s http://localhost:5000/health/hlc | jq

Check for high logical counter: If logical counter is very high, it indicates frequent same-millisecond events.

Resolution

Usually self-correcting as wall clock advances. If persistent:

Review job submission rate
Consider horizontal scaling to distribute load

4. Merge Conflicts

Symptoms

Alert: HlcMergeConflictRateHigh
Metric: airgap_merge_conflicts_total increasing
Log: Merge conflict: job {jobId} has conflicting payloads

Severity

Warning - May indicate duplicate job submissions or clock issues on offline nodes.

Investigation Steps

Identify conflict types:

sum by (conflict_type) (airgap_merge_conflicts_total)

Review merge logs:

grep "merge conflict" /var/log/stellaops/scheduler.log | tail -100

Check offline node clocks:
- Were offline nodes synchronized before disconnection?
- How long were nodes offline?

Resolution

For duplicate jobs: Use idempotency keys to prevent duplicates
For payload conflicts: Review job submission logic
For ordering conflicts: Verify NTP on all nodes before disconnection

5. Slow Air-Gap Sync

Symptoms

Alert: HlcSyncDurationHigh
Metric: airgap_sync_duration_seconds p95 > 30s

Severity

Warning - Delays job processing.

Investigation Steps

Check bundle sizes:

histogram_quantile(0.95, airgap_bundle_size_bytes_bucket)

Check database performance:

SELECT * FROM pg_stat_activity
WHERE state = 'active' AND query LIKE '%scheduler_log%';

Review index usage:

EXPLAIN ANALYZE
SELECT * FROM scheduler.scheduler_log
WHERE tenant_id = '<tenant>'
ORDER BY t_hlc
LIMIT 1000;

Resolution

Chunk large bundles: Split bundles > 10K entries
Optimize database: Ensure indexes are used
Increase resources: Scale up database if needed

6. No Enqueues

Symptoms

Alert: HlcEnqueueRateZero
No jobs appearing in HLC queue

Severity

Info - May be expected or indicate misconfiguration.

Investigation Steps

Check if HLC ordering is enabled:

curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc'

Verify service is receiving jobs:

rate(scheduler_jobs_received_total[5m])

Check for errors:

grep -i "hlc\|enqueue" /var/log/stellaops/scheduler.log | grep -i error

Resolution

If HLC should be enabled:

Scheduler:
  Queue:
    Hlc:
      EnableHlcOrdering: true

If dual-write mode is needed:

Scheduler:
  Queue:
    Hlc:
      DualWriteMode: true

7. Batch Snapshot Failures

Symptoms

Alert: HlcBatchSnapshotFailures
Missing DSSE-signed batch proofs

Severity

Warning - Audit proofs may be incomplete.

Investigation Steps

Check signing key:
```
stella signer status
```

Verify DSSE configuration:

curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc.batchSigning'

Check database connectivity:
```
SELECT 1;  -- Simple connectivity test
```

Resolution

Refresh signing credentials
Check certificate expiry
Verify database permissions for batch_snapshots table

8. Duplicate Node ID

Symptoms

Alert: HlcDuplicateNodeId
Multiple instances with same node_id

Severity

Critical - Will cause chain corruption.

Investigation Steps

Identify affected instances:

group by (node_id, instance) (hlc_ticks_total)

Check node ID configuration:

# On each instance
grep -r "NodeId" /etc/stellaops/

Resolution

Immediate action required:

Stop one of the duplicate instances
Reconfigure with unique node ID
Restart and verify
Check chain integrity for affected time period

Escalation Matrix

Issue	First Responder	Escalation L2	Escalation L3
Chain verification failure	On-call SRE	Scheduler team	Security team
Clock skew	On-call SRE	Infrastructure	Architecture
Merge conflicts	On-call SRE	Scheduler team	-
Performance issues	On-call SRE	Database team	-
Duplicate node ID	On-call SRE	Scheduler team	-

Revision History

Version	Date	Author	Changes
1.0.0	2026-01-07	Agent	Initial release

8.3 KiB Raw Blame History

HLC Troubleshooting Runbook

Table of Contents

1. Chain Verification Failure

Symptoms

Severity

Investigation Steps

Resolution

2. Clock Skew Issues

Symptoms

Severity

Investigation Steps

Resolution

3. Time Offset Drift

Symptoms

Severity

Investigation Steps

Resolution

4. Merge Conflicts

Symptoms

Severity

Investigation Steps

Resolution

5. Slow Air-Gap Sync

Symptoms

Severity

Investigation Steps

Resolution

6. No Enqueues

Symptoms

Severity

Investigation Steps

Resolution

7. Batch Snapshot Failures

Symptoms

Severity

Investigation Steps

Resolution

8. Duplicate Node ID

Symptoms

Severity

Investigation Steps

Resolution

Escalation Matrix

Revision History

8.3 KiB

Raw Blame History