git.stella-ops.org/docs/operations/runbooks/hlc-troubleshooting.md

# HLC Troubleshooting Runbook

> **Version**: 1.0.0
> **Sprint**: SPRINT_20260105_002_004_BE
> **Last Updated**: 2026-01-07

This runbook covers troubleshooting procedures for Hybrid Logical Clock (HLC) based queue ordering in StellaOps.

---

## Table of Contents

1. [Chain Verification Failure](#1-chain-verification-failure)
2. [Clock Skew Issues](#2-clock-skew-issues)
3. [Time Offset Drift](#3-time-offset-drift)
4. [Merge Conflicts](#4-merge-conflicts)
5. [Slow Air-Gap Sync](#5-slow-air-gap-sync)
6. [No Enqueues](#6-no-enqueues)
7. [Batch Snapshot Failures](#7-batch-snapshot-failures)
8. [Duplicate Node ID](#8-duplicate-node-id)

---

## 1. Chain Verification Failure

### Symptoms
- Alert: `HlcChainVerificationFailure`
- Metric: `scheduler_chain_verification_failures_total` increasing
- Log: `Chain verification failed: expected {expected}, got {actual}`

### Severity
**Critical** - Indicates potential data tampering or corruption.

### Investigation Steps

1. **Identify the affected chain segment**:
   ```sql
   SELECT
       job_id,
       t_hlc,
       encode(prev_link, 'hex') as prev_link,
       encode(link, 'hex') as link,
       created_at
   FROM scheduler.scheduler_log
   WHERE tenant_id = '<tenant_id>'
   ORDER BY t_hlc DESC
   LIMIT 100;
   ```

2. **Find the break point**:
   ```sql
   WITH chain AS (
       SELECT
           job_id,
           t_hlc,
           prev_link,
           link,
           LAG(link) OVER (ORDER BY t_hlc) as expected_prev
       FROM scheduler.scheduler_log
       WHERE tenant_id = '<tenant_id>'
   )
   SELECT * FROM chain
   WHERE prev_link IS DISTINCT FROM expected_prev
   ORDER BY t_hlc;
   ```

3. **Check for unauthorized modifications**:
   - Review database audit logs
   - Check for direct SQL updates bypassing the application

4. **Verify chain head consistency**:
   ```sql
   SELECT * FROM scheduler.chain_heads
   WHERE tenant_id = '<tenant_id>';
   ```

### Resolution

**If corruption is isolated**:
1. Mark affected jobs for re-processing
2. Rebuild chain from the last valid point
3. Update chain head

**If tampering is suspected**:
1. Escalate to Security team immediately
2. Preserve all logs and database state
3. Initiate incident response procedure

---

## 2. Clock Skew Issues

### Symptoms
- Alert: `HlcClockSkewExceedsTolerance`
- Metric: `hlc_clock_skew_rejections_total` increasing
- Log: `Clock skew exceeds tolerance: {skew}ms > {tolerance}ms`

### Severity
**Critical** - Can cause job ordering inconsistencies.

### Investigation Steps

1. **Check NTP synchronization**:
   ```bash
   # On affected node
   timedatectl status
   ntpq -p
   chronyc tracking  # if using chrony
   ```

2. **Verify time sources**:
   ```bash
   ntpq -pn
   ```

3. **Check for leap second issues**:
   ```bash
   dmesg | grep -i leap
   ```

4. **Compare with other nodes**:
   ```bash
   for node in node-1 node-2 node-3; do
     echo "$node: $(ssh $node date +%s.%N)"
   done
   ```

### Resolution

1. **Restart NTP client**:
   ```bash
   sudo systemctl restart chronyd  # or ntpd
   ```

2. **Force time sync**:
   ```bash
   sudo chronyc makestep
   ```

3. **Temporarily increase tolerance** (emergency only):
   ```yaml
   Scheduler:
     Queue:
       Hlc:
         MaxClockSkewMs: 60000  # Increase from default 5000
   ```

4. **Restart affected service** to reset HLC state.

---

## 3. Time Offset Drift

### Symptoms
- Alert: `HlcPhysicalTimeOffset`
- Metric: `hlc_physical_time_offset_seconds` > 0.5

### Severity
**Warning** - May cause timestamp anomalies in diagnostics.

### Investigation Steps

1. **Check current offset**:
   ```promql
   hlc_physical_time_offset_seconds{node_id="<node>"}
   ```

2. **Review HLC state**:
   ```bash
   curl -s http://localhost:5000/health/hlc | jq
   ```

3. **Check for high logical counter**:
   If logical counter is very high, it indicates frequent same-millisecond events.

### Resolution

Usually self-correcting as wall clock advances. If persistent:
1. Review job submission rate
2. Consider horizontal scaling to distribute load

---

## 4. Merge Conflicts

### Symptoms
- Alert: `HlcMergeConflictRateHigh`
- Metric: `airgap_merge_conflicts_total` increasing
- Log: `Merge conflict: job {jobId} has conflicting payloads`

### Severity
**Warning** - May indicate duplicate job submissions or clock issues on offline nodes.

### Investigation Steps

1. **Identify conflict types**:
   ```promql
   sum by (conflict_type) (airgap_merge_conflicts_total)
   ```

2. **Review merge logs**:
   ```bash
   grep "merge conflict" /var/log/stellaops/scheduler.log | tail -100
   ```

3. **Check offline node clocks**:
   - Were offline nodes synchronized before disconnection?
   - How long were nodes offline?

### Resolution

1. **For duplicate jobs**: Use idempotency keys to prevent duplicates
2. **For payload conflicts**: Review job submission logic
3. **For ordering conflicts**: Verify NTP on all nodes before disconnection

---

## 5. Slow Air-Gap Sync

### Symptoms
- Alert: `HlcSyncDurationHigh`
- Metric: `airgap_sync_duration_seconds` p95 > 30s

### Severity
**Warning** - Delays job processing.

### Investigation Steps

1. **Check bundle sizes**:
   ```promql
   histogram_quantile(0.95, airgap_bundle_size_bytes_bucket)
   ```

2. **Check database performance**:
   ```sql
   SELECT * FROM pg_stat_activity
   WHERE state = 'active' AND query LIKE '%scheduler_log%';
   ```

3. **Review index usage**:
   ```sql
   EXPLAIN ANALYZE
   SELECT * FROM scheduler.scheduler_log
   WHERE tenant_id = '<tenant>'
   ORDER BY t_hlc
   LIMIT 1000;
   ```

### Resolution

1. **Chunk large bundles**: Split bundles > 10K entries
2. **Optimize database**: Ensure indexes are used
3. **Increase resources**: Scale up database if needed

---

## 6. No Enqueues

### Symptoms
- Alert: `HlcEnqueueRateZero`
- No jobs appearing in HLC queue

### Severity
**Info** - May be expected or indicate misconfiguration.

### Investigation Steps

1. **Check if HLC ordering is enabled**:
   ```bash
   curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc'
   ```

2. **Verify service is receiving jobs**:
   ```promql
   rate(scheduler_jobs_received_total[5m])
   ```

3. **Check for errors**:
   ```bash
   grep -i "hlc\|enqueue" /var/log/stellaops/scheduler.log | grep -i error
   ```

### Resolution

1. If HLC should be enabled:
   ```yaml
   Scheduler:
     Queue:
       Hlc:
         EnableHlcOrdering: true
   ```

2. If dual-write mode is needed:
   ```yaml
   Scheduler:
     Queue:
       Hlc:
         DualWriteMode: true
   ```

---

## 7. Batch Snapshot Failures

### Symptoms
- Alert: `HlcBatchSnapshotFailures`
- Missing DSSE-signed batch proofs

### Severity
**Warning** - Audit proofs may be incomplete.

### Investigation Steps

1. **Check signing key**:
   ```bash
   stella signer status
   ```

2. **Verify DSSE configuration**:
   ```bash
   curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc.batchSigning'
   ```

3. **Check database connectivity**:
   ```sql
   SELECT 1;  -- Simple connectivity test
   ```

### Resolution

1. **Refresh signing credentials**
2. **Check certificate expiry**
3. **Verify database permissions for batch_snapshots table**

---

## 8. Duplicate Node ID

### Symptoms
- Alert: `HlcDuplicateNodeId`
- Multiple instances with same node_id

### Severity
**Critical** - Will cause chain corruption.

### Investigation Steps

1. **Identify affected instances**:
   ```promql
   group by (node_id, instance) (hlc_ticks_total)
   ```

2. **Check node ID configuration**:
   ```bash
   # On each instance
   grep -r "NodeId" /etc/stellaops/
   ```

### Resolution

**Immediate action required**:
1. Stop one of the duplicate instances
2. Reconfigure with unique node ID
3. Restart and verify
4. Check chain integrity for affected time period

---

## Escalation Matrix

| Issue | First Responder | Escalation L2 | Escalation L3 |
|-------|-----------------|---------------|---------------|
| Chain verification failure | On-call SRE | Scheduler team | Security team |
| Clock skew | On-call SRE | Infrastructure | Architecture |
| Merge conflicts | On-call SRE | Scheduler team | - |
| Performance issues | On-call SRE | Database team | - |
| Duplicate node ID | On-call SRE | Scheduler team | - |

---

## Revision History

| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0.0 | 2026-01-07 | Agent | Initial release |