386 lines
8.3 KiB
Markdown
386 lines
8.3 KiB
Markdown
# HLC Troubleshooting Runbook
|
|
|
|
> **Version**: 1.0.0
|
|
> **Sprint**: SPRINT_20260105_002_004_BE
|
|
> **Last Updated**: 2026-01-07
|
|
|
|
This runbook covers troubleshooting procedures for Hybrid Logical Clock (HLC) based queue ordering in StellaOps.
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Chain Verification Failure](#1-chain-verification-failure)
|
|
2. [Clock Skew Issues](#2-clock-skew-issues)
|
|
3. [Time Offset Drift](#3-time-offset-drift)
|
|
4. [Merge Conflicts](#4-merge-conflicts)
|
|
5. [Slow Air-Gap Sync](#5-slow-air-gap-sync)
|
|
6. [No Enqueues](#6-no-enqueues)
|
|
7. [Batch Snapshot Failures](#7-batch-snapshot-failures)
|
|
8. [Duplicate Node ID](#8-duplicate-node-id)
|
|
|
|
---
|
|
|
|
## 1. Chain Verification Failure
|
|
|
|
### Symptoms
|
|
- Alert: `HlcChainVerificationFailure`
|
|
- Metric: `scheduler_chain_verification_failures_total` increasing
|
|
- Log: `Chain verification failed: expected {expected}, got {actual}`
|
|
|
|
### Severity
|
|
**Critical** - Indicates potential data tampering or corruption.
|
|
|
|
### Investigation Steps
|
|
|
|
1. **Identify the affected chain segment**:
|
|
```sql
|
|
SELECT
|
|
job_id,
|
|
t_hlc,
|
|
encode(prev_link, 'hex') as prev_link,
|
|
encode(link, 'hex') as link,
|
|
created_at
|
|
FROM scheduler.scheduler_log
|
|
WHERE tenant_id = '<tenant_id>'
|
|
ORDER BY t_hlc DESC
|
|
LIMIT 100;
|
|
```
|
|
|
|
2. **Find the break point**:
|
|
```sql
|
|
WITH chain AS (
|
|
SELECT
|
|
job_id,
|
|
t_hlc,
|
|
prev_link,
|
|
link,
|
|
LAG(link) OVER (ORDER BY t_hlc) as expected_prev
|
|
FROM scheduler.scheduler_log
|
|
WHERE tenant_id = '<tenant_id>'
|
|
)
|
|
SELECT * FROM chain
|
|
WHERE prev_link IS DISTINCT FROM expected_prev
|
|
ORDER BY t_hlc;
|
|
```
|
|
|
|
3. **Check for unauthorized modifications**:
|
|
- Review database audit logs
|
|
- Check for direct SQL updates bypassing the application
|
|
|
|
4. **Verify chain head consistency**:
|
|
```sql
|
|
SELECT * FROM scheduler.chain_heads
|
|
WHERE tenant_id = '<tenant_id>';
|
|
```
|
|
|
|
### Resolution
|
|
|
|
**If corruption is isolated**:
|
|
1. Mark affected jobs for re-processing
|
|
2. Rebuild chain from the last valid point
|
|
3. Update chain head
|
|
|
|
**If tampering is suspected**:
|
|
1. Escalate to Security team immediately
|
|
2. Preserve all logs and database state
|
|
3. Initiate incident response procedure
|
|
|
|
---
|
|
|
|
## 2. Clock Skew Issues
|
|
|
|
### Symptoms
|
|
- Alert: `HlcClockSkewExceedsTolerance`
|
|
- Metric: `hlc_clock_skew_rejections_total` increasing
|
|
- Log: `Clock skew exceeds tolerance: {skew}ms > {tolerance}ms`
|
|
|
|
### Severity
|
|
**Critical** - Can cause job ordering inconsistencies.
|
|
|
|
### Investigation Steps
|
|
|
|
1. **Check NTP synchronization**:
|
|
```bash
|
|
# On affected node
|
|
timedatectl status
|
|
ntpq -p
|
|
chronyc tracking # if using chrony
|
|
```
|
|
|
|
2. **Verify time sources**:
|
|
```bash
|
|
ntpq -pn
|
|
```
|
|
|
|
3. **Check for leap second issues**:
|
|
```bash
|
|
dmesg | grep -i leap
|
|
```
|
|
|
|
4. **Compare with other nodes**:
|
|
```bash
|
|
for node in node-1 node-2 node-3; do
|
|
echo "$node: $(ssh $node date +%s.%N)"
|
|
done
|
|
```
|
|
|
|
### Resolution
|
|
|
|
1. **Restart NTP client**:
|
|
```bash
|
|
sudo systemctl restart chronyd # or ntpd
|
|
```
|
|
|
|
2. **Force time sync**:
|
|
```bash
|
|
sudo chronyc makestep
|
|
```
|
|
|
|
3. **Temporarily increase tolerance** (emergency only):
|
|
```yaml
|
|
Scheduler:
|
|
Queue:
|
|
Hlc:
|
|
MaxClockSkewMs: 60000 # Increase from default 5000
|
|
```
|
|
|
|
4. **Restart affected service** to reset HLC state.
|
|
|
|
---
|
|
|
|
## 3. Time Offset Drift
|
|
|
|
### Symptoms
|
|
- Alert: `HlcPhysicalTimeOffset`
|
|
- Metric: `hlc_physical_time_offset_seconds` > 0.5
|
|
|
|
### Severity
|
|
**Warning** - May cause timestamp anomalies in diagnostics.
|
|
|
|
### Investigation Steps
|
|
|
|
1. **Check current offset**:
|
|
```promql
|
|
hlc_physical_time_offset_seconds{node_id="<node>"}
|
|
```
|
|
|
|
2. **Review HLC state**:
|
|
```bash
|
|
curl -s http://localhost:5000/health/hlc | jq
|
|
```
|
|
|
|
3. **Check for high logical counter**:
|
|
If logical counter is very high, it indicates frequent same-millisecond events.
|
|
|
|
### Resolution
|
|
|
|
Usually self-correcting as wall clock advances. If persistent:
|
|
1. Review job submission rate
|
|
2. Consider horizontal scaling to distribute load
|
|
|
|
---
|
|
|
|
## 4. Merge Conflicts
|
|
|
|
### Symptoms
|
|
- Alert: `HlcMergeConflictRateHigh`
|
|
- Metric: `airgap_merge_conflicts_total` increasing
|
|
- Log: `Merge conflict: job {jobId} has conflicting payloads`
|
|
|
|
### Severity
|
|
**Warning** - May indicate duplicate job submissions or clock issues on offline nodes.
|
|
|
|
### Investigation Steps
|
|
|
|
1. **Identify conflict types**:
|
|
```promql
|
|
sum by (conflict_type) (airgap_merge_conflicts_total)
|
|
```
|
|
|
|
2. **Review merge logs**:
|
|
```bash
|
|
grep "merge conflict" /var/log/stellaops/scheduler.log | tail -100
|
|
```
|
|
|
|
3. **Check offline node clocks**:
|
|
- Were offline nodes synchronized before disconnection?
|
|
- How long were nodes offline?
|
|
|
|
### Resolution
|
|
|
|
1. **For duplicate jobs**: Use idempotency keys to prevent duplicates
|
|
2. **For payload conflicts**: Review job submission logic
|
|
3. **For ordering conflicts**: Verify NTP on all nodes before disconnection
|
|
|
|
---
|
|
|
|
## 5. Slow Air-Gap Sync
|
|
|
|
### Symptoms
|
|
- Alert: `HlcSyncDurationHigh`
|
|
- Metric: `airgap_sync_duration_seconds` p95 > 30s
|
|
|
|
### Severity
|
|
**Warning** - Delays job processing.
|
|
|
|
### Investigation Steps
|
|
|
|
1. **Check bundle sizes**:
|
|
```promql
|
|
histogram_quantile(0.95, airgap_bundle_size_bytes_bucket)
|
|
```
|
|
|
|
2. **Check database performance**:
|
|
```sql
|
|
SELECT * FROM pg_stat_activity
|
|
WHERE state = 'active' AND query LIKE '%scheduler_log%';
|
|
```
|
|
|
|
3. **Review index usage**:
|
|
```sql
|
|
EXPLAIN ANALYZE
|
|
SELECT * FROM scheduler.scheduler_log
|
|
WHERE tenant_id = '<tenant>'
|
|
ORDER BY t_hlc
|
|
LIMIT 1000;
|
|
```
|
|
|
|
### Resolution
|
|
|
|
1. **Chunk large bundles**: Split bundles > 10K entries
|
|
2. **Optimize database**: Ensure indexes are used
|
|
3. **Increase resources**: Scale up database if needed
|
|
|
|
---
|
|
|
|
## 6. No Enqueues
|
|
|
|
### Symptoms
|
|
- Alert: `HlcEnqueueRateZero`
|
|
- No jobs appearing in HLC queue
|
|
|
|
### Severity
|
|
**Info** - May be expected or indicate misconfiguration.
|
|
|
|
### Investigation Steps
|
|
|
|
1. **Check if HLC ordering is enabled**:
|
|
```bash
|
|
curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc'
|
|
```
|
|
|
|
2. **Verify service is receiving jobs**:
|
|
```promql
|
|
rate(scheduler_jobs_received_total[5m])
|
|
```
|
|
|
|
3. **Check for errors**:
|
|
```bash
|
|
grep -i "hlc\|enqueue" /var/log/stellaops/scheduler.log | grep -i error
|
|
```
|
|
|
|
### Resolution
|
|
|
|
1. If HLC should be enabled:
|
|
```yaml
|
|
Scheduler:
|
|
Queue:
|
|
Hlc:
|
|
EnableHlcOrdering: true
|
|
```
|
|
|
|
2. If dual-write mode is needed:
|
|
```yaml
|
|
Scheduler:
|
|
Queue:
|
|
Hlc:
|
|
DualWriteMode: true
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Batch Snapshot Failures
|
|
|
|
### Symptoms
|
|
- Alert: `HlcBatchSnapshotFailures`
|
|
- Missing DSSE-signed batch proofs
|
|
|
|
### Severity
|
|
**Warning** - Audit proofs may be incomplete.
|
|
|
|
### Investigation Steps
|
|
|
|
1. **Check signing key**:
|
|
```bash
|
|
stella signer status
|
|
```
|
|
|
|
2. **Verify DSSE configuration**:
|
|
```bash
|
|
curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc.batchSigning'
|
|
```
|
|
|
|
3. **Check database connectivity**:
|
|
```sql
|
|
SELECT 1; -- Simple connectivity test
|
|
```
|
|
|
|
### Resolution
|
|
|
|
1. **Refresh signing credentials**
|
|
2. **Check certificate expiry**
|
|
3. **Verify database permissions for batch_snapshots table**
|
|
|
|
---
|
|
|
|
## 8. Duplicate Node ID
|
|
|
|
### Symptoms
|
|
- Alert: `HlcDuplicateNodeId`
|
|
- Multiple instances with same node_id
|
|
|
|
### Severity
|
|
**Critical** - Will cause chain corruption.
|
|
|
|
### Investigation Steps
|
|
|
|
1. **Identify affected instances**:
|
|
```promql
|
|
group by (node_id, instance) (hlc_ticks_total)
|
|
```
|
|
|
|
2. **Check node ID configuration**:
|
|
```bash
|
|
# On each instance
|
|
grep -r "NodeId" /etc/stellaops/
|
|
```
|
|
|
|
### Resolution
|
|
|
|
**Immediate action required**:
|
|
1. Stop one of the duplicate instances
|
|
2. Reconfigure with unique node ID
|
|
3. Restart and verify
|
|
4. Check chain integrity for affected time period
|
|
|
|
---
|
|
|
|
## Escalation Matrix
|
|
|
|
| Issue | First Responder | Escalation L2 | Escalation L3 |
|
|
|-------|-----------------|---------------|---------------|
|
|
| Chain verification failure | On-call SRE | Scheduler team | Security team |
|
|
| Clock skew | On-call SRE | Infrastructure | Architecture |
|
|
| Merge conflicts | On-call SRE | Scheduler team | - |
|
|
| Performance issues | On-call SRE | Database team | - |
|
|
| Duplicate node ID | On-call SRE | Scheduler team | - |
|
|
|
|
---
|
|
|
|
## Revision History
|
|
|
|
| Version | Date | Author | Changes |
|
|
|---------|------|--------|---------|
|
|
| 1.0.0 | 2026-01-07 | Agent | Initial release |
|