Files
git.stella-ops.org/docs/operations/runbooks/hlc-troubleshooting.md

386 lines
8.3 KiB
Markdown

# HLC Troubleshooting Runbook
> **Version**: 1.0.0
> **Sprint**: SPRINT_20260105_002_004_BE
> **Last Updated**: 2026-01-07
This runbook covers troubleshooting procedures for Hybrid Logical Clock (HLC) based queue ordering in StellaOps.
---
## Table of Contents
1. [Chain Verification Failure](#1-chain-verification-failure)
2. [Clock Skew Issues](#2-clock-skew-issues)
3. [Time Offset Drift](#3-time-offset-drift)
4. [Merge Conflicts](#4-merge-conflicts)
5. [Slow Air-Gap Sync](#5-slow-air-gap-sync)
6. [No Enqueues](#6-no-enqueues)
7. [Batch Snapshot Failures](#7-batch-snapshot-failures)
8. [Duplicate Node ID](#8-duplicate-node-id)
---
## 1. Chain Verification Failure
### Symptoms
- Alert: `HlcChainVerificationFailure`
- Metric: `scheduler_chain_verification_failures_total` increasing
- Log: `Chain verification failed: expected {expected}, got {actual}`
### Severity
**Critical** - Indicates potential data tampering or corruption.
### Investigation Steps
1. **Identify the affected chain segment**:
```sql
SELECT
job_id,
t_hlc,
encode(prev_link, 'hex') as prev_link,
encode(link, 'hex') as link,
created_at
FROM scheduler.scheduler_log
WHERE tenant_id = '<tenant_id>'
ORDER BY t_hlc DESC
LIMIT 100;
```
2. **Find the break point**:
```sql
WITH chain AS (
SELECT
job_id,
t_hlc,
prev_link,
link,
LAG(link) OVER (ORDER BY t_hlc) as expected_prev
FROM scheduler.scheduler_log
WHERE tenant_id = '<tenant_id>'
)
SELECT * FROM chain
WHERE prev_link IS DISTINCT FROM expected_prev
ORDER BY t_hlc;
```
3. **Check for unauthorized modifications**:
- Review database audit logs
- Check for direct SQL updates bypassing the application
4. **Verify chain head consistency**:
```sql
SELECT * FROM scheduler.chain_heads
WHERE tenant_id = '<tenant_id>';
```
### Resolution
**If corruption is isolated**:
1. Mark affected jobs for re-processing
2. Rebuild chain from the last valid point
3. Update chain head
**If tampering is suspected**:
1. Escalate to Security team immediately
2. Preserve all logs and database state
3. Initiate incident response procedure
---
## 2. Clock Skew Issues
### Symptoms
- Alert: `HlcClockSkewExceedsTolerance`
- Metric: `hlc_clock_skew_rejections_total` increasing
- Log: `Clock skew exceeds tolerance: {skew}ms > {tolerance}ms`
### Severity
**Critical** - Can cause job ordering inconsistencies.
### Investigation Steps
1. **Check NTP synchronization**:
```bash
# On affected node
timedatectl status
ntpq -p
chronyc tracking # if using chrony
```
2. **Verify time sources**:
```bash
ntpq -pn
```
3. **Check for leap second issues**:
```bash
dmesg | grep -i leap
```
4. **Compare with other nodes**:
```bash
for node in node-1 node-2 node-3; do
echo "$node: $(ssh $node date +%s.%N)"
done
```
### Resolution
1. **Restart NTP client**:
```bash
sudo systemctl restart chronyd # or ntpd
```
2. **Force time sync**:
```bash
sudo chronyc makestep
```
3. **Temporarily increase tolerance** (emergency only):
```yaml
Scheduler:
Queue:
Hlc:
MaxClockSkewMs: 60000 # Increase from default 5000
```
4. **Restart affected service** to reset HLC state.
---
## 3. Time Offset Drift
### Symptoms
- Alert: `HlcPhysicalTimeOffset`
- Metric: `hlc_physical_time_offset_seconds` > 0.5
### Severity
**Warning** - May cause timestamp anomalies in diagnostics.
### Investigation Steps
1. **Check current offset**:
```promql
hlc_physical_time_offset_seconds{node_id="<node>"}
```
2. **Review HLC state**:
```bash
curl -s http://localhost:5000/health/hlc | jq
```
3. **Check for high logical counter**:
If logical counter is very high, it indicates frequent same-millisecond events.
### Resolution
Usually self-correcting as wall clock advances. If persistent:
1. Review job submission rate
2. Consider horizontal scaling to distribute load
---
## 4. Merge Conflicts
### Symptoms
- Alert: `HlcMergeConflictRateHigh`
- Metric: `airgap_merge_conflicts_total` increasing
- Log: `Merge conflict: job {jobId} has conflicting payloads`
### Severity
**Warning** - May indicate duplicate job submissions or clock issues on offline nodes.
### Investigation Steps
1. **Identify conflict types**:
```promql
sum by (conflict_type) (airgap_merge_conflicts_total)
```
2. **Review merge logs**:
```bash
grep "merge conflict" /var/log/stellaops/scheduler.log | tail -100
```
3. **Check offline node clocks**:
- Were offline nodes synchronized before disconnection?
- How long were nodes offline?
### Resolution
1. **For duplicate jobs**: Use idempotency keys to prevent duplicates
2. **For payload conflicts**: Review job submission logic
3. **For ordering conflicts**: Verify NTP on all nodes before disconnection
---
## 5. Slow Air-Gap Sync
### Symptoms
- Alert: `HlcSyncDurationHigh`
- Metric: `airgap_sync_duration_seconds` p95 > 30s
### Severity
**Warning** - Delays job processing.
### Investigation Steps
1. **Check bundle sizes**:
```promql
histogram_quantile(0.95, airgap_bundle_size_bytes_bucket)
```
2. **Check database performance**:
```sql
SELECT * FROM pg_stat_activity
WHERE state = 'active' AND query LIKE '%scheduler_log%';
```
3. **Review index usage**:
```sql
EXPLAIN ANALYZE
SELECT * FROM scheduler.scheduler_log
WHERE tenant_id = '<tenant>'
ORDER BY t_hlc
LIMIT 1000;
```
### Resolution
1. **Chunk large bundles**: Split bundles > 10K entries
2. **Optimize database**: Ensure indexes are used
3. **Increase resources**: Scale up database if needed
---
## 6. No Enqueues
### Symptoms
- Alert: `HlcEnqueueRateZero`
- No jobs appearing in HLC queue
### Severity
**Info** - May be expected or indicate misconfiguration.
### Investigation Steps
1. **Check if HLC ordering is enabled**:
```bash
curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc'
```
2. **Verify service is receiving jobs**:
```promql
rate(scheduler_jobs_received_total[5m])
```
3. **Check for errors**:
```bash
grep -i "hlc\|enqueue" /var/log/stellaops/scheduler.log | grep -i error
```
### Resolution
1. If HLC should be enabled:
```yaml
Scheduler:
Queue:
Hlc:
EnableHlcOrdering: true
```
2. If dual-write mode is needed:
```yaml
Scheduler:
Queue:
Hlc:
DualWriteMode: true
```
---
## 7. Batch Snapshot Failures
### Symptoms
- Alert: `HlcBatchSnapshotFailures`
- Missing DSSE-signed batch proofs
### Severity
**Warning** - Audit proofs may be incomplete.
### Investigation Steps
1. **Check signing key**:
```bash
stella signer status
```
2. **Verify DSSE configuration**:
```bash
curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc.batchSigning'
```
3. **Check database connectivity**:
```sql
SELECT 1; -- Simple connectivity test
```
### Resolution
1. **Refresh signing credentials**
2. **Check certificate expiry**
3. **Verify database permissions for batch_snapshots table**
---
## 8. Duplicate Node ID
### Symptoms
- Alert: `HlcDuplicateNodeId`
- Multiple instances with same node_id
### Severity
**Critical** - Will cause chain corruption.
### Investigation Steps
1. **Identify affected instances**:
```promql
group by (node_id, instance) (hlc_ticks_total)
```
2. **Check node ID configuration**:
```bash
# On each instance
grep -r "NodeId" /etc/stellaops/
```
### Resolution
**Immediate action required**:
1. Stop one of the duplicate instances
2. Reconfigure with unique node ID
3. Restart and verify
4. Check chain integrity for affected time period
---
## Escalation Matrix
| Issue | First Responder | Escalation L2 | Escalation L3 |
|-------|-----------------|---------------|---------------|
| Chain verification failure | On-call SRE | Scheduler team | Security team |
| Clock skew | On-call SRE | Infrastructure | Architecture |
| Merge conflicts | On-call SRE | Scheduler team | - |
| Performance issues | On-call SRE | Database team | - |
| Duplicate node ID | On-call SRE | Scheduler team | - |
---
## Revision History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0.0 | 2026-01-07 | Agent | Initial release |