audit work, fixed StellaOps.sln warnings/errors, fixed tests, sprints work, new advisories
This commit is contained in:
385
docs/operations/runbooks/hlc-troubleshooting.md
Normal file
385
docs/operations/runbooks/hlc-troubleshooting.md
Normal file
@@ -0,0 +1,385 @@
|
||||
# HLC Troubleshooting Runbook
|
||||
|
||||
> **Version**: 1.0.0
|
||||
> **Sprint**: SPRINT_20260105_002_004_BE
|
||||
> **Last Updated**: 2026-01-07
|
||||
|
||||
This runbook covers troubleshooting procedures for Hybrid Logical Clock (HLC) based queue ordering in StellaOps.
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Chain Verification Failure](#1-chain-verification-failure)
|
||||
2. [Clock Skew Issues](#2-clock-skew-issues)
|
||||
3. [Time Offset Drift](#3-time-offset-drift)
|
||||
4. [Merge Conflicts](#4-merge-conflicts)
|
||||
5. [Slow Air-Gap Sync](#5-slow-air-gap-sync)
|
||||
6. [No Enqueues](#6-no-enqueues)
|
||||
7. [Batch Snapshot Failures](#7-batch-snapshot-failures)
|
||||
8. [Duplicate Node ID](#8-duplicate-node-id)
|
||||
|
||||
---
|
||||
|
||||
## 1. Chain Verification Failure
|
||||
|
||||
### Symptoms
|
||||
- Alert: `HlcChainVerificationFailure`
|
||||
- Metric: `scheduler_chain_verification_failures_total` increasing
|
||||
- Log: `Chain verification failed: expected {expected}, got {actual}`
|
||||
|
||||
### Severity
|
||||
**Critical** - Indicates potential data tampering or corruption.
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **Identify the affected chain segment**:
|
||||
```sql
|
||||
SELECT
|
||||
job_id,
|
||||
t_hlc,
|
||||
encode(prev_link, 'hex') as prev_link,
|
||||
encode(link, 'hex') as link,
|
||||
created_at
|
||||
FROM scheduler.scheduler_log
|
||||
WHERE tenant_id = '<tenant_id>'
|
||||
ORDER BY t_hlc DESC
|
||||
LIMIT 100;
|
||||
```
|
||||
|
||||
2. **Find the break point**:
|
||||
```sql
|
||||
WITH chain AS (
|
||||
SELECT
|
||||
job_id,
|
||||
t_hlc,
|
||||
prev_link,
|
||||
link,
|
||||
LAG(link) OVER (ORDER BY t_hlc) as expected_prev
|
||||
FROM scheduler.scheduler_log
|
||||
WHERE tenant_id = '<tenant_id>'
|
||||
)
|
||||
SELECT * FROM chain
|
||||
WHERE prev_link IS DISTINCT FROM expected_prev
|
||||
ORDER BY t_hlc;
|
||||
```
|
||||
|
||||
3. **Check for unauthorized modifications**:
|
||||
- Review database audit logs
|
||||
- Check for direct SQL updates bypassing the application
|
||||
|
||||
4. **Verify chain head consistency**:
|
||||
```sql
|
||||
SELECT * FROM scheduler.chain_heads
|
||||
WHERE tenant_id = '<tenant_id>';
|
||||
```
|
||||
|
||||
### Resolution
|
||||
|
||||
**If corruption is isolated**:
|
||||
1. Mark affected jobs for re-processing
|
||||
2. Rebuild chain from the last valid point
|
||||
3. Update chain head
|
||||
|
||||
**If tampering is suspected**:
|
||||
1. Escalate to Security team immediately
|
||||
2. Preserve all logs and database state
|
||||
3. Initiate incident response procedure
|
||||
|
||||
---
|
||||
|
||||
## 2. Clock Skew Issues
|
||||
|
||||
### Symptoms
|
||||
- Alert: `HlcClockSkewExceedsTolerance`
|
||||
- Metric: `hlc_clock_skew_rejections_total` increasing
|
||||
- Log: `Clock skew exceeds tolerance: {skew}ms > {tolerance}ms`
|
||||
|
||||
### Severity
|
||||
**Critical** - Can cause job ordering inconsistencies.
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **Check NTP synchronization**:
|
||||
```bash
|
||||
# On affected node
|
||||
timedatectl status
|
||||
ntpq -p
|
||||
chronyc tracking # if using chrony
|
||||
```
|
||||
|
||||
2. **Verify time sources**:
|
||||
```bash
|
||||
ntpq -pn
|
||||
```
|
||||
|
||||
3. **Check for leap second issues**:
|
||||
```bash
|
||||
dmesg | grep -i leap
|
||||
```
|
||||
|
||||
4. **Compare with other nodes**:
|
||||
```bash
|
||||
for node in node-1 node-2 node-3; do
|
||||
echo "$node: $(ssh $node date +%s.%N)"
|
||||
done
|
||||
```
|
||||
|
||||
### Resolution
|
||||
|
||||
1. **Restart NTP client**:
|
||||
```bash
|
||||
sudo systemctl restart chronyd # or ntpd
|
||||
```
|
||||
|
||||
2. **Force time sync**:
|
||||
```bash
|
||||
sudo chronyc makestep
|
||||
```
|
||||
|
||||
3. **Temporarily increase tolerance** (emergency only):
|
||||
```yaml
|
||||
Scheduler:
|
||||
Queue:
|
||||
Hlc:
|
||||
MaxClockSkewMs: 60000 # Increase from default 5000
|
||||
```
|
||||
|
||||
4. **Restart affected service** to reset HLC state.
|
||||
|
||||
---
|
||||
|
||||
## 3. Time Offset Drift
|
||||
|
||||
### Symptoms
|
||||
- Alert: `HlcPhysicalTimeOffset`
|
||||
- Metric: `hlc_physical_time_offset_seconds` > 0.5
|
||||
|
||||
### Severity
|
||||
**Warning** - May cause timestamp anomalies in diagnostics.
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **Check current offset**:
|
||||
```promql
|
||||
hlc_physical_time_offset_seconds{node_id="<node>"}
|
||||
```
|
||||
|
||||
2. **Review HLC state**:
|
||||
```bash
|
||||
curl -s http://localhost:5000/health/hlc | jq
|
||||
```
|
||||
|
||||
3. **Check for high logical counter**:
|
||||
If logical counter is very high, it indicates frequent same-millisecond events.
|
||||
|
||||
### Resolution
|
||||
|
||||
Usually self-correcting as wall clock advances. If persistent:
|
||||
1. Review job submission rate
|
||||
2. Consider horizontal scaling to distribute load
|
||||
|
||||
---
|
||||
|
||||
## 4. Merge Conflicts
|
||||
|
||||
### Symptoms
|
||||
- Alert: `HlcMergeConflictRateHigh`
|
||||
- Metric: `airgap_merge_conflicts_total` increasing
|
||||
- Log: `Merge conflict: job {jobId} has conflicting payloads`
|
||||
|
||||
### Severity
|
||||
**Warning** - May indicate duplicate job submissions or clock issues on offline nodes.
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **Identify conflict types**:
|
||||
```promql
|
||||
sum by (conflict_type) (airgap_merge_conflicts_total)
|
||||
```
|
||||
|
||||
2. **Review merge logs**:
|
||||
```bash
|
||||
grep "merge conflict" /var/log/stellaops/scheduler.log | tail -100
|
||||
```
|
||||
|
||||
3. **Check offline node clocks**:
|
||||
- Were offline nodes synchronized before disconnection?
|
||||
- How long were nodes offline?
|
||||
|
||||
### Resolution
|
||||
|
||||
1. **For duplicate jobs**: Use idempotency keys to prevent duplicates
|
||||
2. **For payload conflicts**: Review job submission logic
|
||||
3. **For ordering conflicts**: Verify NTP on all nodes before disconnection
|
||||
|
||||
---
|
||||
|
||||
## 5. Slow Air-Gap Sync
|
||||
|
||||
### Symptoms
|
||||
- Alert: `HlcSyncDurationHigh`
|
||||
- Metric: `airgap_sync_duration_seconds` p95 > 30s
|
||||
|
||||
### Severity
|
||||
**Warning** - Delays job processing.
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **Check bundle sizes**:
|
||||
```promql
|
||||
histogram_quantile(0.95, airgap_bundle_size_bytes_bucket)
|
||||
```
|
||||
|
||||
2. **Check database performance**:
|
||||
```sql
|
||||
SELECT * FROM pg_stat_activity
|
||||
WHERE state = 'active' AND query LIKE '%scheduler_log%';
|
||||
```
|
||||
|
||||
3. **Review index usage**:
|
||||
```sql
|
||||
EXPLAIN ANALYZE
|
||||
SELECT * FROM scheduler.scheduler_log
|
||||
WHERE tenant_id = '<tenant>'
|
||||
ORDER BY t_hlc
|
||||
LIMIT 1000;
|
||||
```
|
||||
|
||||
### Resolution
|
||||
|
||||
1. **Chunk large bundles**: Split bundles > 10K entries
|
||||
2. **Optimize database**: Ensure indexes are used
|
||||
3. **Increase resources**: Scale up database if needed
|
||||
|
||||
---
|
||||
|
||||
## 6. No Enqueues
|
||||
|
||||
### Symptoms
|
||||
- Alert: `HlcEnqueueRateZero`
|
||||
- No jobs appearing in HLC queue
|
||||
|
||||
### Severity
|
||||
**Info** - May be expected or indicate misconfiguration.
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **Check if HLC ordering is enabled**:
|
||||
```bash
|
||||
curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc'
|
||||
```
|
||||
|
||||
2. **Verify service is receiving jobs**:
|
||||
```promql
|
||||
rate(scheduler_jobs_received_total[5m])
|
||||
```
|
||||
|
||||
3. **Check for errors**:
|
||||
```bash
|
||||
grep -i "hlc\|enqueue" /var/log/stellaops/scheduler.log | grep -i error
|
||||
```
|
||||
|
||||
### Resolution
|
||||
|
||||
1. If HLC should be enabled:
|
||||
```yaml
|
||||
Scheduler:
|
||||
Queue:
|
||||
Hlc:
|
||||
EnableHlcOrdering: true
|
||||
```
|
||||
|
||||
2. If dual-write mode is needed:
|
||||
```yaml
|
||||
Scheduler:
|
||||
Queue:
|
||||
Hlc:
|
||||
DualWriteMode: true
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Batch Snapshot Failures
|
||||
|
||||
### Symptoms
|
||||
- Alert: `HlcBatchSnapshotFailures`
|
||||
- Missing DSSE-signed batch proofs
|
||||
|
||||
### Severity
|
||||
**Warning** - Audit proofs may be incomplete.
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **Check signing key**:
|
||||
```bash
|
||||
stella signer status
|
||||
```
|
||||
|
||||
2. **Verify DSSE configuration**:
|
||||
```bash
|
||||
curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc.batchSigning'
|
||||
```
|
||||
|
||||
3. **Check database connectivity**:
|
||||
```sql
|
||||
SELECT 1; -- Simple connectivity test
|
||||
```
|
||||
|
||||
### Resolution
|
||||
|
||||
1. **Refresh signing credentials**
|
||||
2. **Check certificate expiry**
|
||||
3. **Verify database permissions for batch_snapshots table**
|
||||
|
||||
---
|
||||
|
||||
## 8. Duplicate Node ID
|
||||
|
||||
### Symptoms
|
||||
- Alert: `HlcDuplicateNodeId`
|
||||
- Multiple instances with same node_id
|
||||
|
||||
### Severity
|
||||
**Critical** - Will cause chain corruption.
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **Identify affected instances**:
|
||||
```promql
|
||||
group by (node_id, instance) (hlc_ticks_total)
|
||||
```
|
||||
|
||||
2. **Check node ID configuration**:
|
||||
```bash
|
||||
# On each instance
|
||||
grep -r "NodeId" /etc/stellaops/
|
||||
```
|
||||
|
||||
### Resolution
|
||||
|
||||
**Immediate action required**:
|
||||
1. Stop one of the duplicate instances
|
||||
2. Reconfigure with unique node ID
|
||||
3. Restart and verify
|
||||
4. Check chain integrity for affected time period
|
||||
|
||||
---
|
||||
|
||||
## Escalation Matrix
|
||||
|
||||
| Issue | First Responder | Escalation L2 | Escalation L3 |
|
||||
|-------|-----------------|---------------|---------------|
|
||||
| Chain verification failure | On-call SRE | Scheduler team | Security team |
|
||||
| Clock skew | On-call SRE | Infrastructure | Architecture |
|
||||
| Merge conflicts | On-call SRE | Scheduler team | - |
|
||||
| Performance issues | On-call SRE | Database team | - |
|
||||
| Duplicate node ID | On-call SRE | Scheduler team | - |
|
||||
|
||||
---
|
||||
|
||||
## Revision History
|
||||
|
||||
| Version | Date | Author | Changes |
|
||||
|---------|------|--------|---------|
|
||||
| 1.0.0 | 2026-01-07 | Agent | Initial release |
|
||||
@@ -13,19 +13,19 @@ Status: DRAFT — pending policy-registry overlay and production digests. Use fo
|
||||
- Prod: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-stable.yaml --downloads deploy/downloads/manifest.json`
|
||||
- Confirm `.gitea/workflows/release-manifest-verify.yml` is green for the target manifest change.
|
||||
2) Render deployment plan (no apply yet)
|
||||
- Helm: `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-orchestrator.yaml > /tmp/policy-plan.yaml`
|
||||
- Compose (dev): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/policy-compose.yaml`
|
||||
- Helm: `helm template stellaops ./devops/helm/stellaops -f devops/helm/stellaops/values-prod.yaml -f devops/helm/stellaops/values-orchestrator.yaml > /tmp/policy-plan.yaml`
|
||||
- Compose (dev): `USE_MOCK=1 devops/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f devops/compose/docker-compose.dev.yaml -f devops/compose/docker-compose.mock.yaml config > /tmp/policy-compose.yaml`
|
||||
3) Backups
|
||||
- Run `deploy/compose/scripts/backup.sh` before production rollout; archive PostgreSQL/Redis/ObjectStore snapshots to the regulated vault.
|
||||
- Run `devops/compose/scripts/backup.sh` before production rollout; archive PostgreSQL/Redis/ObjectStore snapshots to the regulated vault.
|
||||
|
||||
## Canary publish → promote
|
||||
1) Prepare override (temporary)
|
||||
- Create `deploy/helm/stellaops/values-policy-canary.yaml` with a single replica, reduced worker counts, and an isolated ingress path for policy publish.
|
||||
- Create `devops/helm/stellaops/values-policy-canary.yaml` with a single replica, reduced worker counts, and an isolated ingress path for policy publish.
|
||||
- Keep `mock.enabled=false`; only use real digests when available.
|
||||
2) Dry-run render
|
||||
- `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --debug --validate > /tmp/policy-canary.yaml`
|
||||
- `helm template stellaops ./devops/helm/stellaops -f devops/helm/stellaops/values-prod.yaml -f devops/helm/stellaops/values-policy-canary.yaml --debug --validate > /tmp/policy-canary.yaml`
|
||||
3) Apply canary
|
||||
- `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --atomic --timeout 10m`
|
||||
- `helm upgrade --install stellaops ./devops/helm/stellaops -f devops/helm/stellaops/values-prod.yaml -f devops/helm/stellaops/values-policy-canary.yaml --atomic --timeout 10m`
|
||||
- Monitor: `kubectl logs deployment/policy-registry -n stellaops --tail=200 -f` and readiness probes; rollback on errors.
|
||||
4) Promote
|
||||
- Remove the canary override from the release branch; rerender with `values-prod.yaml` only and redeploy.
|
||||
|
||||
@@ -7,12 +7,12 @@ Status: DRAFT (2025-12-06 UTC). Safe for dev/mock exercises; production rollouts
|
||||
- Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
|
||||
- Prod: rerun against `deploy/releases/2025.09-stable.yaml` once VEX digests land.
|
||||
2) Render plan
|
||||
- Helm (mock overlay): `helm template vex-mock ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vex-mock.yaml`
|
||||
- Compose (dev with overlay): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/vex-compose.yaml`
|
||||
- Helm (mock overlay): `helm template vex-mock ./devops/helm/stellaops -f devops/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vex-mock.yaml`
|
||||
- Compose (dev with overlay): `USE_MOCK=1 devops/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f devops/compose/docker-compose.dev.yaml -f devops/compose/docker-compose.mock.yaml config > /tmp/vex-compose.yaml`
|
||||
3) Backups (when touching prod data) — not required for mock, but in prod take PostgreSQL snapshots for issuer-directory and VEX state before rollout.
|
||||
|
||||
## Deploy (mock path)
|
||||
- Helm dry-run already covers structural checks. To apply in a dev cluster: `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
|
||||
- Helm dry-run already covers structural checks. To apply in a dev cluster: `helm upgrade --install stellaops ./devops/helm/stellaops -f devops/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
|
||||
- Observe VEX Lens pod logs: `kubectl logs deploy/vex-lens -n stellaops --tail=200 -f`.
|
||||
- Issuer Directory seed: ensure `issuer-directory-config` ConfigMap includes `csaf-publishers.json`; mock overlay already mounts default seed.
|
||||
|
||||
|
||||
@@ -10,13 +10,13 @@ Status: DRAFT (2025-12-06 UTC). Safe for dev/mock exercises; production steps ne
|
||||
- Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
|
||||
- Prod: rerun against `deploy/releases/2025.09-stable.yaml` once ledger/api digests land.
|
||||
2) Render plan
|
||||
- Helm (mock overlay): `helm template vuln-mock ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vuln-mock.yaml`
|
||||
- Compose (dev with overlay): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml config > /tmp/vuln-compose.yaml`
|
||||
- Helm (mock overlay): `helm template vuln-mock ./devops/helm/stellaops -f devops/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vuln-mock.yaml`
|
||||
- Compose (dev with overlay): `USE_MOCK=1 devops/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml config > /tmp/vuln-compose.yaml`
|
||||
3) Backups (prod only)
|
||||
- PostgreSQL dump for Findings Ledger DB; copy object-store buckets tied to projector anchors.
|
||||
|
||||
## Deploy (mock path)
|
||||
- Helm apply (dev): `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
|
||||
- Helm apply (dev): `helm upgrade --install stellaops ./devops/helm/stellaops -f devops/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
|
||||
- Compose: quickstart already starts ledger + vuln API with mock pins; validate health at `https://localhost:8443/swagger` (dev certs).
|
||||
|
||||
## Incident drills
|
||||
|
||||
Reference in New Issue
Block a user