audit work, fixed StellaOps.sln warnings/errors, fixed tests, sprints work, new advisories

2026-01-07 18:49:59 +02:00
parent 04ec098046
commit 608a7f85c0
866 changed files with 56323 additions and 6231 deletions
--- a/docs/operations/runbooks/hlc-troubleshooting.md
+++ b/docs/operations/runbooks/hlc-troubleshooting.md
@@ -0,0 +1,385 @@
+# HLC Troubleshooting Runbook
+
+> **Version**: 1.0.0  
+> **Sprint**: SPRINT_20260105_002_004_BE  
+> **Last Updated**: 2026-01-07
+
+This runbook covers troubleshooting procedures for Hybrid Logical Clock (HLC) based queue ordering in StellaOps.
+
+---
+
+## Table of Contents
+
+1. [Chain Verification Failure](#1-chain-verification-failure)
+2. [Clock Skew Issues](#2-clock-skew-issues)
+3. [Time Offset Drift](#3-time-offset-drift)
+4. [Merge Conflicts](#4-merge-conflicts)
+5. [Slow Air-Gap Sync](#5-slow-air-gap-sync)
+6. [No Enqueues](#6-no-enqueues)
+7. [Batch Snapshot Failures](#7-batch-snapshot-failures)
+8. [Duplicate Node ID](#8-duplicate-node-id)
+
+---
+
+## 1. Chain Verification Failure
+
+### Symptoms
+- Alert: `HlcChainVerificationFailure`
+- Metric: `scheduler_chain_verification_failures_total` increasing
+- Log: `Chain verification failed: expected {expected}, got {actual}`
+
+### Severity
+**Critical** - Indicates potential data tampering or corruption.
+
+### Investigation Steps
+
+1. **Identify the affected chain segment**:
+   ```sql
+   SELECT 
+       job_id, 
+       t_hlc, 
+       encode(prev_link, 'hex') as prev_link,
+       encode(link, 'hex') as link,
+       created_at
+   FROM scheduler.scheduler_log
+   WHERE tenant_id = '<tenant_id>'
+   ORDER BY t_hlc DESC
+   LIMIT 100;
+   ```
+
+2. **Find the break point**:
+   ```sql
+   WITH chain AS (
+       SELECT 
+           job_id,
+           t_hlc,
+           prev_link,
+           link,
+           LAG(link) OVER (ORDER BY t_hlc) as expected_prev
+       FROM scheduler.scheduler_log
+       WHERE tenant_id = '<tenant_id>'
+   )
+   SELECT * FROM chain
+   WHERE prev_link IS DISTINCT FROM expected_prev
+   ORDER BY t_hlc;
+   ```
+
+3. **Check for unauthorized modifications**:
+   - Review database audit logs
+   - Check for direct SQL updates bypassing the application
+
+4. **Verify chain head consistency**:
+   ```sql
+   SELECT * FROM scheduler.chain_heads
+   WHERE tenant_id = '<tenant_id>';
+   ```
+
+### Resolution
+
+**If corruption is isolated**:
+1. Mark affected jobs for re-processing
+2. Rebuild chain from the last valid point
+3. Update chain head
+
+**If tampering is suspected**:
+1. Escalate to Security team immediately
+2. Preserve all logs and database state
+3. Initiate incident response procedure
+
+---
+
+## 2. Clock Skew Issues
+
+### Symptoms
+- Alert: `HlcClockSkewExceedsTolerance`
+- Metric: `hlc_clock_skew_rejections_total` increasing
+- Log: `Clock skew exceeds tolerance: {skew}ms > {tolerance}ms`
+
+### Severity
+**Critical** - Can cause job ordering inconsistencies.
+
+### Investigation Steps
+
+1. **Check NTP synchronization**:
+   ```bash
+   # On affected node
+   timedatectl status
+   ntpq -p
+   chronyc tracking  # if using chrony
+   ```
+
+2. **Verify time sources**:
+   ```bash
+   ntpq -pn
+   ```
+
+3. **Check for leap second issues**:
+   ```bash
+   dmesg | grep -i leap
+   ```
+
+4. **Compare with other nodes**:
+   ```bash
+   for node in node-1 node-2 node-3; do
+     echo "$node: $(ssh $node date +%s.%N)"
+   done
+   ```
+
+### Resolution
+
+1. **Restart NTP client**:
+   ```bash
+   sudo systemctl restart chronyd  # or ntpd
+   ```
+
+2. **Force time sync**:
+   ```bash
+   sudo chronyc makestep
+   ```
+
+3. **Temporarily increase tolerance** (emergency only):
+   ```yaml
+   Scheduler:
+     Queue:
+       Hlc:
+         MaxClockSkewMs: 60000  # Increase from default 5000
+   ```
+
+4. **Restart affected service** to reset HLC state.
+
+---
+
+## 3. Time Offset Drift
+
+### Symptoms
+- Alert: `HlcPhysicalTimeOffset`
+- Metric: `hlc_physical_time_offset_seconds` > 0.5
+
+### Severity
+**Warning** - May cause timestamp anomalies in diagnostics.
+
+### Investigation Steps
+
+1. **Check current offset**:
+   ```promql
+   hlc_physical_time_offset_seconds{node_id="<node>"}
+   ```
+
+2. **Review HLC state**:
+   ```bash
+   curl -s http://localhost:5000/health/hlc | jq
+   ```
+
+3. **Check for high logical counter**:
+   If logical counter is very high, it indicates frequent same-millisecond events.
+
+### Resolution
+
+Usually self-correcting as wall clock advances. If persistent:
+1. Review job submission rate
+2. Consider horizontal scaling to distribute load
+
+---
+
+## 4. Merge Conflicts
+
+### Symptoms
+- Alert: `HlcMergeConflictRateHigh`
+- Metric: `airgap_merge_conflicts_total` increasing
+- Log: `Merge conflict: job {jobId} has conflicting payloads`
+
+### Severity
+**Warning** - May indicate duplicate job submissions or clock issues on offline nodes.
+
+### Investigation Steps
+
+1. **Identify conflict types**:
+   ```promql
+   sum by (conflict_type) (airgap_merge_conflicts_total)
+   ```
+
+2. **Review merge logs**:
+   ```bash
+   grep "merge conflict" /var/log/stellaops/scheduler.log | tail -100
+   ```
+
+3. **Check offline node clocks**:
+   - Were offline nodes synchronized before disconnection?
+   - How long were nodes offline?
+
+### Resolution
+
+1. **For duplicate jobs**: Use idempotency keys to prevent duplicates
+2. **For payload conflicts**: Review job submission logic
+3. **For ordering conflicts**: Verify NTP on all nodes before disconnection
+
+---
+
+## 5. Slow Air-Gap Sync
+
+### Symptoms
+- Alert: `HlcSyncDurationHigh`
+- Metric: `airgap_sync_duration_seconds` p95 > 30s
+
+### Severity
+**Warning** - Delays job processing.
+
+### Investigation Steps
+
+1. **Check bundle sizes**:
+   ```promql
+   histogram_quantile(0.95, airgap_bundle_size_bytes_bucket)
+   ```
+
+2. **Check database performance**:
+   ```sql
+   SELECT * FROM pg_stat_activity
+   WHERE state = 'active' AND query LIKE '%scheduler_log%';
+   ```
+
+3. **Review index usage**:
+   ```sql
+   EXPLAIN ANALYZE
+   SELECT * FROM scheduler.scheduler_log
+   WHERE tenant_id = '<tenant>'
+   ORDER BY t_hlc
+   LIMIT 1000;
+   ```
+
+### Resolution
+
+1. **Chunk large bundles**: Split bundles > 10K entries
+2. **Optimize database**: Ensure indexes are used
+3. **Increase resources**: Scale up database if needed
+
+---
+
+## 6. No Enqueues
+
+### Symptoms
+- Alert: `HlcEnqueueRateZero`
+- No jobs appearing in HLC queue
+
+### Severity
+**Info** - May be expected or indicate misconfiguration.
+
+### Investigation Steps
+
+1. **Check if HLC ordering is enabled**:
+   ```bash
+   curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc'
+   ```
+
+2. **Verify service is receiving jobs**:
+   ```promql
+   rate(scheduler_jobs_received_total[5m])
+   ```
+
+3. **Check for errors**:
+   ```bash
+   grep -i "hlc\|enqueue" /var/log/stellaops/scheduler.log | grep -i error
+   ```
+
+### Resolution
+
+1. If HLC should be enabled:
+   ```yaml
+   Scheduler:
+     Queue:
+       Hlc:
+         EnableHlcOrdering: true
+   ```
+
+2. If dual-write mode is needed:
+   ```yaml
+   Scheduler:
+     Queue:
+       Hlc:
+         DualWriteMode: true
+   ```
+
+---
+
+## 7. Batch Snapshot Failures
+
+### Symptoms
+- Alert: `HlcBatchSnapshotFailures`
+- Missing DSSE-signed batch proofs
+
+### Severity
+**Warning** - Audit proofs may be incomplete.
+
+### Investigation Steps
+
+1. **Check signing key**:
+   ```bash
+   stella signer status
+   ```
+
+2. **Verify DSSE configuration**:
+   ```bash
+   curl -s http://localhost:5000/config | jq '.scheduler.queue.hlc.batchSigning'
+   ```
+
+3. **Check database connectivity**:
+   ```sql
+   SELECT 1;  -- Simple connectivity test
+   ```
+
+### Resolution
+
+1. **Refresh signing credentials**
+2. **Check certificate expiry**
+3. **Verify database permissions for batch_snapshots table**
+
+---
+
+## 8. Duplicate Node ID
+
+### Symptoms
+- Alert: `HlcDuplicateNodeId`
+- Multiple instances with same node_id
+
+### Severity
+**Critical** - Will cause chain corruption.
+
+### Investigation Steps
+
+1. **Identify affected instances**:
+   ```promql
+   group by (node_id, instance) (hlc_ticks_total)
+   ```
+
+2. **Check node ID configuration**:
+   ```bash
+   # On each instance
+   grep -r "NodeId" /etc/stellaops/
+   ```
+
+### Resolution
+
+**Immediate action required**:
+1. Stop one of the duplicate instances
+2. Reconfigure with unique node ID
+3. Restart and verify
+4. Check chain integrity for affected time period
+
+---
+
+## Escalation Matrix
+
+| Issue | First Responder | Escalation L2 | Escalation L3 |
+|-------|-----------------|---------------|---------------|
+| Chain verification failure | On-call SRE | Scheduler team | Security team |
+| Clock skew | On-call SRE | Infrastructure | Architecture |
+| Merge conflicts | On-call SRE | Scheduler team | - |
+| Performance issues | On-call SRE | Database team | - |
+| Duplicate node ID | On-call SRE | Scheduler team | - |
+
+---
+
+## Revision History
+
+| Version | Date | Author | Changes |
+|---------|------|--------|---------|
+| 1.0.0 | 2026-01-07 | Agent | Initial release |
--- a/docs/operations/runbooks/policy-incident.md
+++ b/docs/operations/runbooks/policy-incident.md
@@ -13,19 +13,19 @@ Status: DRAFT — pending policy-registry overlay and production digests. Use fo
   - Prod: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-stable.yaml --downloads deploy/downloads/manifest.json`
   - Confirm `.gitea/workflows/release-manifest-verify.yml` is green for the target manifest change.
 2) Render deployment plan (no apply yet)
-   - Helm: `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-orchestrator.yaml > /tmp/policy-plan.yaml`
-   - Compose (dev): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/policy-compose.yaml`
+   - Helm: `helm template stellaops ./devops/helm/stellaops -f devops/helm/stellaops/values-prod.yaml -f devops/helm/stellaops/values-orchestrator.yaml > /tmp/policy-plan.yaml`
+   - Compose (dev): `USE_MOCK=1 devops/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f devops/compose/docker-compose.dev.yaml -f devops/compose/docker-compose.mock.yaml config > /tmp/policy-compose.yaml`
 3) Backups
-   - Run `deploy/compose/scripts/backup.sh` before production rollout; archive PostgreSQL/Redis/ObjectStore snapshots to the regulated vault.
+   - Run `devops/compose/scripts/backup.sh` before production rollout; archive PostgreSQL/Redis/ObjectStore snapshots to the regulated vault.

 ## Canary publish → promote
 1) Prepare override (temporary)
-   - Create `deploy/helm/stellaops/values-policy-canary.yaml` with a single replica, reduced worker counts, and an isolated ingress path for policy publish.
+   - Create `devops/helm/stellaops/values-policy-canary.yaml` with a single replica, reduced worker counts, and an isolated ingress path for policy publish.
   - Keep `mock.enabled=false`; only use real digests when available.
 2) Dry-run render
-   - `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --debug --validate > /tmp/policy-canary.yaml`
+   - `helm template stellaops ./devops/helm/stellaops -f devops/helm/stellaops/values-prod.yaml -f devops/helm/stellaops/values-policy-canary.yaml --debug --validate > /tmp/policy-canary.yaml`
 3) Apply canary
-   - `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --atomic --timeout 10m`
+   - `helm upgrade --install stellaops ./devops/helm/stellaops -f devops/helm/stellaops/values-prod.yaml -f devops/helm/stellaops/values-policy-canary.yaml --atomic --timeout 10m`
   - Monitor: `kubectl logs deployment/policy-registry -n stellaops --tail=200 -f` and readiness probes; rollback on errors.
 4) Promote
   - Remove the canary override from the release branch; rerender with `values-prod.yaml` only and redeploy.
--- a/docs/operations/runbooks/vex-ops.md
+++ b/docs/operations/runbooks/vex-ops.md
@@ -7,12 +7,12 @@ Status: DRAFT (2025-12-06 UTC). Safe for dev/mock exercises; production rollouts
   - Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
   - Prod: rerun against `deploy/releases/2025.09-stable.yaml` once VEX digests land.
 2) Render plan
-   - Helm (mock overlay): `helm template vex-mock ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vex-mock.yaml`
-   - Compose (dev with overlay): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/vex-compose.yaml`
+   - Helm (mock overlay): `helm template vex-mock ./devops/helm/stellaops -f devops/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vex-mock.yaml`
+   - Compose (dev with overlay): `USE_MOCK=1 devops/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f devops/compose/docker-compose.dev.yaml -f devops/compose/docker-compose.mock.yaml config > /tmp/vex-compose.yaml`
 3) Backups (when touching prod data) — not required for mock, but in prod take PostgreSQL snapshots for issuer-directory and VEX state before rollout.

 ## Deploy (mock path)
- Helm dry-run already covers structural checks. To apply in a dev cluster: `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
+- Helm dry-run already covers structural checks. To apply in a dev cluster: `helm upgrade --install stellaops ./devops/helm/stellaops -f devops/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
 - Observe VEX Lens pod logs: `kubectl logs deploy/vex-lens -n stellaops --tail=200 -f`.
 - Issuer Directory seed: ensure `issuer-directory-config` ConfigMap includes `csaf-publishers.json`; mock overlay already mounts default seed.

--- a/docs/operations/runbooks/vuln-ops.md
+++ b/docs/operations/runbooks/vuln-ops.md
@@ -10,13 +10,13 @@ Status: DRAFT (2025-12-06 UTC). Safe for dev/mock exercises; production steps ne
   - Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
   - Prod: rerun against `deploy/releases/2025.09-stable.yaml` once ledger/api digests land.
 2) Render plan
-   - Helm (mock overlay): `helm template vuln-mock ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vuln-mock.yaml`
-   - Compose (dev with overlay): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml config > /tmp/vuln-compose.yaml`
+   - Helm (mock overlay): `helm template vuln-mock ./devops/helm/stellaops -f devops/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vuln-mock.yaml`
+   - Compose (dev with overlay): `USE_MOCK=1 devops/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml config > /tmp/vuln-compose.yaml`
 3) Backups (prod only)
   - PostgreSQL dump for Findings Ledger DB; copy object-store buckets tied to projector anchors.

 ## Deploy (mock path)
- Helm apply (dev): `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
+- Helm apply (dev): `helm upgrade --install stellaops ./devops/helm/stellaops -f devops/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
 - Compose: quickstart already starts ledger + vuln API with mock pins; validate health at `https://localhost:8443/swagger` (dev certs).

 ## Incident drills