save development progress

2025-12-25 23:09:58 +02:00
parent d71853ad7e
commit aa70af062e
351 changed files with 37683 additions and 150156 deletions
--- a/docs/modules/provcache/metrics-alerting.md
+++ b/docs/modules/provcache/metrics-alerting.md
@@ -0,0 +1,419 @@
+# Provcache Metrics and Alerting Guide
+
+This document describes the Prometheus metrics exposed by the Provcache layer and recommended alerting configurations.
+
+## Overview
+
+Provcache emits metrics for monitoring cache performance, hit rates, latency, and invalidation patterns. These metrics enable operators to:
+
+- Track cache effectiveness
+- Identify performance degradation
+- Detect anomalous invalidation patterns
+- Capacity plan for cache infrastructure
+
+## Prometheus Metrics
+
+### Request Counters
+
+#### `provcache_requests_total`
+
+Total number of cache requests.
+
+| Label | Values | Description |
+|-------|--------|-------------|
+| `source` | `valkey`, `postgres` | Cache tier that handled the request |
+| `result` | `hit`, `miss`, `expired` | Request outcome |
+
+```promql
+# Total requests per minute
+rate(provcache_requests_total[1m])
+
+# Hit rate percentage
+sum(rate(provcache_requests_total{result="hit"}[5m])) /
+sum(rate(provcache_requests_total[5m])) * 100
+```
+
+#### `provcache_hits_total`
+
+Total cache hits (subset of requests with `result="hit"`).
+
+| Label | Values | Description |
+|-------|--------|-------------|
+| `source` | `valkey`, `postgres` | Cache tier |
+
+```promql
+# Valkey vs Postgres hit ratio
+sum(rate(provcache_hits_total{source="valkey"}[5m])) /
+sum(rate(provcache_hits_total[5m])) * 100
+```
+
+#### `provcache_misses_total`
+
+Total cache misses.
+
+| Label | Values | Description |
+|-------|--------|-------------|
+| `reason` | `not_found`, `expired`, `invalidated` | Miss reason |
+
+```promql
+# Miss rate by reason
+sum by (reason) (rate(provcache_misses_total[5m]))
+```
+
+### Latency Histogram
+
+#### `provcache_latency_seconds`
+
+Latency distribution for cache operations.
+
+| Label | Values | Description |
+|-------|--------|-------------|
+| `operation` | `get`, `set`, `invalidate` | Operation type |
+| `source` | `valkey`, `postgres` | Cache tier |
+
+Buckets: `0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0`
+
+```promql
+# P50 latency for cache gets
+histogram_quantile(0.50, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
+
+# P95 latency
+histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
+
+# P99 latency
+histogram_quantile(0.99, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
+```
+
+### Gauge Metrics
+
+#### `provcache_items_count`
+
+Current number of items in cache.
+
+| Label | Values | Description |
+|-------|--------|-------------|
+| `source` | `valkey`, `postgres` | Cache tier |
+
+```promql
+# Total cached items
+sum(provcache_items_count)
+
+# Items by tier
+sum by (source) (provcache_items_count)
+```
+
+### Invalidation Metrics
+
+#### `provcache_invalidations_total`
+
+Total invalidation events.
+
+| Label | Values | Description |
+|-------|--------|-------------|
+| `reason` | `signer_revoked`, `epoch_advanced`, `ttl_expired`, `manual` | Invalidation trigger |
+
+```promql
+# Invalidation rate by reason
+sum by (reason) (rate(provcache_invalidations_total[5m]))
+
+# Security-related invalidations
+sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m]))
+```
+
+### Trust Score Metrics
+
+#### `provcache_trust_score_average`
+
+Gauge showing average trust score across cached decisions.
+
+```promql
+# Current average trust score
+provcache_trust_score_average
+```
+
+#### `provcache_trust_score_bucket`
+
+Histogram of trust score distribution.
+
+Buckets: `20, 40, 60, 80, 100`
+
+```promql
+# Percentage of decisions with trust score >= 80
+sum(rate(provcache_trust_score_bucket{le="100"}[5m])) -
+sum(rate(provcache_trust_score_bucket{le="80"}[5m]))
+```
+
+---
+
+## Grafana Dashboard
+
+A pre-built dashboard is available at `deploy/grafana/dashboards/provcache-overview.json`.
+
+### Panels
+
+| Panel | Type | Description |
+|-------|------|-------------|
+| Cache Hit Rate | Gauge | Current hit rate percentage |
+| Hit Rate Over Time | Time series | Hit rate trend |
+| Latency Percentiles | Time series | P50, P95, P99 latency |
+| Invalidation Rate | Time series | Invalidations per minute |
+| Cache Size | Time series | Item count over time |
+| Hits by Source | Pie chart | Valkey vs Postgres distribution |
+| Entry Size Distribution | Histogram | Size of cached entries |
+| Trust Score Distribution | Histogram | Decision trust scores |
+
+### Importing the Dashboard
+
+```bash
+# Via Grafana HTTP API
+curl -X POST http://grafana:3000/api/dashboards/db \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer $GRAFANA_API_KEY" \
+  -d @deploy/grafana/dashboards/provcache-overview.json
+
+# Via Helm (auto-provisioned)
+# Dashboard is auto-imported when using StellaOps Helm chart
+helm upgrade stellaops ./deploy/helm/stellaops \
+  --set grafana.dashboards.provcache.enabled=true
+```
+
+---
+
+## Alerting Rules
+
+### Recommended Alerts
+
+#### Low Cache Hit Rate
+
+```yaml
+alert: ProvcacheLowHitRate
+expr: |
+  sum(rate(provcache_requests_total{result="hit"}[5m])) /
+  sum(rate(provcache_requests_total[5m])) < 0.7
+for: 10m
+labels:
+  severity: warning
+annotations:
+  summary: "Provcache hit rate below 70%"
+  description: "Cache hit rate is {{ $value | humanizePercentage }}. Check for invalidation storms or cold cache."
+```
+
+#### Critical Hit Rate Drop
+
+```yaml
+alert: ProvcacheCriticalHitRate
+expr: |
+  sum(rate(provcache_requests_total{result="hit"}[5m])) /
+  sum(rate(provcache_requests_total[5m])) < 0.5
+for: 5m
+labels:
+  severity: critical
+annotations:
+  summary: "Provcache hit rate critically low"
+  description: "Cache hit rate is {{ $value | humanizePercentage }}. Immediate investigation required."
+```
+
+#### High Latency
+
+```yaml
+alert: ProvcacheHighLatency
+expr: |
+  histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m])) > 0.1
+for: 5m
+labels:
+  severity: warning
+annotations:
+  summary: "Provcache P95 latency above 100ms"
+  description: "P95 get latency is {{ $value | humanizeDuration }}. Check Valkey/Postgres performance."
+```
+
+#### Excessive Invalidations
+
+```yaml
+alert: ProvcacheInvalidationStorm
+expr: |
+  sum(rate(provcache_invalidations_total[5m])) > 100
+for: 5m
+labels:
+  severity: warning
+annotations:
+  summary: "Provcache invalidation rate spike"
+  description: "Invalidations at {{ $value }} per second. Check for feed epoch changes or revocations."
+```
+
+#### Signer Revocation Spike
+
+```yaml
+alert: ProvcacheSignerRevocations
+expr: |
+  sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m])) > 10
+for: 2m
+labels:
+  severity: critical
+annotations:
+  summary: "Signer revocation causing mass invalidation"
+  description: "{{ $value }} invalidations/sec due to signer revocation. Security event investigation required."
+```
+
+#### Cache Size Approaching Limit
+
+```yaml
+alert: ProvcacheSizeHigh
+expr: |
+  sum(provcache_items_count) > 900000
+for: 15m
+labels:
+  severity: warning
+annotations:
+  summary: "Provcache size approaching limit"
+  description: "Cache has {{ $value }} items. Consider scaling or tuning TTL."
+```
+
+#### Low Trust Scores
+
+```yaml
+alert: ProvcacheLowTrustScores
+expr: |
+  provcache_trust_score_average < 60
+for: 30m
+labels:
+  severity: info
+annotations:
+  summary: "Average trust score below 60"
+  description: "Average trust score is {{ $value }}. Review SBOM completeness and VEX coverage."
+```
+
+### AlertManager Configuration
+
+```yaml
+# alertmanager.yml
+route:
+  group_by: ['alertname', 'severity']
+  group_wait: 30s
+  group_interval: 5m
+  repeat_interval: 4h
+  receiver: 'default-receiver'
+  routes:
+    - match:
+        severity: critical
+      receiver: 'pagerduty-critical'
+    - match:
+        alertname: ProvcacheSignerRevocations
+      receiver: 'security-team'
+
+receivers:
+  - name: 'default-receiver'
+    slack_configs:
+      - channel: '#stellaops-alerts'
+        send_resolved: true
+  
+  - name: 'pagerduty-critical'
+    pagerduty_configs:
+      - service_key: '<pagerduty-key>'
+  
+  - name: 'security-team'
+    email_configs:
+      - to: 'security@example.com'
+        send_resolved: true
+```
+
+---
+
+## Recording Rules
+
+Pre-compute expensive queries for dashboard performance:
+
+```yaml
+# prometheus-rules.yml
+groups:
+  - name: provcache-recording
+    interval: 30s
+    rules:
+      # Hit rate pre-computed
+      - record: provcache:hit_rate:5m
+        expr: |
+          sum(rate(provcache_requests_total{result="hit"}[5m])) /
+          sum(rate(provcache_requests_total[5m]))
+
+      # P95 latency pre-computed
+      - record: provcache:latency_p95:5m
+        expr: |
+          histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
+
+      # Invalidation rate
+      - record: provcache:invalidation_rate:5m
+        expr: |
+          sum(rate(provcache_invalidations_total[5m]))
+
+      # Cache efficiency (hits per second vs misses)
+      - record: provcache:efficiency:5m
+        expr: |
+          sum(rate(provcache_hits_total[5m])) /
+          (sum(rate(provcache_hits_total[5m])) + sum(rate(provcache_misses_total[5m])))
+```
+
+---
+
+## Operational Runbook
+
+### Low Hit Rate Investigation
+
+1. **Check invalidation metrics** — Is there an invalidation storm?
+   ```promql
+   sum by (reason) (rate(provcache_invalidations_total[5m]))
+   ```
+
+2. **Check cache age** — Is the cache newly deployed (cold)?
+   ```promql
+   sum(provcache_items_count)
+   ```
+
+3. **Check request patterns** — Are there many unique VeriKeys?
+   ```promql
+   # High cardinality of unique requests suggests insufficient cache sharing
+   ```
+
+4. **Check TTL configuration** — Is TTL too aggressive?
+   - Review `Provcache:DefaultTtl` setting
+   - Consider increasing for stable workloads
+
+### High Latency Investigation
+
+1. **Check Valkey health**
+   ```bash
+   redis-cli -h valkey info stats
+   ```
+
+2. **Check Postgres connections**
+   ```sql
+   SELECT count(*) FROM pg_stat_activity WHERE datname = 'stellaops';
+   ```
+
+3. **Check entry sizes**
+   ```promql
+   histogram_quantile(0.95, rate(provcache_entry_size_bytes_bucket[5m]))
+   ```
+
+4. **Check network latency** between services
+
+### Invalidation Storm Response
+
+1. **Identify cause**
+   ```promql
+   sum by (reason) (increase(provcache_invalidations_total[10m]))
+   ```
+
+2. **If epoch-related**: Expected during feed updates. Monitor duration.
+
+3. **If signer-related**: Security event — escalate to security team.
+
+4. **If manual**: Check audit logs for unauthorized invalidation.
+
+---
+
+## Related Documentation
+
+- [Provcache Module README](../provcache/README.md) — Core concepts
+- [Provcache Architecture](../provcache/architecture.md) — Technical details
+- [Telemetry Architecture](../telemetry/architecture.md) — Observability patterns
+- [Grafana Dashboard Guide](../../deploy/grafana/README.md) — Dashboard management