save development progress

This commit is contained in:
StellaOps Bot
2025-12-25 23:09:58 +02:00
parent d71853ad7e
commit aa70af062e
351 changed files with 37683 additions and 150156 deletions

View File

@@ -0,0 +1,419 @@
# Provcache Metrics and Alerting Guide
This document describes the Prometheus metrics exposed by the Provcache layer and recommended alerting configurations.
## Overview
Provcache emits metrics for monitoring cache performance, hit rates, latency, and invalidation patterns. These metrics enable operators to:
- Track cache effectiveness
- Identify performance degradation
- Detect anomalous invalidation patterns
- Capacity plan for cache infrastructure
## Prometheus Metrics
### Request Counters
#### `provcache_requests_total`
Total number of cache requests.
| Label | Values | Description |
|-------|--------|-------------|
| `source` | `valkey`, `postgres` | Cache tier that handled the request |
| `result` | `hit`, `miss`, `expired` | Request outcome |
```promql
# Total requests per minute
rate(provcache_requests_total[1m])
# Hit rate percentage
sum(rate(provcache_requests_total{result="hit"}[5m])) /
sum(rate(provcache_requests_total[5m])) * 100
```
#### `provcache_hits_total`
Total cache hits (subset of requests with `result="hit"`).
| Label | Values | Description |
|-------|--------|-------------|
| `source` | `valkey`, `postgres` | Cache tier |
```promql
# Valkey vs Postgres hit ratio
sum(rate(provcache_hits_total{source="valkey"}[5m])) /
sum(rate(provcache_hits_total[5m])) * 100
```
#### `provcache_misses_total`
Total cache misses.
| Label | Values | Description |
|-------|--------|-------------|
| `reason` | `not_found`, `expired`, `invalidated` | Miss reason |
```promql
# Miss rate by reason
sum by (reason) (rate(provcache_misses_total[5m]))
```
### Latency Histogram
#### `provcache_latency_seconds`
Latency distribution for cache operations.
| Label | Values | Description |
|-------|--------|-------------|
| `operation` | `get`, `set`, `invalidate` | Operation type |
| `source` | `valkey`, `postgres` | Cache tier |
Buckets: `0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0`
```promql
# P50 latency for cache gets
histogram_quantile(0.50, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
# P95 latency
histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
# P99 latency
histogram_quantile(0.99, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
```
### Gauge Metrics
#### `provcache_items_count`
Current number of items in cache.
| Label | Values | Description |
|-------|--------|-------------|
| `source` | `valkey`, `postgres` | Cache tier |
```promql
# Total cached items
sum(provcache_items_count)
# Items by tier
sum by (source) (provcache_items_count)
```
### Invalidation Metrics
#### `provcache_invalidations_total`
Total invalidation events.
| Label | Values | Description |
|-------|--------|-------------|
| `reason` | `signer_revoked`, `epoch_advanced`, `ttl_expired`, `manual` | Invalidation trigger |
```promql
# Invalidation rate by reason
sum by (reason) (rate(provcache_invalidations_total[5m]))
# Security-related invalidations
sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m]))
```
### Trust Score Metrics
#### `provcache_trust_score_average`
Gauge showing average trust score across cached decisions.
```promql
# Current average trust score
provcache_trust_score_average
```
#### `provcache_trust_score_bucket`
Histogram of trust score distribution.
Buckets: `20, 40, 60, 80, 100`
```promql
# Percentage of decisions with trust score >= 80
sum(rate(provcache_trust_score_bucket{le="100"}[5m])) -
sum(rate(provcache_trust_score_bucket{le="80"}[5m]))
```
---
## Grafana Dashboard
A pre-built dashboard is available at `deploy/grafana/dashboards/provcache-overview.json`.
### Panels
| Panel | Type | Description |
|-------|------|-------------|
| Cache Hit Rate | Gauge | Current hit rate percentage |
| Hit Rate Over Time | Time series | Hit rate trend |
| Latency Percentiles | Time series | P50, P95, P99 latency |
| Invalidation Rate | Time series | Invalidations per minute |
| Cache Size | Time series | Item count over time |
| Hits by Source | Pie chart | Valkey vs Postgres distribution |
| Entry Size Distribution | Histogram | Size of cached entries |
| Trust Score Distribution | Histogram | Decision trust scores |
### Importing the Dashboard
```bash
# Via Grafana HTTP API
curl -X POST http://grafana:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GRAFANA_API_KEY" \
-d @deploy/grafana/dashboards/provcache-overview.json
# Via Helm (auto-provisioned)
# Dashboard is auto-imported when using StellaOps Helm chart
helm upgrade stellaops ./deploy/helm/stellaops \
--set grafana.dashboards.provcache.enabled=true
```
---
## Alerting Rules
### Recommended Alerts
#### Low Cache Hit Rate
```yaml
alert: ProvcacheLowHitRate
expr: |
sum(rate(provcache_requests_total{result="hit"}[5m])) /
sum(rate(provcache_requests_total[5m])) < 0.7
for: 10m
labels:
severity: warning
annotations:
summary: "Provcache hit rate below 70%"
description: "Cache hit rate is {{ $value | humanizePercentage }}. Check for invalidation storms or cold cache."
```
#### Critical Hit Rate Drop
```yaml
alert: ProvcacheCriticalHitRate
expr: |
sum(rate(provcache_requests_total{result="hit"}[5m])) /
sum(rate(provcache_requests_total[5m])) < 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "Provcache hit rate critically low"
description: "Cache hit rate is {{ $value | humanizePercentage }}. Immediate investigation required."
```
#### High Latency
```yaml
alert: ProvcacheHighLatency
expr: |
histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Provcache P95 latency above 100ms"
description: "P95 get latency is {{ $value | humanizeDuration }}. Check Valkey/Postgres performance."
```
#### Excessive Invalidations
```yaml
alert: ProvcacheInvalidationStorm
expr: |
sum(rate(provcache_invalidations_total[5m])) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Provcache invalidation rate spike"
description: "Invalidations at {{ $value }} per second. Check for feed epoch changes or revocations."
```
#### Signer Revocation Spike
```yaml
alert: ProvcacheSignerRevocations
expr: |
sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m])) > 10
for: 2m
labels:
severity: critical
annotations:
summary: "Signer revocation causing mass invalidation"
description: "{{ $value }} invalidations/sec due to signer revocation. Security event investigation required."
```
#### Cache Size Approaching Limit
```yaml
alert: ProvcacheSizeHigh
expr: |
sum(provcache_items_count) > 900000
for: 15m
labels:
severity: warning
annotations:
summary: "Provcache size approaching limit"
description: "Cache has {{ $value }} items. Consider scaling or tuning TTL."
```
#### Low Trust Scores
```yaml
alert: ProvcacheLowTrustScores
expr: |
provcache_trust_score_average < 60
for: 30m
labels:
severity: info
annotations:
summary: "Average trust score below 60"
description: "Average trust score is {{ $value }}. Review SBOM completeness and VEX coverage."
```
### AlertManager Configuration
```yaml
# alertmanager.yml
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
- match:
alertname: ProvcacheSignerRevocations
receiver: 'security-team'
receivers:
- name: 'default-receiver'
slack_configs:
- channel: '#stellaops-alerts'
send_resolved: true
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<pagerduty-key>'
- name: 'security-team'
email_configs:
- to: 'security@example.com'
send_resolved: true
```
---
## Recording Rules
Pre-compute expensive queries for dashboard performance:
```yaml
# prometheus-rules.yml
groups:
- name: provcache-recording
interval: 30s
rules:
# Hit rate pre-computed
- record: provcache:hit_rate:5m
expr: |
sum(rate(provcache_requests_total{result="hit"}[5m])) /
sum(rate(provcache_requests_total[5m]))
# P95 latency pre-computed
- record: provcache:latency_p95:5m
expr: |
histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
# Invalidation rate
- record: provcache:invalidation_rate:5m
expr: |
sum(rate(provcache_invalidations_total[5m]))
# Cache efficiency (hits per second vs misses)
- record: provcache:efficiency:5m
expr: |
sum(rate(provcache_hits_total[5m])) /
(sum(rate(provcache_hits_total[5m])) + sum(rate(provcache_misses_total[5m])))
```
---
## Operational Runbook
### Low Hit Rate Investigation
1. **Check invalidation metrics** — Is there an invalidation storm?
```promql
sum by (reason) (rate(provcache_invalidations_total[5m]))
```
2. **Check cache age** — Is the cache newly deployed (cold)?
```promql
sum(provcache_items_count)
```
3. **Check request patterns** — Are there many unique VeriKeys?
```promql
# High cardinality of unique requests suggests insufficient cache sharing
```
4. **Check TTL configuration** — Is TTL too aggressive?
- Review `Provcache:DefaultTtl` setting
- Consider increasing for stable workloads
### High Latency Investigation
1. **Check Valkey health**
```bash
redis-cli -h valkey info stats
```
2. **Check Postgres connections**
```sql
SELECT count(*) FROM pg_stat_activity WHERE datname = 'stellaops';
```
3. **Check entry sizes**
```promql
histogram_quantile(0.95, rate(provcache_entry_size_bytes_bucket[5m]))
```
4. **Check network latency** between services
### Invalidation Storm Response
1. **Identify cause**
```promql
sum by (reason) (increase(provcache_invalidations_total[10m]))
```
2. **If epoch-related**: Expected during feed updates. Monitor duration.
3. **If signer-related**: Security event — escalate to security team.
4. **If manual**: Check audit logs for unauthorized invalidation.
---
## Related Documentation
- [Provcache Module README](../provcache/README.md) — Core concepts
- [Provcache Architecture](../provcache/architecture.md) — Technical details
- [Telemetry Architecture](../telemetry/architecture.md) — Observability patterns
- [Grafana Dashboard Guide](../../deploy/grafana/README.md) — Dashboard management