420 lines
10 KiB
Markdown
420 lines
10 KiB
Markdown
# Provcache Metrics and Alerting Guide
|
|
|
|
This document describes the Prometheus metrics exposed by the Provcache layer and recommended alerting configurations.
|
|
|
|
## Overview
|
|
|
|
Provcache emits metrics for monitoring cache performance, hit rates, latency, and invalidation patterns. These metrics enable operators to:
|
|
|
|
- Track cache effectiveness
|
|
- Identify performance degradation
|
|
- Detect anomalous invalidation patterns
|
|
- Capacity plan for cache infrastructure
|
|
|
|
## Prometheus Metrics
|
|
|
|
### Request Counters
|
|
|
|
#### `provcache_requests_total`
|
|
|
|
Total number of cache requests.
|
|
|
|
| Label | Values | Description |
|
|
|-------|--------|-------------|
|
|
| `source` | `valkey`, `postgres` | Cache tier that handled the request |
|
|
| `result` | `hit`, `miss`, `expired` | Request outcome |
|
|
|
|
```promql
|
|
# Total requests per minute
|
|
rate(provcache_requests_total[1m])
|
|
|
|
# Hit rate percentage
|
|
sum(rate(provcache_requests_total{result="hit"}[5m])) /
|
|
sum(rate(provcache_requests_total[5m])) * 100
|
|
```
|
|
|
|
#### `provcache_hits_total`
|
|
|
|
Total cache hits (subset of requests with `result="hit"`).
|
|
|
|
| Label | Values | Description |
|
|
|-------|--------|-------------|
|
|
| `source` | `valkey`, `postgres` | Cache tier |
|
|
|
|
```promql
|
|
# Valkey vs Postgres hit ratio
|
|
sum(rate(provcache_hits_total{source="valkey"}[5m])) /
|
|
sum(rate(provcache_hits_total[5m])) * 100
|
|
```
|
|
|
|
#### `provcache_misses_total`
|
|
|
|
Total cache misses.
|
|
|
|
| Label | Values | Description |
|
|
|-------|--------|-------------|
|
|
| `reason` | `not_found`, `expired`, `invalidated` | Miss reason |
|
|
|
|
```promql
|
|
# Miss rate by reason
|
|
sum by (reason) (rate(provcache_misses_total[5m]))
|
|
```
|
|
|
|
### Latency Histogram
|
|
|
|
#### `provcache_latency_seconds`
|
|
|
|
Latency distribution for cache operations.
|
|
|
|
| Label | Values | Description |
|
|
|-------|--------|-------------|
|
|
| `operation` | `get`, `set`, `invalidate` | Operation type |
|
|
| `source` | `valkey`, `postgres` | Cache tier |
|
|
|
|
Buckets: `0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0`
|
|
|
|
```promql
|
|
# P50 latency for cache gets
|
|
histogram_quantile(0.50, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
|
|
|
|
# P95 latency
|
|
histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
|
|
|
|
# P99 latency
|
|
histogram_quantile(0.99, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
|
|
```
|
|
|
|
### Gauge Metrics
|
|
|
|
#### `provcache_items_count`
|
|
|
|
Current number of items in cache.
|
|
|
|
| Label | Values | Description |
|
|
|-------|--------|-------------|
|
|
| `source` | `valkey`, `postgres` | Cache tier |
|
|
|
|
```promql
|
|
# Total cached items
|
|
sum(provcache_items_count)
|
|
|
|
# Items by tier
|
|
sum by (source) (provcache_items_count)
|
|
```
|
|
|
|
### Invalidation Metrics
|
|
|
|
#### `provcache_invalidations_total`
|
|
|
|
Total invalidation events.
|
|
|
|
| Label | Values | Description |
|
|
|-------|--------|-------------|
|
|
| `reason` | `signer_revoked`, `epoch_advanced`, `ttl_expired`, `manual` | Invalidation trigger |
|
|
|
|
```promql
|
|
# Invalidation rate by reason
|
|
sum by (reason) (rate(provcache_invalidations_total[5m]))
|
|
|
|
# Security-related invalidations
|
|
sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m]))
|
|
```
|
|
|
|
### Trust Score Metrics
|
|
|
|
#### `provcache_trust_score_average`
|
|
|
|
Gauge showing average trust score across cached decisions.
|
|
|
|
```promql
|
|
# Current average trust score
|
|
provcache_trust_score_average
|
|
```
|
|
|
|
#### `provcache_trust_score_bucket`
|
|
|
|
Histogram of trust score distribution.
|
|
|
|
Buckets: `20, 40, 60, 80, 100`
|
|
|
|
```promql
|
|
# Percentage of decisions with trust score >= 80
|
|
sum(rate(provcache_trust_score_bucket{le="100"}[5m])) -
|
|
sum(rate(provcache_trust_score_bucket{le="80"}[5m]))
|
|
```
|
|
|
|
---
|
|
|
|
## Grafana Dashboard
|
|
|
|
A pre-built dashboard is available at `deploy/grafana/dashboards/provcache-overview.json`.
|
|
|
|
### Panels
|
|
|
|
| Panel | Type | Description |
|
|
|-------|------|-------------|
|
|
| Cache Hit Rate | Gauge | Current hit rate percentage |
|
|
| Hit Rate Over Time | Time series | Hit rate trend |
|
|
| Latency Percentiles | Time series | P50, P95, P99 latency |
|
|
| Invalidation Rate | Time series | Invalidations per minute |
|
|
| Cache Size | Time series | Item count over time |
|
|
| Hits by Source | Pie chart | Valkey vs Postgres distribution |
|
|
| Entry Size Distribution | Histogram | Size of cached entries |
|
|
| Trust Score Distribution | Histogram | Decision trust scores |
|
|
|
|
### Importing the Dashboard
|
|
|
|
```bash
|
|
# Via Grafana HTTP API
|
|
curl -X POST http://grafana:3000/api/dashboards/db \
|
|
-H "Content-Type: application/json" \
|
|
-H "Authorization: Bearer $GRAFANA_API_KEY" \
|
|
-d @deploy/grafana/dashboards/provcache-overview.json
|
|
|
|
# Via Helm (auto-provisioned)
|
|
# Dashboard is auto-imported when using StellaOps Helm chart
|
|
helm upgrade stellaops ./deploy/helm/stellaops \
|
|
--set grafana.dashboards.provcache.enabled=true
|
|
```
|
|
|
|
---
|
|
|
|
## Alerting Rules
|
|
|
|
### Recommended Alerts
|
|
|
|
#### Low Cache Hit Rate
|
|
|
|
```yaml
|
|
alert: ProvcacheLowHitRate
|
|
expr: |
|
|
sum(rate(provcache_requests_total{result="hit"}[5m])) /
|
|
sum(rate(provcache_requests_total[5m])) < 0.7
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Provcache hit rate below 70%"
|
|
description: "Cache hit rate is {{ $value | humanizePercentage }}. Check for invalidation storms or cold cache."
|
|
```
|
|
|
|
#### Critical Hit Rate Drop
|
|
|
|
```yaml
|
|
alert: ProvcacheCriticalHitRate
|
|
expr: |
|
|
sum(rate(provcache_requests_total{result="hit"}[5m])) /
|
|
sum(rate(provcache_requests_total[5m])) < 0.5
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Provcache hit rate critically low"
|
|
description: "Cache hit rate is {{ $value | humanizePercentage }}. Immediate investigation required."
|
|
```
|
|
|
|
#### High Latency
|
|
|
|
```yaml
|
|
alert: ProvcacheHighLatency
|
|
expr: |
|
|
histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m])) > 0.1
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Provcache P95 latency above 100ms"
|
|
description: "P95 get latency is {{ $value | humanizeDuration }}. Check Valkey/Postgres performance."
|
|
```
|
|
|
|
#### Excessive Invalidations
|
|
|
|
```yaml
|
|
alert: ProvcacheInvalidationStorm
|
|
expr: |
|
|
sum(rate(provcache_invalidations_total[5m])) > 100
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Provcache invalidation rate spike"
|
|
description: "Invalidations at {{ $value }} per second. Check for feed epoch changes or revocations."
|
|
```
|
|
|
|
#### Signer Revocation Spike
|
|
|
|
```yaml
|
|
alert: ProvcacheSignerRevocations
|
|
expr: |
|
|
sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m])) > 10
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Signer revocation causing mass invalidation"
|
|
description: "{{ $value }} invalidations/sec due to signer revocation. Security event investigation required."
|
|
```
|
|
|
|
#### Cache Size Approaching Limit
|
|
|
|
```yaml
|
|
alert: ProvcacheSizeHigh
|
|
expr: |
|
|
sum(provcache_items_count) > 900000
|
|
for: 15m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Provcache size approaching limit"
|
|
description: "Cache has {{ $value }} items. Consider scaling or tuning TTL."
|
|
```
|
|
|
|
#### Low Trust Scores
|
|
|
|
```yaml
|
|
alert: ProvcacheLowTrustScores
|
|
expr: |
|
|
provcache_trust_score_average < 60
|
|
for: 30m
|
|
labels:
|
|
severity: info
|
|
annotations:
|
|
summary: "Average trust score below 60"
|
|
description: "Average trust score is {{ $value }}. Review SBOM completeness and VEX coverage."
|
|
```
|
|
|
|
### AlertManager Configuration
|
|
|
|
```yaml
|
|
# alertmanager.yml
|
|
route:
|
|
group_by: ['alertname', 'severity']
|
|
group_wait: 30s
|
|
group_interval: 5m
|
|
repeat_interval: 4h
|
|
receiver: 'default-receiver'
|
|
routes:
|
|
- match:
|
|
severity: critical
|
|
receiver: 'pagerduty-critical'
|
|
- match:
|
|
alertname: ProvcacheSignerRevocations
|
|
receiver: 'security-team'
|
|
|
|
receivers:
|
|
- name: 'default-receiver'
|
|
slack_configs:
|
|
- channel: '#stellaops-alerts'
|
|
send_resolved: true
|
|
|
|
- name: 'pagerduty-critical'
|
|
pagerduty_configs:
|
|
- service_key: '<pagerduty-key>'
|
|
|
|
- name: 'security-team'
|
|
email_configs:
|
|
- to: 'security@example.com'
|
|
send_resolved: true
|
|
```
|
|
|
|
---
|
|
|
|
## Recording Rules
|
|
|
|
Pre-compute expensive queries for dashboard performance:
|
|
|
|
```yaml
|
|
# prometheus-rules.yml
|
|
groups:
|
|
- name: provcache-recording
|
|
interval: 30s
|
|
rules:
|
|
# Hit rate pre-computed
|
|
- record: provcache:hit_rate:5m
|
|
expr: |
|
|
sum(rate(provcache_requests_total{result="hit"}[5m])) /
|
|
sum(rate(provcache_requests_total[5m]))
|
|
|
|
# P95 latency pre-computed
|
|
- record: provcache:latency_p95:5m
|
|
expr: |
|
|
histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
|
|
|
|
# Invalidation rate
|
|
- record: provcache:invalidation_rate:5m
|
|
expr: |
|
|
sum(rate(provcache_invalidations_total[5m]))
|
|
|
|
# Cache efficiency (hits per second vs misses)
|
|
- record: provcache:efficiency:5m
|
|
expr: |
|
|
sum(rate(provcache_hits_total[5m])) /
|
|
(sum(rate(provcache_hits_total[5m])) + sum(rate(provcache_misses_total[5m])))
|
|
```
|
|
|
|
---
|
|
|
|
## Operational Runbook
|
|
|
|
### Low Hit Rate Investigation
|
|
|
|
1. **Check invalidation metrics** — Is there an invalidation storm?
|
|
```promql
|
|
sum by (reason) (rate(provcache_invalidations_total[5m]))
|
|
```
|
|
|
|
2. **Check cache age** — Is the cache newly deployed (cold)?
|
|
```promql
|
|
sum(provcache_items_count)
|
|
```
|
|
|
|
3. **Check request patterns** — Are there many unique VeriKeys?
|
|
```promql
|
|
# High cardinality of unique requests suggests insufficient cache sharing
|
|
```
|
|
|
|
4. **Check TTL configuration** — Is TTL too aggressive?
|
|
- Review `Provcache:DefaultTtl` setting
|
|
- Consider increasing for stable workloads
|
|
|
|
### High Latency Investigation
|
|
|
|
1. **Check Valkey health**
|
|
```bash
|
|
redis-cli -h valkey info stats
|
|
```
|
|
|
|
2. **Check Postgres connections**
|
|
```sql
|
|
SELECT count(*) FROM pg_stat_activity WHERE datname = 'stellaops';
|
|
```
|
|
|
|
3. **Check entry sizes**
|
|
```promql
|
|
histogram_quantile(0.95, rate(provcache_entry_size_bytes_bucket[5m]))
|
|
```
|
|
|
|
4. **Check network latency** between services
|
|
|
|
### Invalidation Storm Response
|
|
|
|
1. **Identify cause**
|
|
```promql
|
|
sum by (reason) (increase(provcache_invalidations_total[10m]))
|
|
```
|
|
|
|
2. **If epoch-related**: Expected during feed updates. Monitor duration.
|
|
|
|
3. **If signer-related**: Security event — escalate to security team.
|
|
|
|
4. **If manual**: Check audit logs for unauthorized invalidation.
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- [Provcache Module README](../provcache/README.md) — Core concepts
|
|
- [Provcache Architecture](../provcache/architecture.md) — Technical details
|
|
- [Telemetry Architecture](../telemetry/architecture.md) — Observability patterns
|
|
- [Grafana Dashboard Guide](../../deploy/grafana/README.md) — Dashboard management
|