# Provcache Metrics and Alerting Guide This document describes the Prometheus metrics exposed by the Provcache layer and recommended alerting configurations. ## Overview Provcache emits metrics for monitoring cache performance, hit rates, latency, and invalidation patterns. These metrics enable operators to: - Track cache effectiveness - Identify performance degradation - Detect anomalous invalidation patterns - Capacity plan for cache infrastructure ## Prometheus Metrics ### Request Counters #### `provcache_requests_total` Total number of cache requests. | Label | Values | Description | |-------|--------|-------------| | `source` | `valkey`, `postgres` | Cache tier that handled the request | | `result` | `hit`, `miss`, `expired` | Request outcome | ```promql # Total requests per minute rate(provcache_requests_total[1m]) # Hit rate percentage sum(rate(provcache_requests_total{result="hit"}[5m])) / sum(rate(provcache_requests_total[5m])) * 100 ``` #### `provcache_hits_total` Total cache hits (subset of requests with `result="hit"`). | Label | Values | Description | |-------|--------|-------------| | `source` | `valkey`, `postgres` | Cache tier | ```promql # Valkey vs Postgres hit ratio sum(rate(provcache_hits_total{source="valkey"}[5m])) / sum(rate(provcache_hits_total[5m])) * 100 ``` #### `provcache_misses_total` Total cache misses. | Label | Values | Description | |-------|--------|-------------| | `reason` | `not_found`, `expired`, `invalidated` | Miss reason | ```promql # Miss rate by reason sum by (reason) (rate(provcache_misses_total[5m])) ``` ### Latency Histogram #### `provcache_latency_seconds` Latency distribution for cache operations. | Label | Values | Description | |-------|--------|-------------| | `operation` | `get`, `set`, `invalidate` | Operation type | | `source` | `valkey`, `postgres` | Cache tier | Buckets: `0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0` ```promql # P50 latency for cache gets histogram_quantile(0.50, rate(provcache_latency_seconds_bucket{operation="get"}[5m])) # P95 latency histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m])) # P99 latency histogram_quantile(0.99, rate(provcache_latency_seconds_bucket{operation="get"}[5m])) ``` ### Gauge Metrics #### `provcache_items_count` Current number of items in cache. | Label | Values | Description | |-------|--------|-------------| | `source` | `valkey`, `postgres` | Cache tier | ```promql # Total cached items sum(provcache_items_count) # Items by tier sum by (source) (provcache_items_count) ``` ### Invalidation Metrics #### `provcache_invalidations_total` Total invalidation events. | Label | Values | Description | |-------|--------|-------------| | `reason` | `signer_revoked`, `epoch_advanced`, `ttl_expired`, `manual` | Invalidation trigger | ```promql # Invalidation rate by reason sum by (reason) (rate(provcache_invalidations_total[5m])) # Security-related invalidations sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m])) ``` ### Trust Score Metrics #### `provcache_trust_score_average` Gauge showing average trust score across cached decisions. ```promql # Current average trust score provcache_trust_score_average ``` #### `provcache_trust_score_bucket` Histogram of trust score distribution. Buckets: `20, 40, 60, 80, 100` ```promql # Percentage of decisions with trust score >= 80 sum(rate(provcache_trust_score_bucket{le="100"}[5m])) - sum(rate(provcache_trust_score_bucket{le="80"}[5m])) ``` --- ## Grafana Dashboard A pre-built dashboard is available at `deploy/grafana/dashboards/provcache-overview.json`. ### Panels | Panel | Type | Description | |-------|------|-------------| | Cache Hit Rate | Gauge | Current hit rate percentage | | Hit Rate Over Time | Time series | Hit rate trend | | Latency Percentiles | Time series | P50, P95, P99 latency | | Invalidation Rate | Time series | Invalidations per minute | | Cache Size | Time series | Item count over time | | Hits by Source | Pie chart | Valkey vs Postgres distribution | | Entry Size Distribution | Histogram | Size of cached entries | | Trust Score Distribution | Histogram | Decision trust scores | ### Importing the Dashboard ```bash # Via Grafana HTTP API curl -X POST http://grafana:3000/api/dashboards/db \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $GRAFANA_API_KEY" \ -d @deploy/grafana/dashboards/provcache-overview.json # Via Helm (auto-provisioned) # Dashboard is auto-imported when using StellaOps Helm chart helm upgrade stellaops ./deploy/helm/stellaops \ --set grafana.dashboards.provcache.enabled=true ``` --- ## Alerting Rules ### Recommended Alerts #### Low Cache Hit Rate ```yaml alert: ProvcacheLowHitRate expr: | sum(rate(provcache_requests_total{result="hit"}[5m])) / sum(rate(provcache_requests_total[5m])) < 0.7 for: 10m labels: severity: warning annotations: summary: "Provcache hit rate below 70%" description: "Cache hit rate is {{ $value | humanizePercentage }}. Check for invalidation storms or cold cache." ``` #### Critical Hit Rate Drop ```yaml alert: ProvcacheCriticalHitRate expr: | sum(rate(provcache_requests_total{result="hit"}[5m])) / sum(rate(provcache_requests_total[5m])) < 0.5 for: 5m labels: severity: critical annotations: summary: "Provcache hit rate critically low" description: "Cache hit rate is {{ $value | humanizePercentage }}. Immediate investigation required." ``` #### High Latency ```yaml alert: ProvcacheHighLatency expr: | histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m])) > 0.1 for: 5m labels: severity: warning annotations: summary: "Provcache P95 latency above 100ms" description: "P95 get latency is {{ $value | humanizeDuration }}. Check Valkey/Postgres performance." ``` #### Excessive Invalidations ```yaml alert: ProvcacheInvalidationStorm expr: | sum(rate(provcache_invalidations_total[5m])) > 100 for: 5m labels: severity: warning annotations: summary: "Provcache invalidation rate spike" description: "Invalidations at {{ $value }} per second. Check for feed epoch changes or revocations." ``` #### Signer Revocation Spike ```yaml alert: ProvcacheSignerRevocations expr: | sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m])) > 10 for: 2m labels: severity: critical annotations: summary: "Signer revocation causing mass invalidation" description: "{{ $value }} invalidations/sec due to signer revocation. Security event investigation required." ``` #### Cache Size Approaching Limit ```yaml alert: ProvcacheSizeHigh expr: | sum(provcache_items_count) > 900000 for: 15m labels: severity: warning annotations: summary: "Provcache size approaching limit" description: "Cache has {{ $value }} items. Consider scaling or tuning TTL." ``` #### Low Trust Scores ```yaml alert: ProvcacheLowTrustScores expr: | provcache_trust_score_average < 60 for: 30m labels: severity: info annotations: summary: "Average trust score below 60" description: "Average trust score is {{ $value }}. Review SBOM completeness and VEX coverage." ``` ### AlertManager Configuration ```yaml # alertmanager.yml route: group_by: ['alertname', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default-receiver' routes: - match: severity: critical receiver: 'pagerduty-critical' - match: alertname: ProvcacheSignerRevocations receiver: 'security-team' receivers: - name: 'default-receiver' slack_configs: - channel: '#stellaops-alerts' send_resolved: true - name: 'pagerduty-critical' pagerduty_configs: - service_key: '' - name: 'security-team' email_configs: - to: 'security@example.com' send_resolved: true ``` --- ## Recording Rules Pre-compute expensive queries for dashboard performance: ```yaml # prometheus-rules.yml groups: - name: provcache-recording interval: 30s rules: # Hit rate pre-computed - record: provcache:hit_rate:5m expr: | sum(rate(provcache_requests_total{result="hit"}[5m])) / sum(rate(provcache_requests_total[5m])) # P95 latency pre-computed - record: provcache:latency_p95:5m expr: | histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m])) # Invalidation rate - record: provcache:invalidation_rate:5m expr: | sum(rate(provcache_invalidations_total[5m])) # Cache efficiency (hits per second vs misses) - record: provcache:efficiency:5m expr: | sum(rate(provcache_hits_total[5m])) / (sum(rate(provcache_hits_total[5m])) + sum(rate(provcache_misses_total[5m]))) ``` --- ## Operational Runbook ### Low Hit Rate Investigation 1. **Check invalidation metrics** — Is there an invalidation storm? ```promql sum by (reason) (rate(provcache_invalidations_total[5m])) ``` 2. **Check cache age** — Is the cache newly deployed (cold)? ```promql sum(provcache_items_count) ``` 3. **Check request patterns** — Are there many unique VeriKeys? ```promql # High cardinality of unique requests suggests insufficient cache sharing ``` 4. **Check TTL configuration** — Is TTL too aggressive? - Review `Provcache:DefaultTtl` setting - Consider increasing for stable workloads ### High Latency Investigation 1. **Check Valkey health** ```bash redis-cli -h valkey info stats ``` 2. **Check Postgres connections** ```sql SELECT count(*) FROM pg_stat_activity WHERE datname = 'stellaops'; ``` 3. **Check entry sizes** ```promql histogram_quantile(0.95, rate(provcache_entry_size_bytes_bucket[5m])) ``` 4. **Check network latency** between services ### Invalidation Storm Response 1. **Identify cause** ```promql sum by (reason) (increase(provcache_invalidations_total[10m])) ``` 2. **If epoch-related**: Expected during feed updates. Monitor duration. 3. **If signer-related**: Security event — escalate to security team. 4. **If manual**: Check audit logs for unauthorized invalidation. --- ## Related Documentation - [Provcache Module README](../provcache/README.md) — Core concepts - [Provcache Architecture](../provcache/architecture.md) — Technical details - [Telemetry Architecture](../telemetry/architecture.md) — Observability patterns - [Grafana Dashboard Guide](../../deploy/grafana/README.md) — Dashboard management