10 KiB
Provcache Metrics and Alerting Guide
This document describes the Prometheus metrics exposed by the Provcache layer and recommended alerting configurations.
Overview
Provcache emits metrics for monitoring cache performance, hit rates, latency, and invalidation patterns. These metrics enable operators to:
- Track cache effectiveness
- Identify performance degradation
- Detect anomalous invalidation patterns
- Capacity plan for cache infrastructure
Prometheus Metrics
Request Counters
provcache_requests_total
Total number of cache requests.
| Label | Values | Description |
|---|---|---|
source |
valkey, postgres |
Cache tier that handled the request |
result |
hit, miss, expired |
Request outcome |
# Total requests per minute
rate(provcache_requests_total[1m])
# Hit rate percentage
sum(rate(provcache_requests_total{result="hit"}[5m])) /
sum(rate(provcache_requests_total[5m])) * 100
provcache_hits_total
Total cache hits (subset of requests with result="hit").
| Label | Values | Description |
|---|---|---|
source |
valkey, postgres |
Cache tier |
# Valkey vs Postgres hit ratio
sum(rate(provcache_hits_total{source="valkey"}[5m])) /
sum(rate(provcache_hits_total[5m])) * 100
provcache_misses_total
Total cache misses.
| Label | Values | Description |
|---|---|---|
reason |
not_found, expired, invalidated |
Miss reason |
# Miss rate by reason
sum by (reason) (rate(provcache_misses_total[5m]))
Latency Histogram
provcache_latency_seconds
Latency distribution for cache operations.
| Label | Values | Description |
|---|---|---|
operation |
get, set, invalidate |
Operation type |
source |
valkey, postgres |
Cache tier |
Buckets: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0
# P50 latency for cache gets
histogram_quantile(0.50, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
# P95 latency
histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
# P99 latency
histogram_quantile(0.99, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
Gauge Metrics
provcache_items_count
Current number of items in cache.
| Label | Values | Description |
|---|---|---|
source |
valkey, postgres |
Cache tier |
# Total cached items
sum(provcache_items_count)
# Items by tier
sum by (source) (provcache_items_count)
Invalidation Metrics
provcache_invalidations_total
Total invalidation events.
| Label | Values | Description |
|---|---|---|
reason |
signer_revoked, epoch_advanced, ttl_expired, manual |
Invalidation trigger |
# Invalidation rate by reason
sum by (reason) (rate(provcache_invalidations_total[5m]))
# Security-related invalidations
sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m]))
Trust Score Metrics
provcache_trust_score_average
Gauge showing average trust score across cached decisions.
# Current average trust score
provcache_trust_score_average
provcache_trust_score_bucket
Histogram of trust score distribution.
Buckets: 20, 40, 60, 80, 100
# Percentage of decisions with trust score >= 80
sum(rate(provcache_trust_score_bucket{le="100"}[5m])) -
sum(rate(provcache_trust_score_bucket{le="80"}[5m]))
Grafana Dashboard
A pre-built dashboard is available at deploy/grafana/dashboards/provcache-overview.json.
Panels
| Panel | Type | Description |
|---|---|---|
| Cache Hit Rate | Gauge | Current hit rate percentage |
| Hit Rate Over Time | Time series | Hit rate trend |
| Latency Percentiles | Time series | P50, P95, P99 latency |
| Invalidation Rate | Time series | Invalidations per minute |
| Cache Size | Time series | Item count over time |
| Hits by Source | Pie chart | Valkey vs Postgres distribution |
| Entry Size Distribution | Histogram | Size of cached entries |
| Trust Score Distribution | Histogram | Decision trust scores |
Importing the Dashboard
# Via Grafana HTTP API
curl -X POST http://grafana:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GRAFANA_API_KEY" \
-d @deploy/grafana/dashboards/provcache-overview.json
# Via Helm (auto-provisioned)
# Dashboard is auto-imported when using StellaOps Helm chart
helm upgrade stellaops ./deploy/helm/stellaops \
--set grafana.dashboards.provcache.enabled=true
Alerting Rules
Recommended Alerts
Low Cache Hit Rate
alert: ProvcacheLowHitRate
expr: |
sum(rate(provcache_requests_total{result="hit"}[5m])) /
sum(rate(provcache_requests_total[5m])) < 0.7
for: 10m
labels:
severity: warning
annotations:
summary: "Provcache hit rate below 70%"
description: "Cache hit rate is {{ $value | humanizePercentage }}. Check for invalidation storms or cold cache."
Critical Hit Rate Drop
alert: ProvcacheCriticalHitRate
expr: |
sum(rate(provcache_requests_total{result="hit"}[5m])) /
sum(rate(provcache_requests_total[5m])) < 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "Provcache hit rate critically low"
description: "Cache hit rate is {{ $value | humanizePercentage }}. Immediate investigation required."
High Latency
alert: ProvcacheHighLatency
expr: |
histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Provcache P95 latency above 100ms"
description: "P95 get latency is {{ $value | humanizeDuration }}. Check Valkey/Postgres performance."
Excessive Invalidations
alert: ProvcacheInvalidationStorm
expr: |
sum(rate(provcache_invalidations_total[5m])) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Provcache invalidation rate spike"
description: "Invalidations at {{ $value }} per second. Check for feed epoch changes or revocations."
Signer Revocation Spike
alert: ProvcacheSignerRevocations
expr: |
sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m])) > 10
for: 2m
labels:
severity: critical
annotations:
summary: "Signer revocation causing mass invalidation"
description: "{{ $value }} invalidations/sec due to signer revocation. Security event investigation required."
Cache Size Approaching Limit
alert: ProvcacheSizeHigh
expr: |
sum(provcache_items_count) > 900000
for: 15m
labels:
severity: warning
annotations:
summary: "Provcache size approaching limit"
description: "Cache has {{ $value }} items. Consider scaling or tuning TTL."
Low Trust Scores
alert: ProvcacheLowTrustScores
expr: |
provcache_trust_score_average < 60
for: 30m
labels:
severity: info
annotations:
summary: "Average trust score below 60"
description: "Average trust score is {{ $value }}. Review SBOM completeness and VEX coverage."
AlertManager Configuration
# alertmanager.yml
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
- match:
alertname: ProvcacheSignerRevocations
receiver: 'security-team'
receivers:
- name: 'default-receiver'
slack_configs:
- channel: '#stellaops-alerts'
send_resolved: true
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<pagerduty-key>'
- name: 'security-team'
email_configs:
- to: 'security@example.com'
send_resolved: true
Recording Rules
Pre-compute expensive queries for dashboard performance:
# prometheus-rules.yml
groups:
- name: provcache-recording
interval: 30s
rules:
# Hit rate pre-computed
- record: provcache:hit_rate:5m
expr: |
sum(rate(provcache_requests_total{result="hit"}[5m])) /
sum(rate(provcache_requests_total[5m]))
# P95 latency pre-computed
- record: provcache:latency_p95:5m
expr: |
histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
# Invalidation rate
- record: provcache:invalidation_rate:5m
expr: |
sum(rate(provcache_invalidations_total[5m]))
# Cache efficiency (hits per second vs misses)
- record: provcache:efficiency:5m
expr: |
sum(rate(provcache_hits_total[5m])) /
(sum(rate(provcache_hits_total[5m])) + sum(rate(provcache_misses_total[5m])))
Operational Runbook
Low Hit Rate Investigation
-
Check invalidation metrics — Is there an invalidation storm?
sum by (reason) (rate(provcache_invalidations_total[5m])) -
Check cache age — Is the cache newly deployed (cold)?
sum(provcache_items_count) -
Check request patterns — Are there many unique VeriKeys?
# High cardinality of unique requests suggests insufficient cache sharing -
Check TTL configuration — Is TTL too aggressive?
- Review
Provcache:DefaultTtlsetting - Consider increasing for stable workloads
- Review
High Latency Investigation
-
Check Valkey health
redis-cli -h valkey info stats -
Check Postgres connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'stellaops'; -
Check entry sizes
histogram_quantile(0.95, rate(provcache_entry_size_bytes_bucket[5m])) -
Check network latency between services
Invalidation Storm Response
-
Identify cause
sum by (reason) (increase(provcache_invalidations_total[10m])) -
If epoch-related: Expected during feed updates. Monitor duration.
-
If signer-related: Security event — escalate to security team.
-
If manual: Check audit logs for unauthorized invalidation.
Related Documentation
- Provcache Module README — Core concepts
- Provcache Architecture — Technical details
- Telemetry Architecture — Observability patterns
- Grafana Dashboard Guide — Dashboard management