Files
git.stella-ops.org/docs/modules/provcache/metrics-alerting.md
2025-12-25 23:10:09 +02:00

10 KiB

Provcache Metrics and Alerting Guide

This document describes the Prometheus metrics exposed by the Provcache layer and recommended alerting configurations.

Overview

Provcache emits metrics for monitoring cache performance, hit rates, latency, and invalidation patterns. These metrics enable operators to:

  • Track cache effectiveness
  • Identify performance degradation
  • Detect anomalous invalidation patterns
  • Capacity plan for cache infrastructure

Prometheus Metrics

Request Counters

provcache_requests_total

Total number of cache requests.

Label Values Description
source valkey, postgres Cache tier that handled the request
result hit, miss, expired Request outcome
# Total requests per minute
rate(provcache_requests_total[1m])

# Hit rate percentage
sum(rate(provcache_requests_total{result="hit"}[5m])) /
sum(rate(provcache_requests_total[5m])) * 100

provcache_hits_total

Total cache hits (subset of requests with result="hit").

Label Values Description
source valkey, postgres Cache tier
# Valkey vs Postgres hit ratio
sum(rate(provcache_hits_total{source="valkey"}[5m])) /
sum(rate(provcache_hits_total[5m])) * 100

provcache_misses_total

Total cache misses.

Label Values Description
reason not_found, expired, invalidated Miss reason
# Miss rate by reason
sum by (reason) (rate(provcache_misses_total[5m]))

Latency Histogram

provcache_latency_seconds

Latency distribution for cache operations.

Label Values Description
operation get, set, invalidate Operation type
source valkey, postgres Cache tier

Buckets: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0

# P50 latency for cache gets
histogram_quantile(0.50, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))

# P95 latency
histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))

# P99 latency
histogram_quantile(0.99, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))

Gauge Metrics

provcache_items_count

Current number of items in cache.

Label Values Description
source valkey, postgres Cache tier
# Total cached items
sum(provcache_items_count)

# Items by tier
sum by (source) (provcache_items_count)

Invalidation Metrics

provcache_invalidations_total

Total invalidation events.

Label Values Description
reason signer_revoked, epoch_advanced, ttl_expired, manual Invalidation trigger
# Invalidation rate by reason
sum by (reason) (rate(provcache_invalidations_total[5m]))

# Security-related invalidations
sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m]))

Trust Score Metrics

provcache_trust_score_average

Gauge showing average trust score across cached decisions.

# Current average trust score
provcache_trust_score_average

provcache_trust_score_bucket

Histogram of trust score distribution.

Buckets: 20, 40, 60, 80, 100

# Percentage of decisions with trust score >= 80
sum(rate(provcache_trust_score_bucket{le="100"}[5m])) -
sum(rate(provcache_trust_score_bucket{le="80"}[5m]))

Grafana Dashboard

A pre-built dashboard is available at deploy/grafana/dashboards/provcache-overview.json.

Panels

Panel Type Description
Cache Hit Rate Gauge Current hit rate percentage
Hit Rate Over Time Time series Hit rate trend
Latency Percentiles Time series P50, P95, P99 latency
Invalidation Rate Time series Invalidations per minute
Cache Size Time series Item count over time
Hits by Source Pie chart Valkey vs Postgres distribution
Entry Size Distribution Histogram Size of cached entries
Trust Score Distribution Histogram Decision trust scores

Importing the Dashboard

# Via Grafana HTTP API
curl -X POST http://grafana:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -d @deploy/grafana/dashboards/provcache-overview.json

# Via Helm (auto-provisioned)
# Dashboard is auto-imported when using StellaOps Helm chart
helm upgrade stellaops ./deploy/helm/stellaops \
  --set grafana.dashboards.provcache.enabled=true

Alerting Rules

Low Cache Hit Rate

alert: ProvcacheLowHitRate
expr: |
  sum(rate(provcache_requests_total{result="hit"}[5m])) /
  sum(rate(provcache_requests_total[5m])) < 0.7
for: 10m
labels:
  severity: warning
annotations:
  summary: "Provcache hit rate below 70%"
  description: "Cache hit rate is {{ $value | humanizePercentage }}. Check for invalidation storms or cold cache."

Critical Hit Rate Drop

alert: ProvcacheCriticalHitRate
expr: |
  sum(rate(provcache_requests_total{result="hit"}[5m])) /
  sum(rate(provcache_requests_total[5m])) < 0.5
for: 5m
labels:
  severity: critical
annotations:
  summary: "Provcache hit rate critically low"
  description: "Cache hit rate is {{ $value | humanizePercentage }}. Immediate investigation required."

High Latency

alert: ProvcacheHighLatency
expr: |
  histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m])) > 0.1
for: 5m
labels:
  severity: warning
annotations:
  summary: "Provcache P95 latency above 100ms"
  description: "P95 get latency is {{ $value | humanizeDuration }}. Check Valkey/Postgres performance."

Excessive Invalidations

alert: ProvcacheInvalidationStorm
expr: |
  sum(rate(provcache_invalidations_total[5m])) > 100
for: 5m
labels:
  severity: warning
annotations:
  summary: "Provcache invalidation rate spike"
  description: "Invalidations at {{ $value }} per second. Check for feed epoch changes or revocations."

Signer Revocation Spike

alert: ProvcacheSignerRevocations
expr: |
  sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m])) > 10
for: 2m
labels:
  severity: critical
annotations:
  summary: "Signer revocation causing mass invalidation"
  description: "{{ $value }} invalidations/sec due to signer revocation. Security event investigation required."

Cache Size Approaching Limit

alert: ProvcacheSizeHigh
expr: |
  sum(provcache_items_count) > 900000
for: 15m
labels:
  severity: warning
annotations:
  summary: "Provcache size approaching limit"
  description: "Cache has {{ $value }} items. Consider scaling or tuning TTL."

Low Trust Scores

alert: ProvcacheLowTrustScores
expr: |
  provcache_trust_score_average < 60
for: 30m
labels:
  severity: info
annotations:
  summary: "Average trust score below 60"
  description: "Average trust score is {{ $value }}. Review SBOM completeness and VEX coverage."

AlertManager Configuration

# alertmanager.yml
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
    - match:
        alertname: ProvcacheSignerRevocations
      receiver: 'security-team'

receivers:
  - name: 'default-receiver'
    slack_configs:
      - channel: '#stellaops-alerts'
        send_resolved: true
  
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<pagerduty-key>'
  
  - name: 'security-team'
    email_configs:
      - to: 'security@example.com'
        send_resolved: true

Recording Rules

Pre-compute expensive queries for dashboard performance:

# prometheus-rules.yml
groups:
  - name: provcache-recording
    interval: 30s
    rules:
      # Hit rate pre-computed
      - record: provcache:hit_rate:5m
        expr: |
          sum(rate(provcache_requests_total{result="hit"}[5m])) /
          sum(rate(provcache_requests_total[5m]))

      # P95 latency pre-computed
      - record: provcache:latency_p95:5m
        expr: |
          histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))

      # Invalidation rate
      - record: provcache:invalidation_rate:5m
        expr: |
          sum(rate(provcache_invalidations_total[5m]))

      # Cache efficiency (hits per second vs misses)
      - record: provcache:efficiency:5m
        expr: |
          sum(rate(provcache_hits_total[5m])) /
          (sum(rate(provcache_hits_total[5m])) + sum(rate(provcache_misses_total[5m])))

Operational Runbook

Low Hit Rate Investigation

  1. Check invalidation metrics — Is there an invalidation storm?

    sum by (reason) (rate(provcache_invalidations_total[5m]))
    
  2. Check cache age — Is the cache newly deployed (cold)?

    sum(provcache_items_count)
    
  3. Check request patterns — Are there many unique VeriKeys?

    # High cardinality of unique requests suggests insufficient cache sharing
    
  4. Check TTL configuration — Is TTL too aggressive?

    • Review Provcache:DefaultTtl setting
    • Consider increasing for stable workloads

High Latency Investigation

  1. Check Valkey health

    redis-cli -h valkey info stats
    
  2. Check Postgres connections

    SELECT count(*) FROM pg_stat_activity WHERE datname = 'stellaops';
    
  3. Check entry sizes

    histogram_quantile(0.95, rate(provcache_entry_size_bytes_bucket[5m]))
    
  4. Check network latency between services

Invalidation Storm Response

  1. Identify cause

    sum by (reason) (increase(provcache_invalidations_total[10m]))
    
  2. If epoch-related: Expected during feed updates. Monitor duration.

  3. If signer-related: Security event — escalate to security team.

  4. If manual: Check audit logs for unauthorized invalidation.