stella-ops.org/git.stella-ops.org

Fork 0

Files

StellaOps Bot aa70af062e save development progress

2025-12-25 23:10:09 +02:00

10 KiB

Raw Blame History

Provcache Metrics and Alerting Guide

This document describes the Prometheus metrics exposed by the Provcache layer and recommended alerting configurations.

Overview

Provcache emits metrics for monitoring cache performance, hit rates, latency, and invalidation patterns. These metrics enable operators to:

Track cache effectiveness
Identify performance degradation
Detect anomalous invalidation patterns
Capacity plan for cache infrastructure

Prometheus Metrics

Request Counters

`provcache_requests_total`

Total number of cache requests.

Label	Values	Description
`source`	`valkey`, `postgres`	Cache tier that handled the request
`result`	`hit`, `miss`, `expired`	Request outcome

# Total requests per minute
rate(provcache_requests_total[1m])

# Hit rate percentage
sum(rate(provcache_requests_total{result="hit"}[5m])) /
sum(rate(provcache_requests_total[5m])) * 100

`provcache_hits_total`

Total cache hits (subset of requests with result="hit").

Label	Values	Description
`source`	`valkey`, `postgres`	Cache tier

# Valkey vs Postgres hit ratio
sum(rate(provcache_hits_total{source="valkey"}[5m])) /
sum(rate(provcache_hits_total[5m])) * 100

`provcache_misses_total`

Total cache misses.

Label	Values	Description
`reason`	`not_found`, `expired`, `invalidated`	Miss reason

# Miss rate by reason
sum by (reason) (rate(provcache_misses_total[5m]))

Latency Histogram

`provcache_latency_seconds`

Latency distribution for cache operations.

Label	Values	Description
`operation`	`get`, `set`, `invalidate`	Operation type
`source`	`valkey`, `postgres`	Cache tier

Buckets: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0

# P50 latency for cache gets
histogram_quantile(0.50, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))

# P95 latency
histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))

# P99 latency
histogram_quantile(0.99, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))

Gauge Metrics

`provcache_items_count`

Current number of items in cache.

Label	Values	Description
`source`	`valkey`, `postgres`	Cache tier

# Total cached items
sum(provcache_items_count)

# Items by tier
sum by (source) (provcache_items_count)

Invalidation Metrics

`provcache_invalidations_total`

Total invalidation events.

Label	Values	Description
`reason`	`signer_revoked`, `epoch_advanced`, `ttl_expired`, `manual`	Invalidation trigger

# Invalidation rate by reason
sum by (reason) (rate(provcache_invalidations_total[5m]))

# Security-related invalidations
sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m]))

Trust Score Metrics

`provcache_trust_score_average`

Gauge showing average trust score across cached decisions.

# Current average trust score
provcache_trust_score_average

`provcache_trust_score_bucket`

Histogram of trust score distribution.

Buckets: 20, 40, 60, 80, 100

# Percentage of decisions with trust score >= 80
sum(rate(provcache_trust_score_bucket{le="100"}[5m])) -
sum(rate(provcache_trust_score_bucket{le="80"}[5m]))

Grafana Dashboard

A pre-built dashboard is available at deploy/grafana/dashboards/provcache-overview.json.

Panels

Panel	Type	Description
Cache Hit Rate	Gauge	Current hit rate percentage
Hit Rate Over Time	Time series	Hit rate trend
Latency Percentiles	Time series	P50, P95, P99 latency
Invalidation Rate	Time series	Invalidations per minute
Cache Size	Time series	Item count over time
Hits by Source	Pie chart	Valkey vs Postgres distribution
Entry Size Distribution	Histogram	Size of cached entries
Trust Score Distribution	Histogram	Decision trust scores

Importing the Dashboard

# Via Grafana HTTP API
curl -X POST http://grafana:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -d @deploy/grafana/dashboards/provcache-overview.json

# Via Helm (auto-provisioned)
# Dashboard is auto-imported when using StellaOps Helm chart
helm upgrade stellaops ./deploy/helm/stellaops \
  --set grafana.dashboards.provcache.enabled=true

Alerting Rules

Recommended Alerts

Low Cache Hit Rate

alert: ProvcacheLowHitRate
expr: |
  sum(rate(provcache_requests_total{result="hit"}[5m])) /
  sum(rate(provcache_requests_total[5m])) < 0.7
for: 10m
labels:
  severity: warning
annotations:
  summary: "Provcache hit rate below 70%"
  description: "Cache hit rate is {{ $value | humanizePercentage }}. Check for invalidation storms or cold cache."

Critical Hit Rate Drop

alert: ProvcacheCriticalHitRate
expr: |
  sum(rate(provcache_requests_total{result="hit"}[5m])) /
  sum(rate(provcache_requests_total[5m])) < 0.5
for: 5m
labels:
  severity: critical
annotations:
  summary: "Provcache hit rate critically low"
  description: "Cache hit rate is {{ $value | humanizePercentage }}. Immediate investigation required."

High Latency

alert: ProvcacheHighLatency
expr: |
  histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m])) > 0.1
for: 5m
labels:
  severity: warning
annotations:
  summary: "Provcache P95 latency above 100ms"
  description: "P95 get latency is {{ $value | humanizeDuration }}. Check Valkey/Postgres performance."

Excessive Invalidations

alert: ProvcacheInvalidationStorm
expr: |
  sum(rate(provcache_invalidations_total[5m])) > 100
for: 5m
labels:
  severity: warning
annotations:
  summary: "Provcache invalidation rate spike"
  description: "Invalidations at {{ $value }} per second. Check for feed epoch changes or revocations."

Signer Revocation Spike

alert: ProvcacheSignerRevocations
expr: |
  sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m])) > 10
for: 2m
labels:
  severity: critical
annotations:
  summary: "Signer revocation causing mass invalidation"
  description: "{{ $value }} invalidations/sec due to signer revocation. Security event investigation required."

Cache Size Approaching Limit

alert: ProvcacheSizeHigh
expr: |
  sum(provcache_items_count) > 900000
for: 15m
labels:
  severity: warning
annotations:
  summary: "Provcache size approaching limit"
  description: "Cache has {{ $value }} items. Consider scaling or tuning TTL."

Low Trust Scores

alert: ProvcacheLowTrustScores
expr: |
  provcache_trust_score_average < 60
for: 30m
labels:
  severity: info
annotations:
  summary: "Average trust score below 60"
  description: "Average trust score is {{ $value }}. Review SBOM completeness and VEX coverage."

AlertManager Configuration

# alertmanager.yml
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
    - match:
        alertname: ProvcacheSignerRevocations
      receiver: 'security-team'

receivers:
  - name: 'default-receiver'
    slack_configs:
      - channel: '#stellaops-alerts'
        send_resolved: true
  
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<pagerduty-key>'
  
  - name: 'security-team'
    email_configs:
      - to: 'security@example.com'
        send_resolved: true

Recording Rules

Pre-compute expensive queries for dashboard performance:

# prometheus-rules.yml
groups:
  - name: provcache-recording
    interval: 30s
    rules:
      # Hit rate pre-computed
      - record: provcache:hit_rate:5m
        expr: |
          sum(rate(provcache_requests_total{result="hit"}[5m])) /
          sum(rate(provcache_requests_total[5m]))

      # P95 latency pre-computed
      - record: provcache:latency_p95:5m
        expr: |
          histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))

      # Invalidation rate
      - record: provcache:invalidation_rate:5m
        expr: |
          sum(rate(provcache_invalidations_total[5m]))

      # Cache efficiency (hits per second vs misses)
      - record: provcache:efficiency:5m
        expr: |
          sum(rate(provcache_hits_total[5m])) /
          (sum(rate(provcache_hits_total[5m])) + sum(rate(provcache_misses_total[5m])))

Operational Runbook

Low Hit Rate Investigation

Check invalidation metrics — Is there an invalidation storm?

sum by (reason) (rate(provcache_invalidations_total[5m]))

Check cache age — Is the cache newly deployed (cold)?
```
sum(provcache_items_count)
```

Check request patterns — Are there many unique VeriKeys?

# High cardinality of unique requests suggests insufficient cache sharing

Check TTL configuration — Is TTL too aggressive?
- Review Provcache:DefaultTtl setting
- Consider increasing for stable workloads

High Latency Investigation

Check Valkey health
```
redis-cli -h valkey info stats
```

Check Postgres connections

SELECT count(*) FROM pg_stat_activity WHERE datname = 'stellaops';

Check entry sizes

histogram_quantile(0.95, rate(provcache_entry_size_bytes_bucket[5m]))

Check network latency between services

Invalidation Storm Response

Identify cause

sum by (reason) (increase(provcache_invalidations_total[10m]))

If epoch-related: Expected during feed updates. Monitor duration.
If signer-related: Security event — escalate to security team.
If manual: Check audit logs for unauthorized invalidation.

Provcache Module README — Core concepts
Provcache Architecture — Technical details
Telemetry Architecture — Observability patterns
Grafana Dashboard Guide — Dashboard management

10 KiB Raw Blame History