git.stella-ops.org/docs/modules/provcache/metrics-alerting.md

# Provcache Metrics and Alerting Guide

This document describes the Prometheus metrics exposed by the Provcache layer and recommended alerting configurations.

## Overview

Provcache emits metrics for monitoring cache performance, hit rates, latency, and invalidation patterns. These metrics enable operators to:

- Track cache effectiveness
- Identify performance degradation
- Detect anomalous invalidation patterns
- Capacity plan for cache infrastructure

## Prometheus Metrics

### Request Counters

#### `provcache_requests_total`

Total number of cache requests.

| Label | Values | Description |
|-------|--------|-------------|
| `source` | `valkey`, `postgres` | Cache tier that handled the request |
| `result` | `hit`, `miss`, `expired` | Request outcome |

```promql
# Total requests per minute
rate(provcache_requests_total[1m])

# Hit rate percentage
sum(rate(provcache_requests_total{result="hit"}[5m])) /
sum(rate(provcache_requests_total[5m])) * 100
```

#### `provcache_hits_total`

Total cache hits (subset of requests with `result="hit"`).

| Label | Values | Description |
|-------|--------|-------------|
| `source` | `valkey`, `postgres` | Cache tier |

```promql
# Valkey vs Postgres hit ratio
sum(rate(provcache_hits_total{source="valkey"}[5m])) /
sum(rate(provcache_hits_total[5m])) * 100
```

#### `provcache_misses_total`

Total cache misses.

| Label | Values | Description |
|-------|--------|-------------|
| `reason` | `not_found`, `expired`, `invalidated` | Miss reason |

```promql
# Miss rate by reason
sum by (reason) (rate(provcache_misses_total[5m]))
```

### Latency Histogram

#### `provcache_latency_seconds`

Latency distribution for cache operations.

| Label | Values | Description |
|-------|--------|-------------|
| `operation` | `get`, `set`, `invalidate` | Operation type |
| `source` | `valkey`, `postgres` | Cache tier |

Buckets: `0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0`

```promql
# P50 latency for cache gets
histogram_quantile(0.50, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))

# P95 latency
histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))

# P99 latency
histogram_quantile(0.99, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))
```

### Gauge Metrics

#### `provcache_items_count`

Current number of items in cache.

| Label | Values | Description |
|-------|--------|-------------|
| `source` | `valkey`, `postgres` | Cache tier |

```promql
# Total cached items
sum(provcache_items_count)

# Items by tier
sum by (source) (provcache_items_count)
```

### Invalidation Metrics

#### `provcache_invalidations_total`

Total invalidation events.

| Label | Values | Description |
|-------|--------|-------------|
| `reason` | `signer_revoked`, `epoch_advanced`, `ttl_expired`, `manual` | Invalidation trigger |

```promql
# Invalidation rate by reason
sum by (reason) (rate(provcache_invalidations_total[5m]))

# Security-related invalidations
sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m]))
```

### Trust Score Metrics

#### `provcache_trust_score_average`

Gauge showing average trust score across cached decisions.

```promql
# Current average trust score
provcache_trust_score_average
```

#### `provcache_trust_score_bucket`

Histogram of trust score distribution.

Buckets: `20, 40, 60, 80, 100`

```promql
# Percentage of decisions with trust score >= 80
sum(rate(provcache_trust_score_bucket{le="100"}[5m])) -
sum(rate(provcache_trust_score_bucket{le="80"}[5m]))
```

---

## Grafana Dashboard

A pre-built dashboard is available at `deploy/grafana/dashboards/provcache-overview.json`.

### Panels

| Panel | Type | Description |
|-------|------|-------------|
| Cache Hit Rate | Gauge | Current hit rate percentage |
| Hit Rate Over Time | Time series | Hit rate trend |
| Latency Percentiles | Time series | P50, P95, P99 latency |
| Invalidation Rate | Time series | Invalidations per minute |
| Cache Size | Time series | Item count over time |
| Hits by Source | Pie chart | Valkey vs Postgres distribution |
| Entry Size Distribution | Histogram | Size of cached entries |
| Trust Score Distribution | Histogram | Decision trust scores |

### Importing the Dashboard

```bash
# Via Grafana HTTP API
curl -X POST http://grafana:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -d @deploy/grafana/dashboards/provcache-overview.json

# Via Helm (auto-provisioned)
# Dashboard is auto-imported when using StellaOps Helm chart
helm upgrade stellaops ./deploy/helm/stellaops \
  --set grafana.dashboards.provcache.enabled=true
```

---

## Alerting Rules

### Recommended Alerts

#### Low Cache Hit Rate

```yaml
alert: ProvcacheLowHitRate
expr: |
  sum(rate(provcache_requests_total{result="hit"}[5m])) /
  sum(rate(provcache_requests_total[5m])) < 0.7
for: 10m
labels:
  severity: warning
annotations:
  summary: "Provcache hit rate below 70%"
  description: "Cache hit rate is {{ $value | humanizePercentage }}. Check for invalidation storms or cold cache."
```

#### Critical Hit Rate Drop

```yaml
alert: ProvcacheCriticalHitRate
expr: |
  sum(rate(provcache_requests_total{result="hit"}[5m])) /
  sum(rate(provcache_requests_total[5m])) < 0.5
for: 5m
labels:
  severity: critical
annotations:
  summary: "Provcache hit rate critically low"
  description: "Cache hit rate is {{ $value | humanizePercentage }}. Immediate investigation required."
```

#### High Latency

```yaml
alert: ProvcacheHighLatency
expr: |
  histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m])) > 0.1
for: 5m
labels:
  severity: warning
annotations:
  summary: "Provcache P95 latency above 100ms"
  description: "P95 get latency is {{ $value | humanizeDuration }}. Check Valkey/Postgres performance."
```

#### Excessive Invalidations

```yaml
alert: ProvcacheInvalidationStorm
expr: |
  sum(rate(provcache_invalidations_total[5m])) > 100
for: 5m
labels:
  severity: warning
annotations:
  summary: "Provcache invalidation rate spike"
  description: "Invalidations at {{ $value }} per second. Check for feed epoch changes or revocations."
```

#### Signer Revocation Spike

```yaml
alert: ProvcacheSignerRevocations
expr: |
  sum(rate(provcache_invalidations_total{reason="signer_revoked"}[5m])) > 10
for: 2m
labels:
  severity: critical
annotations:
  summary: "Signer revocation causing mass invalidation"
  description: "{{ $value }} invalidations/sec due to signer revocation. Security event investigation required."
```

#### Cache Size Approaching Limit

```yaml
alert: ProvcacheSizeHigh
expr: |
  sum(provcache_items_count) > 900000
for: 15m
labels:
  severity: warning
annotations:
  summary: "Provcache size approaching limit"
  description: "Cache has {{ $value }} items. Consider scaling or tuning TTL."
```

#### Low Trust Scores

```yaml
alert: ProvcacheLowTrustScores
expr: |
  provcache_trust_score_average < 60
for: 30m
labels:
  severity: info
annotations:
  summary: "Average trust score below 60"
  description: "Average trust score is {{ $value }}. Review SBOM completeness and VEX coverage."
```

### AlertManager Configuration

```yaml
# alertmanager.yml
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
    - match:
        alertname: ProvcacheSignerRevocations
      receiver: 'security-team'

receivers:
  - name: 'default-receiver'
    slack_configs:
      - channel: '#stellaops-alerts'
        send_resolved: true

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<pagerduty-key>'

  - name: 'security-team'
    email_configs:
      - to: 'security@example.com'
        send_resolved: true
```

---

## Recording Rules

Pre-compute expensive queries for dashboard performance:

```yaml
# prometheus-rules.yml
groups:
  - name: provcache-recording
    interval: 30s
    rules:
      # Hit rate pre-computed
      - record: provcache:hit_rate:5m
        expr: |
          sum(rate(provcache_requests_total{result="hit"}[5m])) /
          sum(rate(provcache_requests_total[5m]))

      # P95 latency pre-computed
      - record: provcache:latency_p95:5m
        expr: |
          histogram_quantile(0.95, rate(provcache_latency_seconds_bucket{operation="get"}[5m]))

      # Invalidation rate
      - record: provcache:invalidation_rate:5m
        expr: |
          sum(rate(provcache_invalidations_total[5m]))

      # Cache efficiency (hits per second vs misses)
      - record: provcache:efficiency:5m
        expr: |
          sum(rate(provcache_hits_total[5m])) /
          (sum(rate(provcache_hits_total[5m])) + sum(rate(provcache_misses_total[5m])))
```

---

## Operational Runbook

### Low Hit Rate Investigation

1. **Check invalidation metrics** — Is there an invalidation storm?
   ```promql
   sum by (reason) (rate(provcache_invalidations_total[5m]))
   ```

2. **Check cache age** — Is the cache newly deployed (cold)?
   ```promql
   sum(provcache_items_count)
   ```

3. **Check request patterns** — Are there many unique VeriKeys?
   ```promql
   # High cardinality of unique requests suggests insufficient cache sharing
   ```

4. **Check TTL configuration** — Is TTL too aggressive?
   - Review `Provcache:DefaultTtl` setting
   - Consider increasing for stable workloads

### High Latency Investigation

1. **Check Valkey health**
   ```bash
   redis-cli -h valkey info stats
   ```

2. **Check Postgres connections**
   ```sql
   SELECT count(*) FROM pg_stat_activity WHERE datname = 'stellaops';
   ```

3. **Check entry sizes**
   ```promql
   histogram_quantile(0.95, rate(provcache_entry_size_bytes_bucket[5m]))
   ```

4. **Check network latency** between services

### Invalidation Storm Response

1. **Identify cause**
   ```promql
   sum by (reason) (increase(provcache_invalidations_total[10m]))
   ```

2. **If epoch-related**: Expected during feed updates. Monitor duration.

3. **If signer-related**: Security event — escalate to security team.

4. **If manual**: Check audit logs for unauthorized invalidation.

---

## Related Documentation

- [Provcache Module README](../provcache/README.md) — Core concepts
- [Provcache Architecture](../provcache/architecture.md) — Technical details
- [Telemetry Architecture](../telemetry/architecture.md) — Observability patterns
- [Grafana Dashboard Guide](../../deploy/grafana/README.md) — Dashboard management