5.1 KiB
5.1 KiB
Metrics & SLOs (DOCS-OBS-51-001)
Last updated: 2025-12-15
Core metrics (platform-wide)
- Requests:
http_requests_total{tenant,workload,route,status}(counter); latency histogramhttp_request_duration_seconds. - Jobs:
worker_jobs_total{tenant,queue,status};worker_job_duration_seconds. - DB:
db_query_duration_seconds{db,operation};db_pool_in_use,db_pool_available. - Cache:
cache_requests_total{result=hit|miss};cache_latency_seconds. - Queue depth:
queue_depth{tenant,queue}(gauge). - Errors:
errors_total{tenant,workload,code}. - Custom module metrics: keep namespaced (e.g.,
riskengine_score_duration_seconds,notify_delivery_attempts_total).
SLOs (suggested)
- API availability: 99.9% monthly per public service.
- P95 latency: <300 ms for read endpoints; <1 s for write endpoints.
- Worker job success: >99% over 30d; P95 job duration set per queue (document locally).
- Queue backlog: alert when
queue_depth> 1000 for 5 minutes per tenant/queue. - Error budget policy: 28-day rolling window; burn-rate alerts at 2× and 14× budget.
Alert examples
- High error rate:
rate(errors_total[5m]) / rate(http_requests_total[5m]) > 0.02. - Latency regression:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.3. - Queue backlog:
queue_depth > 1000for 5m. - Job failures:
rate(worker_jobs_total{status="failed"}[10m]) > 0.01.
UX KPIs (triage TTFS)
- Targets:
- TTFS first evidence p95: <= 1.5s
- TTFS skeleton p95: <= 0.2s
- Clicks-to-closure median: <= 6
- Evidence completeness avg: >= 90% (>= 3.6/4)
# TTFS first evidence p50/p95
histogram_quantile(0.50, sum(rate(stellaops_ttfs_first_evidence_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(stellaops_ttfs_first_evidence_seconds_bucket[5m])) by (le))
# Clicks-to-closure median
histogram_quantile(0.50, sum(rate(stellaops_clicks_to_closure_bucket[5m])) by (le))
# Evidence completeness average percent (0-4 mapped to 0-100)
100 * (sum(rate(stellaops_evidence_completeness_score_sum[5m])) / clamp_min(sum(rate(stellaops_evidence_completeness_score_count[5m])), 1)) / 4
# Budget violations by phase
sum(rate(stellaops_performance_budget_violations_total[5m])) by (phase)
- Dashboard:
ops/devops/observability/grafana/triage-ttfs.json - Alerts:
ops/devops/observability/triage-alerts.yaml
TTFS Metrics (time-to-first-signal)
-
Core metrics:
ttfs_latency_seconds{surface,cache_hit,signal_source,kind,phase,tenant_id}(histogram)ttfs_signal_total{surface,cache_hit,signal_source,kind,phase,tenant_id}(counter)ttfs_cache_hit_total{surface,cache_hit,signal_source,kind,phase,tenant_id}(counter)ttfs_cache_miss_total{surface,cache_hit,signal_source,kind,phase,tenant_id}(counter)ttfs_slo_breach_total{surface,cache_hit,signal_source,kind,phase,tenant_id}(counter)ttfs_error_total{surface,cache_hit,signal_source,kind,phase,tenant_id,error_type,error_code}(counter)
-
SLO targets:
- P50 < 2s, P95 < 5s (all surfaces)
- Warm path P50 < 700ms, P95 < 2.5s
- Cold path P95 < 4s
# TTFS latency p50/p95
histogram_quantile(0.50, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
# SLO breach rate (per minute)
60 * sum(rate(ttfs_slo_breach_total[5m]))
Offline Kit (air-gap) metrics
offlinekit_import_total{status,tenant_id}(counter)offlinekit_attestation_verify_latency_seconds{attestation_type,success}(histogram)attestor_rekor_success_total{mode}(counter)attestor_rekor_retry_total{reason}(counter)rekor_inclusion_latency{success}(histogram)
# Import rate by status
sum(rate(offlinekit_import_total[5m])) by (status)
# Import success rate
sum(rate(offlinekit_import_total{status="success"}[5m])) / clamp_min(sum(rate(offlinekit_import_total[5m])), 1)
# Attestation verify p95 by type (success only)
histogram_quantile(0.95, sum(rate(offlinekit_attestation_verify_latency_seconds_bucket{success="true"}[5m])) by (le, attestation_type))
# Rekor inclusion latency p95 (by success)
histogram_quantile(0.95, sum(rate(rekor_inclusion_latency_bucket[5m])) by (le, success))
Dashboard: docs/modules/telemetry/dashboards/offline-kit-operations.json
Observability hygiene
- Tag everything with
tenant,workload,env,region,version. - Keep metric names stable; prefer adding labels over renaming.
- No high-cardinality labels (avoid
user_id,path, raw errors); bucket or hash if needed. - Offline: scrape locally (Prometheus/OTLP); ship exports via bundle if required.
Dashboards
- Golden signals per service: traffic, errors, saturation, latency (P50/P95/P99).
- Queue dashboards: depth, age, throughput, success/fail rates.
- Tracing overlays: link span
statusto error metrics; use exemplars where supported.
Validation checklist
- Metrics emitted with required tags.
- Cardinality review completed (no unbounded labels).
- Alerts wired to error budget policy.
- Dashboards cover golden signals and queue health.