Files
git.stella-ops.org/docs/observability/metrics-and-slos.md
StellaOps Bot 9f6e6f7fb3
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
up
2025-11-25 22:09:44 +02:00

2.2 KiB
Raw Blame History

Metrics & SLOs (DOCS-OBS-51-001)

Last updated: 2025-11-25 (Docs Tasks Md.VI)

Core metrics (platform-wide)

  • Requests: http_requests_total{tenant,workload,route,status} (counter); latency histogram http_request_duration_seconds.
  • Jobs: worker_jobs_total{tenant,queue,status}; worker_job_duration_seconds.
  • DB: db_query_duration_seconds{db,operation}; db_pool_in_use, db_pool_available.
  • Cache: cache_requests_total{result=hit|miss}; cache_latency_seconds.
  • Queue depth: queue_depth{tenant,queue} (gauge).
  • Errors: errors_total{tenant,workload,code}.
  • Custom module metrics: keep namespaced (e.g., riskengine_score_duration_seconds, notify_delivery_attempts_total).

SLOs (suggested)

  • API availability: 99.9% monthly per public service.
  • P95 latency: <300 ms for read endpoints; <1 s for write endpoints.
  • Worker job success: >99% over 30d; P95 job duration set per queue (document locally).
  • Queue backlog: alert when queue_depth > 1000 for 5 minutes per tenant/queue.
  • Error budget policy: 28-day rolling window; burn-rate alerts at 2× and 14× budget.

Alert examples

  • High error rate: rate(errors_total[5m]) / rate(http_requests_total[5m]) > 0.02.
  • Latency regression: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.3.
  • Queue backlog: queue_depth > 1000 for 5m.
  • Job failures: rate(worker_jobs_total{status="failed"}[10m]) > 0.01.

Observability hygiene

  • Tag everything with tenant, workload, env, region, version.
  • Keep metric names stable; prefer adding labels over renaming.
  • No high-cardinality labels (avoid user_id, path, raw errors); bucket or hash if needed.
  • Offline: scrape locally (Prometheus/OTLP); ship exports via bundle if required.

Dashboards

  • Golden signals per service: traffic, errors, saturation, latency (P50/P95/P99).
  • Queue dashboards: depth, age, throughput, success/fail rates.
  • Tracing overlays: link span status to error metrics; use exemplars where supported.

Validation checklist

  • Metrics emitted with required tags.
  • Cardinality review completed (no unbounded labels).
  • Alerts wired to error budget policy.
  • Dashboards cover golden signals and queue health.