Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
43 lines
2.2 KiB
Markdown
43 lines
2.2 KiB
Markdown
# Metrics & SLOs (DOCS-OBS-51-001)
|
||
|
||
Last updated: 2025-11-25 (Docs Tasks Md.VI)
|
||
|
||
## Core metrics (platform-wide)
|
||
- **Requests**: `http_requests_total{tenant,workload,route,status}` (counter); latency histogram `http_request_duration_seconds`.
|
||
- **Jobs**: `worker_jobs_total{tenant,queue,status}`; `worker_job_duration_seconds`.
|
||
- **DB**: `db_query_duration_seconds{db,operation}`; `db_pool_in_use`, `db_pool_available`.
|
||
- **Cache**: `cache_requests_total{result=hit|miss}`; `cache_latency_seconds`.
|
||
- **Queue depth**: `queue_depth{tenant,queue}` (gauge).
|
||
- **Errors**: `errors_total{tenant,workload,code}`.
|
||
- **Custom module metrics**: keep namespaced (e.g., `riskengine_score_duration_seconds`, `notify_delivery_attempts_total`).
|
||
|
||
## SLOs (suggested)
|
||
- API availability: 99.9% monthly per public service.
|
||
- P95 latency: <300 ms for read endpoints; <1 s for write endpoints.
|
||
- Worker job success: >99% over 30d; P95 job duration set per queue (document locally).
|
||
- Queue backlog: alert when `queue_depth` > 1000 for 5 minutes per tenant/queue.
|
||
- Error budget policy: 28-day rolling window; burn-rate alerts at 2× and 14× budget.
|
||
|
||
## Alert examples
|
||
- High error rate: `rate(errors_total[5m]) / rate(http_requests_total[5m]) > 0.02`.
|
||
- Latency regression: `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.3`.
|
||
- Queue backlog: `queue_depth > 1000` for 5m.
|
||
- Job failures: `rate(worker_jobs_total{status="failed"}[10m]) > 0.01`.
|
||
|
||
## Observability hygiene
|
||
- Tag everything with `tenant`, `workload`, `env`, `region`, `version`.
|
||
- Keep metric names stable; prefer adding labels over renaming.
|
||
- No high-cardinality labels (avoid `user_id`, `path`, raw errors); bucket or hash if needed.
|
||
- Offline: scrape locally (Prometheus/OTLP); ship exports via bundle if required.
|
||
|
||
## Dashboards
|
||
- Golden signals per service: traffic, errors, saturation, latency (P50/P95/P99).
|
||
- Queue dashboards: depth, age, throughput, success/fail rates.
|
||
- Tracing overlays: link span `status` to error metrics; use exemplars where supported.
|
||
|
||
## Validation checklist
|
||
- [ ] Metrics emitted with required tags.
|
||
- [ ] Cardinality review completed (no unbounded labels).
|
||
- [ ] Alerts wired to error budget policy.
|
||
- [ ] Dashboards cover golden signals and queue health.
|