Files
git.stella-ops.org/docs/observability/metrics-and-slos.md
master 5a480a3c2a
Some checks failed
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
Findings Ledger CI / build-test (push) Has been cancelled
Findings Ledger CI / migration-validation (push) Has been cancelled
Findings Ledger CI / generate-manifest (push) Has been cancelled
Lighthouse CI / Lighthouse Audit (push) Has been cancelled
Lighthouse CI / Axe Accessibility Audit (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Reachability Corpus Validation / validate-corpus (push) Has been cancelled
Reachability Corpus Validation / validate-ground-truths (push) Has been cancelled
Scanner Analyzers / Discover Analyzers (push) Has been cancelled
Scanner Analyzers / Validate Test Fixtures (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Signals Reachability Scoring & Events / reachability-smoke (push) Has been cancelled
Reachability Corpus Validation / determinism-check (push) Has been cancelled
Scanner Analyzers / Build Analyzers (push) Has been cancelled
Scanner Analyzers / Test Language Analyzers (push) Has been cancelled
Scanner Analyzers / Verify Deterministic Output (push) Has been cancelled
Signals Reachability Scoring & Events / sign-and-upload (push) Has been cancelled
Add call graph fixtures for various languages and scenarios
- Introduced `all-edge-reasons.json` to test edge resolution reasons in .NET.
- Added `all-visibility-levels.json` to validate method visibility levels in .NET.
- Created `dotnet-aspnetcore-minimal.json` for a minimal ASP.NET Core application.
- Included `go-gin-api.json` for a Go Gin API application structure.
- Added `java-spring-boot.json` for the Spring PetClinic application in Java.
- Introduced `legacy-no-schema.json` for legacy application structure without schema.
- Created `node-express-api.json` for an Express.js API application structure.
2025-12-16 10:44:24 +02:00

114 lines
5.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Metrics & SLOs (DOCS-OBS-51-001)
Last updated: 2025-12-15
## Core metrics (platform-wide)
- **Requests**: `http_requests_total{tenant,workload,route,status}` (counter); latency histogram `http_request_duration_seconds`.
- **Jobs**: `worker_jobs_total{tenant,queue,status}`; `worker_job_duration_seconds`.
- **DB**: `db_query_duration_seconds{db,operation}`; `db_pool_in_use`, `db_pool_available`.
- **Cache**: `cache_requests_total{result=hit|miss}`; `cache_latency_seconds`.
- **Queue depth**: `queue_depth{tenant,queue}` (gauge).
- **Errors**: `errors_total{tenant,workload,code}`.
- **Custom module metrics**: keep namespaced (e.g., `riskengine_score_duration_seconds`, `notify_delivery_attempts_total`).
## SLOs (suggested)
- API availability: 99.9% monthly per public service.
- P95 latency: <300 ms for read endpoints; <1 s for write endpoints.
- Worker job success: >99% over 30d; P95 job duration set per queue (document locally).
- Queue backlog: alert when `queue_depth` > 1000 for 5 minutes per tenant/queue.
- Error budget policy: 28-day rolling window; burn-rate alerts at 2× and 14× budget.
## Alert examples
- High error rate: `rate(errors_total[5m]) / rate(http_requests_total[5m]) > 0.02`.
- Latency regression: `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.3`.
- Queue backlog: `queue_depth > 1000` for 5m.
- Job failures: `rate(worker_jobs_total{status="failed"}[10m]) > 0.01`.
## UX KPIs (triage TTFS)
- Targets:
- TTFS first evidence p95: <= 1.5s
- TTFS skeleton p95: <= 0.2s
- Clicks-to-closure median: <= 6
- Evidence completeness avg: >= 90% (>= 3.6/4)
```promql
# TTFS first evidence p50/p95
histogram_quantile(0.50, sum(rate(stellaops_ttfs_first_evidence_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(stellaops_ttfs_first_evidence_seconds_bucket[5m])) by (le))
# Clicks-to-closure median
histogram_quantile(0.50, sum(rate(stellaops_clicks_to_closure_bucket[5m])) by (le))
# Evidence completeness average percent (0-4 mapped to 0-100)
100 * (sum(rate(stellaops_evidence_completeness_score_sum[5m])) / clamp_min(sum(rate(stellaops_evidence_completeness_score_count[5m])), 1)) / 4
# Budget violations by phase
sum(rate(stellaops_performance_budget_violations_total[5m])) by (phase)
```
- Dashboard: `ops/devops/observability/grafana/triage-ttfs.json`
- Alerts: `ops/devops/observability/triage-alerts.yaml`
## TTFS Metrics (time-to-first-signal)
- Core metrics:
- `ttfs_latency_seconds{surface,cache_hit,signal_source,kind,phase,tenant_id}` (histogram)
- `ttfs_signal_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
- `ttfs_cache_hit_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
- `ttfs_cache_miss_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
- `ttfs_slo_breach_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
- `ttfs_error_total{surface,cache_hit,signal_source,kind,phase,tenant_id,error_type,error_code}` (counter)
- SLO targets:
- P50 < 2s, P95 < 5s (all surfaces)
- Warm path P50 < 700ms, P95 < 2.5s
- Cold path P95 < 4s
```promql
# TTFS latency p50/p95
histogram_quantile(0.50, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
# SLO breach rate (per minute)
60 * sum(rate(ttfs_slo_breach_total[5m]))
```
## Offline Kit (air-gap) metrics
- `offlinekit_import_total{status,tenant_id}` (counter)
- `offlinekit_attestation_verify_latency_seconds{attestation_type,success}` (histogram)
- `attestor_rekor_success_total{mode}` (counter)
- `attestor_rekor_retry_total{reason}` (counter)
- `rekor_inclusion_latency{success}` (histogram)
```promql
# Import rate by status
sum(rate(offlinekit_import_total[5m])) by (status)
# Import success rate
sum(rate(offlinekit_import_total{status="success"}[5m])) / clamp_min(sum(rate(offlinekit_import_total[5m])), 1)
# Attestation verify p95 by type (success only)
histogram_quantile(0.95, sum(rate(offlinekit_attestation_verify_latency_seconds_bucket{success="true"}[5m])) by (le, attestation_type))
# Rekor inclusion latency p95 (by success)
histogram_quantile(0.95, sum(rate(rekor_inclusion_latency_bucket[5m])) by (le, success))
```
Dashboard: `docs/observability/dashboards/offline-kit-operations.json`
## Observability hygiene
- Tag everything with `tenant`, `workload`, `env`, `region`, `version`.
- Keep metric names stable; prefer adding labels over renaming.
- No high-cardinality labels (avoid `user_id`, `path`, raw errors); bucket or hash if needed.
- Offline: scrape locally (Prometheus/OTLP); ship exports via bundle if required.
## Dashboards
- Golden signals per service: traffic, errors, saturation, latency (P50/P95/P99).
- Queue dashboards: depth, age, throughput, success/fail rates.
- Tracing overlays: link span `status` to error metrics; use exemplars where supported.
## Validation checklist
- [ ] Metrics emitted with required tags.
- [ ] Cardinality review completed (no unbounded labels).
- [ ] Alerts wired to error budget policy.
- [ ] Dashboards cover golden signals and queue health.