Files
git.stella-ops.org/docs/observability/metrics-and-slos.md
master 5a480a3c2a
Some checks failed
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
Findings Ledger CI / build-test (push) Has been cancelled
Findings Ledger CI / migration-validation (push) Has been cancelled
Findings Ledger CI / generate-manifest (push) Has been cancelled
Lighthouse CI / Lighthouse Audit (push) Has been cancelled
Lighthouse CI / Axe Accessibility Audit (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Reachability Corpus Validation / validate-corpus (push) Has been cancelled
Reachability Corpus Validation / validate-ground-truths (push) Has been cancelled
Scanner Analyzers / Discover Analyzers (push) Has been cancelled
Scanner Analyzers / Validate Test Fixtures (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Signals Reachability Scoring & Events / reachability-smoke (push) Has been cancelled
Reachability Corpus Validation / determinism-check (push) Has been cancelled
Scanner Analyzers / Build Analyzers (push) Has been cancelled
Scanner Analyzers / Test Language Analyzers (push) Has been cancelled
Scanner Analyzers / Verify Deterministic Output (push) Has been cancelled
Signals Reachability Scoring & Events / sign-and-upload (push) Has been cancelled
Add call graph fixtures for various languages and scenarios
- Introduced `all-edge-reasons.json` to test edge resolution reasons in .NET.
- Added `all-visibility-levels.json` to validate method visibility levels in .NET.
- Created `dotnet-aspnetcore-minimal.json` for a minimal ASP.NET Core application.
- Included `go-gin-api.json` for a Go Gin API application structure.
- Added `java-spring-boot.json` for the Spring PetClinic application in Java.
- Introduced `legacy-no-schema.json` for legacy application structure without schema.
- Created `node-express-api.json` for an Express.js API application structure.
2025-12-16 10:44:24 +02:00

5.1 KiB
Raw Blame History

Metrics & SLOs (DOCS-OBS-51-001)

Last updated: 2025-12-15

Core metrics (platform-wide)

  • Requests: http_requests_total{tenant,workload,route,status} (counter); latency histogram http_request_duration_seconds.
  • Jobs: worker_jobs_total{tenant,queue,status}; worker_job_duration_seconds.
  • DB: db_query_duration_seconds{db,operation}; db_pool_in_use, db_pool_available.
  • Cache: cache_requests_total{result=hit|miss}; cache_latency_seconds.
  • Queue depth: queue_depth{tenant,queue} (gauge).
  • Errors: errors_total{tenant,workload,code}.
  • Custom module metrics: keep namespaced (e.g., riskengine_score_duration_seconds, notify_delivery_attempts_total).

SLOs (suggested)

  • API availability: 99.9% monthly per public service.
  • P95 latency: <300 ms for read endpoints; <1 s for write endpoints.
  • Worker job success: >99% over 30d; P95 job duration set per queue (document locally).
  • Queue backlog: alert when queue_depth > 1000 for 5 minutes per tenant/queue.
  • Error budget policy: 28-day rolling window; burn-rate alerts at 2× and 14× budget.

Alert examples

  • High error rate: rate(errors_total[5m]) / rate(http_requests_total[5m]) > 0.02.
  • Latency regression: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.3.
  • Queue backlog: queue_depth > 1000 for 5m.
  • Job failures: rate(worker_jobs_total{status="failed"}[10m]) > 0.01.

UX KPIs (triage TTFS)

  • Targets:
    • TTFS first evidence p95: <= 1.5s
    • TTFS skeleton p95: <= 0.2s
    • Clicks-to-closure median: <= 6
    • Evidence completeness avg: >= 90% (>= 3.6/4)
# TTFS first evidence p50/p95
histogram_quantile(0.50, sum(rate(stellaops_ttfs_first_evidence_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(stellaops_ttfs_first_evidence_seconds_bucket[5m])) by (le))

# Clicks-to-closure median
histogram_quantile(0.50, sum(rate(stellaops_clicks_to_closure_bucket[5m])) by (le))

# Evidence completeness average percent (0-4 mapped to 0-100)
100 * (sum(rate(stellaops_evidence_completeness_score_sum[5m])) / clamp_min(sum(rate(stellaops_evidence_completeness_score_count[5m])), 1)) / 4

# Budget violations by phase
sum(rate(stellaops_performance_budget_violations_total[5m])) by (phase)
  • Dashboard: ops/devops/observability/grafana/triage-ttfs.json
  • Alerts: ops/devops/observability/triage-alerts.yaml

TTFS Metrics (time-to-first-signal)

  • Core metrics:

    • ttfs_latency_seconds{surface,cache_hit,signal_source,kind,phase,tenant_id} (histogram)
    • ttfs_signal_total{surface,cache_hit,signal_source,kind,phase,tenant_id} (counter)
    • ttfs_cache_hit_total{surface,cache_hit,signal_source,kind,phase,tenant_id} (counter)
    • ttfs_cache_miss_total{surface,cache_hit,signal_source,kind,phase,tenant_id} (counter)
    • ttfs_slo_breach_total{surface,cache_hit,signal_source,kind,phase,tenant_id} (counter)
    • ttfs_error_total{surface,cache_hit,signal_source,kind,phase,tenant_id,error_type,error_code} (counter)
  • SLO targets:

    • P50 < 2s, P95 < 5s (all surfaces)
    • Warm path P50 < 700ms, P95 < 2.5s
    • Cold path P95 < 4s
# TTFS latency p50/p95
histogram_quantile(0.50, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))

# SLO breach rate (per minute)
60 * sum(rate(ttfs_slo_breach_total[5m]))

Offline Kit (air-gap) metrics

  • offlinekit_import_total{status,tenant_id} (counter)
  • offlinekit_attestation_verify_latency_seconds{attestation_type,success} (histogram)
  • attestor_rekor_success_total{mode} (counter)
  • attestor_rekor_retry_total{reason} (counter)
  • rekor_inclusion_latency{success} (histogram)
# Import rate by status
sum(rate(offlinekit_import_total[5m])) by (status)

# Import success rate
sum(rate(offlinekit_import_total{status="success"}[5m])) / clamp_min(sum(rate(offlinekit_import_total[5m])), 1)

# Attestation verify p95 by type (success only)
histogram_quantile(0.95, sum(rate(offlinekit_attestation_verify_latency_seconds_bucket{success="true"}[5m])) by (le, attestation_type))

# Rekor inclusion latency p95 (by success)
histogram_quantile(0.95, sum(rate(rekor_inclusion_latency_bucket[5m])) by (le, success))

Dashboard: docs/observability/dashboards/offline-kit-operations.json

Observability hygiene

  • Tag everything with tenant, workload, env, region, version.
  • Keep metric names stable; prefer adding labels over renaming.
  • No high-cardinality labels (avoid user_id, path, raw errors); bucket or hash if needed.
  • Offline: scrape locally (Prometheus/OTLP); ship exports via bundle if required.

Dashboards

  • Golden signals per service: traffic, errors, saturation, latency (P50/P95/P99).
  • Queue dashboards: depth, age, throughput, success/fail rates.
  • Tracing overlays: link span status to error metrics; use exemplars where supported.

Validation checklist

  • Metrics emitted with required tags.
  • Cardinality review completed (no unbounded labels).
  • Alerts wired to error budget policy.
  • Dashboards cover golden signals and queue health.