Files
git.stella-ops.org/docs2/observability-metrics-slos.md
master bc4318ef97 Add tests for SBOM generation determinism across multiple formats
- Created `StellaOps.TestKit.Tests` project for unit tests related to determinism.
- Implemented `DeterminismManifestTests` to validate deterministic output for canonical bytes and strings, file read/write operations, and error handling for invalid schema versions.
- Added `SbomDeterminismTests` to ensure identical inputs produce consistent SBOMs across SPDX 3.0.1 and CycloneDX 1.6/1.7 formats, including parallel execution tests.
- Updated project references in `StellaOps.Integration.Determinism` to include the new determinism testing library.
2025-12-23 18:56:12 +02:00

2.3 KiB

Metrics and SLOs

Core metrics (platform-wide)

  • http_requests_total{tenant,workload,route,status}
  • http_request_duration_seconds (histogram)
  • worker_jobs_total{tenant,queue,status}
  • worker_job_duration_seconds (histogram)
  • db_query_duration_seconds{db,operation}
  • db_pool_in_use, db_pool_available
  • cache_requests_total{result=hit|miss}
  • cache_latency_seconds (histogram)
  • queue_depth{tenant,queue}
  • errors_total{tenant,workload,code}

SLO targets (suggested)

  • API availability: 99.9% monthly per public service.
  • P95 latency: <300ms reads, <1s writes.
  • Worker job success: >99% over 30d.
  • Queue backlog: alert when queue_depth > 1000 for 5 minutes.

Alert examples

  • Error rate: rate(errors_total[5m]) / rate(http_requests_total[5m]) > 0.02
  • Latency regression: p95 http_request_duration_seconds > 0.3s
  • Queue backlog: queue_depth > 1000 for 5 minutes
  • Job failures: rate(worker_jobs_total{status="failed"}[10m]) > 0.01

UX KPIs (triage TTFS)

  • P95 first evidence <= 1.5s; skeleton <= 0.2s.
  • Clicks-to-closure median <= 6.
  • Evidence completeness >= 90% (>= 3.6/4).

TTFS metrics

  • ttfs_latency_seconds{surface,cache_hit,signal_source,kind,phase,tenant_id}
  • ttfs_signal_total{surface,cache_hit,signal_source,kind,phase,tenant_id}
  • ttfs_cache_hit_total, ttfs_cache_miss_total
  • ttfs_slo_breach_total{surface,cache_hit,signal_source,kind,phase,tenant_id}
  • ttfs_error_total{surface,cache_hit,signal_source,kind,phase,tenant_id,error_type,error_code}

Offline kit metrics

  • offlinekit_import_total{status,tenant_id}
  • offlinekit_attestation_verify_latency_seconds{attestation_type,success}
  • attestor_rekor_success_total{mode}
  • attestor_rekor_retry_total{reason}
  • rekor_inclusion_latency{success}

Scanner FN-Drift metrics

  • scanner.fn_drift.percent (30-day rolling percentage)
  • scanner.fn_drift.transitions_30d and scanner.fn_drift.evaluated_30d
  • scanner.fn_drift.cause.feed_delta, rule_delta, lattice_delta, reachability_delta, engine
  • scanner.classification_changes_total{cause}
  • scanner.fn_transitions_total{cause}
  • SLO targets: warning above 1.0%, critical above 2.5%, engine drift > 0%

Hygiene

  • Tag metrics with tenant, workload, env, region, version.
  • Keep metric names stable and namespace custom metrics per module.
  • Use deterministic bucket boundaries and consistent units.