- Created `StellaOps.TestKit.Tests` project for unit tests related to determinism. - Implemented `DeterminismManifestTests` to validate deterministic output for canonical bytes and strings, file read/write operations, and error handling for invalid schema versions. - Added `SbomDeterminismTests` to ensure identical inputs produce consistent SBOMs across SPDX 3.0.1 and CycloneDX 1.6/1.7 formats, including parallel execution tests. - Updated project references in `StellaOps.Integration.Determinism` to include the new determinism testing library.
58 lines
2.3 KiB
Markdown
58 lines
2.3 KiB
Markdown
# Metrics and SLOs
|
|
|
|
Core metrics (platform-wide)
|
|
- http_requests_total{tenant,workload,route,status}
|
|
- http_request_duration_seconds (histogram)
|
|
- worker_jobs_total{tenant,queue,status}
|
|
- worker_job_duration_seconds (histogram)
|
|
- db_query_duration_seconds{db,operation}
|
|
- db_pool_in_use, db_pool_available
|
|
- cache_requests_total{result=hit|miss}
|
|
- cache_latency_seconds (histogram)
|
|
- queue_depth{tenant,queue}
|
|
- errors_total{tenant,workload,code}
|
|
|
|
SLO targets (suggested)
|
|
- API availability: 99.9% monthly per public service.
|
|
- P95 latency: <300ms reads, <1s writes.
|
|
- Worker job success: >99% over 30d.
|
|
- Queue backlog: alert when queue_depth > 1000 for 5 minutes.
|
|
|
|
Alert examples
|
|
- Error rate: rate(errors_total[5m]) / rate(http_requests_total[5m]) > 0.02
|
|
- Latency regression: p95 http_request_duration_seconds > 0.3s
|
|
- Queue backlog: queue_depth > 1000 for 5 minutes
|
|
- Job failures: rate(worker_jobs_total{status="failed"}[10m]) > 0.01
|
|
|
|
UX KPIs (triage TTFS)
|
|
- P95 first evidence <= 1.5s; skeleton <= 0.2s.
|
|
- Clicks-to-closure median <= 6.
|
|
- Evidence completeness >= 90% (>= 3.6/4).
|
|
|
|
TTFS metrics
|
|
- ttfs_latency_seconds{surface,cache_hit,signal_source,kind,phase,tenant_id}
|
|
- ttfs_signal_total{surface,cache_hit,signal_source,kind,phase,tenant_id}
|
|
- ttfs_cache_hit_total, ttfs_cache_miss_total
|
|
- ttfs_slo_breach_total{surface,cache_hit,signal_source,kind,phase,tenant_id}
|
|
- ttfs_error_total{surface,cache_hit,signal_source,kind,phase,tenant_id,error_type,error_code}
|
|
|
|
Offline kit metrics
|
|
- offlinekit_import_total{status,tenant_id}
|
|
- offlinekit_attestation_verify_latency_seconds{attestation_type,success}
|
|
- attestor_rekor_success_total{mode}
|
|
- attestor_rekor_retry_total{reason}
|
|
- rekor_inclusion_latency{success}
|
|
|
|
Scanner FN-Drift metrics
|
|
- scanner.fn_drift.percent (30-day rolling percentage)
|
|
- scanner.fn_drift.transitions_30d and scanner.fn_drift.evaluated_30d
|
|
- scanner.fn_drift.cause.feed_delta, rule_delta, lattice_delta, reachability_delta, engine
|
|
- scanner.classification_changes_total{cause}
|
|
- scanner.fn_transitions_total{cause}
|
|
- SLO targets: warning above 1.0%, critical above 2.5%, engine drift > 0%
|
|
|
|
Hygiene
|
|
- Tag metrics with tenant, workload, env, region, version.
|
|
- Keep metric names stable and namespace custom metrics per module.
|
|
- Use deterministic bucket boundaries and consistent units.
|