# Metrics and SLOs Core metrics (platform-wide) - http_requests_total{tenant,workload,route,status} - http_request_duration_seconds (histogram) - worker_jobs_total{tenant,queue,status} - worker_job_duration_seconds (histogram) - db_query_duration_seconds{db,operation} - db_pool_in_use, db_pool_available - cache_requests_total{result=hit|miss} - cache_latency_seconds (histogram) - queue_depth{tenant,queue} - errors_total{tenant,workload,code} SLO targets (suggested) - API availability: 99.9% monthly per public service. - P95 latency: <300ms reads, <1s writes. - Worker job success: >99% over 30d. - Queue backlog: alert when queue_depth > 1000 for 5 minutes. Alert examples - Error rate: rate(errors_total[5m]) / rate(http_requests_total[5m]) > 0.02 - Latency regression: p95 http_request_duration_seconds > 0.3s - Queue backlog: queue_depth > 1000 for 5 minutes - Job failures: rate(worker_jobs_total{status="failed"}[10m]) > 0.01 UX KPIs (triage TTFS) - P95 first evidence <= 1.5s; skeleton <= 0.2s. - Clicks-to-closure median <= 6. - Evidence completeness >= 90% (>= 3.6/4). TTFS metrics - ttfs_latency_seconds{surface,cache_hit,signal_source,kind,phase,tenant_id} - ttfs_signal_total{surface,cache_hit,signal_source,kind,phase,tenant_id} - ttfs_cache_hit_total, ttfs_cache_miss_total - ttfs_slo_breach_total{surface,cache_hit,signal_source,kind,phase,tenant_id} - ttfs_error_total{surface,cache_hit,signal_source,kind,phase,tenant_id,error_type,error_code} Offline kit metrics - offlinekit_import_total{status,tenant_id} - offlinekit_attestation_verify_latency_seconds{attestation_type,success} - attestor_rekor_success_total{mode} - attestor_rekor_retry_total{reason} - rekor_inclusion_latency{success} Scanner FN-Drift metrics - scanner.fn_drift.percent (30-day rolling percentage) - scanner.fn_drift.transitions_30d and scanner.fn_drift.evaluated_30d - scanner.fn_drift.cause.feed_delta, rule_delta, lattice_delta, reachability_delta, engine - scanner.classification_changes_total{cause} - scanner.fn_transitions_total{cause} - SLO targets: warning above 1.0%, critical above 2.5%, engine drift > 0% Hygiene - Tag metrics with tenant, workload, env, region, version. - Keep metric names stable and namespace custom metrics per module. - Use deterministic bucket boundaries and consistent units.