Files

master bc4318ef97 Add tests for SBOM generation determinism across multiple formats

- Created `StellaOps.TestKit.Tests` project for unit tests related to determinism.
- Implemented `DeterminismManifestTests` to validate deterministic output for canonical bytes and strings, file read/write operations, and error handling for invalid schema versions.
- Added `SbomDeterminismTests` to ensure identical inputs produce consistent SBOMs across SPDX 3.0.1 and CycloneDX 1.6/1.7 formats, including parallel execution tests.
- Updated project references in `StellaOps.Integration.Determinism` to include the new determinism testing library.

2025-12-23 18:56:12 +02:00

2.3 KiB

Raw Blame History

Metrics and SLOs

Core metrics (platform-wide)

http_requests_total{tenant,workload,route,status}
http_request_duration_seconds (histogram)
worker_jobs_total{tenant,queue,status}
worker_job_duration_seconds (histogram)
db_query_duration_seconds{db,operation}
db_pool_in_use, db_pool_available
cache_requests_total{result=hit|miss}
cache_latency_seconds (histogram)
queue_depth{tenant,queue}
errors_total{tenant,workload,code}

SLO targets (suggested)

API availability: 99.9% monthly per public service.
P95 latency: <300ms reads, <1s writes.
Worker job success: >99% over 30d.
Queue backlog: alert when queue_depth > 1000 for 5 minutes.

Alert examples

Error rate: rate(errors_total[5m]) / rate(http_requests_total[5m]) > 0.02
Latency regression: p95 http_request_duration_seconds > 0.3s
Queue backlog: queue_depth > 1000 for 5 minutes
Job failures: rate(worker_jobs_total{status="failed"}[10m]) > 0.01

UX KPIs (triage TTFS)

P95 first evidence <= 1.5s; skeleton <= 0.2s.
Clicks-to-closure median <= 6.
Evidence completeness >= 90% (>= 3.6/4).

TTFS metrics

ttfs_latency_seconds{surface,cache_hit,signal_source,kind,phase,tenant_id}
ttfs_signal_total{surface,cache_hit,signal_source,kind,phase,tenant_id}
ttfs_cache_hit_total, ttfs_cache_miss_total
ttfs_slo_breach_total{surface,cache_hit,signal_source,kind,phase,tenant_id}
ttfs_error_total{surface,cache_hit,signal_source,kind,phase,tenant_id,error_type,error_code}

Offline kit metrics

offlinekit_import_total{status,tenant_id}
offlinekit_attestation_verify_latency_seconds{attestation_type,success}
attestor_rekor_success_total{mode}
attestor_rekor_retry_total{reason}
rekor_inclusion_latency{success}

Scanner FN-Drift metrics

scanner.fn_drift.percent (30-day rolling percentage)
scanner.fn_drift.transitions_30d and scanner.fn_drift.evaluated_30d
scanner.fn_drift.cause.feed_delta, rule_delta, lattice_delta, reachability_delta, engine
scanner.classification_changes_total{cause}
scanner.fn_transitions_total{cause}
SLO targets: warning above 1.0%, critical above 2.5%, engine drift > 0%

Hygiene

Tag metrics with tenant, workload, env, region, version.
Keep metric names stable and namespace custom metrics per module.
Use deterministic bucket boundaries and consistent units.

2.3 KiB Raw Blame History

Metrics and SLOs

2.3 KiB

Raw Blame History