Files
git.stella-ops.org/docs2/observability-policy.md
master bc4318ef97 Add tests for SBOM generation determinism across multiple formats
- Created `StellaOps.TestKit.Tests` project for unit tests related to determinism.
- Implemented `DeterminismManifestTests` to validate deterministic output for canonical bytes and strings, file read/write operations, and error handling for invalid schema versions.
- Added `SbomDeterminismTests` to ensure identical inputs produce consistent SBOMs across SPDX 3.0.1 and CycloneDX 1.6/1.7 formats, including parallel execution tests.
- Updated project references in `StellaOps.Integration.Determinism` to include the new determinism testing library.
2025-12-23 18:56:12 +02:00

49 lines
1.8 KiB
Markdown

# Policy observability
Purpose
- Capture Policy Engine metrics, logs, traces, and incident workflows.
Metrics
- policy_run_seconds{tenant,policy,mode}
- policy_run_queue_depth{tenant}
- policy_run_failures_total{tenant,policy,reason}
- policy_run_retries_total{tenant,policy}
- policy_run_inputs_pending_bytes{tenant}
- policy_rules_fired_total{tenant,policy,rule}
- policy_vex_overrides_total{tenant,policy,vendor,justification}
- policy_suppressions_total{tenant,policy,action}
- policy_selection_batch_duration_seconds{tenant,policy}
- policy_materialization_conflicts_total{tenant,policy}
- policy_api_requests_total{endpoint,method,status}
- policy_api_latency_seconds{endpoint,method}
- policy_api_rate_limited_total{endpoint}
- policy_queue_leases_active{tenant}
- policy_queue_lease_expirations_total{tenant}
- policy_delta_backlog_age_seconds{tenant,source}
Logs
- Structured JSON with policyId, policyVersion, tenant, runId, rule, traceId, env.sealed.
- Categories: policy.run, policy.evaluate, policy.materialize, policy.simulate, policy.lifecycle.
- Rule-hit logs sample at 1% by default; incident mode raises to 100%.
Traces
- policy.api, policy.select, policy.evaluate, policy.materialize, policy.simulate.
- Trace context propagated to CLI and UI.
Alerts
- PolicyRunSlaBreach: p95 policy_run_seconds too high.
- PolicyQueueStuck: policy_delta_backlog_age_seconds > 600.
- DeterminismMismatch: ERR_POL_004 or replay diff.
- SimulationDrift: simulation exit 20 over threshold.
- VexOverrideSpike and SuppressionSurge.
Incident mode
- POST /api/policy/incidents/activate toggles sampling to 100%.
- Retention extends to 30 days during incident.
- policy.incident.activated event emitted.
Integration points
- Authority metrics for scope_denied events.
- Concelier and Excititor trace propagation via gRPC metadata.
- Offline kits export metrics and logs snapshots.