- Created `StellaOps.TestKit.Tests` project for unit tests related to determinism. - Implemented `DeterminismManifestTests` to validate deterministic output for canonical bytes and strings, file read/write operations, and error handling for invalid schema versions. - Added `SbomDeterminismTests` to ensure identical inputs produce consistent SBOMs across SPDX 3.0.1 and CycloneDX 1.6/1.7 formats, including parallel execution tests. - Updated project references in `StellaOps.Integration.Determinism` to include the new determinism testing library.
49 lines
1.8 KiB
Markdown
49 lines
1.8 KiB
Markdown
# Policy observability
|
|
|
|
Purpose
|
|
- Capture Policy Engine metrics, logs, traces, and incident workflows.
|
|
|
|
Metrics
|
|
- policy_run_seconds{tenant,policy,mode}
|
|
- policy_run_queue_depth{tenant}
|
|
- policy_run_failures_total{tenant,policy,reason}
|
|
- policy_run_retries_total{tenant,policy}
|
|
- policy_run_inputs_pending_bytes{tenant}
|
|
- policy_rules_fired_total{tenant,policy,rule}
|
|
- policy_vex_overrides_total{tenant,policy,vendor,justification}
|
|
- policy_suppressions_total{tenant,policy,action}
|
|
- policy_selection_batch_duration_seconds{tenant,policy}
|
|
- policy_materialization_conflicts_total{tenant,policy}
|
|
- policy_api_requests_total{endpoint,method,status}
|
|
- policy_api_latency_seconds{endpoint,method}
|
|
- policy_api_rate_limited_total{endpoint}
|
|
- policy_queue_leases_active{tenant}
|
|
- policy_queue_lease_expirations_total{tenant}
|
|
- policy_delta_backlog_age_seconds{tenant,source}
|
|
|
|
Logs
|
|
- Structured JSON with policyId, policyVersion, tenant, runId, rule, traceId, env.sealed.
|
|
- Categories: policy.run, policy.evaluate, policy.materialize, policy.simulate, policy.lifecycle.
|
|
- Rule-hit logs sample at 1% by default; incident mode raises to 100%.
|
|
|
|
Traces
|
|
- policy.api, policy.select, policy.evaluate, policy.materialize, policy.simulate.
|
|
- Trace context propagated to CLI and UI.
|
|
|
|
Alerts
|
|
- PolicyRunSlaBreach: p95 policy_run_seconds too high.
|
|
- PolicyQueueStuck: policy_delta_backlog_age_seconds > 600.
|
|
- DeterminismMismatch: ERR_POL_004 or replay diff.
|
|
- SimulationDrift: simulation exit 20 over threshold.
|
|
- VexOverrideSpike and SuppressionSurge.
|
|
|
|
Incident mode
|
|
- POST /api/policy/incidents/activate toggles sampling to 100%.
|
|
- Retention extends to 30 days during incident.
|
|
- policy.incident.activated event emitted.
|
|
|
|
Integration points
|
|
- Authority metrics for scope_denied events.
|
|
- Concelier and Excititor trace propagation via gRPC metadata.
|
|
- Offline kits export metrics and logs snapshots.
|