Files
git.stella-ops.org/docs2/observability-policy.md
master bc4318ef97 Add tests for SBOM generation determinism across multiple formats
- Created `StellaOps.TestKit.Tests` project for unit tests related to determinism.
- Implemented `DeterminismManifestTests` to validate deterministic output for canonical bytes and strings, file read/write operations, and error handling for invalid schema versions.
- Added `SbomDeterminismTests` to ensure identical inputs produce consistent SBOMs across SPDX 3.0.1 and CycloneDX 1.6/1.7 formats, including parallel execution tests.
- Updated project references in `StellaOps.Integration.Determinism` to include the new determinism testing library.
2025-12-23 18:56:12 +02:00

1.8 KiB

Policy observability

Purpose

  • Capture Policy Engine metrics, logs, traces, and incident workflows.

Metrics

  • policy_run_seconds{tenant,policy,mode}
  • policy_run_queue_depth{tenant}
  • policy_run_failures_total{tenant,policy,reason}
  • policy_run_retries_total{tenant,policy}
  • policy_run_inputs_pending_bytes{tenant}
  • policy_rules_fired_total{tenant,policy,rule}
  • policy_vex_overrides_total{tenant,policy,vendor,justification}
  • policy_suppressions_total{tenant,policy,action}
  • policy_selection_batch_duration_seconds{tenant,policy}
  • policy_materialization_conflicts_total{tenant,policy}
  • policy_api_requests_total{endpoint,method,status}
  • policy_api_latency_seconds{endpoint,method}
  • policy_api_rate_limited_total{endpoint}
  • policy_queue_leases_active{tenant}
  • policy_queue_lease_expirations_total{tenant}
  • policy_delta_backlog_age_seconds{tenant,source}

Logs

  • Structured JSON with policyId, policyVersion, tenant, runId, rule, traceId, env.sealed.
  • Categories: policy.run, policy.evaluate, policy.materialize, policy.simulate, policy.lifecycle.
  • Rule-hit logs sample at 1% by default; incident mode raises to 100%.

Traces

  • policy.api, policy.select, policy.evaluate, policy.materialize, policy.simulate.
  • Trace context propagated to CLI and UI.

Alerts

  • PolicyRunSlaBreach: p95 policy_run_seconds too high.
  • PolicyQueueStuck: policy_delta_backlog_age_seconds > 600.
  • DeterminismMismatch: ERR_POL_004 or replay diff.
  • SimulationDrift: simulation exit 20 over threshold.
  • VexOverrideSpike and SuppressionSurge.

Incident mode

  • POST /api/policy/incidents/activate toggles sampling to 100%.
  • Retention extends to 30 days during incident.
  • policy.incident.activated event emitted.

Integration points

  • Authority metrics for scope_denied events.
  • Concelier and Excititor trace propagation via gRPC metadata.
  • Offline kits export metrics and logs snapshots.