Files
git.stella-ops.org/docs/observability/policy.md
master 96d52884e8
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Add Policy DSL Validator, Schema Exporter, and Simulation Smoke tools
- Implemented PolicyDslValidator with command-line options for strict mode and JSON output.
- Created PolicySchemaExporter to generate JSON schemas for policy-related models.
- Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes.
- Added project files and necessary dependencies for each tool.
- Ensured proper error handling and usage instructions across tools.
2025-10-27 08:00:11 +02:00

8.3 KiB
Raw Blame History

Policy Engine Observability

Audience: Observability Guild, SRE/Platform operators, Policy Guild.
Scope: Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint20).
Prerequisites: Policy Engine v2 deployed with OpenTelemetry exporters enabled (observability:enabled=true in config).


1·Instrumentation Overview

  • Telemetry stack: OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo.
  • Namespace conventions: policy.* for metrics/traces/log categories; labels use tenant, policy, mode, runId.
  • Sampling: Default 10% trace sampling, 1% rule-hit log sampling; incident mode overrides to 100% (see §6).
  • Correlation IDs: Every API request gets traceId + requestId. CLI/UI display IDs to streamline support.

2·Metrics

2.1 Run Pipeline

Metric Type Labels Notes
policy_run_seconds Histogram tenant, policy, mode (full, incremental, simulate) P95 target ≤5min incremental, ≤30min full.
policy_run_queue_depth Gauge tenant Number of pending jobs per tenant (updated each enqueue/dequeue).
policy_run_failures_total Counter tenant, policy, reason (err_pol_*, network, cancelled) Aligns with error codes.
policy_run_retries_total Counter tenant, policy Helps identify noisy sources.
policy_run_inputs_pending_bytes Gauge tenant Size of buffered change batches awaiting run.

2.2 Evaluator Insights

Metric Type Labels Notes
policy_rules_fired_total Counter tenant, policy, rule Increment per rule match (sampled).
policy_vex_overrides_total Counter tenant, policy, vendor, justification Tracks VEX precedence decisions.
policy_suppressions_total Counter tenant, policy, action (ignore, warn, quiet) Audits suppression usage.
policy_selection_batch_duration_seconds Histogram tenant, policy Measures joiner performance.
policy_materialization_conflicts_total Counter tenant, policy Non-zero indicates optimistic concurrency retries.

2.3 API Surface

Metric Type Labels Notes
policy_api_requests_total Counter endpoint, method, status Exposed via Minimal API instrumentation.
policy_api_latency_seconds Histogram endpoint, method Budget ≤250ms for GETs, ≤1s for POSTs.
policy_api_rate_limited_total Counter endpoint Tied to throttles (429).

2.4 Queue & Change Streams

Metric Type Labels Notes
policy_queue_leases_active Gauge tenant Number of leased jobs.
policy_queue_lease_expirations_total Counter tenant Alerts when workers fail to ack.
policy_delta_backlog_age_seconds Gauge tenant, source (concelier, excititor, sbom) Age of oldest unprocessed change event.

3·Logs

  • Format: JSON (Serilog). Core fields: timestamp, level, message, policyId, policyVersion, tenant, runId, rule, traceId, env.sealed, error.code.
  • Log categories:
    • policy.run (queue lifecycle, run begin/end, stats)
    • policy.evaluate (batch execution summaries; rule-hit sampling)
    • policy.materialize (Mongo operations, conflicts, retries)
    • policy.simulate (diff results, CLI invocation metadata)
    • policy.lifecycle (submit/review/approve events)
  • Sampling: Rule-hit logs sample 1% by default; toggled to 100% in incident mode or when --trace flag used in CLI.
  • PII: No user secrets recorded; user identities referenced as user:<id> or group:<id> only.

4·Traces

  • Spans emit via OpenTelemetry instrumentation.
  • Primary spans:
    • policy.api wraps HTTP request, records endpoint, status, scope.
    • policy.select change stream ingestion and batch assembly (attributes: candidateCount, cursor).
    • policy.evaluate evaluation batch (attributes: batchSize, ruleHits, severityChanges).
    • policy.materialize Mongo writes (attributes: writes, historyWrites, retryCount).
    • policy.simulate simulation diff generation (attributes: sbomCount, diffAdded, diffRemoved).
  • Trace context propagated to CLI via response headers traceparent; UI surfaces in run detail view.
  • Incident mode forces span sampling to 100% and extends retention via Collector config override.

5·Dashboards

5.1 Policy Runs Overview

Widgets:

  • Run duration histogram (per mode/tenant).
  • Queue depth + backlog age line charts.
  • Failure rate stacked by error code.
  • Incremental backlog heatmap (policy × age).
  • Active vs scheduled runs table.

5.2 Rule Impact & VEX

  • Top N rules by firings (bar chart).
  • VEX overrides by vendor/justification (stacked chart).
  • Suppression usage (pie + table with justifications).
  • Quieted findings trend (line).

5.3 Simulation & Approval Health

  • Simulation diff histogram (added vs removed).
  • Pending approvals by age (table with SLA colour coding).
  • Compliance checklist status (lint, determinism CI, simulation evidence).

Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (../assets/policy-observability/*.png).


6·Alerting

Alert Condition Suggested Action
PolicyRunSlaBreach policy_run_seconds{mode="incremental"} P95 > 300s for 3 windows Check queue depth, upstream services, scale worker pool.
PolicyQueueStuck policy_delta_backlog_age_seconds > 600 Investigate change stream connectivity.
DeterminismMismatch Run status failed with ERR_POL_004 OR CI replay diff Switch to incident sampling, gather replay bundle, notify Policy Guild.
SimulationDrift CLI/CI simulation exit 20 (blocking diff) over threshold Review policy changes before approval.
VexOverrideSpike policy_vex_overrides_total > configured baseline (per vendor) Verify upstream VEX feed; ensure justification codes expected.
SuppressionSurge policy_suppressions_total increase > 3σ vs baseline Audit new suppress rules; check approvals.

Alerts integrate with Notifier channels (policy.alerts) and Ops on-call rotations.


7·Incident Mode & Forensics

  • Toggle via POST /api/policy/incidents/activate (requires policy:operate scope).
  • Effects:
    • Trace sampling → 100%.
    • Rule-hit log sampling → 100%.
    • Retention window extended to 30days for incident duration.
    • policy.incident.activated event emitted (Console + Notifier banners).
  • Post-incident tasks:
    • stella policy run replay for affected runs; attach bundles to incident record.
    • Restore sampling defaults with .../deactivate.
    • Update incident checklist in /docs/policy/lifecycle.md (section 8) with findings.

8·Integration Points

  • Authority: Exposes metric policy_scope_denied_total for failed authorisation; correlate with policy_api_requests_total.
  • Concelier/Excititor: Shared trace IDs propagate via gRPC metadata to help debug upstream latency.
  • Scheduler: Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002).
  • Offline Kit: CLI exports logs + metrics snapshots (stella offline bundle metrics) for air-gapped audits.

9·Compliance Checklist

  • Metrics registered: All metrics listed above exported and documented in Grafana dashboards.
  • Alert policies configured: Ops or Observability Guild created alerts matching table in §6.
  • Sampling overrides tested: Incident mode toggles verified in staging; retention roll-back rehearsed.
  • Trace propagation validated: CLI/UI display trace IDs and allow copy for support.
  • Log scrubbing enforced: Unit tests guarantee no secrets/PII in logs; sampling respects configuration.
  • Offline capture rehearsed: Metrics/log snapshot commands executed in sealed environment.
  • Docs cross-links: Links to architecture, runs, lifecycle, CLI, API docs verified.

Last updated: 2025-10-26 (Sprint 20).