Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
- Implemented PolicyDslValidator with command-line options for strict mode and JSON output. - Created PolicySchemaExporter to generate JSON schemas for policy-related models. - Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes. - Added project files and necessary dependencies for each tool. - Ensured proper error handling and usage instructions across tools.
8.3 KiB
8.3 KiB
Policy Engine Observability
Audience: Observability Guild, SRE/Platform operators, Policy Guild.
Scope: Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint 20).
Prerequisites: Policy Engine v2 deployed with OpenTelemetry exporters enabled (observability:enabled=truein config).
1 · Instrumentation Overview
- Telemetry stack: OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo.
- Namespace conventions:
policy.*for metrics/traces/log categories; labels usetenant,policy,mode,runId. - Sampling: Default 10 % trace sampling, 1 % rule-hit log sampling; incident mode overrides to 100 % (see §6).
- Correlation IDs: Every API request gets
traceId+requestId. CLI/UI display IDs to streamline support.
2 · Metrics
2.1 Run Pipeline
| Metric | Type | Labels | Notes |
|---|---|---|---|
policy_run_seconds |
Histogram | tenant, policy, mode (full, incremental, simulate) |
P95 target ≤ 5 min incremental, ≤ 30 min full. |
policy_run_queue_depth |
Gauge | tenant |
Number of pending jobs per tenant (updated each enqueue/dequeue). |
policy_run_failures_total |
Counter | tenant, policy, reason (err_pol_*, network, cancelled) |
Aligns with error codes. |
policy_run_retries_total |
Counter | tenant, policy |
Helps identify noisy sources. |
policy_run_inputs_pending_bytes |
Gauge | tenant |
Size of buffered change batches awaiting run. |
2.2 Evaluator Insights
| Metric | Type | Labels | Notes |
|---|---|---|---|
policy_rules_fired_total |
Counter | tenant, policy, rule |
Increment per rule match (sampled). |
policy_vex_overrides_total |
Counter | tenant, policy, vendor, justification |
Tracks VEX precedence decisions. |
policy_suppressions_total |
Counter | tenant, policy, action (ignore, warn, quiet) |
Audits suppression usage. |
policy_selection_batch_duration_seconds |
Histogram | tenant, policy |
Measures joiner performance. |
policy_materialization_conflicts_total |
Counter | tenant, policy |
Non-zero indicates optimistic concurrency retries. |
2.3 API Surface
| Metric | Type | Labels | Notes |
|---|---|---|---|
policy_api_requests_total |
Counter | endpoint, method, status |
Exposed via Minimal API instrumentation. |
policy_api_latency_seconds |
Histogram | endpoint, method |
Budget ≤ 250 ms for GETs, ≤ 1 s for POSTs. |
policy_api_rate_limited_total |
Counter | endpoint |
Tied to throttles (429). |
2.4 Queue & Change Streams
| Metric | Type | Labels | Notes |
|---|---|---|---|
policy_queue_leases_active |
Gauge | tenant |
Number of leased jobs. |
policy_queue_lease_expirations_total |
Counter | tenant |
Alerts when workers fail to ack. |
policy_delta_backlog_age_seconds |
Gauge | tenant, source (concelier, excititor, sbom) |
Age of oldest unprocessed change event. |
3 · Logs
- Format: JSON (
Serilog). Core fields:timestamp,level,message,policyId,policyVersion,tenant,runId,rule,traceId,env.sealed,error.code. - Log categories:
policy.run(queue lifecycle, run begin/end, stats)policy.evaluate(batch execution summaries; rule-hit sampling)policy.materialize(Mongo operations, conflicts, retries)policy.simulate(diff results, CLI invocation metadata)policy.lifecycle(submit/review/approve events)
- Sampling: Rule-hit logs sample 1 % by default; toggled to 100 % in incident mode or when
--traceflag used in CLI. - PII: No user secrets recorded; user identities referenced as
user:<id>orgroup:<id>only.
4 · Traces
- Spans emit via OpenTelemetry instrumentation.
- Primary spans:
policy.api– wraps HTTP request, recordsendpoint,status,scope.policy.select– change stream ingestion and batch assembly (attributes:candidateCount,cursor).policy.evaluate– evaluation batch (attributes:batchSize,ruleHits,severityChanges).policy.materialize– Mongo writes (attributes:writes,historyWrites,retryCount).policy.simulate– simulation diff generation (attributes:sbomCount,diffAdded,diffRemoved).
- Trace context propagated to CLI via response headers
traceparent; UI surfaces in run detail view. - Incident mode forces span sampling to 100 % and extends retention via Collector config override.
5 · Dashboards
5.1 Policy Runs Overview
Widgets:
- Run duration histogram (per mode/tenant).
- Queue depth + backlog age line charts.
- Failure rate stacked by error code.
- Incremental backlog heatmap (policy × age).
- Active vs scheduled runs table.
5.2 Rule Impact & VEX
- Top N rules by firings (bar chart).
- VEX overrides by vendor/justification (stacked chart).
- Suppression usage (pie + table with justifications).
- Quieted findings trend (line).
5.3 Simulation & Approval Health
- Simulation diff histogram (added vs removed).
- Pending approvals by age (table with SLA colour coding).
- Compliance checklist status (lint, determinism CI, simulation evidence).
Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (
../assets/policy-observability/*.png).
6 · Alerting
| Alert | Condition | Suggested Action |
|---|---|---|
| PolicyRunSlaBreach | policy_run_seconds{mode="incremental"} P95 > 300 s for 3 windows |
Check queue depth, upstream services, scale worker pool. |
| PolicyQueueStuck | policy_delta_backlog_age_seconds > 600 |
Investigate change stream connectivity. |
| DeterminismMismatch | Run status failed with ERR_POL_004 OR CI replay diff |
Switch to incident sampling, gather replay bundle, notify Policy Guild. |
| SimulationDrift | CLI/CI simulation exit 20 (blocking diff) over threshold |
Review policy changes before approval. |
| VexOverrideSpike | policy_vex_overrides_total > configured baseline (per vendor) |
Verify upstream VEX feed; ensure justification codes expected. |
| SuppressionSurge | policy_suppressions_total increase > 3σ vs baseline |
Audit new suppress rules; check approvals. |
Alerts integrate with Notifier channels (policy.alerts) and Ops on-call rotations.
7 · Incident Mode & Forensics
- Toggle via
POST /api/policy/incidents/activate(requirespolicy:operatescope). - Effects:
- Trace sampling → 100 %.
- Rule-hit log sampling → 100 %.
- Retention window extended to 30 days for incident duration.
policy.incident.activatedevent emitted (Console + Notifier banners).
- Post-incident tasks:
stella policy run replayfor affected runs; attach bundles to incident record.- Restore sampling defaults with
.../deactivate. - Update incident checklist in
/docs/policy/lifecycle.md(section 8) with findings.
8 · Integration Points
- Authority: Exposes metric
policy_scope_denied_totalfor failed authorisation; correlate withpolicy_api_requests_total. - Concelier/Excititor: Shared trace IDs propagate via gRPC metadata to help debug upstream latency.
- Scheduler: Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002).
- Offline Kit: CLI exports logs + metrics snapshots (
stella offline bundle metrics) for air-gapped audits.
9 · Compliance Checklist
- Metrics registered: All metrics listed above exported and documented in Grafana dashboards.
- Alert policies configured: Ops or Observability Guild created alerts matching table in §6.
- Sampling overrides tested: Incident mode toggles verified in staging; retention roll-back rehearsed.
- Trace propagation validated: CLI/UI display trace IDs and allow copy for support.
- Log scrubbing enforced: Unit tests guarantee no secrets/PII in logs; sampling respects configuration.
- Offline capture rehearsed: Metrics/log snapshot commands executed in sealed environment.
- Docs cross-links: Links to architecture, runs, lifecycle, CLI, API docs verified.
Last updated: 2025-10-26 (Sprint 20).