8.5 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			8.5 KiB
		
	
	
	
	
	
	
	
Policy Engine Observability
Audience: Observability Guild, SRE/Platform operators, Policy Guild.
Scope: Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint 20).
Prerequisites: Policy Engine v2 deployed with OpenTelemetry exporters enabled (observability:enabled=truein config).
1 · Instrumentation Overview
- Telemetry stack: OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo.
 - Namespace conventions: 
policy.*for metrics/traces/log categories; labels usetenant,policy,mode,runId. - Sampling: Default 10 % trace sampling, 1 % rule-hit log sampling; incident mode overrides to 100 % (see §6).
 - Correlation IDs: Every API request gets 
traceId+requestId. CLI/UI display IDs to streamline support. 
2 · Metrics
2.1 Run Pipeline
| Metric | Type | Labels | Notes | 
|---|---|---|---|
policy_run_seconds | 
Histogram | tenant, policy, mode (full, incremental, simulate) | 
P95 target ≤ 5 min incremental, ≤ 30 min full. | 
policy_run_queue_depth | 
Gauge | tenant | 
Number of pending jobs per tenant (updated each enqueue/dequeue). | 
policy_run_failures_total | 
Counter | tenant, policy, reason (err_pol_*, network, cancelled) | 
Aligns with error codes. | 
policy_run_retries_total | 
Counter | tenant, policy | 
Helps identify noisy sources. | 
policy_run_inputs_pending_bytes | 
Gauge | tenant | 
Size of buffered change batches awaiting run. | 
2.2 Evaluator Insights
| Metric | Type | Labels | Notes | 
|---|---|---|---|
policy_rules_fired_total | 
Counter | tenant, policy, rule | 
Increment per rule match (sampled). | 
policy_vex_overrides_total | 
Counter | tenant, policy, vendor, justification | 
Tracks VEX precedence decisions. | 
policy_suppressions_total | 
Counter | tenant, policy, action (ignore, warn, quiet) | 
Audits suppression usage. | 
policy_selection_batch_duration_seconds | 
Histogram | tenant, policy | 
Measures joiner performance. | 
policy_materialization_conflicts_total | 
Counter | tenant, policy | 
Non-zero indicates optimistic concurrency retries. | 
2.3 API Surface
| Metric | Type | Labels | Notes | 
|---|---|---|---|
policy_api_requests_total | 
Counter | endpoint, method, status | 
Exposed via Minimal API instrumentation. | 
policy_api_latency_seconds | 
Histogram | endpoint, method | 
Budget ≤ 250 ms for GETs, ≤ 1 s for POSTs. | 
policy_api_rate_limited_total | 
Counter | endpoint | 
Tied to throttles (429). | 
2.4 Queue & Change Streams
| Metric | Type | Labels | Notes | 
|---|---|---|---|
policy_queue_leases_active | 
Gauge | tenant | 
Number of leased jobs. | 
policy_queue_lease_expirations_total | 
Counter | tenant | 
Alerts when workers fail to ack. | 
policy_delta_backlog_age_seconds | 
Gauge | tenant, source (concelier, excititor, sbom) | 
Age of oldest unprocessed change event. | 
3 · Logs
- Format: JSON (
Serilog). Core fields:timestamp,level,message,policyId,policyVersion,tenant,runId,rule,traceId,env.sealed,error.code. - Log categories:
policy.run(queue lifecycle, run begin/end, stats)policy.evaluate(batch execution summaries; rule-hit sampling)policy.materialize(Mongo operations, conflicts, retries)policy.simulate(diff results, CLI invocation metadata)policy.lifecycle(submit/review/approve events)
 - Sampling: Rule-hit logs sample 1 % by default; toggled to 100 % in incident mode or when 
--traceflag used in CLI. - PII: No user secrets recorded; user identities referenced as 
user:<id>orgroup:<id>only. 
4 · Traces
- Spans emit via OpenTelemetry instrumentation.
 - Primary spans:
policy.api– wraps HTTP request, recordsendpoint,status,scope.policy.select– change stream ingestion and batch assembly (attributes:candidateCount,cursor).policy.evaluate– evaluation batch (attributes:batchSize,ruleHits,severityChanges).policy.materialize– Mongo writes (attributes:writes,historyWrites,retryCount).policy.simulate– simulation diff generation (attributes:sbomCount,diffAdded,diffRemoved).
 - Trace context propagated to CLI via response headers 
traceparent; UI surfaces in run detail view. - Incident mode forces span sampling to 100 % and extends retention via Collector config override.
 
5 · Dashboards
5.1 Policy Runs Overview
Widgets:
- Run duration histogram (per mode/tenant).
 - Queue depth + backlog age line charts.
 - Failure rate stacked by error code.
 - Incremental backlog heatmap (policy × age).
 - Active vs scheduled runs table.
 
5.2 Rule Impact & VEX
- Top N rules by firings (bar chart).
 - VEX overrides by vendor/justification (stacked chart).
 - Suppression usage (pie + table with justifications).
 - Quieted findings trend (line).
 
5.3 Simulation & Approval Health
- Simulation diff histogram (added vs removed).
 - Pending approvals by age (table with SLA colour coding).
 - Compliance checklist status (lint, determinism CI, simulation evidence).
 
Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (
../assets/policy-observability/*.png).
6 · Alerting
| Alert | Condition | Suggested Action | 
|---|---|---|
| PolicyRunSlaBreach | policy_run_seconds{mode="incremental"} P95 > 300 s for 3 windows | 
Check queue depth, upstream services, scale worker pool. | 
| PolicyQueueStuck | policy_delta_backlog_age_seconds > 600 | 
Investigate change stream connectivity. | 
| DeterminismMismatch | Run status failed with ERR_POL_004 OR CI replay diff | 
Switch to incident sampling, gather replay bundle, notify Policy Guild. | 
| SimulationDrift | CLI/CI simulation exit 20 (blocking diff) over threshold | 
Review policy changes before approval. | 
| VexOverrideSpike | policy_vex_overrides_total > configured baseline (per vendor) | 
Verify upstream VEX feed; ensure justification codes expected. | 
| SuppressionSurge | policy_suppressions_total increase > 3σ vs baseline | 
Audit new suppress rules; check approvals. | 
Alerts integrate with Notifier channels (policy.alerts) and Ops on-call rotations.
7 · Incident Mode & Forensics
- Toggle via 
POST /api/policy/incidents/activate(requirespolicy:operatescope). - Effects:
- Trace sampling → 100 %.
 - Rule-hit log sampling → 100 %.
 - Retention window extended to 30 days for incident duration.
 policy.incident.activatedevent emitted (Console + Notifier banners).
 - Post-incident tasks:
stella policy run replayfor affected runs; attach bundles to incident record.- Restore sampling defaults with 
.../deactivate. - Update incident checklist in 
/docs/policy/lifecycle.md(section 8) with findings. 
 
8 · Integration Points
- Authority: Exposes metric 
policy_scope_denied_totalfor failed authorisation; correlate withpolicy_api_requests_total. - Concelier/Excititor: Shared trace IDs propagate via gRPC metadata to help debug upstream latency.
 - Scheduler: Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002).
 - Offline Kit: CLI exports logs + metrics snapshots (
stella offline bundle metrics) for air-gapped audits. 
9 · Compliance Checklist
- Metrics registered: All metrics listed above exported and documented in Grafana dashboards.
 - Alert policies configured: Ops or Observability Guild created alerts matching table in §6.
 - Sampling overrides tested: Incident mode toggles verified in staging; retention roll-back rehearsed.
 - Trace propagation validated: CLI/UI display trace IDs and allow copy for support.
 - Log scrubbing enforced: Unit tests guarantee no secrets/PII in logs; sampling respects configuration.
 - Offline capture rehearsed: Metrics/log snapshot commands executed in sealed environment.
 - Docs cross-links: Links to architecture, runs, lifecycle, CLI, API docs verified.
 
Last updated: 2025-10-26 (Sprint 20).