8.5 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			8.5 KiB
		
	
	
	
	
	
	
	
Policy Engine Observability
Audience: Observability Guild, SRE/Platform operators, Policy Guild.
Scope: Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint 20).
Prerequisites: Policy Engine v2 deployed with OpenTelemetry exporters enabled (observability:enabled=truein config).
1 · Instrumentation Overview
- Telemetry stack: OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo.
- Namespace conventions: policy.*for metrics/traces/log categories; labels usetenant,policy,mode,runId.
- Sampling: Default 10 % trace sampling, 1 % rule-hit log sampling; incident mode overrides to 100 % (see §6).
- Correlation IDs: Every API request gets traceId+requestId. CLI/UI display IDs to streamline support.
2 · Metrics
2.1 Run Pipeline
| Metric | Type | Labels | Notes | 
|---|---|---|---|
| policy_run_seconds | Histogram | tenant,policy,mode(full,incremental,simulate) | P95 target ≤ 5 min incremental, ≤ 30 min full. | 
| policy_run_queue_depth | Gauge | tenant | Number of pending jobs per tenant (updated each enqueue/dequeue). | 
| policy_run_failures_total | Counter | tenant,policy,reason(err_pol_*,network,cancelled) | Aligns with error codes. | 
| policy_run_retries_total | Counter | tenant,policy | Helps identify noisy sources. | 
| policy_run_inputs_pending_bytes | Gauge | tenant | Size of buffered change batches awaiting run. | 
2.2 Evaluator Insights
| Metric | Type | Labels | Notes | 
|---|---|---|---|
| policy_rules_fired_total | Counter | tenant,policy,rule | Increment per rule match (sampled). | 
| policy_vex_overrides_total | Counter | tenant,policy,vendor,justification | Tracks VEX precedence decisions. | 
| policy_suppressions_total | Counter | tenant,policy,action(ignore,warn,quiet) | Audits suppression usage. | 
| policy_selection_batch_duration_seconds | Histogram | tenant,policy | Measures joiner performance. | 
| policy_materialization_conflicts_total | Counter | tenant,policy | Non-zero indicates optimistic concurrency retries. | 
2.3 API Surface
| Metric | Type | Labels | Notes | 
|---|---|---|---|
| policy_api_requests_total | Counter | endpoint,method,status | Exposed via Minimal API instrumentation. | 
| policy_api_latency_seconds | Histogram | endpoint,method | Budget ≤ 250 ms for GETs, ≤ 1 s for POSTs. | 
| policy_api_rate_limited_total | Counter | endpoint | Tied to throttles ( 429). | 
2.4 Queue & Change Streams
| Metric | Type | Labels | Notes | 
|---|---|---|---|
| policy_queue_leases_active | Gauge | tenant | Number of leased jobs. | 
| policy_queue_lease_expirations_total | Counter | tenant | Alerts when workers fail to ack. | 
| policy_delta_backlog_age_seconds | Gauge | tenant,source(concelier,excititor,sbom) | Age of oldest unprocessed change event. | 
3 · Logs
- Format: JSON (Serilog). Core fields:timestamp,level,message,policyId,policyVersion,tenant,runId,rule,traceId,env.sealed,error.code.
- Log categories:
- policy.run(queue lifecycle, run begin/end, stats)
- policy.evaluate(batch execution summaries; rule-hit sampling)
- policy.materialize(Mongo operations, conflicts, retries)
- policy.simulate(diff results, CLI invocation metadata)
- policy.lifecycle(submit/review/approve events)
 
- Sampling: Rule-hit logs sample 1 % by default; toggled to 100 % in incident mode or when --traceflag used in CLI.
- PII: No user secrets recorded; user identities referenced as user:<id>orgroup:<id>only.
4 · Traces
- Spans emit via OpenTelemetry instrumentation.
- Primary spans:
- policy.api– wraps HTTP request, records- endpoint,- status,- scope.
- policy.select– change stream ingestion and batch assembly (attributes:- candidateCount,- cursor).
- policy.evaluate– evaluation batch (attributes:- batchSize,- ruleHits,- severityChanges).
- policy.materialize– Mongo writes (attributes:- writes,- historyWrites,- retryCount).
- policy.simulate– simulation diff generation (attributes:- sbomCount,- diffAdded,- diffRemoved).
 
- Trace context propagated to CLI via response headers traceparent; UI surfaces in run detail view.
- Incident mode forces span sampling to 100 % and extends retention via Collector config override.
5 · Dashboards
5.1 Policy Runs Overview
Widgets:
- Run duration histogram (per mode/tenant).
- Queue depth + backlog age line charts.
- Failure rate stacked by error code.
- Incremental backlog heatmap (policy × age).
- Active vs scheduled runs table.
5.2 Rule Impact & VEX
- Top N rules by firings (bar chart).
- VEX overrides by vendor/justification (stacked chart).
- Suppression usage (pie + table with justifications).
- Quieted findings trend (line).
5.3 Simulation & Approval Health
- Simulation diff histogram (added vs removed).
- Pending approvals by age (table with SLA colour coding).
- Compliance checklist status (lint, determinism CI, simulation evidence).
Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (
../assets/policy-observability/*.png).
6 · Alerting
| Alert | Condition | Suggested Action | 
|---|---|---|
| PolicyRunSlaBreach | policy_run_seconds{mode="incremental"}P95 > 300 s for 3 windows | Check queue depth, upstream services, scale worker pool. | 
| PolicyQueueStuck | policy_delta_backlog_age_seconds> 600 | Investigate change stream connectivity. | 
| DeterminismMismatch | Run status failedwithERR_POL_004OR CI replay diff | Switch to incident sampling, gather replay bundle, notify Policy Guild. | 
| SimulationDrift | CLI/CI simulation exit 20(blocking diff) over threshold | Review policy changes before approval. | 
| VexOverrideSpike | policy_vex_overrides_total> configured baseline (per vendor) | Verify upstream VEX feed; ensure justification codes expected. | 
| SuppressionSurge | policy_suppressions_totalincrease > 3σ vs baseline | Audit new suppress rules; check approvals. | 
Alerts integrate with Notifier channels (policy.alerts) and Ops on-call rotations.
7 · Incident Mode & Forensics
- Toggle via POST /api/policy/incidents/activate(requirespolicy:operatescope).
- Effects:
- Trace sampling → 100 %.
- Rule-hit log sampling → 100 %.
- Retention window extended to 30 days for incident duration.
- policy.incident.activatedevent emitted (Console + Notifier banners).
 
- Post-incident tasks:
- stella policy run replayfor affected runs; attach bundles to incident record.
- Restore sampling defaults with .../deactivate.
- Update incident checklist in /docs/policy/lifecycle.md(section 8) with findings.
 
8 · Integration Points
- Authority: Exposes metric policy_scope_denied_totalfor failed authorisation; correlate withpolicy_api_requests_total.
- Concelier/Excititor: Shared trace IDs propagate via gRPC metadata to help debug upstream latency.
- Scheduler: Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002).
- Offline Kit: CLI exports logs + metrics snapshots (stella offline bundle metrics) for air-gapped audits.
9 · Compliance Checklist
- Metrics registered: All metrics listed above exported and documented in Grafana dashboards.
- Alert policies configured: Ops or Observability Guild created alerts matching table in §6.
- Sampling overrides tested: Incident mode toggles verified in staging; retention roll-back rehearsed.
- Trace propagation validated: CLI/UI display trace IDs and allow copy for support.
- Log scrubbing enforced: Unit tests guarantee no secrets/PII in logs; sampling respects configuration.
- Offline capture rehearsed: Metrics/log snapshot commands executed in sealed environment.
- Docs cross-links: Links to architecture, runs, lifecycle, CLI, API docs verified.
Last updated: 2025-10-26 (Sprint 20).