Restructure solution layout by module
	
		
			
	
		
	
	
		
	
		
			Some checks failed
		
		
	
	
		
			
				
	
				Docs CI / lint-and-preview (push) Has been cancelled
				
			
		
		
	
	
				
					
				
			
		
			Some checks failed
		
		
	
	Docs CI / lint-and-preview (push) Has been cancelled
				
			This commit is contained in:
		| @@ -1,166 +1,166 @@ | ||||
| # Policy Engine Observability | ||||
|  | ||||
| > **Audience:** Observability Guild, SRE/Platform operators, Policy Guild.   | ||||
| > **Scope:** Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint 20).   | ||||
| > **Prerequisites:** Policy Engine v2 deployed with OpenTelemetry exporters enabled (`observability:enabled=true` in config). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 1 · Instrumentation Overview | ||||
|  | ||||
| - **Telemetry stack:** OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo. | ||||
| - **Namespace conventions:** `policy.*` for metrics/traces/log categories; labels use `tenant`, `policy`, `mode`, `runId`. | ||||
| - **Sampling:** Default 10 % trace sampling, 1 % rule-hit log sampling; incident mode overrides to 100 % (see §6). | ||||
| - **Correlation IDs:** Every API request gets `traceId` + `requestId`. CLI/UI display IDs to streamline support. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 2 · Metrics | ||||
|  | ||||
| ### 2.1 Run Pipeline | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_run_seconds` | Histogram | `tenant`, `policy`, `mode` (`full`, `incremental`, `simulate`) | P95 target ≤ 5 min incremental, ≤ 30 min full. | | ||||
| | `policy_run_queue_depth` | Gauge | `tenant` | Number of pending jobs per tenant (updated each enqueue/dequeue). | | ||||
| | `policy_run_failures_total` | Counter | `tenant`, `policy`, `reason` (`err_pol_*`, `network`, `cancelled`) | Aligns with error codes. | | ||||
| | `policy_run_retries_total` | Counter | `tenant`, `policy` | Helps identify noisy sources. | | ||||
| | `policy_run_inputs_pending_bytes` | Gauge | `tenant` | Size of buffered change batches awaiting run. | | ||||
|  | ||||
| ### 2.2 Evaluator Insights | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_rules_fired_total` | Counter | `tenant`, `policy`, `rule` | Increment per rule match (sampled). | | ||||
| | `policy_vex_overrides_total` | Counter | `tenant`, `policy`, `vendor`, `justification` | Tracks VEX precedence decisions. | | ||||
| | `policy_suppressions_total` | Counter | `tenant`, `policy`, `action` (`ignore`, `warn`, `quiet`) | Audits suppression usage. | | ||||
| | `policy_selection_batch_duration_seconds` | Histogram | `tenant`, `policy` | Measures joiner performance. | | ||||
| | `policy_materialization_conflicts_total` | Counter | `tenant`, `policy` | Non-zero indicates optimistic concurrency retries. | | ||||
|  | ||||
| ### 2.3 API Surface | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_api_requests_total` | Counter | `endpoint`, `method`, `status` | Exposed via Minimal API instrumentation. | | ||||
| | `policy_api_latency_seconds` | Histogram | `endpoint`, `method` | Budget ≤ 250 ms for GETs, ≤ 1 s for POSTs. | | ||||
| | `policy_api_rate_limited_total` | Counter | `endpoint` | Tied to throttles (`429`). | | ||||
|  | ||||
| ### 2.4 Queue & Change Streams | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_queue_leases_active` | Gauge | `tenant` | Number of leased jobs. | | ||||
| | `policy_queue_lease_expirations_total` | Counter | `tenant` | Alerts when workers fail to ack. | | ||||
| | `policy_delta_backlog_age_seconds` | Gauge | `tenant`, `source` (`concelier`, `excititor`, `sbom`) | Age of oldest unprocessed change event. | | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 3 · Logs | ||||
|  | ||||
| - **Format:** JSON (`Serilog`). Core fields: `timestamp`, `level`, `message`, `policyId`, `policyVersion`, `tenant`, `runId`, `rule`, `traceId`, `env.sealed`, `error.code`. | ||||
| - **Log categories:** | ||||
|   - `policy.run` (queue lifecycle, run begin/end, stats) | ||||
|   - `policy.evaluate` (batch execution summaries; rule-hit sampling) | ||||
|   - `policy.materialize` (Mongo operations, conflicts, retries) | ||||
|   - `policy.simulate` (diff results, CLI invocation metadata) | ||||
|   - `policy.lifecycle` (submit/review/approve events) | ||||
| - **Sampling:** Rule-hit logs sample 1 % by default; toggled to 100 % in incident mode or when `--trace` flag used in CLI. | ||||
| - **PII:** No user secrets recorded; user identities referenced as `user:<id>` or `group:<id>` only. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 4 · Traces | ||||
|  | ||||
| - Spans emit via OpenTelemetry instrumentation. | ||||
| - **Primary spans:** | ||||
|   - `policy.api` – wraps HTTP request, records `endpoint`, `status`, `scope`. | ||||
|   - `policy.select` – change stream ingestion and batch assembly (attributes: `candidateCount`, `cursor`). | ||||
|   - `policy.evaluate` – evaluation batch (attributes: `batchSize`, `ruleHits`, `severityChanges`). | ||||
|   - `policy.materialize` – Mongo writes (attributes: `writes`, `historyWrites`, `retryCount`). | ||||
|   - `policy.simulate` – simulation diff generation (attributes: `sbomCount`, `diffAdded`, `diffRemoved`). | ||||
| - Trace context propagated to CLI via response headers `traceparent`; UI surfaces in run detail view. | ||||
| - Incident mode forces span sampling to 100 % and extends retention via Collector config override. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 5 · Dashboards | ||||
|  | ||||
| ### 5.1 Policy Runs Overview | ||||
|  | ||||
| Widgets: | ||||
| - Run duration histogram (per mode/tenant). | ||||
| - Queue depth + backlog age line charts. | ||||
| - Failure rate stacked by error code. | ||||
| - Incremental backlog heatmap (policy × age). | ||||
| - Active vs scheduled runs table. | ||||
|  | ||||
| ### 5.2 Rule Impact & VEX | ||||
|  | ||||
| - Top N rules by firings (bar chart). | ||||
| - VEX overrides by vendor/justification (stacked chart). | ||||
| - Suppression usage (pie + table with justifications). | ||||
| - Quieted findings trend (line). | ||||
|  | ||||
| ### 5.3 Simulation & Approval Health | ||||
|  | ||||
| - Simulation diff histogram (added vs removed). | ||||
| - Pending approvals by age (table with SLA colour coding). | ||||
| - Compliance checklist status (lint, determinism CI, simulation evidence). | ||||
|  | ||||
| > Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (`../assets/policy-observability/*.png`). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 6 · Alerting | ||||
|  | ||||
| | Alert | Condition | Suggested Action | | ||||
| |-------|-----------|------------------| | ||||
| | **PolicyRunSlaBreach** | `policy_run_seconds{mode="incremental"}` P95 > 300 s for 3 windows | Check queue depth, upstream services, scale worker pool. | | ||||
| | **PolicyQueueStuck** | `policy_delta_backlog_age_seconds` > 600 | Investigate change stream connectivity. | | ||||
| | **DeterminismMismatch** | Run status `failed` with `ERR_POL_004` OR CI replay diff | Switch to incident sampling, gather replay bundle, notify Policy Guild. | | ||||
| | **SimulationDrift** | CLI/CI simulation exit `20` (blocking diff) over threshold | Review policy changes before approval. | | ||||
| | **VexOverrideSpike** | `policy_vex_overrides_total` > configured baseline (per vendor) | Verify upstream VEX feed; ensure justification codes expected. | | ||||
| | **SuppressionSurge** | `policy_suppressions_total` increase > 3σ vs baseline | Audit new suppress rules; check approvals. | | ||||
|  | ||||
| Alerts integrate with Notifier channels (`policy.alerts`) and Ops on-call rotations. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 7 · Incident Mode & Forensics | ||||
|  | ||||
| - Toggle via `POST /api/policy/incidents/activate` (requires `policy:operate` scope). | ||||
| - Effects: | ||||
|   - Trace sampling → 100 %. | ||||
|   - Rule-hit log sampling → 100 %. | ||||
|   - Retention window extended to 30 days for incident duration. | ||||
|   - `policy.incident.activated` event emitted (Console + Notifier banners). | ||||
| - Post-incident tasks: | ||||
|   - `stella policy run replay` for affected runs; attach bundles to incident record. | ||||
|   - Restore sampling defaults with `.../deactivate`. | ||||
|   - Update incident checklist in `/docs/policy/lifecycle.md` (section 8) with findings. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 8 · Integration Points | ||||
|  | ||||
| - **Authority:** Exposes metric `policy_scope_denied_total` for failed authorisation; correlate with `policy_api_requests_total`. | ||||
| - **Concelier/Excititor:** Shared trace IDs propagate via gRPC metadata to help debug upstream latency. | ||||
| - **Scheduler:** Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002). | ||||
| - **Offline Kit:** CLI exports logs + metrics snapshots (`stella offline bundle metrics`) for air-gapped audits. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 9 · Compliance Checklist | ||||
|  | ||||
| - [ ] **Metrics registered:** All metrics listed above exported and documented in Grafana dashboards. | ||||
| - [ ] **Alert policies configured:** Ops or Observability Guild created alerts matching table in §6. | ||||
| - [ ] **Sampling overrides tested:** Incident mode toggles verified in staging; retention roll-back rehearsed. | ||||
| - [ ] **Trace propagation validated:** CLI/UI display trace IDs and allow copy for support. | ||||
| - [ ] **Log scrubbing enforced:** Unit tests guarantee no secrets/PII in logs; sampling respects configuration. | ||||
| - [ ] **Offline capture rehearsed:** Metrics/log snapshot commands executed in sealed environment. | ||||
| - [ ] **Docs cross-links:** Links to architecture, runs, lifecycle, CLI, API docs verified. | ||||
|  | ||||
| --- | ||||
|  | ||||
| *Last updated: 2025-10-26 (Sprint 20).* | ||||
|  | ||||
| # Policy Engine Observability | ||||
|  | ||||
| > **Audience:** Observability Guild, SRE/Platform operators, Policy Guild.   | ||||
| > **Scope:** Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint 20).   | ||||
| > **Prerequisites:** Policy Engine v2 deployed with OpenTelemetry exporters enabled (`observability:enabled=true` in config). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 1 · Instrumentation Overview | ||||
|  | ||||
| - **Telemetry stack:** OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo. | ||||
| - **Namespace conventions:** `policy.*` for metrics/traces/log categories; labels use `tenant`, `policy`, `mode`, `runId`. | ||||
| - **Sampling:** Default 10 % trace sampling, 1 % rule-hit log sampling; incident mode overrides to 100 % (see §6). | ||||
| - **Correlation IDs:** Every API request gets `traceId` + `requestId`. CLI/UI display IDs to streamline support. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 2 · Metrics | ||||
|  | ||||
| ### 2.1 Run Pipeline | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_run_seconds` | Histogram | `tenant`, `policy`, `mode` (`full`, `incremental`, `simulate`) | P95 target ≤ 5 min incremental, ≤ 30 min full. | | ||||
| | `policy_run_queue_depth` | Gauge | `tenant` | Number of pending jobs per tenant (updated each enqueue/dequeue). | | ||||
| | `policy_run_failures_total` | Counter | `tenant`, `policy`, `reason` (`err_pol_*`, `network`, `cancelled`) | Aligns with error codes. | | ||||
| | `policy_run_retries_total` | Counter | `tenant`, `policy` | Helps identify noisy sources. | | ||||
| | `policy_run_inputs_pending_bytes` | Gauge | `tenant` | Size of buffered change batches awaiting run. | | ||||
|  | ||||
| ### 2.2 Evaluator Insights | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_rules_fired_total` | Counter | `tenant`, `policy`, `rule` | Increment per rule match (sampled). | | ||||
| | `policy_vex_overrides_total` | Counter | `tenant`, `policy`, `vendor`, `justification` | Tracks VEX precedence decisions. | | ||||
| | `policy_suppressions_total` | Counter | `tenant`, `policy`, `action` (`ignore`, `warn`, `quiet`) | Audits suppression usage. | | ||||
| | `policy_selection_batch_duration_seconds` | Histogram | `tenant`, `policy` | Measures joiner performance. | | ||||
| | `policy_materialization_conflicts_total` | Counter | `tenant`, `policy` | Non-zero indicates optimistic concurrency retries. | | ||||
|  | ||||
| ### 2.3 API Surface | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_api_requests_total` | Counter | `endpoint`, `method`, `status` | Exposed via Minimal API instrumentation. | | ||||
| | `policy_api_latency_seconds` | Histogram | `endpoint`, `method` | Budget ≤ 250 ms for GETs, ≤ 1 s for POSTs. | | ||||
| | `policy_api_rate_limited_total` | Counter | `endpoint` | Tied to throttles (`429`). | | ||||
|  | ||||
| ### 2.4 Queue & Change Streams | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_queue_leases_active` | Gauge | `tenant` | Number of leased jobs. | | ||||
| | `policy_queue_lease_expirations_total` | Counter | `tenant` | Alerts when workers fail to ack. | | ||||
| | `policy_delta_backlog_age_seconds` | Gauge | `tenant`, `source` (`concelier`, `excititor`, `sbom`) | Age of oldest unprocessed change event. | | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 3 · Logs | ||||
|  | ||||
| - **Format:** JSON (`Serilog`). Core fields: `timestamp`, `level`, `message`, `policyId`, `policyVersion`, `tenant`, `runId`, `rule`, `traceId`, `env.sealed`, `error.code`. | ||||
| - **Log categories:** | ||||
|   - `policy.run` (queue lifecycle, run begin/end, stats) | ||||
|   - `policy.evaluate` (batch execution summaries; rule-hit sampling) | ||||
|   - `policy.materialize` (Mongo operations, conflicts, retries) | ||||
|   - `policy.simulate` (diff results, CLI invocation metadata) | ||||
|   - `policy.lifecycle` (submit/review/approve events) | ||||
| - **Sampling:** Rule-hit logs sample 1 % by default; toggled to 100 % in incident mode or when `--trace` flag used in CLI. | ||||
| - **PII:** No user secrets recorded; user identities referenced as `user:<id>` or `group:<id>` only. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 4 · Traces | ||||
|  | ||||
| - Spans emit via OpenTelemetry instrumentation. | ||||
| - **Primary spans:** | ||||
|   - `policy.api` – wraps HTTP request, records `endpoint`, `status`, `scope`. | ||||
|   - `policy.select` – change stream ingestion and batch assembly (attributes: `candidateCount`, `cursor`). | ||||
|   - `policy.evaluate` – evaluation batch (attributes: `batchSize`, `ruleHits`, `severityChanges`). | ||||
|   - `policy.materialize` – Mongo writes (attributes: `writes`, `historyWrites`, `retryCount`). | ||||
|   - `policy.simulate` – simulation diff generation (attributes: `sbomCount`, `diffAdded`, `diffRemoved`). | ||||
| - Trace context propagated to CLI via response headers `traceparent`; UI surfaces in run detail view. | ||||
| - Incident mode forces span sampling to 100 % and extends retention via Collector config override. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 5 · Dashboards | ||||
|  | ||||
| ### 5.1 Policy Runs Overview | ||||
|  | ||||
| Widgets: | ||||
| - Run duration histogram (per mode/tenant). | ||||
| - Queue depth + backlog age line charts. | ||||
| - Failure rate stacked by error code. | ||||
| - Incremental backlog heatmap (policy × age). | ||||
| - Active vs scheduled runs table. | ||||
|  | ||||
| ### 5.2 Rule Impact & VEX | ||||
|  | ||||
| - Top N rules by firings (bar chart). | ||||
| - VEX overrides by vendor/justification (stacked chart). | ||||
| - Suppression usage (pie + table with justifications). | ||||
| - Quieted findings trend (line). | ||||
|  | ||||
| ### 5.3 Simulation & Approval Health | ||||
|  | ||||
| - Simulation diff histogram (added vs removed). | ||||
| - Pending approvals by age (table with SLA colour coding). | ||||
| - Compliance checklist status (lint, determinism CI, simulation evidence). | ||||
|  | ||||
| > Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (`../assets/policy-observability/*.png`). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 6 · Alerting | ||||
|  | ||||
| | Alert | Condition | Suggested Action | | ||||
| |-------|-----------|------------------| | ||||
| | **PolicyRunSlaBreach** | `policy_run_seconds{mode="incremental"}` P95 > 300 s for 3 windows | Check queue depth, upstream services, scale worker pool. | | ||||
| | **PolicyQueueStuck** | `policy_delta_backlog_age_seconds` > 600 | Investigate change stream connectivity. | | ||||
| | **DeterminismMismatch** | Run status `failed` with `ERR_POL_004` OR CI replay diff | Switch to incident sampling, gather replay bundle, notify Policy Guild. | | ||||
| | **SimulationDrift** | CLI/CI simulation exit `20` (blocking diff) over threshold | Review policy changes before approval. | | ||||
| | **VexOverrideSpike** | `policy_vex_overrides_total` > configured baseline (per vendor) | Verify upstream VEX feed; ensure justification codes expected. | | ||||
| | **SuppressionSurge** | `policy_suppressions_total` increase > 3σ vs baseline | Audit new suppress rules; check approvals. | | ||||
|  | ||||
| Alerts integrate with Notifier channels (`policy.alerts`) and Ops on-call rotations. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 7 · Incident Mode & Forensics | ||||
|  | ||||
| - Toggle via `POST /api/policy/incidents/activate` (requires `policy:operate` scope). | ||||
| - Effects: | ||||
|   - Trace sampling → 100 %. | ||||
|   - Rule-hit log sampling → 100 %. | ||||
|   - Retention window extended to 30 days for incident duration. | ||||
|   - `policy.incident.activated` event emitted (Console + Notifier banners). | ||||
| - Post-incident tasks: | ||||
|   - `stella policy run replay` for affected runs; attach bundles to incident record. | ||||
|   - Restore sampling defaults with `.../deactivate`. | ||||
|   - Update incident checklist in `/docs/policy/lifecycle.md` (section 8) with findings. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 8 · Integration Points | ||||
|  | ||||
| - **Authority:** Exposes metric `policy_scope_denied_total` for failed authorisation; correlate with `policy_api_requests_total`. | ||||
| - **Concelier/Excititor:** Shared trace IDs propagate via gRPC metadata to help debug upstream latency. | ||||
| - **Scheduler:** Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002). | ||||
| - **Offline Kit:** CLI exports logs + metrics snapshots (`stella offline bundle metrics`) for air-gapped audits. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 9 · Compliance Checklist | ||||
|  | ||||
| - [ ] **Metrics registered:** All metrics listed above exported and documented in Grafana dashboards. | ||||
| - [ ] **Alert policies configured:** Ops or Observability Guild created alerts matching table in §6. | ||||
| - [ ] **Sampling overrides tested:** Incident mode toggles verified in staging; retention roll-back rehearsed. | ||||
| - [ ] **Trace propagation validated:** CLI/UI display trace IDs and allow copy for support. | ||||
| - [ ] **Log scrubbing enforced:** Unit tests guarantee no secrets/PII in logs; sampling respects configuration. | ||||
| - [ ] **Offline capture rehearsed:** Metrics/log snapshot commands executed in sealed environment. | ||||
| - [ ] **Docs cross-links:** Links to architecture, runs, lifecycle, CLI, API docs verified. | ||||
|  | ||||
| --- | ||||
|  | ||||
| *Last updated: 2025-10-26 (Sprint 20).* | ||||
|  | ||||
|   | ||||
		Reference in New Issue
	
	Block a user