Restructure solution layout by module
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

This commit is contained in:
root
2025-10-28 15:10:40 +02:00
parent 4e3e575db5
commit 68da90a11a
4103 changed files with 192899 additions and 187024 deletions

View File

@@ -1,166 +1,166 @@
# Policy Engine Observability
> **Audience:** Observability Guild, SRE/Platform operators, Policy Guild.
> **Scope:** Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint20).
> **Prerequisites:** Policy Engine v2 deployed with OpenTelemetry exporters enabled (`observability:enabled=true` in config).
---
## 1·Instrumentation Overview
- **Telemetry stack:** OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo.
- **Namespace conventions:** `policy.*` for metrics/traces/log categories; labels use `tenant`, `policy`, `mode`, `runId`.
- **Sampling:** Default 10% trace sampling, 1% rule-hit log sampling; incident mode overrides to 100% (see §6).
- **Correlation IDs:** Every API request gets `traceId` + `requestId`. CLI/UI display IDs to streamline support.
---
## 2·Metrics
### 2.1 Run Pipeline
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_run_seconds` | Histogram | `tenant`, `policy`, `mode` (`full`, `incremental`, `simulate`) | P95 target ≤5min incremental, ≤30min full. |
| `policy_run_queue_depth` | Gauge | `tenant` | Number of pending jobs per tenant (updated each enqueue/dequeue). |
| `policy_run_failures_total` | Counter | `tenant`, `policy`, `reason` (`err_pol_*`, `network`, `cancelled`) | Aligns with error codes. |
| `policy_run_retries_total` | Counter | `tenant`, `policy` | Helps identify noisy sources. |
| `policy_run_inputs_pending_bytes` | Gauge | `tenant` | Size of buffered change batches awaiting run. |
### 2.2 Evaluator Insights
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_rules_fired_total` | Counter | `tenant`, `policy`, `rule` | Increment per rule match (sampled). |
| `policy_vex_overrides_total` | Counter | `tenant`, `policy`, `vendor`, `justification` | Tracks VEX precedence decisions. |
| `policy_suppressions_total` | Counter | `tenant`, `policy`, `action` (`ignore`, `warn`, `quiet`) | Audits suppression usage. |
| `policy_selection_batch_duration_seconds` | Histogram | `tenant`, `policy` | Measures joiner performance. |
| `policy_materialization_conflicts_total` | Counter | `tenant`, `policy` | Non-zero indicates optimistic concurrency retries. |
### 2.3 API Surface
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_api_requests_total` | Counter | `endpoint`, `method`, `status` | Exposed via Minimal API instrumentation. |
| `policy_api_latency_seconds` | Histogram | `endpoint`, `method` | Budget ≤250ms for GETs, ≤1s for POSTs. |
| `policy_api_rate_limited_total` | Counter | `endpoint` | Tied to throttles (`429`). |
### 2.4 Queue & Change Streams
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_queue_leases_active` | Gauge | `tenant` | Number of leased jobs. |
| `policy_queue_lease_expirations_total` | Counter | `tenant` | Alerts when workers fail to ack. |
| `policy_delta_backlog_age_seconds` | Gauge | `tenant`, `source` (`concelier`, `excititor`, `sbom`) | Age of oldest unprocessed change event. |
---
## 3·Logs
- **Format:** JSON (`Serilog`). Core fields: `timestamp`, `level`, `message`, `policyId`, `policyVersion`, `tenant`, `runId`, `rule`, `traceId`, `env.sealed`, `error.code`.
- **Log categories:**
- `policy.run` (queue lifecycle, run begin/end, stats)
- `policy.evaluate` (batch execution summaries; rule-hit sampling)
- `policy.materialize` (Mongo operations, conflicts, retries)
- `policy.simulate` (diff results, CLI invocation metadata)
- `policy.lifecycle` (submit/review/approve events)
- **Sampling:** Rule-hit logs sample 1% by default; toggled to 100% in incident mode or when `--trace` flag used in CLI.
- **PII:** No user secrets recorded; user identities referenced as `user:<id>` or `group:<id>` only.
---
## 4·Traces
- Spans emit via OpenTelemetry instrumentation.
- **Primary spans:**
- `policy.api` wraps HTTP request, records `endpoint`, `status`, `scope`.
- `policy.select` change stream ingestion and batch assembly (attributes: `candidateCount`, `cursor`).
- `policy.evaluate` evaluation batch (attributes: `batchSize`, `ruleHits`, `severityChanges`).
- `policy.materialize` Mongo writes (attributes: `writes`, `historyWrites`, `retryCount`).
- `policy.simulate` simulation diff generation (attributes: `sbomCount`, `diffAdded`, `diffRemoved`).
- Trace context propagated to CLI via response headers `traceparent`; UI surfaces in run detail view.
- Incident mode forces span sampling to 100% and extends retention via Collector config override.
---
## 5·Dashboards
### 5.1 Policy Runs Overview
Widgets:
- Run duration histogram (per mode/tenant).
- Queue depth + backlog age line charts.
- Failure rate stacked by error code.
- Incremental backlog heatmap (policy × age).
- Active vs scheduled runs table.
### 5.2 Rule Impact & VEX
- Top N rules by firings (bar chart).
- VEX overrides by vendor/justification (stacked chart).
- Suppression usage (pie + table with justifications).
- Quieted findings trend (line).
### 5.3 Simulation & Approval Health
- Simulation diff histogram (added vs removed).
- Pending approvals by age (table with SLA colour coding).
- Compliance checklist status (lint, determinism CI, simulation evidence).
> Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (`../assets/policy-observability/*.png`).
---
## 6·Alerting
| Alert | Condition | Suggested Action |
|-------|-----------|------------------|
| **PolicyRunSlaBreach** | `policy_run_seconds{mode="incremental"}` P95 > 300s for 3 windows | Check queue depth, upstream services, scale worker pool. |
| **PolicyQueueStuck** | `policy_delta_backlog_age_seconds` > 600 | Investigate change stream connectivity. |
| **DeterminismMismatch** | Run status `failed` with `ERR_POL_004` OR CI replay diff | Switch to incident sampling, gather replay bundle, notify Policy Guild. |
| **SimulationDrift** | CLI/CI simulation exit `20` (blocking diff) over threshold | Review policy changes before approval. |
| **VexOverrideSpike** | `policy_vex_overrides_total` > configured baseline (per vendor) | Verify upstream VEX feed; ensure justification codes expected. |
| **SuppressionSurge** | `policy_suppressions_total` increase > 3σ vs baseline | Audit new suppress rules; check approvals. |
Alerts integrate with Notifier channels (`policy.alerts`) and Ops on-call rotations.
---
## 7·Incident Mode & Forensics
- Toggle via `POST /api/policy/incidents/activate` (requires `policy:operate` scope).
- Effects:
- Trace sampling → 100%.
- Rule-hit log sampling → 100%.
- Retention window extended to 30days for incident duration.
- `policy.incident.activated` event emitted (Console + Notifier banners).
- Post-incident tasks:
- `stella policy run replay` for affected runs; attach bundles to incident record.
- Restore sampling defaults with `.../deactivate`.
- Update incident checklist in `/docs/policy/lifecycle.md` (section 8) with findings.
---
## 8·Integration Points
- **Authority:** Exposes metric `policy_scope_denied_total` for failed authorisation; correlate with `policy_api_requests_total`.
- **Concelier/Excititor:** Shared trace IDs propagate via gRPC metadata to help debug upstream latency.
- **Scheduler:** Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002).
- **Offline Kit:** CLI exports logs + metrics snapshots (`stella offline bundle metrics`) for air-gapped audits.
---
## 9·Compliance Checklist
- [ ] **Metrics registered:** All metrics listed above exported and documented in Grafana dashboards.
- [ ] **Alert policies configured:** Ops or Observability Guild created alerts matching table in §6.
- [ ] **Sampling overrides tested:** Incident mode toggles verified in staging; retention roll-back rehearsed.
- [ ] **Trace propagation validated:** CLI/UI display trace IDs and allow copy for support.
- [ ] **Log scrubbing enforced:** Unit tests guarantee no secrets/PII in logs; sampling respects configuration.
- [ ] **Offline capture rehearsed:** Metrics/log snapshot commands executed in sealed environment.
- [ ] **Docs cross-links:** Links to architecture, runs, lifecycle, CLI, API docs verified.
---
*Last updated: 2025-10-26 (Sprint 20).*
# Policy Engine Observability
> **Audience:** Observability Guild, SRE/Platform operators, Policy Guild.
> **Scope:** Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint20).
> **Prerequisites:** Policy Engine v2 deployed with OpenTelemetry exporters enabled (`observability:enabled=true` in config).
---
## 1·Instrumentation Overview
- **Telemetry stack:** OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo.
- **Namespace conventions:** `policy.*` for metrics/traces/log categories; labels use `tenant`, `policy`, `mode`, `runId`.
- **Sampling:** Default 10% trace sampling, 1% rule-hit log sampling; incident mode overrides to 100% (see §6).
- **Correlation IDs:** Every API request gets `traceId` + `requestId`. CLI/UI display IDs to streamline support.
---
## 2·Metrics
### 2.1 Run Pipeline
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_run_seconds` | Histogram | `tenant`, `policy`, `mode` (`full`, `incremental`, `simulate`) | P95 target ≤5min incremental, ≤30min full. |
| `policy_run_queue_depth` | Gauge | `tenant` | Number of pending jobs per tenant (updated each enqueue/dequeue). |
| `policy_run_failures_total` | Counter | `tenant`, `policy`, `reason` (`err_pol_*`, `network`, `cancelled`) | Aligns with error codes. |
| `policy_run_retries_total` | Counter | `tenant`, `policy` | Helps identify noisy sources. |
| `policy_run_inputs_pending_bytes` | Gauge | `tenant` | Size of buffered change batches awaiting run. |
### 2.2 Evaluator Insights
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_rules_fired_total` | Counter | `tenant`, `policy`, `rule` | Increment per rule match (sampled). |
| `policy_vex_overrides_total` | Counter | `tenant`, `policy`, `vendor`, `justification` | Tracks VEX precedence decisions. |
| `policy_suppressions_total` | Counter | `tenant`, `policy`, `action` (`ignore`, `warn`, `quiet`) | Audits suppression usage. |
| `policy_selection_batch_duration_seconds` | Histogram | `tenant`, `policy` | Measures joiner performance. |
| `policy_materialization_conflicts_total` | Counter | `tenant`, `policy` | Non-zero indicates optimistic concurrency retries. |
### 2.3 API Surface
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_api_requests_total` | Counter | `endpoint`, `method`, `status` | Exposed via Minimal API instrumentation. |
| `policy_api_latency_seconds` | Histogram | `endpoint`, `method` | Budget ≤250ms for GETs, ≤1s for POSTs. |
| `policy_api_rate_limited_total` | Counter | `endpoint` | Tied to throttles (`429`). |
### 2.4 Queue & Change Streams
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_queue_leases_active` | Gauge | `tenant` | Number of leased jobs. |
| `policy_queue_lease_expirations_total` | Counter | `tenant` | Alerts when workers fail to ack. |
| `policy_delta_backlog_age_seconds` | Gauge | `tenant`, `source` (`concelier`, `excititor`, `sbom`) | Age of oldest unprocessed change event. |
---
## 3·Logs
- **Format:** JSON (`Serilog`). Core fields: `timestamp`, `level`, `message`, `policyId`, `policyVersion`, `tenant`, `runId`, `rule`, `traceId`, `env.sealed`, `error.code`.
- **Log categories:**
- `policy.run` (queue lifecycle, run begin/end, stats)
- `policy.evaluate` (batch execution summaries; rule-hit sampling)
- `policy.materialize` (Mongo operations, conflicts, retries)
- `policy.simulate` (diff results, CLI invocation metadata)
- `policy.lifecycle` (submit/review/approve events)
- **Sampling:** Rule-hit logs sample 1% by default; toggled to 100% in incident mode or when `--trace` flag used in CLI.
- **PII:** No user secrets recorded; user identities referenced as `user:<id>` or `group:<id>` only.
---
## 4·Traces
- Spans emit via OpenTelemetry instrumentation.
- **Primary spans:**
- `policy.api` wraps HTTP request, records `endpoint`, `status`, `scope`.
- `policy.select` change stream ingestion and batch assembly (attributes: `candidateCount`, `cursor`).
- `policy.evaluate` evaluation batch (attributes: `batchSize`, `ruleHits`, `severityChanges`).
- `policy.materialize` Mongo writes (attributes: `writes`, `historyWrites`, `retryCount`).
- `policy.simulate` simulation diff generation (attributes: `sbomCount`, `diffAdded`, `diffRemoved`).
- Trace context propagated to CLI via response headers `traceparent`; UI surfaces in run detail view.
- Incident mode forces span sampling to 100% and extends retention via Collector config override.
---
## 5·Dashboards
### 5.1 Policy Runs Overview
Widgets:
- Run duration histogram (per mode/tenant).
- Queue depth + backlog age line charts.
- Failure rate stacked by error code.
- Incremental backlog heatmap (policy × age).
- Active vs scheduled runs table.
### 5.2 Rule Impact & VEX
- Top N rules by firings (bar chart).
- VEX overrides by vendor/justification (stacked chart).
- Suppression usage (pie + table with justifications).
- Quieted findings trend (line).
### 5.3 Simulation & Approval Health
- Simulation diff histogram (added vs removed).
- Pending approvals by age (table with SLA colour coding).
- Compliance checklist status (lint, determinism CI, simulation evidence).
> Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (`../assets/policy-observability/*.png`).
---
## 6·Alerting
| Alert | Condition | Suggested Action |
|-------|-----------|------------------|
| **PolicyRunSlaBreach** | `policy_run_seconds{mode="incremental"}` P95 > 300s for 3 windows | Check queue depth, upstream services, scale worker pool. |
| **PolicyQueueStuck** | `policy_delta_backlog_age_seconds` > 600 | Investigate change stream connectivity. |
| **DeterminismMismatch** | Run status `failed` with `ERR_POL_004` OR CI replay diff | Switch to incident sampling, gather replay bundle, notify Policy Guild. |
| **SimulationDrift** | CLI/CI simulation exit `20` (blocking diff) over threshold | Review policy changes before approval. |
| **VexOverrideSpike** | `policy_vex_overrides_total` > configured baseline (per vendor) | Verify upstream VEX feed; ensure justification codes expected. |
| **SuppressionSurge** | `policy_suppressions_total` increase > 3σ vs baseline | Audit new suppress rules; check approvals. |
Alerts integrate with Notifier channels (`policy.alerts`) and Ops on-call rotations.
---
## 7·Incident Mode & Forensics
- Toggle via `POST /api/policy/incidents/activate` (requires `policy:operate` scope).
- Effects:
- Trace sampling → 100%.
- Rule-hit log sampling → 100%.
- Retention window extended to 30days for incident duration.
- `policy.incident.activated` event emitted (Console + Notifier banners).
- Post-incident tasks:
- `stella policy run replay` for affected runs; attach bundles to incident record.
- Restore sampling defaults with `.../deactivate`.
- Update incident checklist in `/docs/policy/lifecycle.md` (section 8) with findings.
---
## 8·Integration Points
- **Authority:** Exposes metric `policy_scope_denied_total` for failed authorisation; correlate with `policy_api_requests_total`.
- **Concelier/Excititor:** Shared trace IDs propagate via gRPC metadata to help debug upstream latency.
- **Scheduler:** Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002).
- **Offline Kit:** CLI exports logs + metrics snapshots (`stella offline bundle metrics`) for air-gapped audits.
---
## 9·Compliance Checklist
- [ ] **Metrics registered:** All metrics listed above exported and documented in Grafana dashboards.
- [ ] **Alert policies configured:** Ops or Observability Guild created alerts matching table in §6.
- [ ] **Sampling overrides tested:** Incident mode toggles verified in staging; retention roll-back rehearsed.
- [ ] **Trace propagation validated:** CLI/UI display trace IDs and allow copy for support.
- [ ] **Log scrubbing enforced:** Unit tests guarantee no secrets/PII in logs; sampling respects configuration.
- [ ] **Offline capture rehearsed:** Metrics/log snapshot commands executed in sealed environment.
- [ ] **Docs cross-links:** Links to architecture, runs, lifecycle, CLI, API docs verified.
---
*Last updated: 2025-10-26 (Sprint 20).*