stella-ops.org/git.stella-ops.org

Fork 0

Files

root 68da90a11a

Docs CI / lint-and-preview (push) Has been cancelled

Details

Restructure solution layout by module

2025-10-28 15:10:40 +02:00

8.5 KiB

Raw Blame History

Policy Engine Observability

Audience: Observability Guild, SRE/Platform operators, Policy Guild.
Scope: Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint 20).
Prerequisites: Policy Engine v2 deployed with OpenTelemetry exporters enabled (observability:enabled=true in config).

1 · Instrumentation Overview

Telemetry stack: OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo.
Namespace conventions: policy.* for metrics/traces/log categories; labels use tenant, policy, mode, runId.
Sampling: Default 10 % trace sampling, 1 % rule-hit log sampling; incident mode overrides to 100 % (see §6).
Correlation IDs: Every API request gets traceId + requestId. CLI/UI display IDs to streamline support.

2 · Metrics

2.1 Run Pipeline

Metric	Type	Labels	Notes
`policy_run_seconds`	Histogram	`tenant`, `policy`, `mode` (`full`, `incremental`, `simulate`)	P95 target ≤ 5 min incremental, ≤ 30 min full.
`policy_run_queue_depth`	Gauge	`tenant`	Number of pending jobs per tenant (updated each enqueue/dequeue).
`policy_run_failures_total`	Counter	`tenant`, `policy`, `reason` (`err_pol_*`, `network`, `cancelled`)	Aligns with error codes.
`policy_run_retries_total`	Counter	`tenant`, `policy`	Helps identify noisy sources.
`policy_run_inputs_pending_bytes`	Gauge	`tenant`	Size of buffered change batches awaiting run.

2.2 Evaluator Insights

Metric	Type	Labels	Notes
`policy_rules_fired_total`	Counter	`tenant`, `policy`, `rule`	Increment per rule match (sampled).
`policy_vex_overrides_total`	Counter	`tenant`, `policy`, `vendor`, `justification`	Tracks VEX precedence decisions.
`policy_suppressions_total`	Counter	`tenant`, `policy`, `action` (`ignore`, `warn`, `quiet`)	Audits suppression usage.
`policy_selection_batch_duration_seconds`	Histogram	`tenant`, `policy`	Measures joiner performance.
`policy_materialization_conflicts_total`	Counter	`tenant`, `policy`	Non-zero indicates optimistic concurrency retries.

2.3 API Surface

Metric	Type	Labels	Notes
`policy_api_requests_total`	Counter	`endpoint`, `method`, `status`	Exposed via Minimal API instrumentation.
`policy_api_latency_seconds`	Histogram	`endpoint`, `method`	Budget ≤ 250 ms for GETs, ≤ 1 s for POSTs.
`policy_api_rate_limited_total`	Counter	`endpoint`	Tied to throttles (`429`).

2.4 Queue & Change Streams

Metric	Type	Labels	Notes
`policy_queue_leases_active`	Gauge	`tenant`	Number of leased jobs.
`policy_queue_lease_expirations_total`	Counter	`tenant`	Alerts when workers fail to ack.
`policy_delta_backlog_age_seconds`	Gauge	`tenant`, `source` (`concelier`, `excititor`, `sbom`)	Age of oldest unprocessed change event.

3 · Logs

Format: JSON (Serilog). Core fields: timestamp, level, message, policyId, policyVersion, tenant, runId, rule, traceId, env.sealed, error.code.
Log categories:
- policy.run (queue lifecycle, run begin/end, stats)
- policy.evaluate (batch execution summaries; rule-hit sampling)
- policy.materialize (Mongo operations, conflicts, retries)
- policy.simulate (diff results, CLI invocation metadata)
- policy.lifecycle (submit/review/approve events)
Sampling: Rule-hit logs sample 1 % by default; toggled to 100 % in incident mode or when --trace flag used in CLI.
PII: No user secrets recorded; user identities referenced as user:<id> or group:<id> only.

4 · Traces

Spans emit via OpenTelemetry instrumentation.
Primary spans:
- policy.api – wraps HTTP request, records endpoint, status, scope.
- policy.select – change stream ingestion and batch assembly (attributes: candidateCount, cursor).
- policy.evaluate – evaluation batch (attributes: batchSize, ruleHits, severityChanges).
- policy.materialize – Mongo writes (attributes: writes, historyWrites, retryCount).
- policy.simulate – simulation diff generation (attributes: sbomCount, diffAdded, diffRemoved).
Trace context propagated to CLI via response headers traceparent; UI surfaces in run detail view.
Incident mode forces span sampling to 100 % and extends retention via Collector config override.

5 · Dashboards

5.1 Policy Runs Overview

Widgets:

Run duration histogram (per mode/tenant).
Queue depth + backlog age line charts.
Failure rate stacked by error code.
Incremental backlog heatmap (policy × age).
Active vs scheduled runs table.

5.2 Rule Impact & VEX

Top N rules by firings (bar chart).
VEX overrides by vendor/justification (stacked chart).
Suppression usage (pie + table with justifications).
Quieted findings trend (line).

5.3 Simulation & Approval Health

Simulation diff histogram (added vs removed).
Pending approvals by age (table with SLA colour coding).
Compliance checklist status (lint, determinism CI, simulation evidence).

Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (../assets/policy-observability/*.png).

6 · Alerting

Alert	Condition	Suggested Action
PolicyRunSlaBreach	`policy_run_seconds{mode="incremental"}` P95 > 300 s for 3 windows	Check queue depth, upstream services, scale worker pool.
PolicyQueueStuck	`policy_delta_backlog_age_seconds` > 600	Investigate change stream connectivity.
DeterminismMismatch	Run status `failed` with `ERR_POL_004` OR CI replay diff	Switch to incident sampling, gather replay bundle, notify Policy Guild.
SimulationDrift	CLI/CI simulation exit `20` (blocking diff) over threshold	Review policy changes before approval.
VexOverrideSpike	`policy_vex_overrides_total` > configured baseline (per vendor)	Verify upstream VEX feed; ensure justification codes expected.
SuppressionSurge	`policy_suppressions_total` increase > 3σ vs baseline	Audit new suppress rules; check approvals.

Alerts integrate with Notifier channels (policy.alerts) and Ops on-call rotations.

7 · Incident Mode & Forensics

Toggle via POST /api/policy/incidents/activate (requires policy:operate scope).
Effects:
- Trace sampling → 100 %.
- Rule-hit log sampling → 100 %.
- Retention window extended to 30 days for incident duration.
- policy.incident.activated event emitted (Console + Notifier banners).
Post-incident tasks:
- stella policy run replay for affected runs; attach bundles to incident record.
- Restore sampling defaults with .../deactivate.
- Update incident checklist in /docs/policy/lifecycle.md (section 8) with findings.

8 · Integration Points

Authority: Exposes metric policy_scope_denied_total for failed authorisation; correlate with policy_api_requests_total.
Concelier/Excititor: Shared trace IDs propagate via gRPC metadata to help debug upstream latency.
Scheduler: Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002).
Offline Kit: CLI exports logs + metrics snapshots (stella offline bundle metrics) for air-gapped audits.

9 · Compliance Checklist

Metrics registered: All metrics listed above exported and documented in Grafana dashboards.
Alert policies configured: Ops or Observability Guild created alerts matching table in §6.
Sampling overrides tested: Incident mode toggles verified in staging; retention roll-back rehearsed.
Trace propagation validated: CLI/UI display trace IDs and allow copy for support.
Log scrubbing enforced: Unit tests guarantee no secrets/PII in logs; sampling respects configuration.
Offline capture rehearsed: Metrics/log snapshot commands executed in sealed environment.
Docs cross-links: Links to architecture, runs, lifecycle, CLI, API docs verified.

Last updated: 2025-10-26 (Sprint 20).

8.5 KiB Raw Blame History Unescape Escape

Policy Engine Observability

1 · Instrumentation Overview

2 · Metrics