Add Policy DSL Validator, Schema Exporter, and Simulation Smoke tools

- Implemented PolicyDslValidator with command-line options for strict mode and JSON output. - Created PolicySchemaExporter to generate JSON schemas for policy-related models. - Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes. - Added project files and necessary dependencies for each tool. - Ensured proper error handling and usage instructions across tools.
2025-10-27 08:00:11 +02:00
parent 2b7b88ca77
commit 799f787de2
712 changed files with 49449 additions and 6124 deletions
--- a/docs/observability/observability.md
+++ b/docs/observability/observability.md
@@ -0,0 +1,141 @@
+# AOC Observability Guide
+
+> **Audience:** Observability Guild, Concelier/Excititor SREs, platform operators.  
+> **Scope:** Metrics, traces, logs, dashboards, and runbooks introduced as part of the Aggregation-Only Contract (AOC) rollout (Sprint 19).
+
+This guide captures the canonical signals emitted by Concelier and Excititor once AOC guards are active. It explains how to consume the metrics in dashboards, correlate traces/logs for incident triage, and operate in offline environments. Pair this guide with the [AOC reference](../ingestion/aggregation-only-contract.md) and [architecture overview](../architecture/overview.md).
+
+---
+
+## 1 · Metrics
+
+| Metric | Type | Labels | Description |
+|--------|------|--------|-------------|
+| `ingestion_write_total` | Counter | `source`, `tenant`, `result` (`ok`, `reject`, `noop`) | Counts write attempts to `advisory_raw`/`vex_raw`. Rejects correspond to guard failures. |
+| `ingestion_latency_seconds` | Histogram | `source`, `tenant`, `phase` (`fetch`, `transform`, `write`) | Measures end-to-end runtime for ingestion stages. Use `quantile=0.95` for alerting. |
+| `aoc_violation_total` | Counter | `source`, `tenant`, `code` (`ERR_AOC_00x`) | Total guard violations bucketed by error code. Drives dashboard pills and alert thresholds. |
+| `ingestion_signature_verified_total` | Counter | `source`, `tenant`, `result` (`ok`, `fail`, `skipped`) | Tracks signature/checksum verification outcomes. |
+| `advisory_revision_count` | Gauge | `source`, `tenant` | Supersedes depth for raw documents; spikes indicate noisy upstream feeds. |
+| `verify_runs_total` | Counter | `tenant`, `initiator` (`ui`, `cli`, `api`, `scheduled`) | How many `stella aoc verify` or `/aoc/verify` runs executed. |
+| `verify_duration_seconds` | Histogram | `tenant`, `initiator` | Runtime of verification jobs; use P95 to detect regressions. |
+
+### 1.1 Alerts
+
+- **Violation spike:** Alert when `increase(aoc_violation_total[15m]) > 0` for critical sources. Page SRE if `code="ERR_AOC_005"` (signature failure) or `ERR_AOC_001` persists > 30 min.
+- **Stale ingestion:** Alert when `max_over_time(ingestion_latency_seconds_sum / ingestion_latency_seconds_count)[30m]` exceeds 30 s or if `ingestion_write_total` has no growth for > 60 min.
+- **Signature drop:** Warn when `rate(ingestion_signature_verified_total{result="fail"}[1h]) > 0`.
+
+---
+
+## 2 · Traces
+
+### 2.1 Span taxonomy
+
+| Span name | Parent | Key attributes |
+|-----------|--------|----------------|
+| `ingest.fetch` | job root span | `source`, `tenant`, `uri`, `contentHash` |
+| `ingest.transform` | `ingest.fetch` | `documentType` (`csaf`, `osv`, `vex`), `payloadBytes` |
+| `ingest.write` | `ingest.transform` | `collection` (`advisory_raw`, `vex_raw`), `result` (`ok`, `reject`) |
+| `aoc.guard` | `ingest.write` | `code` (on violation), `violationCount`, `supersedes` |
+| `verify.run` | verification job root | `tenant`, `window.from`, `window.to`, `sources`, `violations` |
+
+### 2.2 Trace usage
+
+- Correlate UI dashboard entries with traces via `traceId` surfaced in violation drawers (`docs/ui/console.md`).
+- Use `aoc.guard` spans to inspect guard payload snapshots. Sensitive fields are redacted automatically; raw JSON lives in secure logs only.
+- For scheduled verification, filter traces by `initiator="scheduled"` to compare runtimes pre/post change.
+
+---
+
+## 3 · Logs
+
+Structured logs include the following keys (JSON):
+
+| Key | Description |
+|-----|-------------|
+| `traceId` | Matches OpenTelemetry trace/span IDs for cross-system correlation. |
+| `tenant` | Tenant identifier enforced by Authority middleware. |
+| `source.vendor` | Logical source (e.g., `redhat`, `ubuntu`, `osv`, `ghsa`). |
+| `upstream.upstreamId` | Vendor-provided ID (CVE, GHSA, etc.). |
+| `contentHash` | `sha256:` digest of the raw document. |
+| `violation.code` | Present when guard rejects `ERR_AOC_00x`. |
+| `verification.window` | Present on `/aoc/verify` job logs. |
+
+Logs are shipped to the central Loki/Elasticsearch cluster. Use the template query:
+
+```logql
+{app="concelier-web"} | json | violation_code != ""
+```
+
+to spot active AOC violations.
+
+---
+
+## 4 · Dashboards
+
+Primary Grafana dashboard: **“AOC Ingestion Health”** (`dashboards/aoc-ingestion.json`). Panels include:
+
+1. **Sources overview:** table fed by `ingestion_write_total` and `ingestion_latency_seconds` (mirrors Console tiles).
+2. **Violation trend:** stacked bar chart of `aoc_violation_total` per code.
+3. **Signature success rate:** timeseries derived from `ingestion_signature_verified_total`.
+4. **Supersedes depth:** gauge showing `advisory_revision_count` P95.
+5. **Verification runs:** histogram and latency boxplot using `verify_runs_total` / `verify_duration_seconds`.
+
+Secondary dashboards:
+
+- **AOC Alerts (Ops view):** summarises active alerts, last verify run, and links to incident runbook.
+- **Offline Mode Dashboard:** fed from Offline Kit imports; highlights snapshot age and queued verification jobs.
+
+Update `docs/assets/dashboards/` with screenshots when Grafana capture pipeline produces the latest renders.
+
+---
+
+## 5 · Operational workflows
+
+1. **During ingestion incident:**
+   - Check Console dashboard for offending sources.
+   - Pivot to logs using document `contentHash`.
+   - Re-run `stella sources ingest --dry-run` with problematic payloads to validate fixes.
+   - After remediation, run `stella aoc verify --since 24h` and confirm exit code `0`.
+2. **Scheduled verification:**
+   - Configure cron job to run `stella aoc verify --format json --export ...`.
+   - Ship JSON to `aoc-verify` bucket and ingest into metrics using custom exporter.
+   - Alert on missing exports (no file uploaded within 26 h).
+3. **Offline kit validation:**
+   - Use Offline Dashboard to ensure snapshots contain latest metrics.
+   - Run verification reports locally and attach to bundle before distribution.
+
+---
+
+## 6 · Offline considerations
+
+- Metrics exporters bundled with Offline Kit write to local Prometheus snapshots; sync them with central Grafana once connectivity is restored.
+- CLI verification reports should be hashed (`sha256sum`) and archived for audit trails.
+- Dashboards include offline data sources (`prometheus-offline`) switchable via dropdown.
+
+---
+
+## 7 · References
+
+- [Aggregation-Only Contract reference](../ingestion/aggregation-only-contract.md)
+- [Architecture overview](../architecture/overview.md)
+- [Console AOC dashboard](../ui/console.md)
+- [CLI AOC commands](../cli/cli-reference.md)
+- [Concelier architecture](../ARCHITECTURE_CONCELIER.md)
+- [Excititor architecture](../ARCHITECTURE_EXCITITOR.md)
+
+---
+
+## 8 · Compliance checklist
+
+- [ ] Metrics documented with label sets and alert guidance.
+- [ ] Tracing span taxonomy aligned with Concelier/Excititor implementation.
+- [ ] Log schema matches structured logging contracts (traceId, tenant, source, contentHash).
+- [ ] Grafana dashboard references verified and screenshots scheduled.
+- [ ] Offline/air-gap workflow captured.
+- [ ] Cross-links to AOC reference, console, and CLI docs included.
+- [ ] Observability Guild sign-off scheduled (OWNER: @obs-guild, due 2025-10-28).
+
+---
+
+*Last updated: 2025-10-26 (Sprint 19).* 
--- a/docs/observability/policy.md
+++ b/docs/observability/policy.md
@@ -0,0 +1,166 @@
+# Policy Engine Observability
+
+> **Audience:** Observability Guild, SRE/Platform operators, Policy Guild.  
+> **Scope:** Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint 20).  
+> **Prerequisites:** Policy Engine v2 deployed with OpenTelemetry exporters enabled (`observability:enabled=true` in config).
+
+---
+
+## 1 · Instrumentation Overview
+
+- **Telemetry stack:** OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo.
+- **Namespace conventions:** `policy.*` for metrics/traces/log categories; labels use `tenant`, `policy`, `mode`, `runId`.
+- **Sampling:** Default 10 % trace sampling, 1 % rule-hit log sampling; incident mode overrides to 100 % (see §6).
+- **Correlation IDs:** Every API request gets `traceId` + `requestId`. CLI/UI display IDs to streamline support.
+
+---
+
+## 2 · Metrics
+
+### 2.1 Run Pipeline
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `policy_run_seconds` | Histogram | `tenant`, `policy`, `mode` (`full`, `incremental`, `simulate`) | P95 target ≤ 5 min incremental, ≤ 30 min full. |
+| `policy_run_queue_depth` | Gauge | `tenant` | Number of pending jobs per tenant (updated each enqueue/dequeue). |
+| `policy_run_failures_total` | Counter | `tenant`, `policy`, `reason` (`err_pol_*`, `network`, `cancelled`) | Aligns with error codes. |
+| `policy_run_retries_total` | Counter | `tenant`, `policy` | Helps identify noisy sources. |
+| `policy_run_inputs_pending_bytes` | Gauge | `tenant` | Size of buffered change batches awaiting run. |
+
+### 2.2 Evaluator Insights
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `policy_rules_fired_total` | Counter | `tenant`, `policy`, `rule` | Increment per rule match (sampled). |
+| `policy_vex_overrides_total` | Counter | `tenant`, `policy`, `vendor`, `justification` | Tracks VEX precedence decisions. |
+| `policy_suppressions_total` | Counter | `tenant`, `policy`, `action` (`ignore`, `warn`, `quiet`) | Audits suppression usage. |
+| `policy_selection_batch_duration_seconds` | Histogram | `tenant`, `policy` | Measures joiner performance. |
+| `policy_materialization_conflicts_total` | Counter | `tenant`, `policy` | Non-zero indicates optimistic concurrency retries. |
+
+### 2.3 API Surface
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `policy_api_requests_total` | Counter | `endpoint`, `method`, `status` | Exposed via Minimal API instrumentation. |
+| `policy_api_latency_seconds` | Histogram | `endpoint`, `method` | Budget ≤ 250 ms for GETs, ≤ 1 s for POSTs. |
+| `policy_api_rate_limited_total` | Counter | `endpoint` | Tied to throttles (`429`). |
+
+### 2.4 Queue & Change Streams
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `policy_queue_leases_active` | Gauge | `tenant` | Number of leased jobs. |
+| `policy_queue_lease_expirations_total` | Counter | `tenant` | Alerts when workers fail to ack. |
+| `policy_delta_backlog_age_seconds` | Gauge | `tenant`, `source` (`concelier`, `excititor`, `sbom`) | Age of oldest unprocessed change event. |
+
+---
+
+## 3 · Logs
+
+- **Format:** JSON (`Serilog`). Core fields: `timestamp`, `level`, `message`, `policyId`, `policyVersion`, `tenant`, `runId`, `rule`, `traceId`, `env.sealed`, `error.code`.
+- **Log categories:**
+  - `policy.run` (queue lifecycle, run begin/end, stats)
+  - `policy.evaluate` (batch execution summaries; rule-hit sampling)
+  - `policy.materialize` (Mongo operations, conflicts, retries)
+  - `policy.simulate` (diff results, CLI invocation metadata)
+  - `policy.lifecycle` (submit/review/approve events)
+- **Sampling:** Rule-hit logs sample 1 % by default; toggled to 100 % in incident mode or when `--trace` flag used in CLI.
+- **PII:** No user secrets recorded; user identities referenced as `user:<id>` or `group:<id>` only.
+
+---
+
+## 4 · Traces
+
+- Spans emit via OpenTelemetry instrumentation.
+- **Primary spans:**
+  - `policy.api` – wraps HTTP request, records `endpoint`, `status`, `scope`.
+  - `policy.select` – change stream ingestion and batch assembly (attributes: `candidateCount`, `cursor`).
+  - `policy.evaluate` – evaluation batch (attributes: `batchSize`, `ruleHits`, `severityChanges`).
+  - `policy.materialize` – Mongo writes (attributes: `writes`, `historyWrites`, `retryCount`).
+  - `policy.simulate` – simulation diff generation (attributes: `sbomCount`, `diffAdded`, `diffRemoved`).
+- Trace context propagated to CLI via response headers `traceparent`; UI surfaces in run detail view.
+- Incident mode forces span sampling to 100 % and extends retention via Collector config override.
+
+---
+
+## 5 · Dashboards
+
+### 5.1 Policy Runs Overview
+
+Widgets:
+- Run duration histogram (per mode/tenant).
+- Queue depth + backlog age line charts.
+- Failure rate stacked by error code.
+- Incremental backlog heatmap (policy × age).
+- Active vs scheduled runs table.
+
+### 5.2 Rule Impact & VEX
+
+- Top N rules by firings (bar chart).
+- VEX overrides by vendor/justification (stacked chart).
+- Suppression usage (pie + table with justifications).
+- Quieted findings trend (line).
+
+### 5.3 Simulation & Approval Health
+
+- Simulation diff histogram (added vs removed).
+- Pending approvals by age (table with SLA colour coding).
+- Compliance checklist status (lint, determinism CI, simulation evidence).
+
+> Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (`../assets/policy-observability/*.png`).
+
+---
+
+## 6 · Alerting
+
+| Alert | Condition | Suggested Action |
+|-------|-----------|------------------|
+| **PolicyRunSlaBreach** | `policy_run_seconds{mode="incremental"}` P95 > 300 s for 3 windows | Check queue depth, upstream services, scale worker pool. |
+| **PolicyQueueStuck** | `policy_delta_backlog_age_seconds` > 600 | Investigate change stream connectivity. |
+| **DeterminismMismatch** | Run status `failed` with `ERR_POL_004` OR CI replay diff | Switch to incident sampling, gather replay bundle, notify Policy Guild. |
+| **SimulationDrift** | CLI/CI simulation exit `20` (blocking diff) over threshold | Review policy changes before approval. |
+| **VexOverrideSpike** | `policy_vex_overrides_total` > configured baseline (per vendor) | Verify upstream VEX feed; ensure justification codes expected. |
+| **SuppressionSurge** | `policy_suppressions_total` increase > 3σ vs baseline | Audit new suppress rules; check approvals. |
+
+Alerts integrate with Notifier channels (`policy.alerts`) and Ops on-call rotations.
+
+---
+
+## 7 · Incident Mode & Forensics
+
+- Toggle via `POST /api/policy/incidents/activate` (requires `policy:operate` scope).
+- Effects:
+  - Trace sampling → 100 %.
+  - Rule-hit log sampling → 100 %.
+  - Retention window extended to 30 days for incident duration.
+  - `policy.incident.activated` event emitted (Console + Notifier banners).
+- Post-incident tasks:
+  - `stella policy run replay` for affected runs; attach bundles to incident record.
+  - Restore sampling defaults with `.../deactivate`.
+  - Update incident checklist in `/docs/policy/lifecycle.md` (section 8) with findings.
+
+---
+
+## 8 · Integration Points
+
+- **Authority:** Exposes metric `policy_scope_denied_total` for failed authorisation; correlate with `policy_api_requests_total`.
+- **Concelier/Excititor:** Shared trace IDs propagate via gRPC metadata to help debug upstream latency.
+- **Scheduler:** Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002).
+- **Offline Kit:** CLI exports logs + metrics snapshots (`stella offline bundle metrics`) for air-gapped audits.
+
+---
+
+## 9 · Compliance Checklist
+
+- [ ] **Metrics registered:** All metrics listed above exported and documented in Grafana dashboards.
+- [ ] **Alert policies configured:** Ops or Observability Guild created alerts matching table in §6.
+- [ ] **Sampling overrides tested:** Incident mode toggles verified in staging; retention roll-back rehearsed.
+- [ ] **Trace propagation validated:** CLI/UI display trace IDs and allow copy for support.
+- [ ] **Log scrubbing enforced:** Unit tests guarantee no secrets/PII in logs; sampling respects configuration.
+- [ ] **Offline capture rehearsed:** Metrics/log snapshot commands executed in sealed environment.
+- [ ] **Docs cross-links:** Links to architecture, runs, lifecycle, CLI, API docs verified.
+
+---
+
+*Last updated: 2025-10-26 (Sprint 20).*
+