Add Policy DSL Validator, Schema Exporter, and Simulation Smoke tools

- Implemented PolicyDslValidator with command-line options for strict mode and JSON output.
- Created PolicySchemaExporter to generate JSON schemas for policy-related models.
- Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes.
- Added project files and necessary dependencies for each tool.
- Ensured proper error handling and usage instructions across tools.
This commit is contained in:
master
2025-10-27 08:00:11 +02:00
parent 2b7b88ca77
commit 799f787de2
712 changed files with 49449 additions and 6124 deletions

View File

@@ -0,0 +1,141 @@
# AOC Observability Guide
> **Audience:** Observability Guild, Concelier/Excititor SREs, platform operators.
> **Scope:** Metrics, traces, logs, dashboards, and runbooks introduced as part of the Aggregation-Only Contract (AOC) rollout (Sprint19).
This guide captures the canonical signals emitted by Concelier and Excititor once AOC guards are active. It explains how to consume the metrics in dashboards, correlate traces/logs for incident triage, and operate in offline environments. Pair this guide with the [AOC reference](../ingestion/aggregation-only-contract.md) and [architecture overview](../architecture/overview.md).
---
## 1·Metrics
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `ingestion_write_total` | Counter | `source`, `tenant`, `result` (`ok`, `reject`, `noop`) | Counts write attempts to `advisory_raw`/`vex_raw`. Rejects correspond to guard failures. |
| `ingestion_latency_seconds` | Histogram | `source`, `tenant`, `phase` (`fetch`, `transform`, `write`) | Measures end-to-end runtime for ingestion stages. Use `quantile=0.95` for alerting. |
| `aoc_violation_total` | Counter | `source`, `tenant`, `code` (`ERR_AOC_00x`) | Total guard violations bucketed by error code. Drives dashboard pills and alert thresholds. |
| `ingestion_signature_verified_total` | Counter | `source`, `tenant`, `result` (`ok`, `fail`, `skipped`) | Tracks signature/checksum verification outcomes. |
| `advisory_revision_count` | Gauge | `source`, `tenant` | Supersedes depth for raw documents; spikes indicate noisy upstream feeds. |
| `verify_runs_total` | Counter | `tenant`, `initiator` (`ui`, `cli`, `api`, `scheduled`) | How many `stella aoc verify` or `/aoc/verify` runs executed. |
| `verify_duration_seconds` | Histogram | `tenant`, `initiator` | Runtime of verification jobs; use P95 to detect regressions. |
### 1.1Alerts
- **Violation spike:** Alert when `increase(aoc_violation_total[15m]) > 0` for critical sources. Page SRE if `code="ERR_AOC_005"` (signature failure) or `ERR_AOC_001` persists >30min.
- **Stale ingestion:** Alert when `max_over_time(ingestion_latency_seconds_sum / ingestion_latency_seconds_count)[30m]` exceeds 30s or if `ingestion_write_total` has no growth for >60min.
- **Signature drop:** Warn when `rate(ingestion_signature_verified_total{result="fail"}[1h]) > 0`.
---
## 2·Traces
### 2.1Span taxonomy
| Span name | Parent | Key attributes |
|-----------|--------|----------------|
| `ingest.fetch` | job root span | `source`, `tenant`, `uri`, `contentHash` |
| `ingest.transform` | `ingest.fetch` | `documentType` (`csaf`, `osv`, `vex`), `payloadBytes` |
| `ingest.write` | `ingest.transform` | `collection` (`advisory_raw`, `vex_raw`), `result` (`ok`, `reject`) |
| `aoc.guard` | `ingest.write` | `code` (on violation), `violationCount`, `supersedes` |
| `verify.run` | verification job root | `tenant`, `window.from`, `window.to`, `sources`, `violations` |
### 2.2Trace usage
- Correlate UI dashboard entries with traces via `traceId` surfaced in violation drawers (`docs/ui/console.md`).
- Use `aoc.guard` spans to inspect guard payload snapshots. Sensitive fields are redacted automatically; raw JSON lives in secure logs only.
- For scheduled verification, filter traces by `initiator="scheduled"` to compare runtimes pre/post change.
---
## 3·Logs
Structured logs include the following keys (JSON):
| Key | Description |
|-----|-------------|
| `traceId` | Matches OpenTelemetry trace/span IDs for cross-system correlation. |
| `tenant` | Tenant identifier enforced by Authority middleware. |
| `source.vendor` | Logical source (e.g., `redhat`, `ubuntu`, `osv`, `ghsa`). |
| `upstream.upstreamId` | Vendor-provided ID (CVE, GHSA, etc.). |
| `contentHash` | `sha256:` digest of the raw document. |
| `violation.code` | Present when guard rejects `ERR_AOC_00x`. |
| `verification.window` | Present on `/aoc/verify` job logs. |
Logs are shipped to the central Loki/Elasticsearch cluster. Use the template query:
```logql
{app="concelier-web"} | json | violation_code != ""
```
to spot active AOC violations.
---
## 4·Dashboards
Primary Grafana dashboard: **“AOC Ingestion Health”** (`dashboards/aoc-ingestion.json`). Panels include:
1. **Sources overview:** table fed by `ingestion_write_total` and `ingestion_latency_seconds` (mirrors Console tiles).
2. **Violation trend:** stacked bar chart of `aoc_violation_total` per code.
3. **Signature success rate:** timeseries derived from `ingestion_signature_verified_total`.
4. **Supersedes depth:** gauge showing `advisory_revision_count` P95.
5. **Verification runs:** histogram and latency boxplot using `verify_runs_total` / `verify_duration_seconds`.
Secondary dashboards:
- **AOC Alerts (Ops view):** summarises active alerts, last verify run, and links to incident runbook.
- **Offline Mode Dashboard:** fed from Offline Kit imports; highlights snapshot age and queued verification jobs.
Update `docs/assets/dashboards/` with screenshots when Grafana capture pipeline produces the latest renders.
---
## 5·Operational workflows
1. **During ingestion incident:**
- Check Console dashboard for offending sources.
- Pivot to logs using document `contentHash`.
- Re-run `stella sources ingest --dry-run` with problematic payloads to validate fixes.
- After remediation, run `stella aoc verify --since 24h` and confirm exit code `0`.
2. **Scheduled verification:**
- Configure cron job to run `stella aoc verify --format json --export ...`.
- Ship JSON to `aoc-verify` bucket and ingest into metrics using custom exporter.
- Alert on missing exports (no file uploaded within 26h).
3. **Offline kit validation:**
- Use Offline Dashboard to ensure snapshots contain latest metrics.
- Run verification reports locally and attach to bundle before distribution.
---
## 6·Offline considerations
- Metrics exporters bundled with Offline Kit write to local Prometheus snapshots; sync them with central Grafana once connectivity is restored.
- CLI verification reports should be hashed (`sha256sum`) and archived for audit trails.
- Dashboards include offline data sources (`prometheus-offline`) switchable via dropdown.
---
## 7·References
- [Aggregation-Only Contract reference](../ingestion/aggregation-only-contract.md)
- [Architecture overview](../architecture/overview.md)
- [Console AOC dashboard](../ui/console.md)
- [CLI AOC commands](../cli/cli-reference.md)
- [Concelier architecture](../ARCHITECTURE_CONCELIER.md)
- [Excititor architecture](../ARCHITECTURE_EXCITITOR.md)
---
## 8·Compliance checklist
- [ ] Metrics documented with label sets and alert guidance.
- [ ] Tracing span taxonomy aligned with Concelier/Excititor implementation.
- [ ] Log schema matches structured logging contracts (traceId, tenant, source, contentHash).
- [ ] Grafana dashboard references verified and screenshots scheduled.
- [ ] Offline/air-gap workflow captured.
- [ ] Cross-links to AOC reference, console, and CLI docs included.
- [ ] Observability Guild sign-off scheduled (OWNER: @obs-guild, due 2025-10-28).
---
*Last updated: 2025-10-26 (Sprint19).*

View File

@@ -0,0 +1,166 @@
# Policy Engine Observability
> **Audience:** Observability Guild, SRE/Platform operators, Policy Guild.
> **Scope:** Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint20).
> **Prerequisites:** Policy Engine v2 deployed with OpenTelemetry exporters enabled (`observability:enabled=true` in config).
---
## 1·Instrumentation Overview
- **Telemetry stack:** OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo.
- **Namespace conventions:** `policy.*` for metrics/traces/log categories; labels use `tenant`, `policy`, `mode`, `runId`.
- **Sampling:** Default 10% trace sampling, 1% rule-hit log sampling; incident mode overrides to 100% (see §6).
- **Correlation IDs:** Every API request gets `traceId` + `requestId`. CLI/UI display IDs to streamline support.
---
## 2·Metrics
### 2.1 Run Pipeline
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_run_seconds` | Histogram | `tenant`, `policy`, `mode` (`full`, `incremental`, `simulate`) | P95 target ≤5min incremental, ≤30min full. |
| `policy_run_queue_depth` | Gauge | `tenant` | Number of pending jobs per tenant (updated each enqueue/dequeue). |
| `policy_run_failures_total` | Counter | `tenant`, `policy`, `reason` (`err_pol_*`, `network`, `cancelled`) | Aligns with error codes. |
| `policy_run_retries_total` | Counter | `tenant`, `policy` | Helps identify noisy sources. |
| `policy_run_inputs_pending_bytes` | Gauge | `tenant` | Size of buffered change batches awaiting run. |
### 2.2 Evaluator Insights
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_rules_fired_total` | Counter | `tenant`, `policy`, `rule` | Increment per rule match (sampled). |
| `policy_vex_overrides_total` | Counter | `tenant`, `policy`, `vendor`, `justification` | Tracks VEX precedence decisions. |
| `policy_suppressions_total` | Counter | `tenant`, `policy`, `action` (`ignore`, `warn`, `quiet`) | Audits suppression usage. |
| `policy_selection_batch_duration_seconds` | Histogram | `tenant`, `policy` | Measures joiner performance. |
| `policy_materialization_conflicts_total` | Counter | `tenant`, `policy` | Non-zero indicates optimistic concurrency retries. |
### 2.3 API Surface
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_api_requests_total` | Counter | `endpoint`, `method`, `status` | Exposed via Minimal API instrumentation. |
| `policy_api_latency_seconds` | Histogram | `endpoint`, `method` | Budget ≤250ms for GETs, ≤1s for POSTs. |
| `policy_api_rate_limited_total` | Counter | `endpoint` | Tied to throttles (`429`). |
### 2.4 Queue & Change Streams
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_queue_leases_active` | Gauge | `tenant` | Number of leased jobs. |
| `policy_queue_lease_expirations_total` | Counter | `tenant` | Alerts when workers fail to ack. |
| `policy_delta_backlog_age_seconds` | Gauge | `tenant`, `source` (`concelier`, `excititor`, `sbom`) | Age of oldest unprocessed change event. |
---
## 3·Logs
- **Format:** JSON (`Serilog`). Core fields: `timestamp`, `level`, `message`, `policyId`, `policyVersion`, `tenant`, `runId`, `rule`, `traceId`, `env.sealed`, `error.code`.
- **Log categories:**
- `policy.run` (queue lifecycle, run begin/end, stats)
- `policy.evaluate` (batch execution summaries; rule-hit sampling)
- `policy.materialize` (Mongo operations, conflicts, retries)
- `policy.simulate` (diff results, CLI invocation metadata)
- `policy.lifecycle` (submit/review/approve events)
- **Sampling:** Rule-hit logs sample 1% by default; toggled to 100% in incident mode or when `--trace` flag used in CLI.
- **PII:** No user secrets recorded; user identities referenced as `user:<id>` or `group:<id>` only.
---
## 4·Traces
- Spans emit via OpenTelemetry instrumentation.
- **Primary spans:**
- `policy.api` wraps HTTP request, records `endpoint`, `status`, `scope`.
- `policy.select` change stream ingestion and batch assembly (attributes: `candidateCount`, `cursor`).
- `policy.evaluate` evaluation batch (attributes: `batchSize`, `ruleHits`, `severityChanges`).
- `policy.materialize` Mongo writes (attributes: `writes`, `historyWrites`, `retryCount`).
- `policy.simulate` simulation diff generation (attributes: `sbomCount`, `diffAdded`, `diffRemoved`).
- Trace context propagated to CLI via response headers `traceparent`; UI surfaces in run detail view.
- Incident mode forces span sampling to 100% and extends retention via Collector config override.
---
## 5·Dashboards
### 5.1 Policy Runs Overview
Widgets:
- Run duration histogram (per mode/tenant).
- Queue depth + backlog age line charts.
- Failure rate stacked by error code.
- Incremental backlog heatmap (policy × age).
- Active vs scheduled runs table.
### 5.2 Rule Impact & VEX
- Top N rules by firings (bar chart).
- VEX overrides by vendor/justification (stacked chart).
- Suppression usage (pie + table with justifications).
- Quieted findings trend (line).
### 5.3 Simulation & Approval Health
- Simulation diff histogram (added vs removed).
- Pending approvals by age (table with SLA colour coding).
- Compliance checklist status (lint, determinism CI, simulation evidence).
> Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (`../assets/policy-observability/*.png`).
---
## 6·Alerting
| Alert | Condition | Suggested Action |
|-------|-----------|------------------|
| **PolicyRunSlaBreach** | `policy_run_seconds{mode="incremental"}` P95 > 300s for 3 windows | Check queue depth, upstream services, scale worker pool. |
| **PolicyQueueStuck** | `policy_delta_backlog_age_seconds` > 600 | Investigate change stream connectivity. |
| **DeterminismMismatch** | Run status `failed` with `ERR_POL_004` OR CI replay diff | Switch to incident sampling, gather replay bundle, notify Policy Guild. |
| **SimulationDrift** | CLI/CI simulation exit `20` (blocking diff) over threshold | Review policy changes before approval. |
| **VexOverrideSpike** | `policy_vex_overrides_total` > configured baseline (per vendor) | Verify upstream VEX feed; ensure justification codes expected. |
| **SuppressionSurge** | `policy_suppressions_total` increase > 3σ vs baseline | Audit new suppress rules; check approvals. |
Alerts integrate with Notifier channels (`policy.alerts`) and Ops on-call rotations.
---
## 7·Incident Mode & Forensics
- Toggle via `POST /api/policy/incidents/activate` (requires `policy:operate` scope).
- Effects:
- Trace sampling → 100%.
- Rule-hit log sampling → 100%.
- Retention window extended to 30days for incident duration.
- `policy.incident.activated` event emitted (Console + Notifier banners).
- Post-incident tasks:
- `stella policy run replay` for affected runs; attach bundles to incident record.
- Restore sampling defaults with `.../deactivate`.
- Update incident checklist in `/docs/policy/lifecycle.md` (section 8) with findings.
---
## 8·Integration Points
- **Authority:** Exposes metric `policy_scope_denied_total` for failed authorisation; correlate with `policy_api_requests_total`.
- **Concelier/Excititor:** Shared trace IDs propagate via gRPC metadata to help debug upstream latency.
- **Scheduler:** Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002).
- **Offline Kit:** CLI exports logs + metrics snapshots (`stella offline bundle metrics`) for air-gapped audits.
---
## 9·Compliance Checklist
- [ ] **Metrics registered:** All metrics listed above exported and documented in Grafana dashboards.
- [ ] **Alert policies configured:** Ops or Observability Guild created alerts matching table in §6.
- [ ] **Sampling overrides tested:** Incident mode toggles verified in staging; retention roll-back rehearsed.
- [ ] **Trace propagation validated:** CLI/UI display trace IDs and allow copy for support.
- [ ] **Log scrubbing enforced:** Unit tests guarantee no secrets/PII in logs; sampling respects configuration.
- [ ] **Offline capture rehearsed:** Metrics/log snapshot commands executed in sealed environment.
- [ ] **Docs cross-links:** Links to architecture, runs, lifecycle, CLI, API docs verified.
---
*Last updated: 2025-10-26 (Sprint 20).*