Some checks failed
		
		
	
	Docs CI / lint-and-preview (push) Has been cancelled
				
			- Introduced RunnerBackgroundService to handle execution of runner segments. - Added RunnerExecutionService for processing segments and aggregating results. - Implemented PlannerQueueDispatchService to manage dispatching of planner messages. - Created PlannerQueueDispatcherBackgroundService for leasing and processing planner queue messages. - Developed ScannerReportClient for interacting with the scanner service. - Enhanced observability with SchedulerWorkerMetrics for tracking planner and runner performance. - Added comprehensive documentation for the new runner execution pipeline and observability metrics. - Implemented event emission for rescan activity and scanner report readiness.
		
			
				
	
	
		
			143 lines
		
	
	
		
			7.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			143 lines
		
	
	
		
			7.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # AOC Observability Guide
 | ||
| 
 | ||
| > **Audience:** Observability Guild, Concelier/Excititor SREs, platform operators.  
 | ||
| > **Scope:** Metrics, traces, logs, dashboards, and runbooks introduced as part of the Aggregation-Only Contract (AOC) rollout (Sprint 19).
 | ||
| 
 | ||
| This guide captures the canonical signals emitted by Concelier and Excititor once AOC guards are active. It explains how to consume the metrics in dashboards, correlate traces/logs for incident triage, and operate in offline environments. Pair this guide with the [AOC reference](../ingestion/aggregation-only-contract.md) and [architecture overview](../architecture/overview.md).
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 1 · Metrics
 | ||
| 
 | ||
| | Metric | Type | Labels | Description |
 | ||
| |--------|------|--------|-------------|
 | ||
| | `ingestion_write_total` | Counter | `source`, `tenant`, `result` (`ok`, `reject`, `noop`) | Counts write attempts to `advisory_raw`/`vex_raw`. Rejects correspond to guard failures. |
 | ||
| | `ingestion_latency_seconds` | Histogram | `source`, `tenant`, `phase` (`fetch`, `transform`, `write`) | Measures end-to-end runtime for ingestion stages. Use `quantile=0.95` for alerting. |
 | ||
| | `aoc_violation_total` | Counter | `source`, `tenant`, `code` (`ERR_AOC_00x`) | Total guard violations bucketed by error code. Drives dashboard pills and alert thresholds. |
 | ||
| | `ingestion_signature_verified_total` | Counter | `source`, `tenant`, `result` (`ok`, `fail`, `skipped`) | Tracks signature/checksum verification outcomes. |
 | ||
| | `advisory_revision_count` | Gauge | `source`, `tenant` | Supersedes depth for raw documents; spikes indicate noisy upstream feeds. |
 | ||
| | `verify_runs_total` | Counter | `tenant`, `initiator` (`ui`, `cli`, `api`, `scheduled`) | How many `stella aoc verify` or `/aoc/verify` runs executed. |
 | ||
| | `verify_duration_seconds` | Histogram | `tenant`, `initiator` | Runtime of verification jobs; use P95 to detect regressions. |
 | ||
| 
 | ||
| ### 1.1 Alerts
 | ||
| 
 | ||
| - **Violation spike:** Alert when `increase(aoc_violation_total[15m]) > 0` for critical sources. Page SRE if `code="ERR_AOC_005"` (signature failure) or `ERR_AOC_001` persists > 30 min.
 | ||
| - **Stale ingestion:** Alert when `max_over_time(ingestion_latency_seconds_sum / ingestion_latency_seconds_count)[30m]` exceeds 30 s or if `ingestion_write_total` has no growth for > 60 min.
 | ||
| - **Signature drop:** Warn when `rate(ingestion_signature_verified_total{result="fail"}[1h]) > 0`.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 2 · Traces
 | ||
| 
 | ||
| ### 2.1 Span taxonomy
 | ||
| 
 | ||
| | Span name | Parent | Key attributes |
 | ||
| |-----------|--------|----------------|
 | ||
| | `ingest.fetch` | job root span | `source`, `tenant`, `uri`, `contentHash` |
 | ||
| | `ingest.transform` | `ingest.fetch` | `documentType` (`csaf`, `osv`, `vex`), `payloadBytes` |
 | ||
| | `ingest.write` | `ingest.transform` | `collection` (`advisory_raw`, `vex_raw`), `result` (`ok`, `reject`) |
 | ||
| | `aoc.guard` | `ingest.write` | `code` (on violation), `violationCount`, `supersedes` |
 | ||
| | `verify.run` | verification job root | `tenant`, `window.from`, `window.to`, `sources`, `violations` |
 | ||
| 
 | ||
| ### 2.2 Trace usage
 | ||
| 
 | ||
| - Correlate UI dashboard entries with traces via `traceId` surfaced in violation drawers (`docs/ui/console.md`).
 | ||
| - Use `aoc.guard` spans to inspect guard payload snapshots. Sensitive fields are redacted automatically; raw JSON lives in secure logs only.
 | ||
| - For scheduled verification, filter traces by `initiator="scheduled"` to compare runtimes pre/post change.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 3 · Logs
 | ||
| 
 | ||
| Structured logs include the following keys (JSON):
 | ||
| 
 | ||
| | Key | Description |
 | ||
| |-----|-------------|
 | ||
| | `traceId` | Matches OpenTelemetry trace/span IDs for cross-system correlation. |
 | ||
| | `tenant` | Tenant identifier enforced by Authority middleware. |
 | ||
| | `source.vendor` | Logical source (e.g., `redhat`, `ubuntu`, `osv`, `ghsa`). |
 | ||
| | `upstream.upstreamId` | Vendor-provided ID (CVE, GHSA, etc.). |
 | ||
| | `contentHash` | `sha256:` digest of the raw document. |
 | ||
| | `violation.code` | Present when guard rejects `ERR_AOC_00x`. |
 | ||
| | `verification.window` | Present on `/aoc/verify` job logs. |
 | ||
| 
 | ||
| Logs are shipped to the central Loki/Elasticsearch cluster. Use the template query:
 | ||
| 
 | ||
| ```logql
 | ||
| {app="concelier-web"} | json | violation_code != ""
 | ||
| ```
 | ||
| 
 | ||
| to spot active AOC violations.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 4 · Dashboards
 | ||
| 
 | ||
| Primary Grafana dashboard: **“AOC Ingestion Health”** (`dashboards/aoc-ingestion.json`). Panels include:
 | ||
| 
 | ||
| 1. **Sources overview:** table fed by `ingestion_write_total` and `ingestion_latency_seconds` (mirrors Console tiles).
 | ||
| 2. **Violation trend:** stacked bar chart of `aoc_violation_total` per code.
 | ||
| 3. **Signature success rate:** timeseries derived from `ingestion_signature_verified_total`.
 | ||
| 4. **Supersedes depth:** gauge showing `advisory_revision_count` P95.
 | ||
| 5. **Verification runs:** histogram and latency boxplot using `verify_runs_total` / `verify_duration_seconds`.
 | ||
| 
 | ||
| Secondary dashboards:
 | ||
| 
 | ||
| - **AOC Alerts (Ops view):** summarises active alerts, last verify run, and links to incident runbook.
 | ||
| - **Offline Mode Dashboard:** fed from Offline Kit imports; highlights snapshot age and queued verification jobs.
 | ||
| 
 | ||
| Update `docs/assets/dashboards/` with screenshots when Grafana capture pipeline produces the latest renders.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 5 · Operational workflows
 | ||
| 
 | ||
| 1. **During ingestion incident:**
 | ||
|    - Check Console dashboard for offending sources.
 | ||
|    - Pivot to logs using document `contentHash`.
 | ||
|    - Re-run `stella sources ingest --dry-run` with problematic payloads to validate fixes.
 | ||
|    - After remediation, run `stella aoc verify --since 24h` and confirm exit code `0`.
 | ||
| 2. **Scheduled verification:**
 | ||
|    - Configure cron job to run `stella aoc verify --format json --export ...`.
 | ||
|    - Ship JSON to `aoc-verify` bucket and ingest into metrics using custom exporter.
 | ||
|    - Alert on missing exports (no file uploaded within 26 h).
 | ||
| 3. **Offline kit validation:**
 | ||
|    - Use Offline Dashboard to ensure snapshots contain latest metrics.
 | ||
|    - Run verification reports locally and attach to bundle before distribution.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 6 · Offline considerations
 | ||
| 
 | ||
| - Metrics exporters bundled with Offline Kit write to local Prometheus snapshots; sync them with central Grafana once connectivity is restored.
 | ||
| - CLI verification reports should be hashed (`sha256sum`) and archived for audit trails.
 | ||
| - Dashboards include offline data sources (`prometheus-offline`) switchable via dropdown.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 7 · References
 | ||
| 
 | ||
| - [Aggregation-Only Contract reference](../ingestion/aggregation-only-contract.md)
 | ||
| - [Architecture overview](../architecture/overview.md)
 | ||
| - [Console AOC dashboard](../ui/console.md)
 | ||
| - [CLI AOC commands](../cli/cli-reference.md)
 | ||
| - [Concelier architecture](../ARCHITECTURE_CONCELIER.md)
 | ||
| - [Excititor architecture](../ARCHITECTURE_EXCITITOR.md)
 | ||
| - [Scheduler Worker observability guide](../ops/scheduler-worker-operations.md)
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 8 · Compliance checklist
 | ||
| 
 | ||
| - [ ] Metrics documented with label sets and alert guidance.
 | ||
| - [ ] Tracing span taxonomy aligned with Concelier/Excititor implementation.
 | ||
| - [ ] Log schema matches structured logging contracts (traceId, tenant, source, contentHash).
 | ||
| - [ ] Grafana dashboard references verified and screenshots scheduled.
 | ||
| - [ ] Offline/air-gap workflow captured.
 | ||
| - [ ] Cross-links to AOC reference, console, and CLI docs included.
 | ||
| - [ ] Observability Guild sign-off scheduled (OWNER: @obs-guild, due 2025-10-28).
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| *Last updated: 2025-10-26 (Sprint 19).* 
 |