- Introduced RunnerBackgroundService to handle execution of runner segments. - Added RunnerExecutionService for processing segments and aggregating results. - Implemented PlannerQueueDispatchService to manage dispatching of planner messages. - Created PlannerQueueDispatcherBackgroundService for leasing and processing planner queue messages. - Developed ScannerReportClient for interacting with the scanner service. - Enhanced observability with SchedulerWorkerMetrics for tracking planner and runner performance. - Added comprehensive documentation for the new runner execution pipeline and observability metrics. - Implemented event emission for rescan activity and scanner report readiness.
		
			
				
	
	
	
		
			7.1 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	AOC Observability Guide
Audience: Observability Guild, Concelier/Excititor SREs, platform operators.
Scope: Metrics, traces, logs, dashboards, and runbooks introduced as part of the Aggregation-Only Contract (AOC) rollout (Sprint 19).
This guide captures the canonical signals emitted by Concelier and Excititor once AOC guards are active. It explains how to consume the metrics in dashboards, correlate traces/logs for incident triage, and operate in offline environments. Pair this guide with the AOC reference and architecture overview.
1 · Metrics
| Metric | Type | Labels | Description | 
|---|---|---|---|
| ingestion_write_total | Counter | source,tenant,result(ok,reject,noop) | Counts write attempts to advisory_raw/vex_raw. Rejects correspond to guard failures. | 
| ingestion_latency_seconds | Histogram | source,tenant,phase(fetch,transform,write) | Measures end-to-end runtime for ingestion stages. Use quantile=0.95for alerting. | 
| aoc_violation_total | Counter | source,tenant,code(ERR_AOC_00x) | Total guard violations bucketed by error code. Drives dashboard pills and alert thresholds. | 
| ingestion_signature_verified_total | Counter | source,tenant,result(ok,fail,skipped) | Tracks signature/checksum verification outcomes. | 
| advisory_revision_count | Gauge | source,tenant | Supersedes depth for raw documents; spikes indicate noisy upstream feeds. | 
| verify_runs_total | Counter | tenant,initiator(ui,cli,api,scheduled) | How many stella aoc verifyor/aoc/verifyruns executed. | 
| verify_duration_seconds | Histogram | tenant,initiator | Runtime of verification jobs; use P95 to detect regressions. | 
1.1 Alerts
- Violation spike: Alert when increase(aoc_violation_total[15m]) > 0for critical sources. Page SRE ifcode="ERR_AOC_005"(signature failure) orERR_AOC_001persists > 30 min.
- Stale ingestion: Alert when max_over_time(ingestion_latency_seconds_sum / ingestion_latency_seconds_count)[30m]exceeds 30 s or ifingestion_write_totalhas no growth for > 60 min.
- Signature drop: Warn when rate(ingestion_signature_verified_total{result="fail"}[1h]) > 0.
2 · Traces
2.1 Span taxonomy
| Span name | Parent | Key attributes | 
|---|---|---|
| ingest.fetch | job root span | source,tenant,uri,contentHash | 
| ingest.transform | ingest.fetch | documentType(csaf,osv,vex),payloadBytes | 
| ingest.write | ingest.transform | collection(advisory_raw,vex_raw),result(ok,reject) | 
| aoc.guard | ingest.write | code(on violation),violationCount,supersedes | 
| verify.run | verification job root | tenant,window.from,window.to,sources,violations | 
2.2 Trace usage
- Correlate UI dashboard entries with traces via traceIdsurfaced in violation drawers (docs/ui/console.md).
- Use aoc.guardspans to inspect guard payload snapshots. Sensitive fields are redacted automatically; raw JSON lives in secure logs only.
- For scheduled verification, filter traces by initiator="scheduled"to compare runtimes pre/post change.
3 · Logs
Structured logs include the following keys (JSON):
| Key | Description | 
|---|---|
| traceId | Matches OpenTelemetry trace/span IDs for cross-system correlation. | 
| tenant | Tenant identifier enforced by Authority middleware. | 
| source.vendor | Logical source (e.g., redhat,ubuntu,osv,ghsa). | 
| upstream.upstreamId | Vendor-provided ID (CVE, GHSA, etc.). | 
| contentHash | sha256:digest of the raw document. | 
| violation.code | Present when guard rejects ERR_AOC_00x. | 
| verification.window | Present on /aoc/verifyjob logs. | 
Logs are shipped to the central Loki/Elasticsearch cluster. Use the template query:
{app="concelier-web"} | json | violation_code != ""
to spot active AOC violations.
4 · Dashboards
Primary Grafana dashboard: “AOC Ingestion Health” (dashboards/aoc-ingestion.json). Panels include:
- Sources overview: table fed by ingestion_write_totalandingestion_latency_seconds(mirrors Console tiles).
- Violation trend: stacked bar chart of aoc_violation_totalper code.
- Signature success rate: timeseries derived from ingestion_signature_verified_total.
- Supersedes depth: gauge showing advisory_revision_countP95.
- Verification runs: histogram and latency boxplot using verify_runs_total/verify_duration_seconds.
Secondary dashboards:
- AOC Alerts (Ops view): summarises active alerts, last verify run, and links to incident runbook.
- Offline Mode Dashboard: fed from Offline Kit imports; highlights snapshot age and queued verification jobs.
Update docs/assets/dashboards/ with screenshots when Grafana capture pipeline produces the latest renders.
5 · Operational workflows
- During ingestion incident:
- Check Console dashboard for offending sources.
- Pivot to logs using document contentHash.
- Re-run stella sources ingest --dry-runwith problematic payloads to validate fixes.
- After remediation, run stella aoc verify --since 24hand confirm exit code0.
 
- Scheduled verification:
- Configure cron job to run stella aoc verify --format json --export ....
- Ship JSON to aoc-verifybucket and ingest into metrics using custom exporter.
- Alert on missing exports (no file uploaded within 26 h).
 
- Configure cron job to run 
- Offline kit validation:
- Use Offline Dashboard to ensure snapshots contain latest metrics.
- Run verification reports locally and attach to bundle before distribution.
 
6 · Offline considerations
- Metrics exporters bundled with Offline Kit write to local Prometheus snapshots; sync them with central Grafana once connectivity is restored.
- CLI verification reports should be hashed (sha256sum) and archived for audit trails.
- Dashboards include offline data sources (prometheus-offline) switchable via dropdown.
7 · References
- Aggregation-Only Contract reference
- Architecture overview
- Console AOC dashboard
- CLI AOC commands
- Concelier architecture
- Excititor architecture
- Scheduler Worker observability guide
8 · Compliance checklist
- Metrics documented with label sets and alert guidance.
- Tracing span taxonomy aligned with Concelier/Excititor implementation.
- Log schema matches structured logging contracts (traceId, tenant, source, contentHash).
- Grafana dashboard references verified and screenshots scheduled.
- Offline/air-gap workflow captured.
- Cross-links to AOC reference, console, and CLI docs included.
- Observability Guild sign-off scheduled (OWNER: @obs-guild, due 2025-10-28).
Last updated: 2025-10-26 (Sprint 19).