# AOC Observability Guide > **Audience:** Observability Guild, Concelier/Excititor SREs, platform operators. > **Scope:** Metrics, traces, logs, dashboards, and runbooks introduced as part of the Aggregation-Only Contract (AOC) rollout (Sprint 19). This guide captures the canonical signals emitted by Concelier and Excititor once AOC guards are active. It explains how to consume the metrics in dashboards, correlate traces/logs for incident triage, and operate in offline environments. Pair this guide with the [AOC reference](../ingestion/aggregation-only-contract.md) and [architecture overview](../modules/platform/architecture-overview.md). --- ## 1 · Metrics | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `ingestion_write_total` | Counter | `source`, `tenant`, `result` (`ok`, `reject`, `noop`) | Counts write attempts to `advisory_raw`/`vex_raw`. Rejects correspond to guard failures. | | `ingestion_latency_seconds` | Histogram | `source`, `tenant`, `phase` (`fetch`, `transform`, `write`) | Measures end-to-end runtime for ingestion stages. Use `quantile=0.95` for alerting. | | `aoc_violation_total` | Counter | `source`, `tenant`, `code` (`ERR_AOC_00x`) | Total guard violations bucketed by error code. Drives dashboard pills and alert thresholds. | | `ingestion_signature_verified_total` | Counter | `source`, `tenant`, `result` (`ok`, `fail`, `skipped`) | Tracks signature/checksum verification outcomes. | | `advisory_revision_count` | Gauge | `source`, `tenant` | Supersedes depth for raw documents; spikes indicate noisy upstream feeds. | | `verify_runs_total` | Counter | `tenant`, `initiator` (`ui`, `cli`, `api`, `scheduled`) | How many `stella aoc verify` or `/aoc/verify` runs executed. | | `verify_duration_seconds` | Histogram | `tenant`, `initiator` | Runtime of verification jobs; use P95 to detect regressions. | ### 1.1 Alerts - **Violation spike:** Alert when `increase(aoc_violation_total[15m]) > 0` for critical sources. Page SRE if `code="ERR_AOC_005"` (signature failure) or `ERR_AOC_001` persists > 30 min. - **Stale ingestion:** Alert when `max_over_time(ingestion_latency_seconds_sum / ingestion_latency_seconds_count)[30m]` exceeds 30 s or if `ingestion_write_total` has no growth for > 60 min. - **Signature drop:** Warn when `rate(ingestion_signature_verified_total{result="fail"}[1h]) > 0`. ### 1.2 · `/obs/excititor/health` `GET /obs/excititor/health` (scope `vex.admin`) returns a compact snapshot for Grafana tiles and Console widgets: - `ingest` — overall status, worst lag (seconds), and the top connectors (status, lagSeconds, failure count, last success). - `link` — freshness of consensus/linkset processing plus document counts and the number currently carrying conflicts. - `signature` — recent coverage window (evaluated, with signatures, verified, failures, unsigned, coverage ratio). - `conflicts` — rolling totals grouped by status plus per-bucket trend data for charts. ```json { "generatedAt": "2025-11-08T11:00:00Z", "ingest": { "status": "healthy", "connectors": [ { "connectorId": "excititor:redhat", "lagSeconds": 45.3 } ] }, "link": { "status": "warning", "lastConsensusAt": "2025-11-08T10:57:03Z" }, "signature": { "status": "critical", "documentsEvaluated": 120, "verified": 30, "failures": 2 }, "conflicts": { "status": "warning", "conflictStatements": 325, "trend": [ { "bucketStart": "2025-11-08T10:00:00Z", "conflicts": 130 } ] } } ``` | Setting | Default | Purpose | |---------|---------|---------| | `Excititor:Observability:IngestWarningThreshold` | `06:00:00` | Connector lag before `ingest.status` becomes `warning`. | | `Excititor:Observability:IngestCriticalThreshold` | `24:00:00` | Connector lag before `ingest.status` becomes `critical`. | | `Excititor:Observability:LinkWarningThreshold` | `00:15:00` | Maximum acceptable delay between consensus recalculations. | | `Excititor:Observability:LinkCriticalThreshold` | `01:00:00` | Delay that marks link status as `critical`. | | `Excititor:Observability:SignatureWindow` | `12:00:00` | Lookback window for signature coverage. | | `Excititor:Observability:SignatureHealthyCoverage` | `0.8` | Coverage ratio that still counts as healthy. | | `Excititor:Observability:SignatureWarningCoverage` | `0.5` | Coverage ratio that flips the status to `warning`. | | `Excititor:Observability:ConflictTrendWindow` | `24:00:00` | Rolling window used for conflict aggregation. | | `Excititor:Observability:ConflictTrendBucketMinutes` | `60` | Resolution of conflict `trend` buckets. | | `Excititor:Observability:ConflictWarningRatio` | `0.15` | Fraction of consensus docs with conflicts that triggers `warning`. | | `Excititor:Observability:ConflictCriticalRatio` | `0.3` | Ratio that marks `conflicts.status` as `critical`. | | `Excititor:Observability:MaxConnectorDetails` | `50` | Number of connector entries returned (keeps payloads small). | ### 1.3 · Regression & DI hygiene 1. **Keep storage/integration tests green when telemetry touches persistence.** - `./tools/mongodb/local-mongo.sh start` downloads MongoDB 6.0.16 (if needed), launches `rs0`, and prints `export EXCITITOR_TEST_MONGO_URI=mongodb://.../excititor-tests`. Copy that export into your shell. - `./tools/mongodb/local-mongo.sh restart` is a shortcut for “stop if running, then start” using the same dataset—use it after tweaking config or when tests need a bounce without wiping fixtures. - `./tools/mongodb/local-mongo.sh clean` stops the instance (if running) and deletes the managed data/log directories so storage tests begin from a pristine catalog. - Run `dotnet test src/Excititor/__Tests/StellaOps.Excititor.Storage.Mongo.Tests/StellaOps.Excititor.Storage.Mongo.Tests.csproj -nologo -v minimal` (add `--filter` if you only touched specific suites). These tests exercise the same write paths that feed the dashboards, so regressions show up immediately. - `./tools/mongodb/local-mongo.sh stop` when finished so CI/dev hosts stay clean; `status|logs|shell` are available for troubleshooting. 2. **Declare optional Minimal API dependencies with `[FromServices] ... = null`.** RequestDelegateFactory treats `[FromServices] IVexSigner? signer = null` (or similar) as optional, so host startup succeeds even when tests have not registered that service. This pattern keeps observability endpoints cancellable while avoiding brittle test overrides. --- ## 2 · Traces ### 2.1 Span taxonomy | Span name | Parent | Key attributes | |-----------|--------|----------------| | `ingest.fetch` | job root span | `source`, `tenant`, `uri`, `contentHash` | | `ingest.transform` | `ingest.fetch` | `documentType` (`csaf`, `osv`, `vex`), `payloadBytes` | | `ingest.write` | `ingest.transform` | `collection` (`advisory_raw`, `vex_raw`), `result` (`ok`, `reject`) | | `aoc.guard` | `ingest.write` | `code` (on violation), `violationCount`, `supersedes` | | `verify.run` | verification job root | `tenant`, `window.from`, `window.to`, `sources`, `violations` | ### 2.2 Trace usage - Correlate UI dashboard entries with traces via `traceId` surfaced in violation drawers (`docs/ui/console.md`). - Use `aoc.guard` spans to inspect guard payload snapshots. Sensitive fields are redacted automatically; raw JSON lives in secure logs only. - For scheduled verification, filter traces by `initiator="scheduled"` to compare runtimes pre/post change. ### 2.3 Telemetry configuration (Excititor) - Configure the web service via `Excititor:Telemetry`: ```jsonc { "Excititor": { "Telemetry": { "Enabled": true, "EnableTracing": true, "EnableMetrics": true, "ServiceName": "stellaops-excititor-web", "OtlpEndpoint": "http://otel-collector:4317", "OtlpHeaders": { "Authorization": "Bearer ${OTEL_PUSH_TOKEN}" }, "ResourceAttributes": { "env": "prod-us", "service.group": "ingestion" } } } } ``` - Point the OTLP endpoint at the shared collector profile from §1 so Excititor metrics land in the `ingestion_*` dashboards next to Concelier. Resource attributes drive Grafana filtering (e.g., `env`, `service.group`). - For offline/air-gap bundles set `Enabled=false` and collect the file exporter artifacts from the Offline Kit; import them into Grafana after transfer to keep time-to-truth dashboards consistent. - Local development templates: run `tools/mongodb/local-mongo.sh start` to spin up a single-node replica set plus the matching `mongosh` client. The script prints the `export EXCITITOR_TEST_MONGO_URI=...` command that integration tests (e.g., `StellaOps.Excititor.Storage.Mongo.Tests`) will honor. Use `restart` for a quick bounce, `clean` to wipe data between suites, and `stop` when finished. --- ## 3 · Logs Structured logs include the following keys (JSON): | Key | Description | |-----|-------------| | `traceId` | Matches OpenTelemetry trace/span IDs for cross-system correlation. | | `tenant` | Tenant identifier enforced by Authority middleware. | | `source.vendor` | Logical source (e.g., `redhat`, `ubuntu`, `osv`, `ghsa`). | | `upstream.upstreamId` | Vendor-provided ID (CVE, GHSA, etc.). | | `contentHash` | `sha256:` digest of the raw document. | | `violation.code` | Present when guard rejects `ERR_AOC_00x`. | | `verification.window` | Present on `/aoc/verify` job logs. | Excititor APIs mirror these identifiers via response headers: | Header | Purpose | | --- | --- | | `X-Stella-TraceId` | W3C trace/span identifier for deep-linking from Console → Grafana/Loki. | | `X-Stella-CorrelationId` | Stable correlation identifier (respects inbound header or falls back to the request trace ID). | Logs are shipped to the central Loki/Elasticsearch cluster. Use the template query: ```logql {app="concelier-web"} | json | violation_code != "" ``` to spot active AOC violations. ### 1.3 · Advisory chunk API (Advisory AI feeds) Advisory AI now leans on Concelier’s `/advisories/{key}/chunks` endpoint for deterministic evidence packs. The service exports dedicated metrics so dashboards can highlight latency spikes, cache noise, or aggressive guardrail filtering before they impact Advisory AI responses. | Metric | Type | Labels | Description | | --- | --- | --- | --- | | `advisory_ai_chunk_requests_total` | Counter | `tenant`, `result`, `truncated`, `cache` | Count of chunk API calls, tagged with cache hits/misses and truncation state. | | `advisory_ai_chunk_latency_milliseconds` | Histogram | `tenant`, `result`, `truncated`, `cache` | End-to-end build latency (milliseconds) for each chunk request. | | `advisory_ai_chunk_segments` | Histogram | `tenant`, `result`, `truncated` | Number of chunk segments returned to the caller; watch for sudden drops tied to guardrails. | | `advisory_ai_chunk_sources` | Histogram | `tenant`, `result` | How many upstream observations/sources contributed to a response (after observation limits). | | `advisory_ai_guardrail_blocks_total` | Counter | `tenant`, `reason`, `cache` | Per-reason count of segments suppressed by guardrails (length, normalization, character set). | Dashboards should plot latency P95/P99 next to cache hit rates and guardrail block deltas to catch degradation early. Advisory AI CLI/Console surfaces the same metadata so support engineers can correlate with Grafana/Loki entries using `traceId`/`correlationId` headers. --- ## 4 · Dashboards Primary Grafana dashboard: **“AOC Ingestion Health”** (`dashboards/aoc-ingestion.json`). Panels include: 1. **Sources overview:** table fed by `ingestion_write_total` and `ingestion_latency_seconds` (mirrors Console tiles). 2. **Violation trend:** stacked bar chart of `aoc_violation_total` per code. 3. **Signature success rate:** timeseries derived from `ingestion_signature_verified_total`. 4. **Supersedes depth:** gauge showing `advisory_revision_count` P95. 5. **Verification runs:** histogram and latency boxplot using `verify_runs_total` / `verify_duration_seconds`. Secondary dashboards: - **AOC Alerts (Ops view):** summarises active alerts, last verify run, and links to incident runbook. - **Offline Mode Dashboard:** fed from Offline Kit imports; highlights snapshot age and queued verification jobs. Update `docs/assets/dashboards/` with screenshots when Grafana capture pipeline produces the latest renders. --- ## 5 · Operational workflows 1. **During ingestion incident:** - Check Console dashboard for offending sources. - Pivot to logs using document `contentHash`. - Re-run `stella sources ingest --dry-run` with problematic payloads to validate fixes. - After remediation, run `stella aoc verify --since 24h` and confirm exit code `0`. 2. **Scheduled verification:** - Configure cron job to run `stella aoc verify --format json --export ...`. - Ship JSON to `aoc-verify` bucket and ingest into metrics using custom exporter. - Alert on missing exports (no file uploaded within 26 h). 3. **Offline kit validation:** - Use Offline Dashboard 4. **Incident toggle audit:** - Authority requires `incident_reason` when issuing `obs:incident` tokens; plan your runbooks to capture business justification. - Auditors can call `/authority/audit/incident?limit=100` with the tenant header to list recent incident activations, including reason and issuer. to ensure snapshots contain latest metrics. - Run verification reports locally and attach to bundle before distribution. --- ## 6 · Offline considerations - Metrics exporters bundled with Offline Kit write to local Prometheus snapshots; sync them with central Grafana once connectivity is restored. - CLI verification reports should be hashed (`sha256sum`) and archived for audit trails. - Dashboards include offline data sources (`prometheus-offline`) switchable via dropdown. --- ## 7 · References - [Aggregation-Only Contract reference](../ingestion/aggregation-only-contract.md) - [Architecture overview](../modules/platform/architecture-overview.md) - [Console AOC dashboard](../ui/console.md) - [CLI AOC commands](../modules/cli/guides/cli-reference.md) - [Concelier architecture](../modules/concelier/architecture.md) - [Excititor architecture](../modules/excititor/architecture.md) - [Scheduler Worker observability guide](../modules/scheduler/operations/worker.md) --- ## 8 · Compliance checklist - [ ] Metrics documented with label sets and alert guidance. - [ ] Tracing span taxonomy aligned with Concelier/Excititor implementation. - [ ] Log schema matches structured logging contracts (traceId, tenant, source, contentHash). - [ ] Grafana dashboard references verified and screenshots scheduled. - [ ] Offline/air-gap workflow captured. - [ ] Cross-links to AOC reference, console, and CLI docs included. - [ ] Observability Guild sign-off scheduled (OWNER: @obs-guild, due 2025-10-28). --- *Last updated: 2025-10-26 (Sprint 19).*