Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
- Created CycloneDX and SPDX SBOM files for both reachable and unreachable images. - Added symbols.json detailing function entry and sink points in the WordPress code. - Included runtime traces for function calls in both reachable and unreachable scenarios. - Developed OpenVEX files indicating vulnerability status and justification for both cases. - Updated README for evaluator harness to guide integration with scanner output.
227 lines
13 KiB
Markdown
227 lines
13 KiB
Markdown
# AOC Observability Guide
|
||
|
||
> **Audience:** Observability Guild, Concelier/Excititor SREs, platform operators.
|
||
> **Scope:** Metrics, traces, logs, dashboards, and runbooks introduced as part of the Aggregation-Only Contract (AOC) rollout (Sprint 19).
|
||
|
||
This guide captures the canonical signals emitted by Concelier and Excititor once AOC guards are active. It explains how to consume the metrics in dashboards, correlate traces/logs for incident triage, and operate in offline environments. Pair this guide with the [AOC reference](../ingestion/aggregation-only-contract.md) and [architecture overview](../modules/platform/architecture-overview.md).
|
||
|
||
---
|
||
|
||
## 1 · Metrics
|
||
|
||
| Metric | Type | Labels | Description |
|
||
|--------|------|--------|-------------|
|
||
| `ingestion_write_total` | Counter | `source`, `tenant`, `result` (`ok`, `reject`, `noop`) | Counts write attempts to `advisory_raw`/`vex_raw`. Rejects correspond to guard failures. |
|
||
| `ingestion_latency_seconds` | Histogram | `source`, `tenant`, `phase` (`fetch`, `transform`, `write`) | Measures end-to-end runtime for ingestion stages. Use `quantile=0.95` for alerting. |
|
||
| `aoc_violation_total` | Counter | `source`, `tenant`, `code` (`ERR_AOC_00x`) | Total guard violations bucketed by error code. Drives dashboard pills and alert thresholds. |
|
||
| `ingestion_signature_verified_total` | Counter | `source`, `tenant`, `result` (`ok`, `fail`, `skipped`) | Tracks signature/checksum verification outcomes. |
|
||
| `advisory_revision_count` | Gauge | `source`, `tenant` | Supersedes depth for raw documents; spikes indicate noisy upstream feeds. |
|
||
| `verify_runs_total` | Counter | `tenant`, `initiator` (`ui`, `cli`, `api`, `scheduled`) | How many `stella aoc verify` or `/aoc/verify` runs executed. |
|
||
| `verify_duration_seconds` | Histogram | `tenant`, `initiator` | Runtime of verification jobs; use P95 to detect regressions. |
|
||
|
||
### 1.1 Alerts
|
||
|
||
- **Violation spike:** Alert when `increase(aoc_violation_total[15m]) > 0` for critical sources. Page SRE if `code="ERR_AOC_005"` (signature failure) or `ERR_AOC_001` persists > 30 min.
|
||
- **Stale ingestion:** Alert when `max_over_time(ingestion_latency_seconds_sum / ingestion_latency_seconds_count)[30m]` exceeds 30 s or if `ingestion_write_total` has no growth for > 60 min.
|
||
- **Signature drop:** Warn when `rate(ingestion_signature_verified_total{result="fail"}[1h]) > 0`.
|
||
|
||
### 1.2 · `/obs/excititor/health`
|
||
|
||
`GET /obs/excititor/health` (scope `vex.admin`) returns a compact snapshot for Grafana tiles and Console widgets:
|
||
|
||
- `ingest` — overall status, worst lag (seconds), and the top connectors (status, lagSeconds, failure count, last success).
|
||
- `link` — freshness of consensus/linkset processing plus document counts and the number currently carrying conflicts.
|
||
- `signature` — recent coverage window (evaluated, with signatures, verified, failures, unsigned, coverage ratio).
|
||
- `conflicts` — rolling totals grouped by status plus per-bucket trend data for charts.
|
||
|
||
```json
|
||
{
|
||
"generatedAt": "2025-11-08T11:00:00Z",
|
||
"ingest": { "status": "healthy", "connectors": [ { "connectorId": "excititor:redhat", "lagSeconds": 45.3 } ] },
|
||
"link": { "status": "warning", "lastConsensusAt": "2025-11-08T10:57:03Z" },
|
||
"signature": { "status": "critical", "documentsEvaluated": 120, "verified": 30, "failures": 2 },
|
||
"conflicts": { "status": "warning", "conflictStatements": 325, "trend": [ { "bucketStart": "2025-11-08T10:00:00Z", "conflicts": 130 } ] }
|
||
}
|
||
```
|
||
|
||
| Setting | Default | Purpose |
|
||
|---------|---------|---------|
|
||
| `Excititor:Observability:IngestWarningThreshold` | `06:00:00` | Connector lag before `ingest.status` becomes `warning`. |
|
||
| `Excititor:Observability:IngestCriticalThreshold` | `24:00:00` | Connector lag before `ingest.status` becomes `critical`. |
|
||
| `Excititor:Observability:LinkWarningThreshold` | `00:15:00` | Maximum acceptable delay between consensus recalculations. |
|
||
| `Excititor:Observability:LinkCriticalThreshold` | `01:00:00` | Delay that marks link status as `critical`. |
|
||
| `Excititor:Observability:SignatureWindow` | `12:00:00` | Lookback window for signature coverage. |
|
||
| `Excititor:Observability:SignatureHealthyCoverage` | `0.8` | Coverage ratio that still counts as healthy. |
|
||
| `Excititor:Observability:SignatureWarningCoverage` | `0.5` | Coverage ratio that flips the status to `warning`. |
|
||
| `Excititor:Observability:ConflictTrendWindow` | `24:00:00` | Rolling window used for conflict aggregation. |
|
||
| `Excititor:Observability:ConflictTrendBucketMinutes` | `60` | Resolution of conflict `trend` buckets. |
|
||
| `Excititor:Observability:ConflictWarningRatio` | `0.15` | Fraction of consensus docs with conflicts that triggers `warning`. |
|
||
| `Excititor:Observability:ConflictCriticalRatio` | `0.3` | Ratio that marks `conflicts.status` as `critical`. |
|
||
| `Excititor:Observability:MaxConnectorDetails` | `50` | Number of connector entries returned (keeps payloads small). |
|
||
|
||
### 1.3 · Regression & DI hygiene
|
||
|
||
1. **Keep storage/integration tests green when telemetry touches persistence.**
|
||
- `./tools/mongodb/local-mongo.sh start` downloads MongoDB 6.0.16 (if needed), launches `rs0`, and prints `export EXCITITOR_TEST_MONGO_URI=mongodb://.../excititor-tests`. Copy that export into your shell.
|
||
- `./tools/mongodb/local-mongo.sh restart` is a shortcut for “stop if running, then start” using the same dataset—use it after tweaking config or when tests need a bounce without wiping fixtures.
|
||
- `./tools/mongodb/local-mongo.sh clean` stops the instance (if running) and deletes the managed data/log directories so storage tests begin from a pristine catalog.
|
||
- Run `dotnet test src/Excititor/__Tests/StellaOps.Excititor.Storage.Mongo.Tests/StellaOps.Excititor.Storage.Mongo.Tests.csproj -nologo -v minimal` (add `--filter` if you only touched specific suites). These tests exercise the same write paths that feed the dashboards, so regressions show up immediately.
|
||
- `./tools/mongodb/local-mongo.sh stop` when finished so CI/dev hosts stay clean; `status|logs|shell` are available for troubleshooting.
|
||
2. **Declare optional Minimal API dependencies with `[FromServices] ... = null`.** RequestDelegateFactory treats `[FromServices] IVexSigner? signer = null` (or similar) as optional, so host startup succeeds even when tests have not registered that service. This pattern keeps observability endpoints cancellable while avoiding brittle test overrides.
|
||
|
||
|
||
---
|
||
|
||
## 2 · Traces
|
||
|
||
### 2.1 Span taxonomy
|
||
|
||
| Span name | Parent | Key attributes |
|
||
|-----------|--------|----------------|
|
||
| `ingest.fetch` | job root span | `source`, `tenant`, `uri`, `contentHash` |
|
||
| `ingest.transform` | `ingest.fetch` | `documentType` (`csaf`, `osv`, `vex`), `payloadBytes` |
|
||
| `ingest.write` | `ingest.transform` | `collection` (`advisory_raw`, `vex_raw`), `result` (`ok`, `reject`) |
|
||
| `aoc.guard` | `ingest.write` | `code` (on violation), `violationCount`, `supersedes` |
|
||
| `verify.run` | verification job root | `tenant`, `window.from`, `window.to`, `sources`, `violations` |
|
||
|
||
### 2.2 Trace usage
|
||
|
||
- Correlate UI dashboard entries with traces via `traceId` surfaced in violation drawers (`docs/ui/console.md`).
|
||
- Use `aoc.guard` spans to inspect guard payload snapshots. Sensitive fields are redacted automatically; raw JSON lives in secure logs only.
|
||
- For scheduled verification, filter traces by `initiator="scheduled"` to compare runtimes pre/post change.
|
||
|
||
### 2.3 Telemetry configuration (Excititor)
|
||
|
||
- Configure the web service via `Excititor:Telemetry`:
|
||
|
||
```jsonc
|
||
{
|
||
"Excititor": {
|
||
"Telemetry": {
|
||
"Enabled": true,
|
||
"EnableTracing": true,
|
||
"EnableMetrics": true,
|
||
"ServiceName": "stellaops-excititor-web",
|
||
"OtlpEndpoint": "http://otel-collector:4317",
|
||
"OtlpHeaders": {
|
||
"Authorization": "Bearer ${OTEL_PUSH_TOKEN}"
|
||
},
|
||
"ResourceAttributes": {
|
||
"env": "prod-us",
|
||
"service.group": "ingestion"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
- Point the OTLP endpoint at the shared collector profile from §1 so Excititor metrics land in the `ingestion_*` dashboards next to Concelier. Resource attributes drive Grafana filtering (e.g., `env`, `service.group`).
|
||
- For offline/air-gap bundles set `Enabled=false` and collect the file exporter artifacts from the Offline Kit; import them into Grafana after transfer to keep time-to-truth dashboards consistent.
|
||
- Local development templates: run `tools/mongodb/local-mongo.sh start` to spin up a single-node replica set plus the matching `mongosh` client. The script prints the `export EXCITITOR_TEST_MONGO_URI=...` command that integration tests (e.g., `StellaOps.Excititor.Storage.Mongo.Tests`) will honor. Use `restart` for a quick bounce, `clean` to wipe data between suites, and `stop` when finished.
|
||
|
||
---
|
||
|
||
## 3 · Logs
|
||
|
||
Structured logs include the following keys (JSON):
|
||
|
||
| Key | Description |
|
||
|-----|-------------|
|
||
| `traceId` | Matches OpenTelemetry trace/span IDs for cross-system correlation. |
|
||
| `tenant` | Tenant identifier enforced by Authority middleware. |
|
||
| `source.vendor` | Logical source (e.g., `redhat`, `ubuntu`, `osv`, `ghsa`). |
|
||
| `upstream.upstreamId` | Vendor-provided ID (CVE, GHSA, etc.). |
|
||
| `contentHash` | `sha256:` digest of the raw document. |
|
||
| `violation.code` | Present when guard rejects `ERR_AOC_00x`. |
|
||
| `verification.window` | Present on `/aoc/verify` job logs. |
|
||
|
||
Excititor APIs mirror these identifiers via response headers:
|
||
|
||
| Header | Purpose |
|
||
| --- | --- |
|
||
| `X-Stella-TraceId` | W3C trace/span identifier for deep-linking from Console → Grafana/Loki. |
|
||
| `X-Stella-CorrelationId` | Stable correlation identifier (respects inbound header or falls back to the request trace ID). |
|
||
|
||
Logs are shipped to the central Loki/Elasticsearch cluster. Use the template query:
|
||
|
||
```logql
|
||
{app="concelier-web"} | json | violation_code != ""
|
||
```
|
||
|
||
to spot active AOC violations.
|
||
|
||
---
|
||
|
||
## 4 · Dashboards
|
||
|
||
Primary Grafana dashboard: **“AOC Ingestion Health”** (`dashboards/aoc-ingestion.json`). Panels include:
|
||
|
||
1. **Sources overview:** table fed by `ingestion_write_total` and `ingestion_latency_seconds` (mirrors Console tiles).
|
||
2. **Violation trend:** stacked bar chart of `aoc_violation_total` per code.
|
||
3. **Signature success rate:** timeseries derived from `ingestion_signature_verified_total`.
|
||
4. **Supersedes depth:** gauge showing `advisory_revision_count` P95.
|
||
5. **Verification runs:** histogram and latency boxplot using `verify_runs_total` / `verify_duration_seconds`.
|
||
|
||
Secondary dashboards:
|
||
|
||
- **AOC Alerts (Ops view):** summarises active alerts, last verify run, and links to incident runbook.
|
||
- **Offline Mode Dashboard:** fed from Offline Kit imports; highlights snapshot age and queued verification jobs.
|
||
|
||
Update `docs/assets/dashboards/` with screenshots when Grafana capture pipeline produces the latest renders.
|
||
|
||
---
|
||
|
||
## 5 · Operational workflows
|
||
|
||
1. **During ingestion incident:**
|
||
- Check Console dashboard for offending sources.
|
||
- Pivot to logs using document `contentHash`.
|
||
- Re-run `stella sources ingest --dry-run` with problematic payloads to validate fixes.
|
||
- After remediation, run `stella aoc verify --since 24h` and confirm exit code `0`.
|
||
2. **Scheduled verification:**
|
||
- Configure cron job to run `stella aoc verify --format json --export ...`.
|
||
- Ship JSON to `aoc-verify` bucket and ingest into metrics using custom exporter.
|
||
- Alert on missing exports (no file uploaded within 26 h).
|
||
3. **Offline kit validation:**
|
||
- Use Offline Dashboard
|
||
4. **Incident toggle audit:**
|
||
- Authority requires `incident_reason` when issuing `obs:incident` tokens; plan your runbooks to capture business justification.
|
||
- Auditors can call `/authority/audit/incident?limit=100` with the tenant header to list recent incident activations, including reason and issuer. to ensure snapshots contain latest metrics.
|
||
- Run verification reports locally and attach to bundle before distribution.
|
||
|
||
---
|
||
|
||
## 6 · Offline considerations
|
||
|
||
- Metrics exporters bundled with Offline Kit write to local Prometheus snapshots; sync them with central Grafana once connectivity is restored.
|
||
- CLI verification reports should be hashed (`sha256sum`) and archived for audit trails.
|
||
- Dashboards include offline data sources (`prometheus-offline`) switchable via dropdown.
|
||
|
||
---
|
||
|
||
## 7 · References
|
||
|
||
- [Aggregation-Only Contract reference](../ingestion/aggregation-only-contract.md)
|
||
- [Architecture overview](../modules/platform/architecture-overview.md)
|
||
- [Console AOC dashboard](../ui/console.md)
|
||
- [CLI AOC commands](../modules/cli/guides/cli-reference.md)
|
||
- [Concelier architecture](../modules/concelier/architecture.md)
|
||
- [Excititor architecture](../modules/excititor/architecture.md)
|
||
- [Scheduler Worker observability guide](../modules/scheduler/operations/worker.md)
|
||
|
||
---
|
||
|
||
## 8 · Compliance checklist
|
||
|
||
- [ ] Metrics documented with label sets and alert guidance.
|
||
- [ ] Tracing span taxonomy aligned with Concelier/Excititor implementation.
|
||
- [ ] Log schema matches structured logging contracts (traceId, tenant, source, contentHash).
|
||
- [ ] Grafana dashboard references verified and screenshots scheduled.
|
||
- [ ] Offline/air-gap workflow captured.
|
||
- [ ] Cross-links to AOC reference, console, and CLI docs included.
|
||
- [ ] Observability Guild sign-off scheduled (OWNER: @obs-guild, due 2025-10-28).
|
||
|
||
---
|
||
|
||
*Last updated: 2025-10-26 (Sprint 19).*
|