Files
StellaOps Bot 9f6e6f7fb3
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
up
2025-11-25 22:09:44 +02:00

38 lines
1.9 KiB
Markdown

# Aggregation Observability
Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-LNM-22-007)
Covers metrics, traces, and logs for Link-Not-Merge (LNM) aggregation and evidence pipelines.
## Metrics
- `aggregation_ingest_latency_seconds` (histogram) — end-to-end ingest per statement; labels: `tenant`, `source`, `status`.
- `aggregation_conflict_total` (counter) — conflicts encountered; labels: `tenant`, `advisory`, `product`, `reason`.
- `aggregation_overlay_cache_hits_total` / `_misses_total` — overlay cache effectiveness; labels: `tenant`, `cache`.
- `aggregation_vex_gate_total` — VEX gating outcomes; labels: `tenant`, `status` (`affected`, `not_affected`, `unknown`).
- `aggregation_queue_depth` (gauge) — pending statements per tenant.
## Traces
- Span name `aggregation.process` with attributes:
- `tenant`, `advisory`, `product`, `vex_status`, `source_kind`
- `overlay_version`, `cache_hit` (bool)
- Link to upstream ingest span (`traceparent` forwarded by Excititor/Concelier).
- Export to OTLP; sampling default 10% outside prod, 100% for `status=error`.
## Logs
Structured JSON with fields: `tenant`, `advisory`, `product`, `vex_status`, `decision` (`merged|suppressed|dropped`), `reason`, `duration_ms`, `trace_id`.
## SLOs
- **Ingest latency**: p95 < 500ms per statement (steady state).
- **Cache hit rate**: >80% for overlays; alerts when below for 15 minutes.
- **Error rate**: <0.1% over 10 minute window.
## Alerts
- `HighConflictRate` `aggregation_conflict_total` delta > 100/minute per tenant.
- `QueueBacklog``aggregation_queue_depth` > 10k for 5 minutes.
- `LowCacheHit` — overlay cache hit rate < 60% for 10 minutes.
## Offline/air-gap considerations
- Export metrics to local Prometheus scrape; no external sinks.
- Trace sampling and log retention configured via environment without needing control-plane access.
- Deterministic ordering preserved; cache warmers seeded from bundled fixtures.