git.stella-ops.org/docs/modules/telemetry/guides/aggregation.md

# Aggregation Observability

Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-LNM-22-007)

Covers metrics, traces, and logs for Link-Not-Merge (LNM) aggregation and evidence pipelines.

## Metrics
- `aggregation_ingest_latency_seconds` (histogram) — end-to-end ingest per statement; labels: `tenant`, `source`, `status`.
- `aggregation_conflict_total` (counter) — conflicts encountered; labels: `tenant`, `advisory`, `product`, `reason`.
- `aggregation_overlay_cache_hits_total` / `_misses_total` — overlay cache effectiveness; labels: `tenant`, `cache`.
- `aggregation_vex_gate_total` — VEX gating outcomes; labels: `tenant`, `status` (`affected`, `not_affected`, `unknown`).
- `aggregation_queue_depth` (gauge) — pending statements per tenant.

## Traces
- Span name `aggregation.process` with attributes:
  - `tenant`, `advisory`, `product`, `vex_status`, `source_kind`
  - `overlay_version`, `cache_hit` (bool)
- Link to upstream ingest span (`traceparent` forwarded by Excititor/Concelier).
- Export to OTLP; sampling default 10% outside prod, 100% for `status=error`.

## Logs
Structured JSON with fields: `tenant`, `advisory`, `product`, `vex_status`, `decision` (`merged|suppressed|dropped`), `reason`, `duration_ms`, `trace_id`.

## SLOs
- **Ingest latency**: p95 < 500ms per statement (steady state).
- **Cache hit rate**: >80% for overlays; alerts when below for 15 minutes.
- **Error rate**: <0.1% over 10 minute window.

## Alerts
- `HighConflictRate` — `aggregation_conflict_total` delta > 100/minute per tenant.
- `QueueBacklog` — `aggregation_queue_depth` > 10k for 5 minutes.
- `LowCacheHit` — overlay cache hit rate < 60% for 10 minutes.

## Offline/air-gap considerations
- Export metrics to local Prometheus scrape; no external sinks.
- Trace sampling and log retention configured via environment without needing control-plane access.
- Deterministic ordering preserved; cache warmers seeded from bundled fixtures.