Files
StellaOps Bot 9f6e6f7fb3
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
up
2025-11-25 22:09:44 +02:00

1.9 KiB

Aggregation Observability

Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-LNM-22-007)

Covers metrics, traces, and logs for Link-Not-Merge (LNM) aggregation and evidence pipelines.

Metrics

  • aggregation_ingest_latency_seconds (histogram) — end-to-end ingest per statement; labels: tenant, source, status.
  • aggregation_conflict_total (counter) — conflicts encountered; labels: tenant, advisory, product, reason.
  • aggregation_overlay_cache_hits_total / _misses_total — overlay cache effectiveness; labels: tenant, cache.
  • aggregation_vex_gate_total — VEX gating outcomes; labels: tenant, status (affected, not_affected, unknown).
  • aggregation_queue_depth (gauge) — pending statements per tenant.

Traces

  • Span name aggregation.process with attributes:
    • tenant, advisory, product, vex_status, source_kind
    • overlay_version, cache_hit (bool)
  • Link to upstream ingest span (traceparent forwarded by Excititor/Concelier).
  • Export to OTLP; sampling default 10% outside prod, 100% for status=error.

Logs

Structured JSON with fields: tenant, advisory, product, vex_status, decision (merged|suppressed|dropped), reason, duration_ms, trace_id.

SLOs

  • Ingest latency: p95 < 500ms per statement (steady state).
  • Cache hit rate: >80% for overlays; alerts when below for 15 minutes.
  • Error rate: <0.1% over 10 minute window.

Alerts

  • HighConflictRateaggregation_conflict_total delta > 100/minute per tenant.
  • QueueBacklogaggregation_queue_depth > 10k for 5 minutes.
  • LowCacheHit — overlay cache hit rate < 60% for 10 minutes.

Offline/air-gap considerations

  • Export metrics to local Prometheus scrape; no external sinks.
  • Trace sampling and log retention configured via environment without needing control-plane access.
  • Deterministic ordering preserved; cache warmers seeded from bundled fixtures.