Files
git.stella-ops.org/docs/modules/excititor/operations/observability.md
StellaOps Bot 150b3730ef
Some checks failed
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Mirror Thin Bundle Sign & Verify / mirror-sign (push) Has been cancelled
api-governance / spectral-lint (push) Has been cancelled
up
2025-11-24 07:52:25 +02:00

5.7 KiB
Raw Blame History

Excititor Observability Guide

Added 2025-11-14 alongside Sprint 119 (EXCITITOR-AIAI-31-003). Complements the AirGap/mirror runbooks under the same folder.

Excititors evidence APIs now emit first-class OpenTelemetry metrics so Lens, Advisory AI, and Ops can detect misuse or missing provenance without paging through logs. This document lists the counters/histograms shipped by the WebService (src/Excititor/StellaOps.Excititor.WebService) and how to hook them into your exporters/dashboards.

Telemetry prerequisites

  • Enable Excititor:Telemetry in the service configuration (appsettings.*), ensuring metrics export is on. The WebService automatically adds the evidence meter (StellaOps.Excititor.WebService.Evidence) alongside the ingestion meter.
  • Deploy at least one OTLP or console exporter (see TelemetryExtensions.ConfigureExcititorTelemetry). If your region lacks OTLP transport, fall back to scraping the console exporter for smoke tests.
  • Coordinate with the Ops/Signals guild to provision the span/metric sinks referenced in docs/modules/platform/architecture-overview.md#observability.

Metrics reference

Metric Type Description Key dimensions
excititor.vex.observation.requests Counter Number of /v1/vex/observations/{vulnerabilityId}/{productKey} requests handled. tenant, outcome (success, error, cancelled), truncated (true/false)
excititor.vex.observation.statement_count Histogram Distribution of statements returned per observation projection request. tenant, outcome
excititor.vex.signature.status Counter Signature status per statement (missing vs. unverified). tenant, status (missing, unverified)
excititor.vex.aoc.guard_violations Counter Aggregated count of Aggregation-Only Contract violations detected by the WebService (ingest + /v1/vex/aoc/verify). tenant, surface (ingest, aoc_verify, etc.), code (AOC error code)
excititor.vex.chunks.requests Counter Requests to /v1/vex/evidence/chunks (NDJSON stream). tenant, outcome (success,error,cancelled), truncated (true/false)
excititor.vex.chunks.bytes Histogram Size of NDJSON chunk streams served (bytes). tenant, outcome
excititor.vex.chunks.records Histogram Count of evidence records emitted per chunk stream. tenant, outcome

All metrics originate from the EvidenceTelemetry helper (src/Excititor/StellaOps.Excititor.WebService/Telemetry/EvidenceTelemetry.cs). When disabled (telemetry off), the helper is inert.

Dashboard hints

  • Advisory-AI readiness alert when excititor.vex.signature.status{status="missing"} spikes for a tenant, indicating connectors arent supplying signatures.
  • Guardrail monitoring graph excititor.vex.aoc.guard_violations per code to catch upstream feed regressions before they pollute Evidence Locker or Lens caches.
  • Capacity planning histogram percentiles of excititor.vex.observation.statement_count feed API sizing (higher counts mean Advisory AI is requesting broad scopes).

Operational steps

  1. Enable telemetry: set Excititor:Telemetry:EnableMetrics=true, configure OTLP endpoints/headers as described in TelemetryExtensions.
  2. Add dashboards: import panels referencing the metrics above (see Grafana JSON snippets in Ops repo once merged).
  3. Alerting: add rules for high guard violation rates, missing signatures, and abnormal chunk bytes/record counts. Tie alerts back to connectors via tenant metadata.
  4. Post-deploy checks: after each release, verify metrics emit by curling /v1/vex/observations/... and /v1/vex/evidence/chunks, watching the console exporter (dev) or OTLP (prod).

SLOs (Sprint 119 OBS-51-001)

The following SLOs apply to Excititor evidence read paths when telemetry is enabled. Record them in the shared SLO registry and alert via the platform alertmanager.

Surface SLI Target Window Burn alert Notes
/v1/vex/observations p95 latency ≤ 450ms 7d 2% over 1h Measured on successful responses only; tenant scoped.
/v1/vex/observations freshness ≥ 99% within 5min of upstream ingest 7d 5% over 4h Derived from arrival minus createdAt; requires ingest clocks in UTC.
/v1/vex/observations signature presence ≥ 98% statements with signature present 7d 3% over 24h Use excititor.vex.signature.status{status="missing"}.
/v1/vex/evidence/chunks p95 stream duration ≤ 600ms 7d 2% over 1h From request start to last NDJSON write; excludes client disconnects.
/v1/vex/evidence/chunks truncation rate ≤ 1% truncated streams 7d 1% over 1h excititor.vex.chunks.records with truncated=true.
AOC guardrail zero hard violations 0 continuous immediate Any excititor.vex.aoc.guard_violations with severity error pages ops.

Implementation notes:

  • Emit latency/freshness SLOs via OTEL views that pre-aggregate by tenant and route to the platform SLO backend; keep bucket boundaries aligned with 50/100/250/450/650/1000ms.
  • Freshness SLI derived from ingest timestamps; ensure clocks are synchronized (NTP) and stored in UTC.
  • For air-gapped deployments without OTEL sinks, scrape console exporter and push to offline Prometheus; same thresholds apply.
  • docs/modules/excititor/architecture.md API contract, AOC guardrails, connector responsibilities.
  • docs/modules/excititor/mirrors.md AirGap/mirror ingestion checklist (feeds into EXCITITOR-AIRGAP-56/57).
  • docs/modules/platform/architecture-overview.md#observability platform-wide telemetry guidance.