5.7 KiB
Excititor Observability Guide
Added 2025-11-14 alongside Sprint 119 (
EXCITITOR-AIAI-31-003). Complements the AirGap/mirror runbooks under the same folder.
Excititor’s evidence APIs now emit first-class OpenTelemetry metrics so Lens, Advisory AI, and Ops can detect misuse or missing provenance without paging through logs. This document lists the counters/histograms shipped by the WebService (src/Excititor/StellaOps.Excititor.WebService) and how to hook them into your exporters/dashboards.
Telemetry prerequisites
- Enable
Excititor:Telemetryin the service configuration (appsettings.*), ensuring metrics export is on. The WebService automatically adds the evidence meter (StellaOps.Excititor.WebService.Evidence) alongside the ingestion meter. - Deploy at least one OTLP or console exporter (see
TelemetryExtensions.ConfigureExcititorTelemetry). If your region lacks OTLP transport, fall back to scraping the console exporter for smoke tests. - Coordinate with the Ops/Signals guild to provision the span/metric sinks referenced in
docs/modules/platform/architecture-overview.md#observability.
Metrics reference
| Metric | Type | Description | Key dimensions |
|---|---|---|---|
excititor.vex.observation.requests |
Counter | Number of /v1/vex/observations/{vulnerabilityId}/{productKey} requests handled. |
tenant, outcome (success, error, cancelled), truncated (true/false) |
excititor.vex.observation.statement_count |
Histogram | Distribution of statements returned per observation projection request. | tenant, outcome |
excititor.vex.signature.status |
Counter | Signature status per statement (missing vs. unverified). | tenant, status (missing, unverified) |
excititor.vex.aoc.guard_violations |
Counter | Aggregated count of Aggregation-Only Contract violations detected by the WebService (ingest + /v1/vex/aoc/verify). |
tenant, surface (ingest, aoc_verify, etc.), code (AOC error code) |
excititor.vex.chunks.requests |
Counter | Requests to /v1/vex/evidence/chunks (NDJSON stream). |
tenant, outcome (success,error,cancelled), truncated (true/false) |
excititor.vex.chunks.bytes |
Histogram | Size of NDJSON chunk streams served (bytes). | tenant, outcome |
excititor.vex.chunks.records |
Histogram | Count of evidence records emitted per chunk stream. | tenant, outcome |
All metrics originate from the
EvidenceTelemetryhelper (src/Excititor/StellaOps.Excititor.WebService/Telemetry/EvidenceTelemetry.cs). When disabled (telemetry off), the helper is inert.
Dashboard hints
- Advisory-AI readiness – alert when
excititor.vex.signature.status{status="missing"}spikes for a tenant, indicating connectors aren’t supplying signatures. - Guardrail monitoring – graph
excititor.vex.aoc.guard_violationspercodeto catch upstream feed regressions before they pollute Evidence Locker or Lens caches. - Capacity planning – histogram percentiles of
excititor.vex.observation.statement_countfeed API sizing (higher counts mean Advisory AI is requesting broad scopes).
Operational steps
- Enable telemetry: set
Excititor:Telemetry:EnableMetrics=true, configure OTLP endpoints/headers as described inTelemetryExtensions. - Add dashboards: import panels referencing the metrics above (see Grafana JSON snippets in Ops repo once merged).
- Alerting: add rules for high guard violation rates, missing signatures, and abnormal chunk bytes/record counts. Tie alerts back to connectors via tenant metadata.
- Post-deploy checks: after each release, verify metrics emit by curling
/v1/vex/observations/...and/v1/vex/evidence/chunks, watching the console exporter (dev) or OTLP (prod).
SLOs (Sprint 119 – OBS-51-001)
The following SLOs apply to Excititor evidence read paths when telemetry is enabled. Record them in the shared SLO registry and alert via the platform alertmanager.
| Surface | SLI | Target | Window | Burn alert | Notes |
|---|---|---|---|---|---|
/v1/vex/observations |
p95 latency | ≤ 450 ms | 7d | 2 % over 1h | Measured on successful responses only; tenant scoped. |
/v1/vex/observations |
freshness | ≥ 99 % within 5 min of upstream ingest | 7d | 5 % over 4h | Derived from arrival minus createdAt; requires ingest clocks in UTC. |
/v1/vex/observations |
signature presence | ≥ 98 % statements with signature present | 7d | 3 % over 24h | Use excititor.vex.signature.status{status="missing"}. |
/v1/vex/evidence/chunks |
p95 stream duration | ≤ 600 ms | 7d | 2 % over 1h | From request start to last NDJSON write; excludes client disconnects. |
/v1/vex/evidence/chunks |
truncation rate | ≤ 1 % truncated streams | 7d | 1 % over 1h | excititor.vex.chunks.records with truncated=true. |
| AOC guardrail | zero hard violations | 0 | continuous | immediate | Any excititor.vex.aoc.guard_violations with severity error pages ops. |
Implementation notes:
- Emit latency/freshness SLOs via OTEL views that pre-aggregate by tenant and route to the platform SLO backend; keep bucket boundaries aligned with 50/100/250/450/650/1000 ms.
- Freshness SLI derived from ingest timestamps; ensure clocks are synchronized (NTP) and stored in UTC.
- For air-gapped deployments without OTEL sinks, scrape console exporter and push to offline Prometheus; same thresholds apply.
Related documents
docs/modules/excititor/architecture.md– API contract, AOC guardrails, connector responsibilities.docs/modules/excititor/mirrors.md– AirGap/mirror ingestion checklist (feeds intoEXCITITOR-AIRGAP-56/57).docs/modules/platform/architecture-overview.md#observability– platform-wide telemetry guidance.