Files
git.stella-ops.org/docs/observability/observability.md
master b059bc7675
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
feat(metrics): Add new histograms for chunk latency, results, and sources in AdvisoryAiMetrics
feat(telemetry): Record chunk latency, result count, and source count in AdvisoryAiTelemetry

fix(endpoint): Include telemetry source count in advisory chunks endpoint response

test(metrics): Enhance WebServiceEndpointsTests to validate new metrics for chunk latency, results, and sources

refactor(tests): Update test utilities for Deno language analyzer tests

chore(tests): Add performance tests for AdvisoryGuardrail with scenarios and blocked phrases

docs: Archive Sprint 137 design document for scanner and surface enhancements
2025-11-10 22:26:43 +02:00

15 KiB
Raw Blame History

AOC Observability Guide

Audience: Observability Guild, Concelier/Excititor SREs, platform operators.
Scope: Metrics, traces, logs, dashboards, and runbooks introduced as part of the Aggregation-Only Contract (AOC) rollout (Sprint19).

This guide captures the canonical signals emitted by Concelier and Excititor once AOC guards are active. It explains how to consume the metrics in dashboards, correlate traces/logs for incident triage, and operate in offline environments. Pair this guide with the AOC reference and architecture overview.


1·Metrics

Metric Type Labels Description
ingestion_write_total Counter source, tenant, result (ok, reject, noop) Counts write attempts to advisory_raw/vex_raw. Rejects correspond to guard failures.
ingestion_latency_seconds Histogram source, tenant, phase (fetch, transform, write) Measures end-to-end runtime for ingestion stages. Use quantile=0.95 for alerting.
aoc_violation_total Counter source, tenant, code (ERR_AOC_00x) Total guard violations bucketed by error code. Drives dashboard pills and alert thresholds.
ingestion_signature_verified_total Counter source, tenant, result (ok, fail, skipped) Tracks signature/checksum verification outcomes.
advisory_revision_count Gauge source, tenant Supersedes depth for raw documents; spikes indicate noisy upstream feeds.
verify_runs_total Counter tenant, initiator (ui, cli, api, scheduled) How many stella aoc verify or /aoc/verify runs executed.
verify_duration_seconds Histogram tenant, initiator Runtime of verification jobs; use P95 to detect regressions.

1.1Alerts

  • Violation spike: Alert when increase(aoc_violation_total[15m]) > 0 for critical sources. Page SRE if code="ERR_AOC_005" (signature failure) or ERR_AOC_001 persists >30min.
  • Stale ingestion: Alert when max_over_time(ingestion_latency_seconds_sum / ingestion_latency_seconds_count)[30m] exceeds 30s or if ingestion_write_total has no growth for >60min.
  • Signature drop: Warn when rate(ingestion_signature_verified_total{result="fail"}[1h]) > 0.

1.2 · /obs/excititor/health

GET /obs/excititor/health (scope vex.admin) returns a compact snapshot for Grafana tiles and Console widgets:

  • ingest — overall status, worst lag (seconds), and the top connectors (status, lagSeconds, failure count, last success).
  • link — freshness of consensus/linkset processing plus document counts and the number currently carrying conflicts.
  • signature — recent coverage window (evaluated, with signatures, verified, failures, unsigned, coverage ratio).
  • conflicts — rolling totals grouped by status plus per-bucket trend data for charts.
{
  "generatedAt": "2025-11-08T11:00:00Z",
  "ingest": { "status": "healthy", "connectors": [ { "connectorId": "excititor:redhat", "lagSeconds": 45.3 } ] },
  "link": { "status": "warning", "lastConsensusAt": "2025-11-08T10:57:03Z" },
  "signature": { "status": "critical", "documentsEvaluated": 120, "verified": 30, "failures": 2 },
  "conflicts": { "status": "warning", "conflictStatements": 325, "trend": [ { "bucketStart": "2025-11-08T10:00:00Z", "conflicts": 130 } ] }
}
Setting Default Purpose
Excititor:Observability:IngestWarningThreshold 06:00:00 Connector lag before ingest.status becomes warning.
Excititor:Observability:IngestCriticalThreshold 24:00:00 Connector lag before ingest.status becomes critical.
Excititor:Observability:LinkWarningThreshold 00:15:00 Maximum acceptable delay between consensus recalculations.
Excititor:Observability:LinkCriticalThreshold 01:00:00 Delay that marks link status as critical.
Excititor:Observability:SignatureWindow 12:00:00 Lookback window for signature coverage.
Excititor:Observability:SignatureHealthyCoverage 0.8 Coverage ratio that still counts as healthy.
Excititor:Observability:SignatureWarningCoverage 0.5 Coverage ratio that flips the status to warning.
Excititor:Observability:ConflictTrendWindow 24:00:00 Rolling window used for conflict aggregation.
Excititor:Observability:ConflictTrendBucketMinutes 60 Resolution of conflict trend buckets.
Excititor:Observability:ConflictWarningRatio 0.15 Fraction of consensus docs with conflicts that triggers warning.
Excititor:Observability:ConflictCriticalRatio 0.3 Ratio that marks conflicts.status as critical.
Excititor:Observability:MaxConnectorDetails 50 Number of connector entries returned (keeps payloads small).

1.3 · Regression & DI hygiene

  1. Keep storage/integration tests green when telemetry touches persistence.
    • ./tools/mongodb/local-mongo.sh start downloads MongoDB6.0.16 (if needed), launches rs0, and prints export EXCITITOR_TEST_MONGO_URI=mongodb://.../excititor-tests. Copy that export into your shell.
    • ./tools/mongodb/local-mongo.sh restart is a shortcut for “stop if running, then start” using the same dataset—use it after tweaking config or when tests need a bounce without wiping fixtures.
    • ./tools/mongodb/local-mongo.sh clean stops the instance (if running) and deletes the managed data/log directories so storage tests begin from a pristine catalog.
    • Run dotnet test src/Excititor/__Tests/StellaOps.Excititor.Storage.Mongo.Tests/StellaOps.Excititor.Storage.Mongo.Tests.csproj -nologo -v minimal (add --filter if you only touched specific suites). These tests exercise the same write paths that feed the dashboards, so regressions show up immediately.
    • ./tools/mongodb/local-mongo.sh stop when finished so CI/dev hosts stay clean; status|logs|shell are available for troubleshooting.
  2. Declare optional Minimal API dependencies with [FromServices] ... = null. RequestDelegateFactory treats [FromServices] IVexSigner? signer = null (or similar) as optional, so host startup succeeds even when tests have not registered that service. This pattern keeps observability endpoints cancellable while avoiding brittle test overrides.

2·Traces

2.1Span taxonomy

Span name Parent Key attributes
ingest.fetch job root span source, tenant, uri, contentHash
ingest.transform ingest.fetch documentType (csaf, osv, vex), payloadBytes
ingest.write ingest.transform collection (advisory_raw, vex_raw), result (ok, reject)
aoc.guard ingest.write code (on violation), violationCount, supersedes
verify.run verification job root tenant, window.from, window.to, sources, violations

2.2Trace usage

  • Correlate UI dashboard entries with traces via traceId surfaced in violation drawers (docs/ui/console.md).
  • Use aoc.guard spans to inspect guard payload snapshots. Sensitive fields are redacted automatically; raw JSON lives in secure logs only.
  • For scheduled verification, filter traces by initiator="scheduled" to compare runtimes pre/post change.

2.3Telemetry configuration (Excititor)

  • Configure the web service via Excititor:Telemetry:

    {
      "Excititor": {
        "Telemetry": {
          "Enabled": true,
          "EnableTracing": true,
          "EnableMetrics": true,
          "ServiceName": "stellaops-excititor-web",
          "OtlpEndpoint": "http://otel-collector:4317",
          "OtlpHeaders": {
            "Authorization": "Bearer ${OTEL_PUSH_TOKEN}"
          },
          "ResourceAttributes": {
            "env": "prod-us",
            "service.group": "ingestion"
          }
        }
      }
    }
    
  • Point the OTLP endpoint at the shared collector profile from §1 so Excititor metrics land in the ingestion_* dashboards next to Concelier. Resource attributes drive Grafana filtering (e.g., env, service.group).

  • For offline/air-gap bundles set Enabled=false and collect the file exporter artifacts from the Offline Kit; import them into Grafana after transfer to keep time-to-truth dashboards consistent.

  • Local development templates: run tools/mongodb/local-mongo.sh start to spin up a single-node replica set plus the matching mongosh client. The script prints the export EXCITITOR_TEST_MONGO_URI=... command that integration tests (e.g., StellaOps.Excititor.Storage.Mongo.Tests) will honor. Use restart for a quick bounce, clean to wipe data between suites, and stop when finished.


3·Logs

Structured logs include the following keys (JSON):

Key Description
traceId Matches OpenTelemetry trace/span IDs for cross-system correlation.
tenant Tenant identifier enforced by Authority middleware.
source.vendor Logical source (e.g., redhat, ubuntu, osv, ghsa).
upstream.upstreamId Vendor-provided ID (CVE, GHSA, etc.).
contentHash sha256: digest of the raw document.
violation.code Present when guard rejects ERR_AOC_00x.
verification.window Present on /aoc/verify job logs.

Excititor APIs mirror these identifiers via response headers:

Header Purpose
X-Stella-TraceId W3C trace/span identifier for deep-linking from Console → Grafana/Loki.
X-Stella-CorrelationId Stable correlation identifier (respects inbound header or falls back to the request trace ID).

Logs are shipped to the central Loki/Elasticsearch cluster. Use the template query:

{app="concelier-web"} | json | violation_code != ""

to spot active AOC violations.

1.3 · Advisory chunk API (AdvisoryAI feeds)

AdvisoryAI now leans on Conceliers /advisories/{key}/chunks endpoint for deterministic evidence packs. The service exports dedicated metrics so dashboards can highlight latency spikes, cache noise, or aggressive guardrail filtering before they impact AdvisoryAI responses.

Metric Type Labels Description
advisory_ai_chunk_requests_total Counter tenant, result, truncated, cache Count of chunk API calls, tagged with cache hits/misses and truncation state.
advisory_ai_chunk_latency_milliseconds Histogram tenant, result, truncated, cache End-to-end build latency (milliseconds) for each chunk request.
advisory_ai_chunk_segments Histogram tenant, result, truncated Number of chunk segments returned to the caller; watch for sudden drops tied to guardrails.
advisory_ai_chunk_sources Histogram tenant, result How many upstream observations/sources contributed to a response (after observation limits).
advisory_ai_guardrail_blocks_total Counter tenant, reason, cache Per-reason count of segments suppressed by guardrails (length, normalization, character set).

Dashboards should plot latency P95/P99 next to cache hit rates and guardrail block deltas to catch degradation early. AdvisoryAI CLI/Console surfaces the same metadata so support engineers can correlate with Grafana/Loki entries using traceId/correlationId headers.


4·Dashboards

Primary Grafana dashboard: “AOC Ingestion Health” (dashboards/aoc-ingestion.json). Panels include:

  1. Sources overview: table fed by ingestion_write_total and ingestion_latency_seconds (mirrors Console tiles).
  2. Violation trend: stacked bar chart of aoc_violation_total per code.
  3. Signature success rate: timeseries derived from ingestion_signature_verified_total.
  4. Supersedes depth: gauge showing advisory_revision_count P95.
  5. Verification runs: histogram and latency boxplot using verify_runs_total / verify_duration_seconds.

Secondary dashboards:

  • AOC Alerts (Ops view): summarises active alerts, last verify run, and links to incident runbook.
  • Offline Mode Dashboard: fed from Offline Kit imports; highlights snapshot age and queued verification jobs.

Update docs/assets/dashboards/ with screenshots when Grafana capture pipeline produces the latest renders.


5·Operational workflows

  1. During ingestion incident:
    • Check Console dashboard for offending sources.
    • Pivot to logs using document contentHash.
    • Re-run stella sources ingest --dry-run with problematic payloads to validate fixes.
    • After remediation, run stella aoc verify --since 24h and confirm exit code 0.
  2. Scheduled verification:
    • Configure cron job to run stella aoc verify --format json --export ....
    • Ship JSON to aoc-verify bucket and ingest into metrics using custom exporter.
    • Alert on missing exports (no file uploaded within 26h).
  3. Offline kit validation:
    • Use Offline Dashboard
  4. Incident toggle audit:
    • Authority requires incident_reason when issuing obs:incident tokens; plan your runbooks to capture business justification.
    • Auditors can call /authority/audit/incident?limit=100 with the tenant header to list recent incident activations, including reason and issuer. to ensure snapshots contain latest metrics.
    • Run verification reports locally and attach to bundle before distribution.

6·Offline considerations

  • Metrics exporters bundled with Offline Kit write to local Prometheus snapshots; sync them with central Grafana once connectivity is restored.
  • CLI verification reports should be hashed (sha256sum) and archived for audit trails.
  • Dashboards include offline data sources (prometheus-offline) switchable via dropdown.

7·References


8·Compliance checklist

  • Metrics documented with label sets and alert guidance.
  • Tracing span taxonomy aligned with Concelier/Excititor implementation.
  • Log schema matches structured logging contracts (traceId, tenant, source, contentHash).
  • Grafana dashboard references verified and screenshots scheduled.
  • Offline/air-gap workflow captured.
  • Cross-links to AOC reference, console, and CLI docs included.
  • Observability Guild sign-off scheduled (OWNER: @obs-guild, due 2025-10-28).

Last updated: 2025-10-26 (Sprint19).