docs consolidation and others
This commit is contained in:
37
docs/modules/telemetry/guides/aggregation.md
Normal file
37
docs/modules/telemetry/guides/aggregation.md
Normal file
@@ -0,0 +1,37 @@
|
||||
# Aggregation Observability
|
||||
|
||||
Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-LNM-22-007)
|
||||
|
||||
Covers metrics, traces, and logs for Link-Not-Merge (LNM) aggregation and evidence pipelines.
|
||||
|
||||
## Metrics
|
||||
- `aggregation_ingest_latency_seconds` (histogram) — end-to-end ingest per statement; labels: `tenant`, `source`, `status`.
|
||||
- `aggregation_conflict_total` (counter) — conflicts encountered; labels: `tenant`, `advisory`, `product`, `reason`.
|
||||
- `aggregation_overlay_cache_hits_total` / `_misses_total` — overlay cache effectiveness; labels: `tenant`, `cache`.
|
||||
- `aggregation_vex_gate_total` — VEX gating outcomes; labels: `tenant`, `status` (`affected`, `not_affected`, `unknown`).
|
||||
- `aggregation_queue_depth` (gauge) — pending statements per tenant.
|
||||
|
||||
## Traces
|
||||
- Span name `aggregation.process` with attributes:
|
||||
- `tenant`, `advisory`, `product`, `vex_status`, `source_kind`
|
||||
- `overlay_version`, `cache_hit` (bool)
|
||||
- Link to upstream ingest span (`traceparent` forwarded by Excititor/Concelier).
|
||||
- Export to OTLP; sampling default 10% outside prod, 100% for `status=error`.
|
||||
|
||||
## Logs
|
||||
Structured JSON with fields: `tenant`, `advisory`, `product`, `vex_status`, `decision` (`merged|suppressed|dropped`), `reason`, `duration_ms`, `trace_id`.
|
||||
|
||||
## SLOs
|
||||
- **Ingest latency**: p95 < 500ms per statement (steady state).
|
||||
- **Cache hit rate**: >80% for overlays; alerts when below for 15 minutes.
|
||||
- **Error rate**: <0.1% over 10 minute window.
|
||||
|
||||
## Alerts
|
||||
- `HighConflictRate` — `aggregation_conflict_total` delta > 100/minute per tenant.
|
||||
- `QueueBacklog` — `aggregation_queue_depth` > 10k for 5 minutes.
|
||||
- `LowCacheHit` — overlay cache hit rate < 60% for 10 minutes.
|
||||
|
||||
## Offline/air-gap considerations
|
||||
- Export metrics to local Prometheus scrape; no external sinks.
|
||||
- Trace sampling and log retention configured via environment without needing control-plane access.
|
||||
- Deterministic ordering preserved; cache warmers seeded from bundled fixtures.
|
||||
29
docs/modules/telemetry/guides/cli-incident-toggle-12-001.md
Normal file
29
docs/modules/telemetry/guides/cli-incident-toggle-12-001.md
Normal file
@@ -0,0 +1,29 @@
|
||||
# CLI incident toggle contract (CLI-OBS-12-001)
|
||||
|
||||
**Goal**: define a deterministic CLI flag and config surface to enter/exit incident mode, required by TELEMETRY-OBS-55-001/56-001.
|
||||
|
||||
## Flags and config
|
||||
- CLI flag: `--incident-mode` (bool). Defaults to false.
|
||||
- Config key: `Telemetry:Incident:Enabled` (bool) and `Telemetry:Incident:TTL` (TimeSpan).
|
||||
- When both flag and config specified, flag wins (opt-in only; cannot disable if config enables and flag present).
|
||||
|
||||
## Effects when enabled
|
||||
- Increase sampling rate ceiling to 100% for telemetry within the process.
|
||||
- Add tag `incident=true` to logs/metrics/traces.
|
||||
- Shorten exporter/reporting flush interval to 5s; disable external exporters when `Sealed=true`.
|
||||
- Emit activation audit event `telemetry.incident.activated` with fields `{tenant, actor, source, expires_at}`.
|
||||
|
||||
## Persistence
|
||||
- Incident flag runtime value stored in local state file `~/.stellaops/incident-mode.json` with fields `{enabled, set_at, expires_at, actor}` for offline continuity.
|
||||
- File is tenant-scoped; permissions 0600.
|
||||
|
||||
## Expiry / TTL
|
||||
- Default TTL: 30 minutes unless `Telemetry:Incident:TTL` provided.
|
||||
- On expiry, emit `telemetry.incident.expired` audit event.
|
||||
|
||||
## Validation expectations
|
||||
- CLI should refuse `--incident-mode` if `--sealed` is set and external exporters are configured (must drop exporters first).
|
||||
- Unit tests to cover precedence (flag over config), TTL expiry, state file perms, and audit emissions.
|
||||
|
||||
## Provenance
|
||||
- Authored 2025-11-20 to unblock PREP-CLI-OBS-12-001 and TELEMETRY-OBS-55-001.
|
||||
177
docs/modules/telemetry/guides/fn-drift.md
Normal file
177
docs/modules/telemetry/guides/fn-drift.md
Normal file
@@ -0,0 +1,177 @@
|
||||
# FN-Drift Metrics Reference
|
||||
|
||||
> **Sprint:** SPRINT_3404_0001_0001
|
||||
> **Module:** Scanner Storage / Telemetry
|
||||
|
||||
## Overview
|
||||
|
||||
False-Negative Drift (FN-Drift) measures how often vulnerability classifications change from "not affected" or "unknown" to "affected" during rescans. This metric is critical for:
|
||||
|
||||
- **Accuracy Assessment**: Tracking scanner reliability over time
|
||||
- **SLO Compliance**: Meeting false-negative rate targets
|
||||
- **Root Cause Analysis**: Stratified analysis by drift cause
|
||||
- **Feed Quality**: Identifying problematic vulnerability feeds
|
||||
|
||||
## Metrics
|
||||
|
||||
### Gauges (30-day rolling window)
|
||||
|
||||
| Metric | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `scanner.fn_drift.percent` | Gauge | 30-day rolling FN-Drift percentage |
|
||||
| `scanner.fn_drift.transitions_30d` | Gauge | Total FN transitions in last 30 days |
|
||||
| `scanner.fn_drift.evaluated_30d` | Gauge | Total findings evaluated in last 30 days |
|
||||
| `scanner.fn_drift.cause.feed_delta` | Gauge | FN transitions caused by feed updates |
|
||||
| `scanner.fn_drift.cause.rule_delta` | Gauge | FN transitions caused by rule changes |
|
||||
| `scanner.fn_drift.cause.lattice_delta` | Gauge | FN transitions caused by VEX lattice changes |
|
||||
| `scanner.fn_drift.cause.reachability_delta` | Gauge | FN transitions caused by reachability changes |
|
||||
| `scanner.fn_drift.cause.engine` | Gauge | FN transitions caused by engine changes (should be ~0) |
|
||||
|
||||
### Counters (all-time)
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
|--------|------|--------|-------------|
|
||||
| `scanner.classification_changes_total` | Counter | `cause` | Total classification status changes |
|
||||
| `scanner.fn_transitions_total` | Counter | `cause` | Total false-negative transitions |
|
||||
|
||||
## Classification Statuses
|
||||
|
||||
| Status | Description |
|
||||
|--------|-------------|
|
||||
| `new` | First scan, no previous status |
|
||||
| `unaffected` | Confirmed not affected |
|
||||
| `unknown` | Status unknown/uncertain |
|
||||
| `affected` | Confirmed affected |
|
||||
| `fixed` | Previously affected, now fixed |
|
||||
|
||||
## Drift Causes
|
||||
|
||||
| Cause | Description | Expected Impact |
|
||||
|-------|-------------|-----------------|
|
||||
| `feed_delta` | Vulnerability feed updated (NVD, GHSA, OVAL) | High - most common cause |
|
||||
| `rule_delta` | Policy rules changed | Medium - controlled by policy team |
|
||||
| `lattice_delta` | VEX lattice state changed | Medium - VEX updates |
|
||||
| `reachability_delta` | Reachability analysis changed | Low - improved analysis |
|
||||
| `engine` | Scanner engine change | ~0 - determinism violation if >0 |
|
||||
| `other` | Unknown/unclassified cause | Low - investigate if high |
|
||||
|
||||
## FN-Drift Definition
|
||||
|
||||
A **False-Negative Transition** occurs when:
|
||||
- Previous status was `unaffected` or `unknown`
|
||||
- New status is `affected`
|
||||
|
||||
This indicates the scanner previously classified a finding as "not vulnerable" but now classifies it as "vulnerable" - a false negative in the earlier scan.
|
||||
|
||||
### FN-Drift Rate Calculation
|
||||
|
||||
```
|
||||
FN-Drift % = (FN Transitions / Total Reclassified) × 100
|
||||
```
|
||||
|
||||
Where:
|
||||
- **FN Transitions**: Count of `(unaffected|unknown) → affected` changes
|
||||
- **Total Reclassified**: Count of all status changes (excluding `new`)
|
||||
|
||||
## SLO Thresholds
|
||||
|
||||
| SLO Level | FN-Drift Threshold | Alert Severity |
|
||||
|-----------|-------------------|----------------|
|
||||
| Target | < 1.0% | None |
|
||||
| Warning | 1.0% - 2.5% | Warning |
|
||||
| Critical | > 2.5% | Critical |
|
||||
| Engine Drift | > 0% | Page |
|
||||
|
||||
### Alerting Rules
|
||||
|
||||
```yaml
|
||||
# Example Prometheus alerting rules
|
||||
groups:
|
||||
- name: fn-drift
|
||||
rules:
|
||||
- alert: FnDriftWarning
|
||||
expr: scanner_fn_drift_percent > 1.0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "FN-Drift rate above warning threshold"
|
||||
|
||||
- alert: FnDriftCritical
|
||||
expr: scanner_fn_drift_percent > 2.5
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "FN-Drift rate above critical threshold"
|
||||
|
||||
- alert: EngineDriftDetected
|
||||
expr: scanner_fn_drift_cause_engine > 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: page
|
||||
annotations:
|
||||
summary: "Engine-caused FN drift detected - determinism violation"
|
||||
```
|
||||
|
||||
## Dashboard Queries
|
||||
|
||||
### FN-Drift Trend (Grafana)
|
||||
|
||||
```promql
|
||||
# 30-day rolling FN-Drift percentage
|
||||
scanner_fn_drift_percent
|
||||
|
||||
# FN transitions by cause
|
||||
sum by (cause) (rate(scanner_fn_transitions_total[1h]))
|
||||
|
||||
# Classification changes rate
|
||||
sum by (cause) (rate(scanner_classification_changes_total[1h]))
|
||||
```
|
||||
|
||||
### Drift Cause Breakdown
|
||||
|
||||
```promql
|
||||
# Pie chart of drift causes
|
||||
topk(5,
|
||||
sum by (cause) (
|
||||
increase(scanner_fn_transitions_total[24h])
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
## Database Schema
|
||||
|
||||
### classification_history Table
|
||||
|
||||
```sql
|
||||
CREATE TABLE scanner.classification_history (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
artifact_digest TEXT NOT NULL,
|
||||
vuln_id TEXT NOT NULL,
|
||||
package_purl TEXT NOT NULL,
|
||||
tenant_id UUID NOT NULL,
|
||||
manifest_id UUID NOT NULL,
|
||||
execution_id UUID NOT NULL,
|
||||
previous_status TEXT NOT NULL,
|
||||
new_status TEXT NOT NULL,
|
||||
is_fn_transition BOOLEAN GENERATED ALWAYS AS (...) STORED,
|
||||
cause TEXT NOT NULL,
|
||||
cause_detail JSONB,
|
||||
changed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
### fn_drift_stats Materialized View
|
||||
|
||||
Aggregated daily statistics for efficient dashboard queries:
|
||||
- Day bucket
|
||||
- Tenant ID
|
||||
- Cause breakdown
|
||||
- FN count and percentage
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Determinism Technical Reference](../product-advisories/14-Dec-2025%20-%20Determinism%20and%20Reproducibility%20Technical%20Reference.md) - Section 13.2
|
||||
- [Scanner Architecture](../modules/scanner/architecture.md)
|
||||
- [Telemetry Stack](../modules/telemetry/architecture.md)
|
||||
49
docs/modules/telemetry/guides/logging.md
Normal file
49
docs/modules/telemetry/guides/logging.md
Normal file
@@ -0,0 +1,49 @@
|
||||
# Logging Standards (DOCS-OBS-50-003)
|
||||
|
||||
Last updated: 2025-12-15
|
||||
|
||||
## Goals
|
||||
- Deterministic, structured logs for all services.
|
||||
- Keep tenant safety and redaction guarantees while enabling search, correlation, and offline analysis.
|
||||
|
||||
## Log shape (JSON)
|
||||
Required fields:
|
||||
- `timestamp` (UTC ISO-8601)
|
||||
- `tenant`, `workload` (service name), `env`, `region`, `version`
|
||||
- `level` (`debug|info|warn|error|fatal`)
|
||||
- `category` (logger/category name), `operation` (verb/action)
|
||||
- `trace_id`, `span_id`, `correlation_id` (if external)
|
||||
- `message` (concise, no secrets)
|
||||
- `status` (`ok|error|fault|throttle`)
|
||||
- `error.code`, `error.message` (redacted), `retryable` (bool) when status != ok
|
||||
|
||||
Optional but recommended:
|
||||
- `resource` (subject id/purl/path when safe), `http.method`, `http.status_code`, `duration_ms`, `host`, `pid`, `thread`.
|
||||
|
||||
## Offline Kit / air-gap import fields
|
||||
When emitting logs for Offline Kit import/activation flows, keep field names stable:
|
||||
- Required scope key: `tenant_id`
|
||||
- Common keys: `bundle_type`, `bundle_digest`, `bundle_path`, `manifest_version`, `manifest_created_at`
|
||||
- Force activation keys: `force_activate`, `force_activate_reason`
|
||||
- Outcome keys: `result`, `reason_code`, `reason_message`
|
||||
- Quarantine keys: `quarantine_id`, `quarantine_path`
|
||||
|
||||
## Redaction rules
|
||||
- Never log Authorization headers, tokens, passwords, private keys, full request/response bodies.
|
||||
- Redact to `"[redacted]"` and add `redaction.reason` (`secret|pii|policy`).
|
||||
- Hash low-cardinality identifiers when needed (`sha256` hex) and mark `hashed=true`.
|
||||
|
||||
## Determinism & offline posture
|
||||
- Stable key ordering not required, but field set must be consistent per log type.
|
||||
- No external enrichment; rely on bundled metadata (service map, tenant labels).
|
||||
- All times UTC; newline-delimited JSON (NDJSON); LF line endings.
|
||||
|
||||
## Sampling & rate limits
|
||||
- Info logs rate-limited per component (default 100/s); warn/error/fatal never sampled.
|
||||
- Structured audit logs (`category=audit`) are never sampled and must include `actor`, `action`, `target`, `result`.
|
||||
|
||||
## Validation checklist
|
||||
- [ ] Required fields present and non-empty.
|
||||
- [ ] No secrets/PII; redaction markers recorded.
|
||||
- [ ] Correlation fields (`trace_id`, `span_id`) set when spans exist.
|
||||
- [ ] Log level matches outcome (errors use warn/error/fatal only).
|
||||
113
docs/modules/telemetry/guides/metrics-and-slos.md
Normal file
113
docs/modules/telemetry/guides/metrics-and-slos.md
Normal file
@@ -0,0 +1,113 @@
|
||||
# Metrics & SLOs (DOCS-OBS-51-001)
|
||||
|
||||
Last updated: 2025-12-15
|
||||
|
||||
## Core metrics (platform-wide)
|
||||
- **Requests**: `http_requests_total{tenant,workload,route,status}` (counter); latency histogram `http_request_duration_seconds`.
|
||||
- **Jobs**: `worker_jobs_total{tenant,queue,status}`; `worker_job_duration_seconds`.
|
||||
- **DB**: `db_query_duration_seconds{db,operation}`; `db_pool_in_use`, `db_pool_available`.
|
||||
- **Cache**: `cache_requests_total{result=hit|miss}`; `cache_latency_seconds`.
|
||||
- **Queue depth**: `queue_depth{tenant,queue}` (gauge).
|
||||
- **Errors**: `errors_total{tenant,workload,code}`.
|
||||
- **Custom module metrics**: keep namespaced (e.g., `riskengine_score_duration_seconds`, `notify_delivery_attempts_total`).
|
||||
|
||||
## SLOs (suggested)
|
||||
- API availability: 99.9% monthly per public service.
|
||||
- P95 latency: <300 ms for read endpoints; <1 s for write endpoints.
|
||||
- Worker job success: >99% over 30d; P95 job duration set per queue (document locally).
|
||||
- Queue backlog: alert when `queue_depth` > 1000 for 5 minutes per tenant/queue.
|
||||
- Error budget policy: 28-day rolling window; burn-rate alerts at 2× and 14× budget.
|
||||
|
||||
## Alert examples
|
||||
- High error rate: `rate(errors_total[5m]) / rate(http_requests_total[5m]) > 0.02`.
|
||||
- Latency regression: `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.3`.
|
||||
- Queue backlog: `queue_depth > 1000` for 5m.
|
||||
- Job failures: `rate(worker_jobs_total{status="failed"}[10m]) > 0.01`.
|
||||
|
||||
## UX KPIs (triage TTFS)
|
||||
- Targets:
|
||||
- TTFS first evidence p95: <= 1.5s
|
||||
- TTFS skeleton p95: <= 0.2s
|
||||
- Clicks-to-closure median: <= 6
|
||||
- Evidence completeness avg: >= 90% (>= 3.6/4)
|
||||
|
||||
```promql
|
||||
# TTFS first evidence p50/p95
|
||||
histogram_quantile(0.50, sum(rate(stellaops_ttfs_first_evidence_seconds_bucket[5m])) by (le))
|
||||
histogram_quantile(0.95, sum(rate(stellaops_ttfs_first_evidence_seconds_bucket[5m])) by (le))
|
||||
|
||||
# Clicks-to-closure median
|
||||
histogram_quantile(0.50, sum(rate(stellaops_clicks_to_closure_bucket[5m])) by (le))
|
||||
|
||||
# Evidence completeness average percent (0-4 mapped to 0-100)
|
||||
100 * (sum(rate(stellaops_evidence_completeness_score_sum[5m])) / clamp_min(sum(rate(stellaops_evidence_completeness_score_count[5m])), 1)) / 4
|
||||
|
||||
# Budget violations by phase
|
||||
sum(rate(stellaops_performance_budget_violations_total[5m])) by (phase)
|
||||
```
|
||||
|
||||
- Dashboard: `ops/devops/observability/grafana/triage-ttfs.json`
|
||||
- Alerts: `ops/devops/observability/triage-alerts.yaml`
|
||||
|
||||
## TTFS Metrics (time-to-first-signal)
|
||||
- Core metrics:
|
||||
- `ttfs_latency_seconds{surface,cache_hit,signal_source,kind,phase,tenant_id}` (histogram)
|
||||
- `ttfs_signal_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
|
||||
- `ttfs_cache_hit_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
|
||||
- `ttfs_cache_miss_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
|
||||
- `ttfs_slo_breach_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
|
||||
- `ttfs_error_total{surface,cache_hit,signal_source,kind,phase,tenant_id,error_type,error_code}` (counter)
|
||||
|
||||
- SLO targets:
|
||||
- P50 < 2s, P95 < 5s (all surfaces)
|
||||
- Warm path P50 < 700ms, P95 < 2.5s
|
||||
- Cold path P95 < 4s
|
||||
|
||||
```promql
|
||||
# TTFS latency p50/p95
|
||||
histogram_quantile(0.50, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
|
||||
histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
|
||||
|
||||
# SLO breach rate (per minute)
|
||||
60 * sum(rate(ttfs_slo_breach_total[5m]))
|
||||
```
|
||||
|
||||
## Offline Kit (air-gap) metrics
|
||||
- `offlinekit_import_total{status,tenant_id}` (counter)
|
||||
- `offlinekit_attestation_verify_latency_seconds{attestation_type,success}` (histogram)
|
||||
- `attestor_rekor_success_total{mode}` (counter)
|
||||
- `attestor_rekor_retry_total{reason}` (counter)
|
||||
- `rekor_inclusion_latency{success}` (histogram)
|
||||
|
||||
```promql
|
||||
# Import rate by status
|
||||
sum(rate(offlinekit_import_total[5m])) by (status)
|
||||
|
||||
# Import success rate
|
||||
sum(rate(offlinekit_import_total{status="success"}[5m])) / clamp_min(sum(rate(offlinekit_import_total[5m])), 1)
|
||||
|
||||
# Attestation verify p95 by type (success only)
|
||||
histogram_quantile(0.95, sum(rate(offlinekit_attestation_verify_latency_seconds_bucket{success="true"}[5m])) by (le, attestation_type))
|
||||
|
||||
# Rekor inclusion latency p95 (by success)
|
||||
histogram_quantile(0.95, sum(rate(rekor_inclusion_latency_bucket[5m])) by (le, success))
|
||||
```
|
||||
|
||||
Dashboard: `docs/modules/telemetry/dashboards/offline-kit-operations.json`
|
||||
|
||||
## Observability hygiene
|
||||
- Tag everything with `tenant`, `workload`, `env`, `region`, `version`.
|
||||
- Keep metric names stable; prefer adding labels over renaming.
|
||||
- No high-cardinality labels (avoid `user_id`, `path`, raw errors); bucket or hash if needed.
|
||||
- Offline: scrape locally (Prometheus/OTLP); ship exports via bundle if required.
|
||||
|
||||
## Dashboards
|
||||
- Golden signals per service: traffic, errors, saturation, latency (P50/P95/P99).
|
||||
- Queue dashboards: depth, age, throughput, success/fail rates.
|
||||
- Tracing overlays: link span `status` to error metrics; use exemplars where supported.
|
||||
|
||||
## Validation checklist
|
||||
- [ ] Metrics emitted with required tags.
|
||||
- [ ] Cardinality review completed (no unbounded labels).
|
||||
- [ ] Alerts wired to error budget policy.
|
||||
- [ ] Dashboards cover golden signals and queue health.
|
||||
240
docs/modules/telemetry/guides/observability.md
Normal file
240
docs/modules/telemetry/guides/observability.md
Normal file
@@ -0,0 +1,240 @@
|
||||
# AOC Observability Guide
|
||||
|
||||
> **Audience:** Observability Guild, Concelier/Excititor SREs, platform operators.
|
||||
> **Scope:** Metrics, traces, logs, dashboards, and runbooks introduced as part of the Aggregation-Only Contract (AOC) rollout (Sprint 19).
|
||||
|
||||
This guide captures the canonical signals emitted by Concelier and Excititor once AOC guards are active. It explains how to consume the metrics in dashboards, correlate traces/logs for incident triage, and operate in offline environments. Pair this guide with the [AOC reference](../aoc/aggregation-only-contract.md) and [architecture overview](../modules/platform/architecture-overview.md).
|
||||
|
||||
---
|
||||
|
||||
## 1 · Metrics
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
|--------|------|--------|-------------|
|
||||
| `ingestion_write_total` | Counter | `source`, `tenant`, `result` (`ok`, `reject`, `noop`) | Counts write attempts to `advisory_raw`/`vex_raw`. Rejects correspond to guard failures. |
|
||||
| `ingestion_latency_seconds` | Histogram | `source`, `tenant`, `phase` (`fetch`, `transform`, `write`) | Measures end-to-end runtime for ingestion stages. Use `quantile=0.95` for alerting. |
|
||||
| `aoc_violation_total` | Counter | `source`, `tenant`, `code` (`ERR_AOC_00x`) | Total guard violations bucketed by error code. Drives dashboard pills and alert thresholds. |
|
||||
| `ingestion_signature_verified_total` | Counter | `source`, `tenant`, `result` (`ok`, `fail`, `skipped`) | Tracks signature/checksum verification outcomes. |
|
||||
| `advisory_revision_count` | Gauge | `source`, `tenant` | Supersedes depth for raw documents; spikes indicate noisy upstream feeds. |
|
||||
| `verify_runs_total` | Counter | `tenant`, `initiator` (`ui`, `cli`, `api`, `scheduled`) | How many `stella aoc verify` or `/aoc/verify` runs executed. |
|
||||
| `verify_duration_seconds` | Histogram | `tenant`, `initiator` | Runtime of verification jobs; use P95 to detect regressions. |
|
||||
|
||||
### 1.1 Alerts
|
||||
|
||||
- **Violation spike:** Alert when `increase(aoc_violation_total[15m]) > 0` for critical sources. Page SRE if `code="ERR_AOC_005"` (signature failure) or `ERR_AOC_001` persists > 30 min.
|
||||
- **Stale ingestion:** Alert when `max_over_time(ingestion_latency_seconds_sum / ingestion_latency_seconds_count)[30m]` exceeds 30 s or if `ingestion_write_total` has no growth for > 60 min.
|
||||
- **Signature drop:** Warn when `rate(ingestion_signature_verified_total{result="fail"}[1h]) > 0`.
|
||||
|
||||
### 1.2 · `/obs/excititor/health`
|
||||
|
||||
`GET /obs/excititor/health` (scope `vex.admin`) returns a compact snapshot for Grafana tiles and Console widgets:
|
||||
|
||||
- `ingest` — overall status, worst lag (seconds), and the top connectors (status, lagSeconds, failure count, last success).
|
||||
- `link` — freshness of consensus/linkset processing plus document counts and the number currently carrying conflicts.
|
||||
- `signature` — recent coverage window (evaluated, with signatures, verified, failures, unsigned, coverage ratio).
|
||||
- `conflicts` — rolling totals grouped by status plus per-bucket trend data for charts.
|
||||
|
||||
```json
|
||||
{
|
||||
"generatedAt": "2025-11-08T11:00:00Z",
|
||||
"ingest": { "status": "healthy", "connectors": [ { "connectorId": "excititor:redhat", "lagSeconds": 45.3 } ] },
|
||||
"link": { "status": "warning", "lastConsensusAt": "2025-11-08T10:57:03Z" },
|
||||
"signature": { "status": "critical", "documentsEvaluated": 120, "verified": 30, "failures": 2 },
|
||||
"conflicts": { "status": "warning", "conflictStatements": 325, "trend": [ { "bucketStart": "2025-11-08T10:00:00Z", "conflicts": 130 } ] }
|
||||
}
|
||||
```
|
||||
|
||||
| Setting | Default | Purpose |
|
||||
|---------|---------|---------|
|
||||
| `Excititor:Observability:IngestWarningThreshold` | `06:00:00` | Connector lag before `ingest.status` becomes `warning`. |
|
||||
| `Excititor:Observability:IngestCriticalThreshold` | `24:00:00` | Connector lag before `ingest.status` becomes `critical`. |
|
||||
| `Excititor:Observability:LinkWarningThreshold` | `00:15:00` | Maximum acceptable delay between consensus recalculations. |
|
||||
| `Excititor:Observability:LinkCriticalThreshold` | `01:00:00` | Delay that marks link status as `critical`. |
|
||||
| `Excititor:Observability:SignatureWindow` | `12:00:00` | Lookback window for signature coverage. |
|
||||
| `Excititor:Observability:SignatureHealthyCoverage` | `0.8` | Coverage ratio that still counts as healthy. |
|
||||
| `Excititor:Observability:SignatureWarningCoverage` | `0.5` | Coverage ratio that flips the status to `warning`. |
|
||||
| `Excititor:Observability:ConflictTrendWindow` | `24:00:00` | Rolling window used for conflict aggregation. |
|
||||
| `Excititor:Observability:ConflictTrendBucketMinutes` | `60` | Resolution of conflict `trend` buckets. |
|
||||
| `Excititor:Observability:ConflictWarningRatio` | `0.15` | Fraction of consensus docs with conflicts that triggers `warning`. |
|
||||
| `Excititor:Observability:ConflictCriticalRatio` | `0.3` | Ratio that marks `conflicts.status` as `critical`. |
|
||||
| `Excititor:Observability:MaxConnectorDetails` | `50` | Number of connector entries returned (keeps payloads small). |
|
||||
|
||||
### 1.3 · Regression & DI hygiene
|
||||
|
||||
1. **Keep storage/integration tests green when telemetry touches persistence.**
|
||||
- `./tools/postgres/local-postgres.sh start` downloads PostgreSQL 16.x (if needed), launches the instance, and prints `export EXCITITOR_TEST_POSTGRES_URI=postgresql://.../excititor-tests`. Copy that export into your shell.
|
||||
- `./tools/postgres/local-postgres.sh restart` is a shortcut for "stop if running, then start" using the same dataset—use it after tweaking config or when tests need a bounce without wiping fixtures.
|
||||
- `./tools/postgres/local-postgres.sh clean` stops the instance (if running) and deletes the managed data/log directories so storage tests begin from a pristine catalog.
|
||||
- Run `dotnet test src/Excititor/__Tests/StellaOps.Excititor.Storage.Postgres.Tests/StellaOps.Excititor.Storage.Postgres.Tests.csproj -nologo -v minimal` (add `--filter` if you only touched specific suites). These tests exercise the same write paths that feed the dashboards, so regressions show up immediately.
|
||||
- `./tools/postgres/local-postgres.sh stop` when finished so CI/dev hosts stay clean; `status|logs|shell` are available for troubleshooting.
|
||||
2. **Declare optional Minimal API dependencies with `[FromServices] ... = null`.** RequestDelegateFactory treats `[FromServices] IVexSigner? signer = null` (or similar) as optional, so host startup succeeds even when tests have not registered that service. This pattern keeps observability endpoints cancellable while avoiding brittle test overrides.
|
||||
|
||||
|
||||
---
|
||||
|
||||
## 2 · Traces
|
||||
|
||||
### 2.1 Span taxonomy
|
||||
|
||||
| Span name | Parent | Key attributes |
|
||||
|-----------|--------|----------------|
|
||||
| `ingest.fetch` | job root span | `source`, `tenant`, `uri`, `contentHash` |
|
||||
| `ingest.transform` | `ingest.fetch` | `documentType` (`csaf`, `osv`, `vex`), `payloadBytes` |
|
||||
| `ingest.write` | `ingest.transform` | `collection` (`advisory_raw`, `vex_raw`), `result` (`ok`, `reject`) |
|
||||
| `aoc.guard` | `ingest.write` | `code` (on violation), `violationCount`, `supersedes` |
|
||||
| `verify.run` | verification job root | `tenant`, `window.from`, `window.to`, `sources`, `violations` |
|
||||
|
||||
### 2.2 Trace usage
|
||||
|
||||
- Correlate UI dashboard entries with traces via `traceId` surfaced in violation drawers (`docs/UI_GUIDE.md`).
|
||||
- Use `aoc.guard` spans to inspect guard payload snapshots. Sensitive fields are redacted automatically; raw JSON lives in secure logs only.
|
||||
- For scheduled verification, filter traces by `initiator="scheduled"` to compare runtimes pre/post change.
|
||||
|
||||
### 2.3 Telemetry configuration (Excititor)
|
||||
|
||||
- Configure the web service via `Excititor:Telemetry`:
|
||||
|
||||
```jsonc
|
||||
{
|
||||
"Excititor": {
|
||||
"Telemetry": {
|
||||
"Enabled": true,
|
||||
"EnableTracing": true,
|
||||
"EnableMetrics": true,
|
||||
"ServiceName": "stellaops-excititor-web",
|
||||
"OtlpEndpoint": "http://otel-collector:4317",
|
||||
"OtlpHeaders": {
|
||||
"Authorization": "Bearer ${OTEL_PUSH_TOKEN}"
|
||||
},
|
||||
"ResourceAttributes": {
|
||||
"env": "prod-us",
|
||||
"service.group": "ingestion"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- Point the OTLP endpoint at the shared collector profile from §1 so Excititor metrics land in the `ingestion_*` dashboards next to Concelier. Resource attributes drive Grafana filtering (e.g., `env`, `service.group`).
|
||||
- For offline/air-gap bundles set `Enabled=false` and collect the file exporter artifacts from the Offline Kit; import them into Grafana after transfer to keep time-to-truth dashboards consistent.
|
||||
- Local development templates: run `tools/postgres/local-postgres.sh start` to spin up a PostgreSQL instance plus the matching `psql` client. The script prints the `export EXCITITOR_TEST_POSTGRES_URI=...` command that integration tests (e.g., `StellaOps.Excititor.Storage.Postgres.Tests`) will honor. Use `restart` for a quick bounce, `clean` to wipe data between suites, and `stop` when finished.
|
||||
|
||||
---
|
||||
|
||||
## 3 · Logs
|
||||
|
||||
Structured logs include the following keys (JSON):
|
||||
|
||||
| Key | Description |
|
||||
|-----|-------------|
|
||||
| `traceId` | Matches OpenTelemetry trace/span IDs for cross-system correlation. |
|
||||
| `tenant` | Tenant identifier enforced by Authority middleware. |
|
||||
| `source.vendor` | Logical source (e.g., `redhat`, `ubuntu`, `osv`, `ghsa`). |
|
||||
| `upstream.upstreamId` | Vendor-provided ID (CVE, GHSA, etc.). |
|
||||
| `contentHash` | `sha256:` digest of the raw document. |
|
||||
| `violation.code` | Present when guard rejects `ERR_AOC_00x`. |
|
||||
| `verification.window` | Present on `/aoc/verify` job logs. |
|
||||
|
||||
Excititor APIs mirror these identifiers via response headers:
|
||||
|
||||
| Header | Purpose |
|
||||
| --- | --- |
|
||||
| `X-Stella-TraceId` | W3C trace/span identifier for deep-linking from Console → Grafana/Loki. |
|
||||
| `X-Stella-CorrelationId` | Stable correlation identifier (respects inbound header or falls back to the request trace ID). |
|
||||
|
||||
Logs are shipped to the central Loki/Elasticsearch cluster. Use the template query:
|
||||
|
||||
```logql
|
||||
{app="concelier-web"} | json | violation_code != ""
|
||||
```
|
||||
|
||||
to spot active AOC violations.
|
||||
|
||||
### 1.3 · Advisory chunk API (Advisory AI feeds)
|
||||
|
||||
Advisory AI now leans on Concelier’s `/advisories/{key}/chunks` endpoint for deterministic evidence packs. The service exports dedicated metrics so dashboards can highlight latency spikes, cache noise, or aggressive guardrail filtering before they impact Advisory AI responses.
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
| --- | --- | --- | --- |
|
||||
| `advisory_ai_chunk_requests_total` | Counter | `tenant`, `result`, `truncated`, `cache` | Count of chunk API calls, tagged with cache hits/misses and truncation state. |
|
||||
| `advisory_ai_chunk_latency_milliseconds` | Histogram | `tenant`, `result`, `truncated`, `cache` | End-to-end build latency (milliseconds) for each chunk request. |
|
||||
| `advisory_ai_chunk_segments` | Histogram | `tenant`, `result`, `truncated` | Number of chunk segments returned to the caller; watch for sudden drops tied to guardrails. |
|
||||
| `advisory_ai_chunk_sources` | Histogram | `tenant`, `result` | How many upstream observations/sources contributed to a response (after observation limits). |
|
||||
| `advisory_ai_guardrail_blocks_total` | Counter | `tenant`, `reason`, `cache` | Per-reason count of segments suppressed by guardrails (length, normalization, character set). |
|
||||
|
||||
Dashboards should plot latency P95/P99 next to cache hit rates and guardrail block deltas to catch degradation early. Advisory AI CLI/Console surfaces the same metadata so support engineers can correlate with Grafana/Loki entries using `traceId`/`correlationId` headers.
|
||||
|
||||
---
|
||||
|
||||
## 4 · Dashboards
|
||||
|
||||
Primary Grafana dashboard: **“AOC Ingestion Health”** (`dashboards/aoc-ingestion.json`). Panels include:
|
||||
|
||||
1. **Sources overview:** table fed by `ingestion_write_total` and `ingestion_latency_seconds` (mirrors Console tiles).
|
||||
2. **Violation trend:** stacked bar chart of `aoc_violation_total` per code.
|
||||
3. **Signature success rate:** timeseries derived from `ingestion_signature_verified_total`.
|
||||
4. **Supersedes depth:** gauge showing `advisory_revision_count` P95.
|
||||
5. **Verification runs:** histogram and latency boxplot using `verify_runs_total` / `verify_duration_seconds`.
|
||||
|
||||
Secondary dashboards:
|
||||
|
||||
- **AOC Alerts (Ops view):** summarises active alerts, last verify run, and links to incident runbook.
|
||||
- **Offline Mode Dashboard:** fed from Offline Kit imports; highlights snapshot age and queued verification jobs.
|
||||
|
||||
Update `docs/assets/dashboards/` with screenshots when Grafana capture pipeline produces the latest renders.
|
||||
|
||||
---
|
||||
|
||||
## 5 · Operational workflows
|
||||
|
||||
1. **During ingestion incident:**
|
||||
- Check Console dashboard for offending sources.
|
||||
- Pivot to logs using document `contentHash`.
|
||||
- Re-run `stella sources ingest --dry-run` with problematic payloads to validate fixes.
|
||||
- After remediation, run `stella aoc verify --since 24h` and confirm exit code `0`.
|
||||
2. **Scheduled verification:**
|
||||
- Configure cron job to run `stella aoc verify --format json --export ...`.
|
||||
- Ship JSON to `aoc-verify` bucket and ingest into metrics using custom exporter.
|
||||
- Alert on missing exports (no file uploaded within 26 h).
|
||||
3. **Offline kit validation:**
|
||||
- Use Offline Dashboard
|
||||
4. **Incident toggle audit:**
|
||||
- Authority requires `incident_reason` when issuing `obs:incident` tokens; plan your runbooks to capture business justification.
|
||||
- Auditors can call `/authority/audit/incident?limit=100` with the tenant header to list recent incident activations, including reason and issuer. to ensure snapshots contain latest metrics.
|
||||
- Run verification reports locally and attach to bundle before distribution.
|
||||
|
||||
---
|
||||
|
||||
## 6 · Offline considerations
|
||||
|
||||
- Metrics exporters bundled with Offline Kit write to local Prometheus snapshots; sync them with central Grafana once connectivity is restored.
|
||||
- CLI verification reports should be hashed (`sha256sum`) and archived for audit trails.
|
||||
- Dashboards include offline data sources (`prometheus-offline`) switchable via dropdown.
|
||||
|
||||
---
|
||||
|
||||
## 7 · References
|
||||
|
||||
- [Aggregation-Only Contract reference](../aoc/aggregation-only-contract.md)
|
||||
- [Architecture overview](../modules/platform/architecture-overview.md)
|
||||
- [Console guide](../UI_GUIDE.md)
|
||||
- [CLI AOC commands](../modules/cli/guides/cli-reference.md)
|
||||
- [Concelier architecture](../modules/concelier/architecture.md)
|
||||
- [Excititor architecture](../modules/excititor/architecture.md)
|
||||
- [Scheduler Worker observability guide](../modules/scheduler/operations/worker.md)
|
||||
|
||||
---
|
||||
|
||||
## 8 · Compliance checklist
|
||||
|
||||
- [ ] Metrics documented with label sets and alert guidance.
|
||||
- [ ] Tracing span taxonomy aligned with Concelier/Excititor implementation.
|
||||
- [ ] Log schema matches structured logging contracts (traceId, tenant, source, contentHash).
|
||||
- [ ] Grafana dashboard references verified and screenshots scheduled.
|
||||
- [ ] Offline/air-gap workflow captured.
|
||||
- [ ] Cross-links to AOC reference, console, and CLI docs included.
|
||||
- [ ] Observability Guild sign-off scheduled (OWNER: @obs-guild, due 2025-10-28).
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2025-10-26 (Sprint 19).*
|
||||
166
docs/modules/telemetry/guides/policy.md
Normal file
166
docs/modules/telemetry/guides/policy.md
Normal file
@@ -0,0 +1,166 @@
|
||||
# Policy Engine Observability
|
||||
|
||||
> **Audience:** Observability Guild, SRE/Platform operators, Policy Guild.
|
||||
> **Scope:** Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint 20).
|
||||
> **Prerequisites:** Policy Engine v2 deployed with OpenTelemetry exporters enabled (`observability:enabled=true` in config).
|
||||
|
||||
---
|
||||
|
||||
## 1 · Instrumentation Overview
|
||||
|
||||
- **Telemetry stack:** OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo.
|
||||
- **Namespace conventions:** `policy.*` for metrics/traces/log categories; labels use `tenant`, `policy`, `mode`, `runId`.
|
||||
- **Sampling:** Default 10 % trace sampling, 1 % rule-hit log sampling; incident mode overrides to 100 % (see §6).
|
||||
- **Correlation IDs:** Every API request gets `traceId` + `requestId`. CLI/UI display IDs to streamline support.
|
||||
|
||||
---
|
||||
|
||||
## 2 · Metrics
|
||||
|
||||
### 2.1 Run Pipeline
|
||||
|
||||
| Metric | Type | Labels | Notes |
|
||||
|--------|------|--------|-------|
|
||||
| `policy_run_seconds` | Histogram | `tenant`, `policy`, `mode` (`full`, `incremental`, `simulate`) | P95 target ≤ 5 min incremental, ≤ 30 min full. |
|
||||
| `policy_run_queue_depth` | Gauge | `tenant` | Number of pending jobs per tenant (updated each enqueue/dequeue). |
|
||||
| `policy_run_failures_total` | Counter | `tenant`, `policy`, `reason` (`err_pol_*`, `network`, `cancelled`) | Aligns with error codes. |
|
||||
| `policy_run_retries_total` | Counter | `tenant`, `policy` | Helps identify noisy sources. |
|
||||
| `policy_run_inputs_pending_bytes` | Gauge | `tenant` | Size of buffered change batches awaiting run. |
|
||||
|
||||
### 2.2 Evaluator Insights
|
||||
|
||||
| Metric | Type | Labels | Notes |
|
||||
|--------|------|--------|-------|
|
||||
| `policy_rules_fired_total` | Counter | `tenant`, `policy`, `rule` | Increment per rule match (sampled). |
|
||||
| `policy_vex_overrides_total` | Counter | `tenant`, `policy`, `vendor`, `justification` | Tracks VEX precedence decisions. |
|
||||
| `policy_suppressions_total` | Counter | `tenant`, `policy`, `action` (`ignore`, `warn`, `quiet`) | Audits suppression usage. |
|
||||
| `policy_selection_batch_duration_seconds` | Histogram | `tenant`, `policy` | Measures joiner performance. |
|
||||
| `policy_materialization_conflicts_total` | Counter | `tenant`, `policy` | Non-zero indicates optimistic concurrency retries. |
|
||||
|
||||
### 2.3 API Surface
|
||||
|
||||
| Metric | Type | Labels | Notes |
|
||||
|--------|------|--------|-------|
|
||||
| `policy_api_requests_total` | Counter | `endpoint`, `method`, `status` | Exposed via Minimal API instrumentation. |
|
||||
| `policy_api_latency_seconds` | Histogram | `endpoint`, `method` | Budget ≤ 250 ms for GETs, ≤ 1 s for POSTs. |
|
||||
| `policy_api_rate_limited_total` | Counter | `endpoint` | Tied to throttles (`429`). |
|
||||
|
||||
### 2.4 Queue & Change Streams
|
||||
|
||||
| Metric | Type | Labels | Notes |
|
||||
|--------|------|--------|-------|
|
||||
| `policy_queue_leases_active` | Gauge | `tenant` | Number of leased jobs. |
|
||||
| `policy_queue_lease_expirations_total` | Counter | `tenant` | Alerts when workers fail to ack. |
|
||||
| `policy_delta_backlog_age_seconds` | Gauge | `tenant`, `source` (`concelier`, `excititor`, `sbom`) | Age of oldest unprocessed change event. |
|
||||
|
||||
---
|
||||
|
||||
## 3 · Logs
|
||||
|
||||
- **Format:** JSON (`Serilog`). Core fields: `timestamp`, `level`, `message`, `policyId`, `policyVersion`, `tenant`, `runId`, `rule`, `traceId`, `env.sealed`, `error.code`.
|
||||
- **Log categories:**
|
||||
- `policy.run` (queue lifecycle, run begin/end, stats)
|
||||
- `policy.evaluate` (batch execution summaries; rule-hit sampling)
|
||||
- `policy.materialize` (Mongo operations, conflicts, retries)
|
||||
- `policy.simulate` (diff results, CLI invocation metadata)
|
||||
- `policy.lifecycle` (submit/review/approve events)
|
||||
- **Sampling:** Rule-hit logs sample 1 % by default; toggled to 100 % in incident mode or when `--trace` flag used in CLI.
|
||||
- **PII:** No user secrets recorded; user identities referenced as `user:<id>` or `group:<id>` only.
|
||||
|
||||
---
|
||||
|
||||
## 4 · Traces
|
||||
|
||||
- Spans emit via OpenTelemetry instrumentation.
|
||||
- **Primary spans:**
|
||||
- `policy.api` – wraps HTTP request, records `endpoint`, `status`, `scope`.
|
||||
- `policy.select` – change stream ingestion and batch assembly (attributes: `candidateCount`, `cursor`).
|
||||
- `policy.evaluate` – evaluation batch (attributes: `batchSize`, `ruleHits`, `severityChanges`).
|
||||
- `policy.materialize` – Mongo writes (attributes: `writes`, `historyWrites`, `retryCount`).
|
||||
- `policy.simulate` – simulation diff generation (attributes: `sbomCount`, `diffAdded`, `diffRemoved`).
|
||||
- Trace context propagated to CLI via response headers `traceparent`; UI surfaces in run detail view.
|
||||
- Incident mode forces span sampling to 100 % and extends retention via Collector config override.
|
||||
|
||||
---
|
||||
|
||||
## 5 · Dashboards
|
||||
|
||||
### 5.1 Policy Runs Overview
|
||||
|
||||
Widgets:
|
||||
- Run duration histogram (per mode/tenant).
|
||||
- Queue depth + backlog age line charts.
|
||||
- Failure rate stacked by error code.
|
||||
- Incremental backlog heatmap (policy × age).
|
||||
- Active vs scheduled runs table.
|
||||
|
||||
### 5.2 Rule Impact & VEX
|
||||
|
||||
- Top N rules by firings (bar chart).
|
||||
- VEX overrides by vendor/justification (stacked chart).
|
||||
- Suppression usage (pie + table with justifications).
|
||||
- Quieted findings trend (line).
|
||||
|
||||
### 5.3 Simulation & Approval Health
|
||||
|
||||
- Simulation diff histogram (added vs removed).
|
||||
- Pending approvals by age (table with SLA colour coding).
|
||||
- Compliance checklist status (lint, determinism CI, simulation evidence).
|
||||
|
||||
> Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (`../assets/policy-observability/*.png`).
|
||||
|
||||
---
|
||||
|
||||
## 6 · Alerting
|
||||
|
||||
| Alert | Condition | Suggested Action |
|
||||
|-------|-----------|------------------|
|
||||
| **PolicyRunSlaBreach** | `policy_run_seconds{mode="incremental"}` P95 > 300 s for 3 windows | Check queue depth, upstream services, scale worker pool. |
|
||||
| **PolicyQueueStuck** | `policy_delta_backlog_age_seconds` > 600 | Investigate change stream connectivity. |
|
||||
| **DeterminismMismatch** | Run status `failed` with `ERR_POL_004` OR CI replay diff | Switch to incident sampling, gather replay bundle, notify Policy Guild. |
|
||||
| **SimulationDrift** | CLI/CI simulation exit `20` (blocking diff) over threshold | Review policy changes before approval. |
|
||||
| **VexOverrideSpike** | `policy_vex_overrides_total` > configured baseline (per vendor) | Verify upstream VEX feed; ensure justification codes expected. |
|
||||
| **SuppressionSurge** | `policy_suppressions_total` increase > 3σ vs baseline | Audit new suppress rules; check approvals. |
|
||||
|
||||
Alerts integrate with Notifier channels (`policy.alerts`) and Ops on-call rotations.
|
||||
|
||||
---
|
||||
|
||||
## 7 · Incident Mode & Forensics
|
||||
|
||||
- Toggle via `POST /api/policy/incidents/activate` (requires `policy:operate` scope).
|
||||
- Effects:
|
||||
- Trace sampling → 100 %.
|
||||
- Rule-hit log sampling → 100 %.
|
||||
- Retention window extended to 30 days for incident duration.
|
||||
- `policy.incident.activated` event emitted (Console + Notifier banners).
|
||||
- Post-incident tasks:
|
||||
- `stella policy run replay` for affected runs; attach bundles to incident record.
|
||||
- Restore sampling defaults with `.../deactivate`.
|
||||
- Update incident checklist in `/docs/policy/lifecycle.md` (section 8) with findings.
|
||||
|
||||
---
|
||||
|
||||
## 8 · Integration Points
|
||||
|
||||
- **Authority:** Exposes metric `policy_scope_denied_total` for failed authorisation; correlate with `policy_api_requests_total`.
|
||||
- **Concelier/Excititor:** Shared trace IDs propagate via gRPC metadata to help debug upstream latency.
|
||||
- **Scheduler:** Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002).
|
||||
- **Offline Kit:** CLI exports logs + metrics snapshots (`stella offline bundle metrics`) for air-gapped audits.
|
||||
|
||||
---
|
||||
|
||||
## 9 · Compliance Checklist
|
||||
|
||||
- [ ] **Metrics registered:** All metrics listed above exported and documented in Grafana dashboards.
|
||||
- [ ] **Alert policies configured:** Ops or Observability Guild created alerts matching table in §6.
|
||||
- [ ] **Sampling overrides tested:** Incident mode toggles verified in staging; retention roll-back rehearsed.
|
||||
- [ ] **Trace propagation validated:** CLI/UI display trace IDs and allow copy for support.
|
||||
- [ ] **Log scrubbing enforced:** Unit tests guarantee no secrets/PII in logs; sampling respects configuration.
|
||||
- [ ] **Offline capture rehearsed:** Metrics/log snapshot commands executed in sealed environment.
|
||||
- [ ] **Docs cross-links:** Links to architecture, runs, lifecycle, CLI, API docs verified.
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2025-10-26 (Sprint 20).*
|
||||
|
||||
48
docs/modules/telemetry/guides/telemetry-bootstrap.md
Normal file
48
docs/modules/telemetry/guides/telemetry-bootstrap.md
Normal file
@@ -0,0 +1,48 @@
|
||||
# Telemetry Core Bootstrap (v1 · 2025-11-19)
|
||||
|
||||
## Goal
|
||||
Show minimal host wiring for `StellaOps.Telemetry.Core` with deterministic defaults and sealed-mode friendliness.
|
||||
|
||||
## Sample (web/worker host)
|
||||
```csharp
|
||||
var builder = WebApplication.CreateBuilder(args);
|
||||
|
||||
builder.Services.AddStellaOpsTelemetry(
|
||||
builder.Configuration,
|
||||
serviceName: "StellaOps.SampleService",
|
||||
serviceVersion: builder.Configuration["VERSION"],
|
||||
configureOptions: options =>
|
||||
{
|
||||
// Disable collector in sealed mode / air-gap
|
||||
options.Collector.Enabled = builder.Configuration.GetValue<bool>("Telemetry:Collector:Enabled", true);
|
||||
options.Collector.Endpoint = builder.Configuration["Telemetry:Collector:Endpoint"];
|
||||
options.Collector.Protocol = TelemetryCollectorProtocol.Grpc;
|
||||
},
|
||||
configureMetrics: m => m.AddAspNetCoreInstrumentation(),
|
||||
configureTracing: t => t.AddHttpClientInstrumentation());
|
||||
```
|
||||
|
||||
## Configuration (appsettings.json)
|
||||
```json
|
||||
{
|
||||
"Telemetry": {
|
||||
"Collector": {
|
||||
"Enabled": true,
|
||||
"Endpoint": "https://otel-collector.example:4317",
|
||||
"Protocol": "Grpc",
|
||||
"Component": "sample-service",
|
||||
"Intent": "telemetry-export",
|
||||
"DisableOnViolation": true
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Determinism & safety
|
||||
- UTC timestamps only; no random IDs introduced by the helper.
|
||||
- Exporter is skipped when endpoint missing or egress policy denies.
|
||||
- `VSTEST_DISABLE_APPDOMAIN=1` recommended for tests with `tools/linksets-ci.sh` pattern.
|
||||
|
||||
## Next
|
||||
- Propagation adapters (50-002) will build on this bootstrap.
|
||||
- Scrub/analyzer policies live under upcoming 51-001/51-002 tasks.
|
||||
@@ -0,0 +1,45 @@
|
||||
# Telemetry propagation contract (TELEMETRY-OBS-51-001)
|
||||
|
||||
**Goal**: standardise trace/metrics propagation across StellaOps services so golden-signal helpers remain deterministic, tenant-safe, and offline-friendly.
|
||||
|
||||
## Scope
|
||||
- Applies to HTTP, gRPC, background jobs, and message handlers instrumented via `StellaOps.Telemetry.Core`.
|
||||
- Complements bootstrap guide (`telemetry-bootstrap.md`) and precedes metrics helper implementation.
|
||||
|
||||
## Required context fields
|
||||
- `trace_id` / `span_id`: W3C TraceContext headers only (no B3); generate if missing.
|
||||
- `tenant`: lower-case string; required for all incoming requests; default to `unknown` only in sealed/offline diagnostics jobs.
|
||||
- `actor`: optional user/service principal; redacted to hash in logs when `Scrub.Sealed=true`.
|
||||
- `imposed_rule`: optional string conveying enforcement context (e.g., `merge=false`).
|
||||
|
||||
## HTTP middleware
|
||||
- Accept `traceparent`/`tracestate`; reject/strip vendor-specific headers.
|
||||
- Propagate `tenant`, `actor`, `imposed-rule` via `x-stella-tenant`, `x-stella-actor`, `x-stella-imposed-rule` headers (defaults configurable via `Telemetry:Propagation`).
|
||||
- Middleware entry point: `app.UseStellaOpsTelemetryContext()` plus the `TelemetryPropagationHandler` automatically added to all `HttpClient` instances when `AddStellaOpsTelemetry` is called.
|
||||
- Emit exemplars: when sampling is off, attach exemplar ids to request duration and active request metrics.
|
||||
|
||||
## gRPC interceptors
|
||||
- Use binary TraceContext; carry metadata keys `stella-tenant`, `stella-actor`, `stella-imposed-rule`.
|
||||
- Enforce presence of `tenant`; abort with `Unauthenticated` if missing in non-sealed mode.
|
||||
|
||||
## Jobs & message handlers
|
||||
- Wrap background job execution with Activity + baggage items (`tenant`, `actor`, `imposed_rule`).
|
||||
- When publishing bus events, stamp `trace_id` and `tenant` into headers; avoid embedding PII in payloads.
|
||||
|
||||
## Metrics helper expectations
|
||||
- Golden signals: `http.server.duration`, `http.client.duration`, `messaging.operation.duration`, `job.execution.duration`, `runtime.gc.pause`, `db.call.duration`.
|
||||
- Mandatory tags: `tenant`, `service`, `endpoint`/`operation`, `result` (`ok|error|cancelled|throttled`), `sealed` (`true|false`).
|
||||
- Cardinality guard: trim tag values to 64 chars (configurable) and replace values beyond the first 50 distinct entries per key with `other` (enforced by `MetricLabelGuard`).
|
||||
- Helper API: `Histogram<double>.RecordRequestDuration(guard, durationMs, route, verb, status, result)` applies guard + tags consistently.
|
||||
|
||||
## Determinism & offline posture
|
||||
- All timestamps UTC RFC3339; sampling configs controlled via appsettings and mirrored in offline bundles.
|
||||
- No external exporters when `Sealed=true`; use in-memory or file-based OTLP for air-gap.
|
||||
|
||||
## Tests to add with implementation
|
||||
- Middleware unit tests asserting header/baggage mapping and tenant enforcement.
|
||||
- Metrics helper tests ensuring required tags present and trimmed; exemplar id attached when enabled.
|
||||
- Deterministic snapshot tests for serialized OTLP when sealed/offline.
|
||||
|
||||
## Provenance
|
||||
- Authored 2025-11-20 to unblock TELEMETRY-OBS-51-001; to be refined as helpers are coded.
|
||||
35
docs/modules/telemetry/guides/telemetry-scrub-51-002.md
Normal file
35
docs/modules/telemetry/guides/telemetry-scrub-51-002.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Telemetry scrubbing contract (TELEMETRY-OBS-51-002)
|
||||
|
||||
**Purpose**: define redaction/scrubbing rules for logs/traces/metrics before implementing helpers in `StellaOps.Telemetry.Core`.
|
||||
|
||||
## Redaction rules
|
||||
- Strip or hash PII/credentials: emails, tokens, passwords, secrets, bearer/mTLS cert blobs.
|
||||
- Default hash algorithm: SHA-256 hex; include `scrubbed=true` tag.
|
||||
- Allowlist fields that remain: `tenant`, `trace_id`, `span_id`, `endpoint`, `result`, `sealed`.
|
||||
|
||||
## Configuration knobs
|
||||
- `Telemetry:Scrub:Enabled` (bool, default true).
|
||||
- `Telemetry:Scrub:Sealed` (bool, default false) — when true, force scrubbing and disable external exporters.
|
||||
- `Telemetry:Scrub:HashSalt` (string, optional) — per-tenant salt; omit to keep deterministic hashes across deployments.
|
||||
- `Telemetry:Scrub:MaxValueLength` (int, default 256) — truncate values beyond this length before hashing.
|
||||
|
||||
## Logger sink expectations
|
||||
- Implement scrubber as `ILogPayloadFilter` injected before sink.
|
||||
- Ensure message templates remain intact; only values scrubbed.
|
||||
- Preserve structured shape so downstream parsing remains deterministic.
|
||||
|
||||
## Metrics & traces
|
||||
- Never place raw user input into metric/tag values; pass through scrubber before export.
|
||||
- Span events must omit payload bodies; include keyed references only.
|
||||
|
||||
## Auditing
|
||||
- When scrubbing occurs, add tag `scrubbed=true` and `scrub_reason` (`pii|secret|length|pattern`).
|
||||
- Provide counter `telemetry.scrub.events{tenant,reason}` for observability.
|
||||
|
||||
## Tests to add with implementation
|
||||
- Unit tests for regex-based scrubbing of tokens, emails, URLs with creds.
|
||||
- Config-driven tests toggling `Enabled`/`Sealed` modes to ensure exporters are suppressed when sealed.
|
||||
- Determinism test: same input yields identical hashed output when salt unset.
|
||||
|
||||
## Provenance
|
||||
- Authored 2025-11-20 to unblock TELEMETRY-OBS-51-002 and downstream 55/56 tasks.
|
||||
33
docs/modules/telemetry/guides/telemetry-sealed-56-001.md
Normal file
33
docs/modules/telemetry/guides/telemetry-sealed-56-001.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# Sealed-mode telemetry helpers (TELEMETRY-OBS-56-001 prep)
|
||||
|
||||
## Objective
|
||||
Define behavior and configuration for telemetry when `Sealed=true`, ensuring no external egress while preserving deterministic local traces/metrics for audits.
|
||||
|
||||
## Requirements
|
||||
- Disable external OTLP/exporters automatically when sealed; fallback to in-memory or file OTLP (`telemetry-sealed.otlp`) with bounded size (default 10 MB, ring buffer).
|
||||
- Add tag `sealed=true` to all spans/metrics/logs; suppress exemplars.
|
||||
- Force scrubbing: treat `Scrub.Sealed=true` regardless of default settings.
|
||||
- Sampling: cap to 10% max in sealed mode unless CLI incident toggle raises it (see CLI-OBS-12-001 contract); ceiling 100% with explicit override `Telemetry:Sealed:MaxSamplingPercent`.
|
||||
- Clock source: require monotonic clock for duration; emit warning if system clock skew detected >500ms.
|
||||
|
||||
## Configuration keys
|
||||
- `Telemetry:Sealed:Enabled` (bool) — driven by host; when true activate sealed behavior.
|
||||
- `Telemetry:Sealed:Exporter` (enum `memory|file`) — default `file`.
|
||||
- `Telemetry:Sealed:FilePath` (string) — default `./logs/telemetry-sealed.otlp`.
|
||||
- `Telemetry:Sealed:MaxBytes` (int) — default 10_485_760 (10 MB).
|
||||
- `Telemetry:Sealed:MaxSamplingPercent` (int) — default 10.
|
||||
- Derived flag `Telemetry:Sealed:EffectiveIncidentMode` (read-only) exposes if incident-mode override lifted sampling ceiling.
|
||||
|
||||
## File exporter format
|
||||
- OTLP binary, append-only, deterministic ordering by enqueue time.
|
||||
- Rotate when exceeding `MaxBytes` using suffix `.1`, `.2` capped to 3 files; oldest dropped.
|
||||
- Permissions 0600 by default; fail-start if path is world-readable.
|
||||
|
||||
## Validation tests to implement with 56-001
|
||||
- Unit: sealed mode forces exporter swap and tags `sealed=true`, `scrubbed=true`.
|
||||
- Unit: sampling capped at max percent unless incident override set.
|
||||
- Unit: file exporter rotates deterministically and enforces 0600 perms.
|
||||
- Integration: sealed + incident mode together still block external exporters and honor scrub rules.
|
||||
|
||||
## Provenance
|
||||
- Authored 2025-11-20 to satisfy PREP-TELEMETRY-OBS-56-001 and unblock implementation.
|
||||
38
docs/modules/telemetry/guides/telemetry-standards.md
Normal file
38
docs/modules/telemetry/guides/telemetry-standards.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# Telemetry Standards (DOCS-OBS-50-002)
|
||||
|
||||
Last updated: 2025-11-25 (Docs Tasks Md.VI)
|
||||
|
||||
## Common envelope
|
||||
- **Trace context**: `trace_id`, `span_id`, `trace_flags`; propagate W3C `traceparent` and `baggage` end to end.
|
||||
- **Tenant & workload**: `tenant`, `workload` (service name), `region`, `env` (dev/stage/prod), `version` (git sha or semver).
|
||||
- **Subject**: `component` (module), `operation` (verb/name), `resource` (purl/uri/subject id when safe).
|
||||
- **Timing**: UTC ISO-8601 `timestamp`; durations in milliseconds with integers.
|
||||
- **Outcome**: `status` (`ok|error|fault|throttle`), `error.code` (machine), `error.message` (human, redacted), `retryable` (bool).
|
||||
|
||||
## Scrubbing policy
|
||||
- Denylist PII/secrets before emit: emails, tokens, Authorization headers, bearer fragments, private keys, passwords, session IDs.
|
||||
- Redact fields to `"[redacted]"` and add `redaction.reason` (`secret|pii|tenant_policy`).
|
||||
- Hash low-cardinality identifiers when needed (`sha256` lowercase hex) and mark `hashed=true`.
|
||||
- Logs must not contain full request/response bodies; store hashes plus lengths. For NDJSON exports, allow hashes + selected headers only.
|
||||
|
||||
## Sampling defaults
|
||||
- **Traces**: 10% head sampling non-prod; 100% for `status=error|fault` and for spans tagged `audit=true`. Prod default 5% with the same error/audit boost.
|
||||
- **Logs**: info logs rate-limited per component (default 100/s); warn/error never sampled. Structured JSON only.
|
||||
- **Metrics**: never sampled; counters/gauges/histograms use deterministic bucket boundaries documented in component specs.
|
||||
|
||||
## Redaction override procedure
|
||||
- Overrides are rare and must be auditable.
|
||||
- To allow a field temporarily, set `telemetry.redaction.overrides=<comma list>` in service config with change-ticket id; emit `redaction.override=true` tag on affected spans/logs.
|
||||
- Overrides expire automatically after `telemetry.redaction.override_ttl` (default 24h); services refuse to start with expired overrides.
|
||||
- All overrides are logged to `telemetry.redaction.audit` channel with actor, ticket, fields, TTL.
|
||||
|
||||
## Determinism & offline posture
|
||||
- No external enrichers; all enrichment data must be preloaded bundles (e.g., service map, tenant metadata).
|
||||
- Sorting for exports: by `timestamp`, then `workload`, then `operation`.
|
||||
- Time always UTC; avoid locale-specific formats.
|
||||
|
||||
## Validation checklist
|
||||
- [ ] `traceparent` propagated and present on inbound/outbound.
|
||||
- [ ] Required fields present (`tenant`, `workload`, `operation`, `status`).
|
||||
- [ ] Scrubbing tests cover auth headers and bodies.
|
||||
- [ ] Sampling knobs configurable via env vars with documented defaults.
|
||||
37
docs/modules/telemetry/guides/tracing.md
Normal file
37
docs/modules/telemetry/guides/tracing.md
Normal file
@@ -0,0 +1,37 @@
|
||||
# Tracing Standards (DOCS-OBS-50-004)
|
||||
|
||||
Last updated: 2025-11-25 (Docs Tasks Md.VI)
|
||||
|
||||
## Goals
|
||||
- Consistent distributed tracing across services (API, workers, CLI).
|
||||
- Safe for offline/air-gapped deployments.
|
||||
- Deterministic span data for replay/debug.
|
||||
|
||||
## Context propagation
|
||||
- Use W3C headers: `traceparent` (required), `baggage` (optional key/value pairs).
|
||||
- Preserve incoming `trace_id` for all downstream calls; create child spans per operation.
|
||||
- For async work (queues, cron), copy `traceparent` and `baggage` into the message envelope; new span links to the stored context using **links**, not a new parent.
|
||||
|
||||
## Span conventions
|
||||
- Names: `<component>.<operation>` (e.g., `riskengine.simulate`, `notify.deliver`).
|
||||
- Required attributes: `tenant`, `workload` (service), `env`, `region`, `version`, `operation`, `status`.
|
||||
- HTTP spans: add `http.method`, `http.route`, `http.status_code`, `net.peer.name`, `net.peer.port`.
|
||||
- DB spans: `db.system`, `db.name`, `db.operation`, `db.statement` (omit literals).
|
||||
- Message spans: `messaging.system`, `messaging.destination`, `messaging.operation` (`send|receive|process`), `messaging.message_id`.
|
||||
- Errors: set `status=error`, include `error.code`, redacted `error.message`, `retryable` (bool).
|
||||
|
||||
## Sampling
|
||||
- Default head sampling: 10% non-prod, 5% prod.
|
||||
- Always sample spans with `status=error|fault` or `audit=true`.
|
||||
- Allow override via env `Tracing__SampleRate` (0–1) per service; document in runbooks.
|
||||
|
||||
## Offline/air-gap posture
|
||||
- No external exporters; emit OTLP to local collector or file.
|
||||
- Disable remote enrichment; rely on bundled service map.
|
||||
- All timestamps UTC; span ids deterministic only in scope of traceparent (no GUID reuse).
|
||||
|
||||
## Validation checklist
|
||||
- [ ] `traceparent` forwarded on every inbound/outbound call.
|
||||
- [ ] Required attributes present on spans.
|
||||
- [ ] Error spans include codes and redacted messages.
|
||||
- [ ] Sampling knobs documented in service config.
|
||||
190
docs/modules/telemetry/guides/ui-telemetry.md
Normal file
190
docs/modules/telemetry/guides/ui-telemetry.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# Console Observability
|
||||
|
||||
> **Audience:** Observability Guild, Console Guild, SRE/operators.
|
||||
> **Scope:** Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint 23).
|
||||
> **Prerequisites:** Console deployed with metrics enabled (`CONSOLE_METRICS_ENABLED=true`) and OTLP exporters configured (`OTEL_EXPORTER_OTLP_*`).
|
||||
|
||||
---
|
||||
|
||||
## 1 · Instrumentation Overview
|
||||
|
||||
- **Telemetry stack:** OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose `/metrics` (Prometheus) and `/health/*`.
|
||||
- **Sampling:** Front-end spans sample at 5 % by default (`OTEL_TRACES_SAMPLER=parentbased_traceidratio`). Metrics are un-sampled; log sampling is handled per category (§3).
|
||||
- **Correlation IDs:** Every API call carries `x-stellaops-correlation-id`; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI.
|
||||
- **Scope gating:** Operators need the `ui.telemetry` scope to view live charts in the Admin workspace; the scope also controls access to `/console/telemetry` SSE streams.
|
||||
|
||||
---
|
||||
|
||||
## 2 · Metrics
|
||||
|
||||
### 2.1 Experience & Navigation
|
||||
|
||||
| Metric | Type | Labels | Notes |
|
||||
|--------|------|--------|-------|
|
||||
| `ui_route_render_seconds` | Histogram | `route`, `tenant`, `device` (`desktop`,`tablet`) | Time between route activation and first interactive paint. Target P95 ≤ 1.5 s (cached). |
|
||||
| `ui_request_duration_seconds` | Histogram | `service`, `method`, `status`, `tenant` | Gateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades. |
|
||||
| `ui_filter_apply_total` | Counter | `route`, `filter`, `tenant` | Increments when a global filter or context chip is applied. Used to track adoption of saved views. |
|
||||
| `ui_tenant_switch_total` | Counter | `fromTenant`, `toTenant`, `trigger` (`picker`, `shortcut`, `link`) | Emitted after a successful tenant switch; correlates with Authority `ui.tenant.switch` logs. |
|
||||
| `ui_offline_banner_seconds` | Histogram | `reason` (`authority`, `manifest`, `gateway`), `tenant` | Duration of offline banner visibility; integrate with air-gap SLAs. |
|
||||
|
||||
### 2.2 Security & Session
|
||||
|
||||
| Metric | Type | Labels | Notes |
|
||||
|--------|------|--------|-------|
|
||||
| `ui_dpop_failure_total` | Counter | `endpoint`, `reason` (`nonce`, `jkt`, `clockSkew`) | Raised when DPoP validation fails; pair with Authority audit trail. |
|
||||
| `ui_fresh_auth_prompt_total` | Counter | `action` (`token.revoke`, `policy.activate`, `client.create`), `tenant` | Counts fresh-auth modals; backlog above baseline indicates workflow friction. |
|
||||
| `ui_fresh_auth_failure_total` | Counter | `action`, `reason` (`timeout`,`cancelled`,`auth_error`) | Optional metric (set `CONSOLE_FRESH_AUTH_METRICS=true` when feature flag lands). |
|
||||
|
||||
### 2.3 Downloads & Offline Kit
|
||||
|
||||
| Metric | Type | Labels | Notes |
|
||||
|--------|------|--------|-------|
|
||||
| `ui_download_manifest_refresh_seconds` | Histogram | `tenant`, `channel` (`edge`,`stable`,`airgap`) | Time to fetch and verify downloads manifest. Target < 3 s. |
|
||||
| `ui_download_export_queue_depth` | Gauge | `tenant`, `artifactType` (`sbom`,`policy`,`attestation`,`console`) | Mirrors `/console/downloads` queue depth; triggers when offline bundles lag. |
|
||||
| `ui_download_command_copied_total` | Counter | `tenant`, `artifactType` | Increments when users copy CLI commands from the UI. Useful to observe CLI parity adoption. |
|
||||
|
||||
### 2.4 Telemetry Emission & Errors
|
||||
|
||||
| Metric | Type | Labels | Notes |
|
||||
|--------|------|--------|-------|
|
||||
| `ui_telemetry_batch_failures_total` | Counter | `transport` (`otlp-http`,`otlp-grpc`), `reason` | Emitted by OTLP bridge when batches fail. Enable via `CONSOLE_METRICS_VERBOSE=true`. |
|
||||
| `ui_telemetry_queue_depth` | Gauge | `priority` (`normal`,`high`), `tenant` | Browser-side buffer depth; monitor for spikes under degraded collectors. |
|
||||
|
||||
> **Scraping tips:**
|
||||
> - Enable `/metrics` via `CONSOLE_METRICS_ENABLED=true`.
|
||||
> - Set `OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318` and relevant headers (`OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>`).
|
||||
> - For air-gapped sites, point the exporter to the Offline Kit collector (`localhost:4318`) and forward the metrics snapshot using `stella offline bundle metrics`.
|
||||
|
||||
---
|
||||
|
||||
## 3 · Logs
|
||||
|
||||
- **Format:** JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields: `timestamp`, `level`, `action`, `route`, `tenant`, `subject`, `correlationId`, `dpop.jkt`, `device`, `offlineMode`.
|
||||
- **Categories:**
|
||||
- `ui.action` – general user interactions (route changes, command palette, filter updates). Sampled 50 % by default; override with feature flag `telemetry.logVerbose`.
|
||||
- `ui.tenant.switch` – always logged; includes `fromTenant`, `toTenant`, `tokenId`, and Authority audit correlation.
|
||||
- `ui.download.commandCopied` – download commands copied; includes `artifactId`, `digest`, `manifestVersion`.
|
||||
- `ui.security.anomaly` – DPoP mismatches, tenant header errors, CSP violations (level = `Warning`).
|
||||
- `ui.telemetry.failure` – OTLP export errors; include `httpStatus`, `batchSize`, `retryCount`.
|
||||
- **PII handling:** Full emails are scrubbed; only hashed values (`user:<sha256>`) appear unless `ui.admin` + fresh-auth were granted for the action (still redacted in logs).
|
||||
- **Retention:** Recommended 14 days for connected sites, 30 days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label `service="stellaops-web-ui"`.
|
||||
|
||||
---
|
||||
|
||||
## 4 · Traces
|
||||
|
||||
- **Span names & attributes:**
|
||||
- `ui.route.transition` – wraps route navigation; attributes: `route`, `tenant`, `renderMillis`, `prefetchHit`.
|
||||
- `ui.api.fetch` – HTTP fetch to backend; attributes: `service`, `endpoint`, `status`, `networkTime`.
|
||||
- `ui.sse.stream` – Server-sent event subscriptions (status ticker, runs); attributes: `channel`, `connectedMillis`, `reconnects`.
|
||||
- `ui.telemetry.batch` – Browser OTLP flush; attributes: `batchSize`, `success`, `retryCount`.
|
||||
- `ui.policy.action` – Policy workspace actions (simulate, approve, activate) per `docs/UI_GUIDE.md`.
|
||||
- **Propagation:** Spans use W3C `traceparent`; gateway echoes header to backend APIs so traces stitch across UI → gateway → service.
|
||||
- **Sampling controls:** `OTEL_TRACES_SAMPLER_ARG` (ratio) and feature flag `telemetry.forceSampling` (sets to 100 % for incident debugging).
|
||||
- **Viewing traces:** Grafana Tempo or Jaeger via collector. Filter by `service.name = stellaops-console`. For cross-service debugging, filter on `correlationId` and `tenant`.
|
||||
|
||||
---
|
||||
|
||||
## 5 · Dashboards
|
||||
|
||||
### 5.1 Experience Overview
|
||||
|
||||
Panels:
|
||||
- Route render histogram (P50/P90/P99) by route.
|
||||
- Backend call latency stacked by service (`ui_request_duration_seconds`).
|
||||
- Offline banner duration trend (`ui_offline_banner_seconds`).
|
||||
- Tenant switch volume vs failure rate (overlay `ui_dpop_failure_total`).
|
||||
- Command palette usage (`ui_filter_apply_total` + `ui.action` log counts).
|
||||
|
||||
### 5.2 Downloads & Offline Kit
|
||||
|
||||
- Manifest refresh time chart (per channel).
|
||||
- Export queue depth gauge with alert thresholds.
|
||||
- CLI command adoption (bar chart per artifact type, using `ui_download_command_copied_total`).
|
||||
- Offline parity banner occurrences (`downloads.offlineParity` flag from API → derived metric).
|
||||
- Last Offline Kit import timestamp (join with Downloads API metadata).
|
||||
|
||||
### 5.3 Security & Session
|
||||
|
||||
- Fresh-auth prompt counts vs success/fail ratios.
|
||||
- DPoP failure stacked by reason.
|
||||
- Tenant mismatch warnings (from `ui.security.anomaly` logs).
|
||||
- Scope usage heatmap (derived from Authority audit events + UI logs).
|
||||
- CSP violation counts (browser `securitypolicyviolation` listener forwarded to logs).
|
||||
|
||||
> Capture screenshots for Grafana once dashboards stabilise (`docs/assets/ui/observability/*.png`). Replace placeholders before releasing the doc.
|
||||
|
||||
---
|
||||
|
||||
## 6 · Alerting
|
||||
|
||||
| Alert | Condition | Suggested Action |
|
||||
|-------|-----------|------------------|
|
||||
| **ConsoleLatencyHigh** | `ui_route_render_seconds_bucket{le="1.5"}` drops below 0.95 for 3 intervals | Inspect route splits, check backend latencies, review CDN cache. |
|
||||
| **BackendLatencyHigh** | `ui_request_duration_seconds_sum / ui_request_duration_seconds_count` > 1 s for any service | Correlate with gateway/service dashboards; escalate to owning guild. |
|
||||
| **TenantSwitchFailures** | Increase in `ui_dpop_failure_total` or `ui.security.anomaly` (tenant mismatch) > 3/min | Validate Authority issuer, check clock skew, confirm tenant config. |
|
||||
| **FreshAuthLoop** | `ui_fresh_auth_prompt_total` spikes with matching `ui_fresh_auth_failure_total` | Review Authority `/fresh-auth` endpoint, session timeout config, UX regressions. |
|
||||
| **OfflineBannerLong** | `ui_offline_banner_seconds` P95 > 120 s | Investigate Authority/gateway availability; verify Offline Kit freshness. |
|
||||
| **DownloadsBacklog** | `ui_download_export_queue_depth` > 5 for 10 min OR queue age > alert threshold | Ping Downloads service, ensure manifest pipeline (`DOWNLOADS-CONSOLE-23-001`) is healthy. |
|
||||
| **TelemetryExportErrors** | `ui_telemetry_batch_failures_total` > 0 for ≥5 min | Check collector health, credentials, or TLS trust. |
|
||||
|
||||
Integrate alerts with Notifier (`ui.alerts`) or existing Ops channels. Tag incidents with `component=console` for correlation.
|
||||
|
||||
---
|
||||
|
||||
## 7 · Feature Flags & Configuration
|
||||
|
||||
| Flag / Env Var | Purpose | Default |
|
||||
|----------------|---------|---------|
|
||||
| `CONSOLE_FEATURE_FLAGS` | Enables UI modules (`runs`, `downloads`, `policies`, `telemetry`). Telemetry panel requires `telemetry`. | `runs,downloads,policies` |
|
||||
| `CONSOLE_METRICS_ENABLED` | Exposes `/metrics` for Prometheus scrape. | `true` |
|
||||
| `CONSOLE_METRICS_VERBOSE` | Emits additional batching metrics (`ui_telemetry_*`). | `false` |
|
||||
| `CONSOLE_LOG_LEVEL` | Minimum log level (`Information`, `Debug`). Use `Debug` for incident sampling. | `Information` |
|
||||
| `CONSOLE_METRICS_SAMPLING` *(planned)* | Controls front-end span sampling ratio. Document once released. | `0.05` |
|
||||
| `OTEL_EXPORTER_OTLP_ENDPOINT` | Collector URL; supports HTTPS. | unset |
|
||||
| `OTEL_EXPORTER_OTLP_HEADERS` | Comma-separated headers (auth). | unset |
|
||||
| `OTEL_EXPORTER_OTLP_INSECURE` | Allow HTTP (dev only). | `false` |
|
||||
| `OTEL_SERVICE_NAME` | Service tag for traces/logs. Set to `stellaops-console`. | auto |
|
||||
| `CONSOLE_TELEMETRY_SSE_ENABLED` | Enables `/console/telemetry` SSE feed for dashboards. | `true` |
|
||||
|
||||
Feature flag changes should be tracked in release notes and mirrored in `docs/UI_GUIDE.md` (navigation and workflow expectations).
|
||||
|
||||
---
|
||||
|
||||
## 8 · Offline / Air-Gapped Workflow
|
||||
|
||||
- Mirror the console image and telemetry collector as part of the Offline Kit (see `/docs/operations/console-docker-install.md` §4).
|
||||
- Scrape metrics locally via `curl -k https://console.local/metrics > metrics.prom`; archive alongside logs for audits.
|
||||
- Use `stella offline kit import` to keep the downloads manifest in sync; dashboards display staleness using `ui_download_manifest_refresh_seconds`.
|
||||
- When collectors are unavailable, console queues OTLP batches (up to 5 min) and exposes backlog through `ui_telemetry_queue_depth`; export queue metrics to prove no data loss.
|
||||
- After reconnecting, run `stella console status --telemetry` *(CLI parity pending; see DOCS-CONSOLE-23-014)* or verify `ui_telemetry_batch_failures_total` resets to zero.
|
||||
- Retain telemetry bundles for 30 days per compliance guidelines; include Grafana JSON exports in audit packages.
|
||||
|
||||
---
|
||||
|
||||
## 9 · Compliance Checklist
|
||||
|
||||
- [ ] `/metrics` scraped in staging & production; dashboards display `ui_route_render_seconds`, `ui_request_duration_seconds`, and downloads metrics.
|
||||
- [ ] OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki).
|
||||
- [ ] Alert rules from §6 implemented in monitoring stack with runbooks linked.
|
||||
- [ ] Feature flags documented and change-controlled; telemetry disabled only with approval.
|
||||
- [ ] DPoP/fresh-auth anomalies correlated with Authority audit logs during drill.
|
||||
- [ ] Offline capture workflow exercised; evidence stored in audit vault.
|
||||
- [ ] Screenshots of Grafana dashboards committed once they stabilise (update references).
|
||||
- [ ] Cross-links verified (`docs/deploy/console.md`, `docs/security/console-security.md`, `docs/UI_GUIDE.md`).
|
||||
|
||||
---
|
||||
|
||||
## 10 · References
|
||||
|
||||
- `/docs/deploy/console.md` – Metrics endpoint, OTLP config, health checks.
|
||||
- `/docs/security/console-security.md` – Security metrics & alert hints.
|
||||
- `docs/UI_GUIDE.md` – Console workflows and offline posture.
|
||||
- `/docs/observability/observability.md` – Platform-wide practices.
|
||||
- `/ops/telemetry-collector.md` & `/ops/telemetry-storage.md` – Collector deployment.
|
||||
- `/docs/operations/console-docker-install.md` – Compose/Helm environment variables.
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2025-10-28 (Sprint 23).*
|
||||
|
||||
Reference in New Issue
Block a user