docs consolidation and others

This commit is contained in:
master
2026-01-06 19:02:21 +02:00
parent d7bdca6d97
commit 4789027317
849 changed files with 16551 additions and 66770 deletions

View File

@@ -0,0 +1,37 @@
# Aggregation Observability
Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-LNM-22-007)
Covers metrics, traces, and logs for Link-Not-Merge (LNM) aggregation and evidence pipelines.
## Metrics
- `aggregation_ingest_latency_seconds` (histogram) — end-to-end ingest per statement; labels: `tenant`, `source`, `status`.
- `aggregation_conflict_total` (counter) — conflicts encountered; labels: `tenant`, `advisory`, `product`, `reason`.
- `aggregation_overlay_cache_hits_total` / `_misses_total` — overlay cache effectiveness; labels: `tenant`, `cache`.
- `aggregation_vex_gate_total` — VEX gating outcomes; labels: `tenant`, `status` (`affected`, `not_affected`, `unknown`).
- `aggregation_queue_depth` (gauge) — pending statements per tenant.
## Traces
- Span name `aggregation.process` with attributes:
- `tenant`, `advisory`, `product`, `vex_status`, `source_kind`
- `overlay_version`, `cache_hit` (bool)
- Link to upstream ingest span (`traceparent` forwarded by Excititor/Concelier).
- Export to OTLP; sampling default 10% outside prod, 100% for `status=error`.
## Logs
Structured JSON with fields: `tenant`, `advisory`, `product`, `vex_status`, `decision` (`merged|suppressed|dropped`), `reason`, `duration_ms`, `trace_id`.
## SLOs
- **Ingest latency**: p95 < 500ms per statement (steady state).
- **Cache hit rate**: >80% for overlays; alerts when below for 15 minutes.
- **Error rate**: <0.1% over 10 minute window.
## Alerts
- `HighConflictRate` `aggregation_conflict_total` delta > 100/minute per tenant.
- `QueueBacklog``aggregation_queue_depth` > 10k for 5 minutes.
- `LowCacheHit` — overlay cache hit rate < 60% for 10 minutes.
## Offline/air-gap considerations
- Export metrics to local Prometheus scrape; no external sinks.
- Trace sampling and log retention configured via environment without needing control-plane access.
- Deterministic ordering preserved; cache warmers seeded from bundled fixtures.

View File

@@ -0,0 +1,29 @@
# CLI incident toggle contract (CLI-OBS-12-001)
**Goal**: define a deterministic CLI flag and config surface to enter/exit incident mode, required by TELEMETRY-OBS-55-001/56-001.
## Flags and config
- CLI flag: `--incident-mode` (bool). Defaults to false.
- Config key: `Telemetry:Incident:Enabled` (bool) and `Telemetry:Incident:TTL` (TimeSpan).
- When both flag and config specified, flag wins (opt-in only; cannot disable if config enables and flag present).
## Effects when enabled
- Increase sampling rate ceiling to 100% for telemetry within the process.
- Add tag `incident=true` to logs/metrics/traces.
- Shorten exporter/reporting flush interval to 5s; disable external exporters when `Sealed=true`.
- Emit activation audit event `telemetry.incident.activated` with fields `{tenant, actor, source, expires_at}`.
## Persistence
- Incident flag runtime value stored in local state file `~/.stellaops/incident-mode.json` with fields `{enabled, set_at, expires_at, actor}` for offline continuity.
- File is tenant-scoped; permissions 0600.
## Expiry / TTL
- Default TTL: 30 minutes unless `Telemetry:Incident:TTL` provided.
- On expiry, emit `telemetry.incident.expired` audit event.
## Validation expectations
- CLI should refuse `--incident-mode` if `--sealed` is set and external exporters are configured (must drop exporters first).
- Unit tests to cover precedence (flag over config), TTL expiry, state file perms, and audit emissions.
## Provenance
- Authored 2025-11-20 to unblock PREP-CLI-OBS-12-001 and TELEMETRY-OBS-55-001.

View File

@@ -0,0 +1,177 @@
# FN-Drift Metrics Reference
> **Sprint:** SPRINT_3404_0001_0001
> **Module:** Scanner Storage / Telemetry
## Overview
False-Negative Drift (FN-Drift) measures how often vulnerability classifications change from "not affected" or "unknown" to "affected" during rescans. This metric is critical for:
- **Accuracy Assessment**: Tracking scanner reliability over time
- **SLO Compliance**: Meeting false-negative rate targets
- **Root Cause Analysis**: Stratified analysis by drift cause
- **Feed Quality**: Identifying problematic vulnerability feeds
## Metrics
### Gauges (30-day rolling window)
| Metric | Type | Description |
|--------|------|-------------|
| `scanner.fn_drift.percent` | Gauge | 30-day rolling FN-Drift percentage |
| `scanner.fn_drift.transitions_30d` | Gauge | Total FN transitions in last 30 days |
| `scanner.fn_drift.evaluated_30d` | Gauge | Total findings evaluated in last 30 days |
| `scanner.fn_drift.cause.feed_delta` | Gauge | FN transitions caused by feed updates |
| `scanner.fn_drift.cause.rule_delta` | Gauge | FN transitions caused by rule changes |
| `scanner.fn_drift.cause.lattice_delta` | Gauge | FN transitions caused by VEX lattice changes |
| `scanner.fn_drift.cause.reachability_delta` | Gauge | FN transitions caused by reachability changes |
| `scanner.fn_drift.cause.engine` | Gauge | FN transitions caused by engine changes (should be ~0) |
### Counters (all-time)
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `scanner.classification_changes_total` | Counter | `cause` | Total classification status changes |
| `scanner.fn_transitions_total` | Counter | `cause` | Total false-negative transitions |
## Classification Statuses
| Status | Description |
|--------|-------------|
| `new` | First scan, no previous status |
| `unaffected` | Confirmed not affected |
| `unknown` | Status unknown/uncertain |
| `affected` | Confirmed affected |
| `fixed` | Previously affected, now fixed |
## Drift Causes
| Cause | Description | Expected Impact |
|-------|-------------|-----------------|
| `feed_delta` | Vulnerability feed updated (NVD, GHSA, OVAL) | High - most common cause |
| `rule_delta` | Policy rules changed | Medium - controlled by policy team |
| `lattice_delta` | VEX lattice state changed | Medium - VEX updates |
| `reachability_delta` | Reachability analysis changed | Low - improved analysis |
| `engine` | Scanner engine change | ~0 - determinism violation if >0 |
| `other` | Unknown/unclassified cause | Low - investigate if high |
## FN-Drift Definition
A **False-Negative Transition** occurs when:
- Previous status was `unaffected` or `unknown`
- New status is `affected`
This indicates the scanner previously classified a finding as "not vulnerable" but now classifies it as "vulnerable" - a false negative in the earlier scan.
### FN-Drift Rate Calculation
```
FN-Drift % = (FN Transitions / Total Reclassified) × 100
```
Where:
- **FN Transitions**: Count of `(unaffected|unknown) → affected` changes
- **Total Reclassified**: Count of all status changes (excluding `new`)
## SLO Thresholds
| SLO Level | FN-Drift Threshold | Alert Severity |
|-----------|-------------------|----------------|
| Target | < 1.0% | None |
| Warning | 1.0% - 2.5% | Warning |
| Critical | > 2.5% | Critical |
| Engine Drift | > 0% | Page |
### Alerting Rules
```yaml
# Example Prometheus alerting rules
groups:
- name: fn-drift
rules:
- alert: FnDriftWarning
expr: scanner_fn_drift_percent > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "FN-Drift rate above warning threshold"
- alert: FnDriftCritical
expr: scanner_fn_drift_percent > 2.5
for: 5m
labels:
severity: critical
annotations:
summary: "FN-Drift rate above critical threshold"
- alert: EngineDriftDetected
expr: scanner_fn_drift_cause_engine > 0
for: 1m
labels:
severity: page
annotations:
summary: "Engine-caused FN drift detected - determinism violation"
```
## Dashboard Queries
### FN-Drift Trend (Grafana)
```promql
# 30-day rolling FN-Drift percentage
scanner_fn_drift_percent
# FN transitions by cause
sum by (cause) (rate(scanner_fn_transitions_total[1h]))
# Classification changes rate
sum by (cause) (rate(scanner_classification_changes_total[1h]))
```
### Drift Cause Breakdown
```promql
# Pie chart of drift causes
topk(5,
sum by (cause) (
increase(scanner_fn_transitions_total[24h])
)
)
```
## Database Schema
### classification_history Table
```sql
CREATE TABLE scanner.classification_history (
id BIGSERIAL PRIMARY KEY,
artifact_digest TEXT NOT NULL,
vuln_id TEXT NOT NULL,
package_purl TEXT NOT NULL,
tenant_id UUID NOT NULL,
manifest_id UUID NOT NULL,
execution_id UUID NOT NULL,
previous_status TEXT NOT NULL,
new_status TEXT NOT NULL,
is_fn_transition BOOLEAN GENERATED ALWAYS AS (...) STORED,
cause TEXT NOT NULL,
cause_detail JSONB,
changed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
```
### fn_drift_stats Materialized View
Aggregated daily statistics for efficient dashboard queries:
- Day bucket
- Tenant ID
- Cause breakdown
- FN count and percentage
## Related Documentation
- [Determinism Technical Reference](../product-advisories/14-Dec-2025%20-%20Determinism%20and%20Reproducibility%20Technical%20Reference.md) - Section 13.2
- [Scanner Architecture](../modules/scanner/architecture.md)
- [Telemetry Stack](../modules/telemetry/architecture.md)

View File

@@ -0,0 +1,49 @@
# Logging Standards (DOCS-OBS-50-003)
Last updated: 2025-12-15
## Goals
- Deterministic, structured logs for all services.
- Keep tenant safety and redaction guarantees while enabling search, correlation, and offline analysis.
## Log shape (JSON)
Required fields:
- `timestamp` (UTC ISO-8601)
- `tenant`, `workload` (service name), `env`, `region`, `version`
- `level` (`debug|info|warn|error|fatal`)
- `category` (logger/category name), `operation` (verb/action)
- `trace_id`, `span_id`, `correlation_id` (if external)
- `message` (concise, no secrets)
- `status` (`ok|error|fault|throttle`)
- `error.code`, `error.message` (redacted), `retryable` (bool) when status != ok
Optional but recommended:
- `resource` (subject id/purl/path when safe), `http.method`, `http.status_code`, `duration_ms`, `host`, `pid`, `thread`.
## Offline Kit / air-gap import fields
When emitting logs for Offline Kit import/activation flows, keep field names stable:
- Required scope key: `tenant_id`
- Common keys: `bundle_type`, `bundle_digest`, `bundle_path`, `manifest_version`, `manifest_created_at`
- Force activation keys: `force_activate`, `force_activate_reason`
- Outcome keys: `result`, `reason_code`, `reason_message`
- Quarantine keys: `quarantine_id`, `quarantine_path`
## Redaction rules
- Never log Authorization headers, tokens, passwords, private keys, full request/response bodies.
- Redact to `"[redacted]"` and add `redaction.reason` (`secret|pii|policy`).
- Hash low-cardinality identifiers when needed (`sha256` hex) and mark `hashed=true`.
## Determinism & offline posture
- Stable key ordering not required, but field set must be consistent per log type.
- No external enrichment; rely on bundled metadata (service map, tenant labels).
- All times UTC; newline-delimited JSON (NDJSON); LF line endings.
## Sampling & rate limits
- Info logs rate-limited per component (default 100/s); warn/error/fatal never sampled.
- Structured audit logs (`category=audit`) are never sampled and must include `actor`, `action`, `target`, `result`.
## Validation checklist
- [ ] Required fields present and non-empty.
- [ ] No secrets/PII; redaction markers recorded.
- [ ] Correlation fields (`trace_id`, `span_id`) set when spans exist.
- [ ] Log level matches outcome (errors use warn/error/fatal only).

View File

@@ -0,0 +1,113 @@
# Metrics & SLOs (DOCS-OBS-51-001)
Last updated: 2025-12-15
## Core metrics (platform-wide)
- **Requests**: `http_requests_total{tenant,workload,route,status}` (counter); latency histogram `http_request_duration_seconds`.
- **Jobs**: `worker_jobs_total{tenant,queue,status}`; `worker_job_duration_seconds`.
- **DB**: `db_query_duration_seconds{db,operation}`; `db_pool_in_use`, `db_pool_available`.
- **Cache**: `cache_requests_total{result=hit|miss}`; `cache_latency_seconds`.
- **Queue depth**: `queue_depth{tenant,queue}` (gauge).
- **Errors**: `errors_total{tenant,workload,code}`.
- **Custom module metrics**: keep namespaced (e.g., `riskengine_score_duration_seconds`, `notify_delivery_attempts_total`).
## SLOs (suggested)
- API availability: 99.9% monthly per public service.
- P95 latency: <300 ms for read endpoints; <1 s for write endpoints.
- Worker job success: >99% over 30d; P95 job duration set per queue (document locally).
- Queue backlog: alert when `queue_depth` > 1000 for 5 minutes per tenant/queue.
- Error budget policy: 28-day rolling window; burn-rate alerts at 2× and 14× budget.
## Alert examples
- High error rate: `rate(errors_total[5m]) / rate(http_requests_total[5m]) > 0.02`.
- Latency regression: `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.3`.
- Queue backlog: `queue_depth > 1000` for 5m.
- Job failures: `rate(worker_jobs_total{status="failed"}[10m]) > 0.01`.
## UX KPIs (triage TTFS)
- Targets:
- TTFS first evidence p95: <= 1.5s
- TTFS skeleton p95: <= 0.2s
- Clicks-to-closure median: <= 6
- Evidence completeness avg: >= 90% (>= 3.6/4)
```promql
# TTFS first evidence p50/p95
histogram_quantile(0.50, sum(rate(stellaops_ttfs_first_evidence_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(stellaops_ttfs_first_evidence_seconds_bucket[5m])) by (le))
# Clicks-to-closure median
histogram_quantile(0.50, sum(rate(stellaops_clicks_to_closure_bucket[5m])) by (le))
# Evidence completeness average percent (0-4 mapped to 0-100)
100 * (sum(rate(stellaops_evidence_completeness_score_sum[5m])) / clamp_min(sum(rate(stellaops_evidence_completeness_score_count[5m])), 1)) / 4
# Budget violations by phase
sum(rate(stellaops_performance_budget_violations_total[5m])) by (phase)
```
- Dashboard: `ops/devops/observability/grafana/triage-ttfs.json`
- Alerts: `ops/devops/observability/triage-alerts.yaml`
## TTFS Metrics (time-to-first-signal)
- Core metrics:
- `ttfs_latency_seconds{surface,cache_hit,signal_source,kind,phase,tenant_id}` (histogram)
- `ttfs_signal_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
- `ttfs_cache_hit_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
- `ttfs_cache_miss_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
- `ttfs_slo_breach_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
- `ttfs_error_total{surface,cache_hit,signal_source,kind,phase,tenant_id,error_type,error_code}` (counter)
- SLO targets:
- P50 < 2s, P95 < 5s (all surfaces)
- Warm path P50 < 700ms, P95 < 2.5s
- Cold path P95 < 4s
```promql
# TTFS latency p50/p95
histogram_quantile(0.50, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
# SLO breach rate (per minute)
60 * sum(rate(ttfs_slo_breach_total[5m]))
```
## Offline Kit (air-gap) metrics
- `offlinekit_import_total{status,tenant_id}` (counter)
- `offlinekit_attestation_verify_latency_seconds{attestation_type,success}` (histogram)
- `attestor_rekor_success_total{mode}` (counter)
- `attestor_rekor_retry_total{reason}` (counter)
- `rekor_inclusion_latency{success}` (histogram)
```promql
# Import rate by status
sum(rate(offlinekit_import_total[5m])) by (status)
# Import success rate
sum(rate(offlinekit_import_total{status="success"}[5m])) / clamp_min(sum(rate(offlinekit_import_total[5m])), 1)
# Attestation verify p95 by type (success only)
histogram_quantile(0.95, sum(rate(offlinekit_attestation_verify_latency_seconds_bucket{success="true"}[5m])) by (le, attestation_type))
# Rekor inclusion latency p95 (by success)
histogram_quantile(0.95, sum(rate(rekor_inclusion_latency_bucket[5m])) by (le, success))
```
Dashboard: `docs/modules/telemetry/dashboards/offline-kit-operations.json`
## Observability hygiene
- Tag everything with `tenant`, `workload`, `env`, `region`, `version`.
- Keep metric names stable; prefer adding labels over renaming.
- No high-cardinality labels (avoid `user_id`, `path`, raw errors); bucket or hash if needed.
- Offline: scrape locally (Prometheus/OTLP); ship exports via bundle if required.
## Dashboards
- Golden signals per service: traffic, errors, saturation, latency (P50/P95/P99).
- Queue dashboards: depth, age, throughput, success/fail rates.
- Tracing overlays: link span `status` to error metrics; use exemplars where supported.
## Validation checklist
- [ ] Metrics emitted with required tags.
- [ ] Cardinality review completed (no unbounded labels).
- [ ] Alerts wired to error budget policy.
- [ ] Dashboards cover golden signals and queue health.

View File

@@ -0,0 +1,240 @@
# AOC Observability Guide
> **Audience:** Observability Guild, Concelier/Excititor SREs, platform operators.
> **Scope:** Metrics, traces, logs, dashboards, and runbooks introduced as part of the Aggregation-Only Contract (AOC) rollout (Sprint19).
This guide captures the canonical signals emitted by Concelier and Excititor once AOC guards are active. It explains how to consume the metrics in dashboards, correlate traces/logs for incident triage, and operate in offline environments. Pair this guide with the [AOC reference](../aoc/aggregation-only-contract.md) and [architecture overview](../modules/platform/architecture-overview.md).
---
## 1·Metrics
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `ingestion_write_total` | Counter | `source`, `tenant`, `result` (`ok`, `reject`, `noop`) | Counts write attempts to `advisory_raw`/`vex_raw`. Rejects correspond to guard failures. |
| `ingestion_latency_seconds` | Histogram | `source`, `tenant`, `phase` (`fetch`, `transform`, `write`) | Measures end-to-end runtime for ingestion stages. Use `quantile=0.95` for alerting. |
| `aoc_violation_total` | Counter | `source`, `tenant`, `code` (`ERR_AOC_00x`) | Total guard violations bucketed by error code. Drives dashboard pills and alert thresholds. |
| `ingestion_signature_verified_total` | Counter | `source`, `tenant`, `result` (`ok`, `fail`, `skipped`) | Tracks signature/checksum verification outcomes. |
| `advisory_revision_count` | Gauge | `source`, `tenant` | Supersedes depth for raw documents; spikes indicate noisy upstream feeds. |
| `verify_runs_total` | Counter | `tenant`, `initiator` (`ui`, `cli`, `api`, `scheduled`) | How many `stella aoc verify` or `/aoc/verify` runs executed. |
| `verify_duration_seconds` | Histogram | `tenant`, `initiator` | Runtime of verification jobs; use P95 to detect regressions. |
### 1.1Alerts
- **Violation spike:** Alert when `increase(aoc_violation_total[15m]) > 0` for critical sources. Page SRE if `code="ERR_AOC_005"` (signature failure) or `ERR_AOC_001` persists >30min.
- **Stale ingestion:** Alert when `max_over_time(ingestion_latency_seconds_sum / ingestion_latency_seconds_count)[30m]` exceeds 30s or if `ingestion_write_total` has no growth for >60min.
- **Signature drop:** Warn when `rate(ingestion_signature_verified_total{result="fail"}[1h]) > 0`.
### 1.2 · `/obs/excititor/health`
`GET /obs/excititor/health` (scope `vex.admin`) returns a compact snapshot for Grafana tiles and Console widgets:
- `ingest` — overall status, worst lag (seconds), and the top connectors (status, lagSeconds, failure count, last success).
- `link` — freshness of consensus/linkset processing plus document counts and the number currently carrying conflicts.
- `signature` — recent coverage window (evaluated, with signatures, verified, failures, unsigned, coverage ratio).
- `conflicts` — rolling totals grouped by status plus per-bucket trend data for charts.
```json
{
"generatedAt": "2025-11-08T11:00:00Z",
"ingest": { "status": "healthy", "connectors": [ { "connectorId": "excititor:redhat", "lagSeconds": 45.3 } ] },
"link": { "status": "warning", "lastConsensusAt": "2025-11-08T10:57:03Z" },
"signature": { "status": "critical", "documentsEvaluated": 120, "verified": 30, "failures": 2 },
"conflicts": { "status": "warning", "conflictStatements": 325, "trend": [ { "bucketStart": "2025-11-08T10:00:00Z", "conflicts": 130 } ] }
}
```
| Setting | Default | Purpose |
|---------|---------|---------|
| `Excititor:Observability:IngestWarningThreshold` | `06:00:00` | Connector lag before `ingest.status` becomes `warning`. |
| `Excititor:Observability:IngestCriticalThreshold` | `24:00:00` | Connector lag before `ingest.status` becomes `critical`. |
| `Excititor:Observability:LinkWarningThreshold` | `00:15:00` | Maximum acceptable delay between consensus recalculations. |
| `Excititor:Observability:LinkCriticalThreshold` | `01:00:00` | Delay that marks link status as `critical`. |
| `Excititor:Observability:SignatureWindow` | `12:00:00` | Lookback window for signature coverage. |
| `Excititor:Observability:SignatureHealthyCoverage` | `0.8` | Coverage ratio that still counts as healthy. |
| `Excititor:Observability:SignatureWarningCoverage` | `0.5` | Coverage ratio that flips the status to `warning`. |
| `Excititor:Observability:ConflictTrendWindow` | `24:00:00` | Rolling window used for conflict aggregation. |
| `Excititor:Observability:ConflictTrendBucketMinutes` | `60` | Resolution of conflict `trend` buckets. |
| `Excititor:Observability:ConflictWarningRatio` | `0.15` | Fraction of consensus docs with conflicts that triggers `warning`. |
| `Excititor:Observability:ConflictCriticalRatio` | `0.3` | Ratio that marks `conflicts.status` as `critical`. |
| `Excititor:Observability:MaxConnectorDetails` | `50` | Number of connector entries returned (keeps payloads small). |
### 1.3 · Regression & DI hygiene
1. **Keep storage/integration tests green when telemetry touches persistence.**
- `./tools/postgres/local-postgres.sh start` downloads PostgreSQL 16.x (if needed), launches the instance, and prints `export EXCITITOR_TEST_POSTGRES_URI=postgresql://.../excititor-tests`. Copy that export into your shell.
- `./tools/postgres/local-postgres.sh restart` is a shortcut for "stop if running, then start" using the same dataset—use it after tweaking config or when tests need a bounce without wiping fixtures.
- `./tools/postgres/local-postgres.sh clean` stops the instance (if running) and deletes the managed data/log directories so storage tests begin from a pristine catalog.
- Run `dotnet test src/Excititor/__Tests/StellaOps.Excititor.Storage.Postgres.Tests/StellaOps.Excititor.Storage.Postgres.Tests.csproj -nologo -v minimal` (add `--filter` if you only touched specific suites). These tests exercise the same write paths that feed the dashboards, so regressions show up immediately.
- `./tools/postgres/local-postgres.sh stop` when finished so CI/dev hosts stay clean; `status|logs|shell` are available for troubleshooting.
2. **Declare optional Minimal API dependencies with `[FromServices] ... = null`.** RequestDelegateFactory treats `[FromServices] IVexSigner? signer = null` (or similar) as optional, so host startup succeeds even when tests have not registered that service. This pattern keeps observability endpoints cancellable while avoiding brittle test overrides.
---
## 2·Traces
### 2.1Span taxonomy
| Span name | Parent | Key attributes |
|-----------|--------|----------------|
| `ingest.fetch` | job root span | `source`, `tenant`, `uri`, `contentHash` |
| `ingest.transform` | `ingest.fetch` | `documentType` (`csaf`, `osv`, `vex`), `payloadBytes` |
| `ingest.write` | `ingest.transform` | `collection` (`advisory_raw`, `vex_raw`), `result` (`ok`, `reject`) |
| `aoc.guard` | `ingest.write` | `code` (on violation), `violationCount`, `supersedes` |
| `verify.run` | verification job root | `tenant`, `window.from`, `window.to`, `sources`, `violations` |
### 2.2Trace usage
- Correlate UI dashboard entries with traces via `traceId` surfaced in violation drawers (`docs/UI_GUIDE.md`).
- Use `aoc.guard` spans to inspect guard payload snapshots. Sensitive fields are redacted automatically; raw JSON lives in secure logs only.
- For scheduled verification, filter traces by `initiator="scheduled"` to compare runtimes pre/post change.
### 2.3Telemetry configuration (Excititor)
- Configure the web service via `Excititor:Telemetry`:
```jsonc
{
"Excititor": {
"Telemetry": {
"Enabled": true,
"EnableTracing": true,
"EnableMetrics": true,
"ServiceName": "stellaops-excititor-web",
"OtlpEndpoint": "http://otel-collector:4317",
"OtlpHeaders": {
"Authorization": "Bearer ${OTEL_PUSH_TOKEN}"
},
"ResourceAttributes": {
"env": "prod-us",
"service.group": "ingestion"
}
}
}
}
```
- Point the OTLP endpoint at the shared collector profile from §1 so Excititor metrics land in the `ingestion_*` dashboards next to Concelier. Resource attributes drive Grafana filtering (e.g., `env`, `service.group`).
- For offline/air-gap bundles set `Enabled=false` and collect the file exporter artifacts from the Offline Kit; import them into Grafana after transfer to keep time-to-truth dashboards consistent.
- Local development templates: run `tools/postgres/local-postgres.sh start` to spin up a PostgreSQL instance plus the matching `psql` client. The script prints the `export EXCITITOR_TEST_POSTGRES_URI=...` command that integration tests (e.g., `StellaOps.Excititor.Storage.Postgres.Tests`) will honor. Use `restart` for a quick bounce, `clean` to wipe data between suites, and `stop` when finished.
---
## 3·Logs
Structured logs include the following keys (JSON):
| Key | Description |
|-----|-------------|
| `traceId` | Matches OpenTelemetry trace/span IDs for cross-system correlation. |
| `tenant` | Tenant identifier enforced by Authority middleware. |
| `source.vendor` | Logical source (e.g., `redhat`, `ubuntu`, `osv`, `ghsa`). |
| `upstream.upstreamId` | Vendor-provided ID (CVE, GHSA, etc.). |
| `contentHash` | `sha256:` digest of the raw document. |
| `violation.code` | Present when guard rejects `ERR_AOC_00x`. |
| `verification.window` | Present on `/aoc/verify` job logs. |
Excititor APIs mirror these identifiers via response headers:
| Header | Purpose |
| --- | --- |
| `X-Stella-TraceId` | W3C trace/span identifier for deep-linking from Console → Grafana/Loki. |
| `X-Stella-CorrelationId` | Stable correlation identifier (respects inbound header or falls back to the request trace ID). |
Logs are shipped to the central Loki/Elasticsearch cluster. Use the template query:
```logql
{app="concelier-web"} | json | violation_code != ""
```
to spot active AOC violations.
### 1.3 · Advisory chunk API (AdvisoryAI feeds)
AdvisoryAI now leans on Conceliers `/advisories/{key}/chunks` endpoint for deterministic evidence packs. The service exports dedicated metrics so dashboards can highlight latency spikes, cache noise, or aggressive guardrail filtering before they impact AdvisoryAI responses.
| Metric | Type | Labels | Description |
| --- | --- | --- | --- |
| `advisory_ai_chunk_requests_total` | Counter | `tenant`, `result`, `truncated`, `cache` | Count of chunk API calls, tagged with cache hits/misses and truncation state. |
| `advisory_ai_chunk_latency_milliseconds` | Histogram | `tenant`, `result`, `truncated`, `cache` | End-to-end build latency (milliseconds) for each chunk request. |
| `advisory_ai_chunk_segments` | Histogram | `tenant`, `result`, `truncated` | Number of chunk segments returned to the caller; watch for sudden drops tied to guardrails. |
| `advisory_ai_chunk_sources` | Histogram | `tenant`, `result` | How many upstream observations/sources contributed to a response (after observation limits). |
| `advisory_ai_guardrail_blocks_total` | Counter | `tenant`, `reason`, `cache` | Per-reason count of segments suppressed by guardrails (length, normalization, character set). |
Dashboards should plot latency P95/P99 next to cache hit rates and guardrail block deltas to catch degradation early. AdvisoryAI CLI/Console surfaces the same metadata so support engineers can correlate with Grafana/Loki entries using `traceId`/`correlationId` headers.
---
## 4·Dashboards
Primary Grafana dashboard: **“AOC Ingestion Health”** (`dashboards/aoc-ingestion.json`). Panels include:
1. **Sources overview:** table fed by `ingestion_write_total` and `ingestion_latency_seconds` (mirrors Console tiles).
2. **Violation trend:** stacked bar chart of `aoc_violation_total` per code.
3. **Signature success rate:** timeseries derived from `ingestion_signature_verified_total`.
4. **Supersedes depth:** gauge showing `advisory_revision_count` P95.
5. **Verification runs:** histogram and latency boxplot using `verify_runs_total` / `verify_duration_seconds`.
Secondary dashboards:
- **AOC Alerts (Ops view):** summarises active alerts, last verify run, and links to incident runbook.
- **Offline Mode Dashboard:** fed from Offline Kit imports; highlights snapshot age and queued verification jobs.
Update `docs/assets/dashboards/` with screenshots when Grafana capture pipeline produces the latest renders.
---
## 5·Operational workflows
1. **During ingestion incident:**
- Check Console dashboard for offending sources.
- Pivot to logs using document `contentHash`.
- Re-run `stella sources ingest --dry-run` with problematic payloads to validate fixes.
- After remediation, run `stella aoc verify --since 24h` and confirm exit code `0`.
2. **Scheduled verification:**
- Configure cron job to run `stella aoc verify --format json --export ...`.
- Ship JSON to `aoc-verify` bucket and ingest into metrics using custom exporter.
- Alert on missing exports (no file uploaded within 26h).
3. **Offline kit validation:**
- Use Offline Dashboard
4. **Incident toggle audit:**
- Authority requires `incident_reason` when issuing `obs:incident` tokens; plan your runbooks to capture business justification.
- Auditors can call `/authority/audit/incident?limit=100` with the tenant header to list recent incident activations, including reason and issuer. to ensure snapshots contain latest metrics.
- Run verification reports locally and attach to bundle before distribution.
---
## 6·Offline considerations
- Metrics exporters bundled with Offline Kit write to local Prometheus snapshots; sync them with central Grafana once connectivity is restored.
- CLI verification reports should be hashed (`sha256sum`) and archived for audit trails.
- Dashboards include offline data sources (`prometheus-offline`) switchable via dropdown.
---
## 7·References
- [Aggregation-Only Contract reference](../aoc/aggregation-only-contract.md)
- [Architecture overview](../modules/platform/architecture-overview.md)
- [Console guide](../UI_GUIDE.md)
- [CLI AOC commands](../modules/cli/guides/cli-reference.md)
- [Concelier architecture](../modules/concelier/architecture.md)
- [Excititor architecture](../modules/excititor/architecture.md)
- [Scheduler Worker observability guide](../modules/scheduler/operations/worker.md)
---
## 8·Compliance checklist
- [ ] Metrics documented with label sets and alert guidance.
- [ ] Tracing span taxonomy aligned with Concelier/Excititor implementation.
- [ ] Log schema matches structured logging contracts (traceId, tenant, source, contentHash).
- [ ] Grafana dashboard references verified and screenshots scheduled.
- [ ] Offline/air-gap workflow captured.
- [ ] Cross-links to AOC reference, console, and CLI docs included.
- [ ] Observability Guild sign-off scheduled (OWNER: @obs-guild, due 2025-10-28).
---
*Last updated: 2025-10-26 (Sprint19).*

View File

@@ -0,0 +1,166 @@
# Policy Engine Observability
> **Audience:** Observability Guild, SRE/Platform operators, Policy Guild.
> **Scope:** Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint20).
> **Prerequisites:** Policy Engine v2 deployed with OpenTelemetry exporters enabled (`observability:enabled=true` in config).
---
## 1·Instrumentation Overview
- **Telemetry stack:** OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo.
- **Namespace conventions:** `policy.*` for metrics/traces/log categories; labels use `tenant`, `policy`, `mode`, `runId`.
- **Sampling:** Default 10% trace sampling, 1% rule-hit log sampling; incident mode overrides to 100% (see §6).
- **Correlation IDs:** Every API request gets `traceId` + `requestId`. CLI/UI display IDs to streamline support.
---
## 2·Metrics
### 2.1 Run Pipeline
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_run_seconds` | Histogram | `tenant`, `policy`, `mode` (`full`, `incremental`, `simulate`) | P95 target ≤5min incremental, ≤30min full. |
| `policy_run_queue_depth` | Gauge | `tenant` | Number of pending jobs per tenant (updated each enqueue/dequeue). |
| `policy_run_failures_total` | Counter | `tenant`, `policy`, `reason` (`err_pol_*`, `network`, `cancelled`) | Aligns with error codes. |
| `policy_run_retries_total` | Counter | `tenant`, `policy` | Helps identify noisy sources. |
| `policy_run_inputs_pending_bytes` | Gauge | `tenant` | Size of buffered change batches awaiting run. |
### 2.2 Evaluator Insights
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_rules_fired_total` | Counter | `tenant`, `policy`, `rule` | Increment per rule match (sampled). |
| `policy_vex_overrides_total` | Counter | `tenant`, `policy`, `vendor`, `justification` | Tracks VEX precedence decisions. |
| `policy_suppressions_total` | Counter | `tenant`, `policy`, `action` (`ignore`, `warn`, `quiet`) | Audits suppression usage. |
| `policy_selection_batch_duration_seconds` | Histogram | `tenant`, `policy` | Measures joiner performance. |
| `policy_materialization_conflicts_total` | Counter | `tenant`, `policy` | Non-zero indicates optimistic concurrency retries. |
### 2.3 API Surface
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_api_requests_total` | Counter | `endpoint`, `method`, `status` | Exposed via Minimal API instrumentation. |
| `policy_api_latency_seconds` | Histogram | `endpoint`, `method` | Budget ≤250ms for GETs, ≤1s for POSTs. |
| `policy_api_rate_limited_total` | Counter | `endpoint` | Tied to throttles (`429`). |
### 2.4 Queue & Change Streams
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `policy_queue_leases_active` | Gauge | `tenant` | Number of leased jobs. |
| `policy_queue_lease_expirations_total` | Counter | `tenant` | Alerts when workers fail to ack. |
| `policy_delta_backlog_age_seconds` | Gauge | `tenant`, `source` (`concelier`, `excititor`, `sbom`) | Age of oldest unprocessed change event. |
---
## 3·Logs
- **Format:** JSON (`Serilog`). Core fields: `timestamp`, `level`, `message`, `policyId`, `policyVersion`, `tenant`, `runId`, `rule`, `traceId`, `env.sealed`, `error.code`.
- **Log categories:**
- `policy.run` (queue lifecycle, run begin/end, stats)
- `policy.evaluate` (batch execution summaries; rule-hit sampling)
- `policy.materialize` (Mongo operations, conflicts, retries)
- `policy.simulate` (diff results, CLI invocation metadata)
- `policy.lifecycle` (submit/review/approve events)
- **Sampling:** Rule-hit logs sample 1% by default; toggled to 100% in incident mode or when `--trace` flag used in CLI.
- **PII:** No user secrets recorded; user identities referenced as `user:<id>` or `group:<id>` only.
---
## 4·Traces
- Spans emit via OpenTelemetry instrumentation.
- **Primary spans:**
- `policy.api` wraps HTTP request, records `endpoint`, `status`, `scope`.
- `policy.select` change stream ingestion and batch assembly (attributes: `candidateCount`, `cursor`).
- `policy.evaluate` evaluation batch (attributes: `batchSize`, `ruleHits`, `severityChanges`).
- `policy.materialize` Mongo writes (attributes: `writes`, `historyWrites`, `retryCount`).
- `policy.simulate` simulation diff generation (attributes: `sbomCount`, `diffAdded`, `diffRemoved`).
- Trace context propagated to CLI via response headers `traceparent`; UI surfaces in run detail view.
- Incident mode forces span sampling to 100% and extends retention via Collector config override.
---
## 5·Dashboards
### 5.1 Policy Runs Overview
Widgets:
- Run duration histogram (per mode/tenant).
- Queue depth + backlog age line charts.
- Failure rate stacked by error code.
- Incremental backlog heatmap (policy × age).
- Active vs scheduled runs table.
### 5.2 Rule Impact & VEX
- Top N rules by firings (bar chart).
- VEX overrides by vendor/justification (stacked chart).
- Suppression usage (pie + table with justifications).
- Quieted findings trend (line).
### 5.3 Simulation & Approval Health
- Simulation diff histogram (added vs removed).
- Pending approvals by age (table with SLA colour coding).
- Compliance checklist status (lint, determinism CI, simulation evidence).
> Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (`../assets/policy-observability/*.png`).
---
## 6·Alerting
| Alert | Condition | Suggested Action |
|-------|-----------|------------------|
| **PolicyRunSlaBreach** | `policy_run_seconds{mode="incremental"}` P95 > 300s for 3 windows | Check queue depth, upstream services, scale worker pool. |
| **PolicyQueueStuck** | `policy_delta_backlog_age_seconds` > 600 | Investigate change stream connectivity. |
| **DeterminismMismatch** | Run status `failed` with `ERR_POL_004` OR CI replay diff | Switch to incident sampling, gather replay bundle, notify Policy Guild. |
| **SimulationDrift** | CLI/CI simulation exit `20` (blocking diff) over threshold | Review policy changes before approval. |
| **VexOverrideSpike** | `policy_vex_overrides_total` > configured baseline (per vendor) | Verify upstream VEX feed; ensure justification codes expected. |
| **SuppressionSurge** | `policy_suppressions_total` increase > 3σ vs baseline | Audit new suppress rules; check approvals. |
Alerts integrate with Notifier channels (`policy.alerts`) and Ops on-call rotations.
---
## 7·Incident Mode & Forensics
- Toggle via `POST /api/policy/incidents/activate` (requires `policy:operate` scope).
- Effects:
- Trace sampling → 100%.
- Rule-hit log sampling → 100%.
- Retention window extended to 30days for incident duration.
- `policy.incident.activated` event emitted (Console + Notifier banners).
- Post-incident tasks:
- `stella policy run replay` for affected runs; attach bundles to incident record.
- Restore sampling defaults with `.../deactivate`.
- Update incident checklist in `/docs/policy/lifecycle.md` (section 8) with findings.
---
## 8·Integration Points
- **Authority:** Exposes metric `policy_scope_denied_total` for failed authorisation; correlate with `policy_api_requests_total`.
- **Concelier/Excititor:** Shared trace IDs propagate via gRPC metadata to help debug upstream latency.
- **Scheduler:** Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002).
- **Offline Kit:** CLI exports logs + metrics snapshots (`stella offline bundle metrics`) for air-gapped audits.
---
## 9·Compliance Checklist
- [ ] **Metrics registered:** All metrics listed above exported and documented in Grafana dashboards.
- [ ] **Alert policies configured:** Ops or Observability Guild created alerts matching table in §6.
- [ ] **Sampling overrides tested:** Incident mode toggles verified in staging; retention roll-back rehearsed.
- [ ] **Trace propagation validated:** CLI/UI display trace IDs and allow copy for support.
- [ ] **Log scrubbing enforced:** Unit tests guarantee no secrets/PII in logs; sampling respects configuration.
- [ ] **Offline capture rehearsed:** Metrics/log snapshot commands executed in sealed environment.
- [ ] **Docs cross-links:** Links to architecture, runs, lifecycle, CLI, API docs verified.
---
*Last updated: 2025-10-26 (Sprint 20).*

View File

@@ -0,0 +1,48 @@
# Telemetry Core Bootstrap (v1 · 2025-11-19)
## Goal
Show minimal host wiring for `StellaOps.Telemetry.Core` with deterministic defaults and sealed-mode friendliness.
## Sample (web/worker host)
```csharp
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddStellaOpsTelemetry(
builder.Configuration,
serviceName: "StellaOps.SampleService",
serviceVersion: builder.Configuration["VERSION"],
configureOptions: options =>
{
// Disable collector in sealed mode / air-gap
options.Collector.Enabled = builder.Configuration.GetValue<bool>("Telemetry:Collector:Enabled", true);
options.Collector.Endpoint = builder.Configuration["Telemetry:Collector:Endpoint"];
options.Collector.Protocol = TelemetryCollectorProtocol.Grpc;
},
configureMetrics: m => m.AddAspNetCoreInstrumentation(),
configureTracing: t => t.AddHttpClientInstrumentation());
```
## Configuration (appsettings.json)
```json
{
"Telemetry": {
"Collector": {
"Enabled": true,
"Endpoint": "https://otel-collector.example:4317",
"Protocol": "Grpc",
"Component": "sample-service",
"Intent": "telemetry-export",
"DisableOnViolation": true
}
}
}
```
## Determinism & safety
- UTC timestamps only; no random IDs introduced by the helper.
- Exporter is skipped when endpoint missing or egress policy denies.
- `VSTEST_DISABLE_APPDOMAIN=1` recommended for tests with `tools/linksets-ci.sh` pattern.
## Next
- Propagation adapters (50-002) will build on this bootstrap.
- Scrub/analyzer policies live under upcoming 51-001/51-002 tasks.

View File

@@ -0,0 +1,45 @@
# Telemetry propagation contract (TELEMETRY-OBS-51-001)
**Goal**: standardise trace/metrics propagation across StellaOps services so golden-signal helpers remain deterministic, tenant-safe, and offline-friendly.
## Scope
- Applies to HTTP, gRPC, background jobs, and message handlers instrumented via `StellaOps.Telemetry.Core`.
- Complements bootstrap guide (`telemetry-bootstrap.md`) and precedes metrics helper implementation.
## Required context fields
- `trace_id` / `span_id`: W3C TraceContext headers only (no B3); generate if missing.
- `tenant`: lower-case string; required for all incoming requests; default to `unknown` only in sealed/offline diagnostics jobs.
- `actor`: optional user/service principal; redacted to hash in logs when `Scrub.Sealed=true`.
- `imposed_rule`: optional string conveying enforcement context (e.g., `merge=false`).
## HTTP middleware
- Accept `traceparent`/`tracestate`; reject/strip vendor-specific headers.
- Propagate `tenant`, `actor`, `imposed-rule` via `x-stella-tenant`, `x-stella-actor`, `x-stella-imposed-rule` headers (defaults configurable via `Telemetry:Propagation`).
- Middleware entry point: `app.UseStellaOpsTelemetryContext()` plus the `TelemetryPropagationHandler` automatically added to all `HttpClient` instances when `AddStellaOpsTelemetry` is called.
- Emit exemplars: when sampling is off, attach exemplar ids to request duration and active request metrics.
## gRPC interceptors
- Use binary TraceContext; carry metadata keys `stella-tenant`, `stella-actor`, `stella-imposed-rule`.
- Enforce presence of `tenant`; abort with `Unauthenticated` if missing in non-sealed mode.
## Jobs & message handlers
- Wrap background job execution with Activity + baggage items (`tenant`, `actor`, `imposed_rule`).
- When publishing bus events, stamp `trace_id` and `tenant` into headers; avoid embedding PII in payloads.
## Metrics helper expectations
- Golden signals: `http.server.duration`, `http.client.duration`, `messaging.operation.duration`, `job.execution.duration`, `runtime.gc.pause`, `db.call.duration`.
- Mandatory tags: `tenant`, `service`, `endpoint`/`operation`, `result` (`ok|error|cancelled|throttled`), `sealed` (`true|false`).
- Cardinality guard: trim tag values to 64 chars (configurable) and replace values beyond the first 50 distinct entries per key with `other` (enforced by `MetricLabelGuard`).
- Helper API: `Histogram<double>.RecordRequestDuration(guard, durationMs, route, verb, status, result)` applies guard + tags consistently.
## Determinism & offline posture
- All timestamps UTC RFC3339; sampling configs controlled via appsettings and mirrored in offline bundles.
- No external exporters when `Sealed=true`; use in-memory or file-based OTLP for air-gap.
## Tests to add with implementation
- Middleware unit tests asserting header/baggage mapping and tenant enforcement.
- Metrics helper tests ensuring required tags present and trimmed; exemplar id attached when enabled.
- Deterministic snapshot tests for serialized OTLP when sealed/offline.
## Provenance
- Authored 2025-11-20 to unblock TELEMETRY-OBS-51-001; to be refined as helpers are coded.

View File

@@ -0,0 +1,35 @@
# Telemetry scrubbing contract (TELEMETRY-OBS-51-002)
**Purpose**: define redaction/scrubbing rules for logs/traces/metrics before implementing helpers in `StellaOps.Telemetry.Core`.
## Redaction rules
- Strip or hash PII/credentials: emails, tokens, passwords, secrets, bearer/mTLS cert blobs.
- Default hash algorithm: SHA-256 hex; include `scrubbed=true` tag.
- Allowlist fields that remain: `tenant`, `trace_id`, `span_id`, `endpoint`, `result`, `sealed`.
## Configuration knobs
- `Telemetry:Scrub:Enabled` (bool, default true).
- `Telemetry:Scrub:Sealed` (bool, default false) — when true, force scrubbing and disable external exporters.
- `Telemetry:Scrub:HashSalt` (string, optional) — per-tenant salt; omit to keep deterministic hashes across deployments.
- `Telemetry:Scrub:MaxValueLength` (int, default 256) — truncate values beyond this length before hashing.
## Logger sink expectations
- Implement scrubber as `ILogPayloadFilter` injected before sink.
- Ensure message templates remain intact; only values scrubbed.
- Preserve structured shape so downstream parsing remains deterministic.
## Metrics & traces
- Never place raw user input into metric/tag values; pass through scrubber before export.
- Span events must omit payload bodies; include keyed references only.
## Auditing
- When scrubbing occurs, add tag `scrubbed=true` and `scrub_reason` (`pii|secret|length|pattern`).
- Provide counter `telemetry.scrub.events{tenant,reason}` for observability.
## Tests to add with implementation
- Unit tests for regex-based scrubbing of tokens, emails, URLs with creds.
- Config-driven tests toggling `Enabled`/`Sealed` modes to ensure exporters are suppressed when sealed.
- Determinism test: same input yields identical hashed output when salt unset.
## Provenance
- Authored 2025-11-20 to unblock TELEMETRY-OBS-51-002 and downstream 55/56 tasks.

View File

@@ -0,0 +1,33 @@
# Sealed-mode telemetry helpers (TELEMETRY-OBS-56-001 prep)
## Objective
Define behavior and configuration for telemetry when `Sealed=true`, ensuring no external egress while preserving deterministic local traces/metrics for audits.
## Requirements
- Disable external OTLP/exporters automatically when sealed; fallback to in-memory or file OTLP (`telemetry-sealed.otlp`) with bounded size (default 10 MB, ring buffer).
- Add tag `sealed=true` to all spans/metrics/logs; suppress exemplars.
- Force scrubbing: treat `Scrub.Sealed=true` regardless of default settings.
- Sampling: cap to 10% max in sealed mode unless CLI incident toggle raises it (see CLI-OBS-12-001 contract); ceiling 100% with explicit override `Telemetry:Sealed:MaxSamplingPercent`.
- Clock source: require monotonic clock for duration; emit warning if system clock skew detected >500ms.
## Configuration keys
- `Telemetry:Sealed:Enabled` (bool) — driven by host; when true activate sealed behavior.
- `Telemetry:Sealed:Exporter` (enum `memory|file`) — default `file`.
- `Telemetry:Sealed:FilePath` (string) — default `./logs/telemetry-sealed.otlp`.
- `Telemetry:Sealed:MaxBytes` (int) — default 10_485_760 (10 MB).
- `Telemetry:Sealed:MaxSamplingPercent` (int) — default 10.
- Derived flag `Telemetry:Sealed:EffectiveIncidentMode` (read-only) exposes if incident-mode override lifted sampling ceiling.
## File exporter format
- OTLP binary, append-only, deterministic ordering by enqueue time.
- Rotate when exceeding `MaxBytes` using suffix `.1`, `.2` capped to 3 files; oldest dropped.
- Permissions 0600 by default; fail-start if path is world-readable.
## Validation tests to implement with 56-001
- Unit: sealed mode forces exporter swap and tags `sealed=true`, `scrubbed=true`.
- Unit: sampling capped at max percent unless incident override set.
- Unit: file exporter rotates deterministically and enforces 0600 perms.
- Integration: sealed + incident mode together still block external exporters and honor scrub rules.
## Provenance
- Authored 2025-11-20 to satisfy PREP-TELEMETRY-OBS-56-001 and unblock implementation.

View File

@@ -0,0 +1,38 @@
# Telemetry Standards (DOCS-OBS-50-002)
Last updated: 2025-11-25 (Docs Tasks Md.VI)
## Common envelope
- **Trace context**: `trace_id`, `span_id`, `trace_flags`; propagate W3C `traceparent` and `baggage` end to end.
- **Tenant & workload**: `tenant`, `workload` (service name), `region`, `env` (dev/stage/prod), `version` (git sha or semver).
- **Subject**: `component` (module), `operation` (verb/name), `resource` (purl/uri/subject id when safe).
- **Timing**: UTC ISO-8601 `timestamp`; durations in milliseconds with integers.
- **Outcome**: `status` (`ok|error|fault|throttle`), `error.code` (machine), `error.message` (human, redacted), `retryable` (bool).
## Scrubbing policy
- Denylist PII/secrets before emit: emails, tokens, Authorization headers, bearer fragments, private keys, passwords, session IDs.
- Redact fields to `"[redacted]"` and add `redaction.reason` (`secret|pii|tenant_policy`).
- Hash low-cardinality identifiers when needed (`sha256` lowercase hex) and mark `hashed=true`.
- Logs must not contain full request/response bodies; store hashes plus lengths. For NDJSON exports, allow hashes + selected headers only.
## Sampling defaults
- **Traces**: 10% head sampling non-prod; 100% for `status=error|fault` and for spans tagged `audit=true`. Prod default 5% with the same error/audit boost.
- **Logs**: info logs rate-limited per component (default 100/s); warn/error never sampled. Structured JSON only.
- **Metrics**: never sampled; counters/gauges/histograms use deterministic bucket boundaries documented in component specs.
## Redaction override procedure
- Overrides are rare and must be auditable.
- To allow a field temporarily, set `telemetry.redaction.overrides=<comma list>` in service config with change-ticket id; emit `redaction.override=true` tag on affected spans/logs.
- Overrides expire automatically after `telemetry.redaction.override_ttl` (default 24h); services refuse to start with expired overrides.
- All overrides are logged to `telemetry.redaction.audit` channel with actor, ticket, fields, TTL.
## Determinism & offline posture
- No external enrichers; all enrichment data must be preloaded bundles (e.g., service map, tenant metadata).
- Sorting for exports: by `timestamp`, then `workload`, then `operation`.
- Time always UTC; avoid locale-specific formats.
## Validation checklist
- [ ] `traceparent` propagated and present on inbound/outbound.
- [ ] Required fields present (`tenant`, `workload`, `operation`, `status`).
- [ ] Scrubbing tests cover auth headers and bodies.
- [ ] Sampling knobs configurable via env vars with documented defaults.

View File

@@ -0,0 +1,37 @@
# Tracing Standards (DOCS-OBS-50-004)
Last updated: 2025-11-25 (Docs Tasks Md.VI)
## Goals
- Consistent distributed tracing across services (API, workers, CLI).
- Safe for offline/air-gapped deployments.
- Deterministic span data for replay/debug.
## Context propagation
- Use W3C headers: `traceparent` (required), `baggage` (optional key/value pairs).
- Preserve incoming `trace_id` for all downstream calls; create child spans per operation.
- For async work (queues, cron), copy `traceparent` and `baggage` into the message envelope; new span links to the stored context using **links**, not a new parent.
## Span conventions
- Names: `<component>.<operation>` (e.g., `riskengine.simulate`, `notify.deliver`).
- Required attributes: `tenant`, `workload` (service), `env`, `region`, `version`, `operation`, `status`.
- HTTP spans: add `http.method`, `http.route`, `http.status_code`, `net.peer.name`, `net.peer.port`.
- DB spans: `db.system`, `db.name`, `db.operation`, `db.statement` (omit literals).
- Message spans: `messaging.system`, `messaging.destination`, `messaging.operation` (`send|receive|process`), `messaging.message_id`.
- Errors: set `status=error`, include `error.code`, redacted `error.message`, `retryable` (bool).
## Sampling
- Default head sampling: 10% non-prod, 5% prod.
- Always sample spans with `status=error|fault` or `audit=true`.
- Allow override via env `Tracing__SampleRate` (01) per service; document in runbooks.
## Offline/air-gap posture
- No external exporters; emit OTLP to local collector or file.
- Disable remote enrichment; rely on bundled service map.
- All timestamps UTC; span ids deterministic only in scope of traceparent (no GUID reuse).
## Validation checklist
- [ ] `traceparent` forwarded on every inbound/outbound call.
- [ ] Required attributes present on spans.
- [ ] Error spans include codes and redacted messages.
- [ ] Sampling knobs documented in service config.

View File

@@ -0,0 +1,190 @@
# Console Observability
> **Audience:** Observability Guild, Console Guild, SRE/operators.
> **Scope:** Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint23).
> **Prerequisites:** Console deployed with metrics enabled (`CONSOLE_METRICS_ENABLED=true`) and OTLP exporters configured (`OTEL_EXPORTER_OTLP_*`).
---
## 1·Instrumentation Overview
- **Telemetry stack:** OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose `/metrics` (Prometheus) and `/health/*`.
- **Sampling:** Front-end spans sample at 5% by default (`OTEL_TRACES_SAMPLER=parentbased_traceidratio`). Metrics are un-sampled; log sampling is handled per category (§3).
- **Correlation IDs:** Every API call carries `x-stellaops-correlation-id`; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI.
- **Scope gating:** Operators need the `ui.telemetry` scope to view live charts in the Admin workspace; the scope also controls access to `/console/telemetry` SSE streams.
---
## 2·Metrics
### 2.1 Experience & Navigation
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `ui_route_render_seconds` | Histogram | `route`, `tenant`, `device` (`desktop`,`tablet`) | Time between route activation and first interactive paint. Target P95 ≤1.5s (cached). |
| `ui_request_duration_seconds` | Histogram | `service`, `method`, `status`, `tenant` | Gateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades. |
| `ui_filter_apply_total` | Counter | `route`, `filter`, `tenant` | Increments when a global filter or context chip is applied. Used to track adoption of saved views. |
| `ui_tenant_switch_total` | Counter | `fromTenant`, `toTenant`, `trigger` (`picker`, `shortcut`, `link`) | Emitted after a successful tenant switch; correlates with Authority `ui.tenant.switch` logs. |
| `ui_offline_banner_seconds` | Histogram | `reason` (`authority`, `manifest`, `gateway`), `tenant` | Duration of offline banner visibility; integrate with air-gap SLAs. |
### 2.2 Security & Session
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `ui_dpop_failure_total` | Counter | `endpoint`, `reason` (`nonce`, `jkt`, `clockSkew`) | Raised when DPoP validation fails; pair with Authority audit trail. |
| `ui_fresh_auth_prompt_total` | Counter | `action` (`token.revoke`, `policy.activate`, `client.create`), `tenant` | Counts fresh-auth modals; backlog above baseline indicates workflow friction. |
| `ui_fresh_auth_failure_total` | Counter | `action`, `reason` (`timeout`,`cancelled`,`auth_error`) | Optional metric (set `CONSOLE_FRESH_AUTH_METRICS=true` when feature flag lands). |
### 2.3 Downloads & Offline Kit
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `ui_download_manifest_refresh_seconds` | Histogram | `tenant`, `channel` (`edge`,`stable`,`airgap`) | Time to fetch and verify downloads manifest. Target <3s. |
| `ui_download_export_queue_depth` | Gauge | `tenant`, `artifactType` (`sbom`,`policy`,`attestation`,`console`) | Mirrors `/console/downloads` queue depth; triggers when offline bundles lag. |
| `ui_download_command_copied_total` | Counter | `tenant`, `artifactType` | Increments when users copy CLI commands from the UI. Useful to observe CLI parity adoption. |
### 2.4 Telemetry Emission & Errors
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `ui_telemetry_batch_failures_total` | Counter | `transport` (`otlp-http`,`otlp-grpc`), `reason` | Emitted by OTLP bridge when batches fail. Enable via `CONSOLE_METRICS_VERBOSE=true`. |
| `ui_telemetry_queue_depth` | Gauge | `priority` (`normal`,`high`), `tenant` | Browser-side buffer depth; monitor for spikes under degraded collectors. |
> **Scraping tips:**
> - Enable `/metrics` via `CONSOLE_METRICS_ENABLED=true`.
> - Set `OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318` and relevant headers (`OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>`).
> - For air-gapped sites, point the exporter to the Offline Kit collector (`localhost:4318`) and forward the metrics snapshot using `stella offline bundle metrics`.
---
## 3·Logs
- **Format:** JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields: `timestamp`, `level`, `action`, `route`, `tenant`, `subject`, `correlationId`, `dpop.jkt`, `device`, `offlineMode`.
- **Categories:**
- `ui.action` general user interactions (route changes, command palette, filter updates). Sampled 50% by default; override with feature flag `telemetry.logVerbose`.
- `ui.tenant.switch` always logged; includes `fromTenant`, `toTenant`, `tokenId`, and Authority audit correlation.
- `ui.download.commandCopied` download commands copied; includes `artifactId`, `digest`, `manifestVersion`.
- `ui.security.anomaly` DPoP mismatches, tenant header errors, CSP violations (level = `Warning`).
- `ui.telemetry.failure` OTLP export errors; include `httpStatus`, `batchSize`, `retryCount`.
- **PII handling:** Full emails are scrubbed; only hashed values (`user:<sha256>`) appear unless `ui.admin` + fresh-auth were granted for the action (still redacted in logs).
- **Retention:** Recommended 14days for connected sites, 30days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label `service="stellaops-web-ui"`.
---
## 4·Traces
- **Span names & attributes:**
- `ui.route.transition` wraps route navigation; attributes: `route`, `tenant`, `renderMillis`, `prefetchHit`.
- `ui.api.fetch` HTTP fetch to backend; attributes: `service`, `endpoint`, `status`, `networkTime`.
- `ui.sse.stream` Server-sent event subscriptions (status ticker, runs); attributes: `channel`, `connectedMillis`, `reconnects`.
- `ui.telemetry.batch` Browser OTLP flush; attributes: `batchSize`, `success`, `retryCount`.
- `ui.policy.action` Policy workspace actions (simulate, approve, activate) per `docs/UI_GUIDE.md`.
- **Propagation:** Spans use W3C `traceparent`; gateway echoes header to backend APIs so traces stitch across UI gateway service.
- **Sampling controls:** `OTEL_TRACES_SAMPLER_ARG` (ratio) and feature flag `telemetry.forceSampling` (sets to 100% for incident debugging).
- **Viewing traces:** Grafana Tempo or Jaeger via collector. Filter by `service.name = stellaops-console`. For cross-service debugging, filter on `correlationId` and `tenant`.
---
## 5·Dashboards
### 5.1 Experience Overview
Panels:
- Route render histogram (P50/P90/P99) by route.
- Backend call latency stacked by service (`ui_request_duration_seconds`).
- Offline banner duration trend (`ui_offline_banner_seconds`).
- Tenant switch volume vs failure rate (overlay `ui_dpop_failure_total`).
- Command palette usage (`ui_filter_apply_total` + `ui.action` log counts).
### 5.2 Downloads & Offline Kit
- Manifest refresh time chart (per channel).
- Export queue depth gauge with alert thresholds.
- CLI command adoption (bar chart per artifact type, using `ui_download_command_copied_total`).
- Offline parity banner occurrences (`downloads.offlineParity` flag from API derived metric).
- Last Offline Kit import timestamp (join with Downloads API metadata).
### 5.3 Security & Session
- Fresh-auth prompt counts vs success/fail ratios.
- DPoP failure stacked by reason.
- Tenant mismatch warnings (from `ui.security.anomaly` logs).
- Scope usage heatmap (derived from Authority audit events + UI logs).
- CSP violation counts (browser `securitypolicyviolation` listener forwarded to logs).
> Capture screenshots for Grafana once dashboards stabilise (`docs/assets/ui/observability/*.png`). Replace placeholders before releasing the doc.
---
## 6·Alerting
| Alert | Condition | Suggested Action |
|-------|-----------|------------------|
| **ConsoleLatencyHigh** | `ui_route_render_seconds_bucket{le="1.5"}` drops below 0.95 for 3 intervals | Inspect route splits, check backend latencies, review CDN cache. |
| **BackendLatencyHigh** | `ui_request_duration_seconds_sum / ui_request_duration_seconds_count` > 1s for any service | Correlate with gateway/service dashboards; escalate to owning guild. |
| **TenantSwitchFailures** | Increase in `ui_dpop_failure_total` or `ui.security.anomaly` (tenant mismatch) > 3/min | Validate Authority issuer, check clock skew, confirm tenant config. |
| **FreshAuthLoop** | `ui_fresh_auth_prompt_total` spikes with matching `ui_fresh_auth_failure_total` | Review Authority `/fresh-auth` endpoint, session timeout config, UX regressions. |
| **OfflineBannerLong** | `ui_offline_banner_seconds` P95 > 120s | Investigate Authority/gateway availability; verify Offline Kit freshness. |
| **DownloadsBacklog** | `ui_download_export_queue_depth` > 5 for 10min OR queue age > alert threshold | Ping Downloads service, ensure manifest pipeline (`DOWNLOADS-CONSOLE-23-001`) is healthy. |
| **TelemetryExportErrors** | `ui_telemetry_batch_failures_total` > 0 for ≥5min | Check collector health, credentials, or TLS trust. |
Integrate alerts with Notifier (`ui.alerts`) or existing Ops channels. Tag incidents with `component=console` for correlation.
---
## 7·Feature Flags & Configuration
| Flag / Env Var | Purpose | Default |
|----------------|---------|---------|
| `CONSOLE_FEATURE_FLAGS` | Enables UI modules (`runs`, `downloads`, `policies`, `telemetry`). Telemetry panel requires `telemetry`. | `runs,downloads,policies` |
| `CONSOLE_METRICS_ENABLED` | Exposes `/metrics` for Prometheus scrape. | `true` |
| `CONSOLE_METRICS_VERBOSE` | Emits additional batching metrics (`ui_telemetry_*`). | `false` |
| `CONSOLE_LOG_LEVEL` | Minimum log level (`Information`, `Debug`). Use `Debug` for incident sampling. | `Information` |
| `CONSOLE_METRICS_SAMPLING` *(planned)* | Controls front-end span sampling ratio. Document once released. | `0.05` |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | Collector URL; supports HTTPS. | unset |
| `OTEL_EXPORTER_OTLP_HEADERS` | Comma-separated headers (auth). | unset |
| `OTEL_EXPORTER_OTLP_INSECURE` | Allow HTTP (dev only). | `false` |
| `OTEL_SERVICE_NAME` | Service tag for traces/logs. Set to `stellaops-console`. | auto |
| `CONSOLE_TELEMETRY_SSE_ENABLED` | Enables `/console/telemetry` SSE feed for dashboards. | `true` |
Feature flag changes should be tracked in release notes and mirrored in `docs/UI_GUIDE.md` (navigation and workflow expectations).
---
## 8·Offline / Air-Gapped Workflow
- Mirror the console image and telemetry collector as part of the Offline Kit (see `/docs/operations/console-docker-install.md` §4).
- Scrape metrics locally via `curl -k https://console.local/metrics > metrics.prom`; archive alongside logs for audits.
- Use `stella offline kit import` to keep the downloads manifest in sync; dashboards display staleness using `ui_download_manifest_refresh_seconds`.
- When collectors are unavailable, console queues OTLP batches (up to 5min) and exposes backlog through `ui_telemetry_queue_depth`; export queue metrics to prove no data loss.
- After reconnecting, run `stella console status --telemetry` *(CLI parity pending; see DOCS-CONSOLE-23-014)* or verify `ui_telemetry_batch_failures_total` resets to zero.
- Retain telemetry bundles for 30days per compliance guidelines; include Grafana JSON exports in audit packages.
---
## 9·Compliance Checklist
- [ ] `/metrics` scraped in staging & production; dashboards display `ui_route_render_seconds`, `ui_request_duration_seconds`, and downloads metrics.
- [ ] OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki).
- [ ] Alert rules from §6 implemented in monitoring stack with runbooks linked.
- [ ] Feature flags documented and change-controlled; telemetry disabled only with approval.
- [ ] DPoP/fresh-auth anomalies correlated with Authority audit logs during drill.
- [ ] Offline capture workflow exercised; evidence stored in audit vault.
- [ ] Screenshots of Grafana dashboards committed once they stabilise (update references).
- [ ] Cross-links verified (`docs/deploy/console.md`, `docs/security/console-security.md`, `docs/UI_GUIDE.md`).
---
## 10·References
- `/docs/deploy/console.md` Metrics endpoint, OTLP config, health checks.
- `/docs/security/console-security.md` Security metrics & alert hints.
- `docs/UI_GUIDE.md` Console workflows and offline posture.
- `/docs/observability/observability.md` Platform-wide practices.
- `/ops/telemetry-collector.md` & `/ops/telemetry-storage.md` Collector deployment.
- `/docs/operations/console-docker-install.md` Compose/Helm environment variables.
---
*Last updated: 2025-10-28 (Sprint23).*