up
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled

This commit is contained in:
StellaOps Bot
2025-11-25 22:09:44 +02:00
parent 6bee1fdcf5
commit 9f6e6f7fb3
116 changed files with 4495 additions and 730 deletions

View File

@@ -0,0 +1,37 @@
# Aggregation Observability
Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-LNM-22-007)
Covers metrics, traces, and logs for Link-Not-Merge (LNM) aggregation and evidence pipelines.
## Metrics
- `aggregation_ingest_latency_seconds` (histogram) — end-to-end ingest per statement; labels: `tenant`, `source`, `status`.
- `aggregation_conflict_total` (counter) — conflicts encountered; labels: `tenant`, `advisory`, `product`, `reason`.
- `aggregation_overlay_cache_hits_total` / `_misses_total` — overlay cache effectiveness; labels: `tenant`, `cache`.
- `aggregation_vex_gate_total` — VEX gating outcomes; labels: `tenant`, `status` (`affected`, `not_affected`, `unknown`).
- `aggregation_queue_depth` (gauge) — pending statements per tenant.
## Traces
- Span name `aggregation.process` with attributes:
- `tenant`, `advisory`, `product`, `vex_status`, `source_kind`
- `overlay_version`, `cache_hit` (bool)
- Link to upstream ingest span (`traceparent` forwarded by Excititor/Concelier).
- Export to OTLP; sampling default 10% outside prod, 100% for `status=error`.
## Logs
Structured JSON with fields: `tenant`, `advisory`, `product`, `vex_status`, `decision` (`merged|suppressed|dropped`), `reason`, `duration_ms`, `trace_id`.
## SLOs
- **Ingest latency**: p95 < 500ms per statement (steady state).
- **Cache hit rate**: >80% for overlays; alerts when below for 15 minutes.
- **Error rate**: <0.1% over 10 minute window.
## Alerts
- `HighConflictRate` `aggregation_conflict_total` delta > 100/minute per tenant.
- `QueueBacklog``aggregation_queue_depth` > 10k for 5 minutes.
- `LowCacheHit` — overlay cache hit rate < 60% for 10 minutes.
## Offline/air-gap considerations
- Export metrics to local Prometheus scrape; no external sinks.
- Trace sampling and log retention configured via environment without needing control-plane access.
- Deterministic ordering preserved; cache warmers seeded from bundled fixtures.

View File

@@ -0,0 +1,41 @@
# Logging Standards (DOCS-OBS-50-003)
Last updated: 2025-11-25 (Docs Tasks Md.VI)
## Goals
- Deterministic, structured logs for all services.
- Keep tenant safety and redaction guarantees while enabling search, correlation, and offline analysis.
## Log shape (JSON)
Required fields:
- `timestamp` (UTC ISO-8601)
- `tenant`, `workload` (service name), `env`, `region`, `version`
- `level` (`debug|info|warn|error|fatal`)
- `category` (logger/category name), `operation` (verb/action)
- `trace_id`, `span_id`, `correlation_id` (if external)
- `message` (concise, no secrets)
- `status` (`ok|error|fault|throttle`)
- `error.code`, `error.message` (redacted), `retryable` (bool) when status != ok
Optional but recommended:
- `resource` (subject id/purl/path when safe), `http.method`, `http.status_code`, `duration_ms`, `host`, `pid`, `thread`.
## Redaction rules
- Never log Authorization headers, tokens, passwords, private keys, full request/response bodies.
- Redact to `"[redacted]"` and add `redaction.reason` (`secret|pii|policy`).
- Hash low-cardinality identifiers when needed (`sha256` hex) and mark `hashed=true`.
## Determinism & offline posture
- Stable key ordering not required, but field set must be consistent per log type.
- No external enrichment; rely on bundled metadata (service map, tenant labels).
- All times UTC; newline-delimited JSON (NDJSON); LF line endings.
## Sampling & rate limits
- Info logs rate-limited per component (default 100/s); warn/error/fatal never sampled.
- Structured audit logs (`category=audit`) are never sampled and must include `actor`, `action`, `target`, `result`.
## Validation checklist
- [ ] Required fields present and non-empty.
- [ ] No secrets/PII; redaction markers recorded.
- [ ] Correlation fields (`trace_id`, `span_id`) set when spans exist.
- [ ] Log level matches outcome (errors use warn/error/fatal only).

View File

@@ -0,0 +1,42 @@
# Metrics & SLOs (DOCS-OBS-51-001)
Last updated: 2025-11-25 (Docs Tasks Md.VI)
## Core metrics (platform-wide)
- **Requests**: `http_requests_total{tenant,workload,route,status}` (counter); latency histogram `http_request_duration_seconds`.
- **Jobs**: `worker_jobs_total{tenant,queue,status}`; `worker_job_duration_seconds`.
- **DB**: `db_query_duration_seconds{db,operation}`; `db_pool_in_use`, `db_pool_available`.
- **Cache**: `cache_requests_total{result=hit|miss}`; `cache_latency_seconds`.
- **Queue depth**: `queue_depth{tenant,queue}` (gauge).
- **Errors**: `errors_total{tenant,workload,code}`.
- **Custom module metrics**: keep namespaced (e.g., `riskengine_score_duration_seconds`, `notify_delivery_attempts_total`).
## SLOs (suggested)
- API availability: 99.9% monthly per public service.
- P95 latency: <300 ms for read endpoints; <1 s for write endpoints.
- Worker job success: >99% over 30d; P95 job duration set per queue (document locally).
- Queue backlog: alert when `queue_depth` > 1000 for 5 minutes per tenant/queue.
- Error budget policy: 28-day rolling window; burn-rate alerts at 2× and 14× budget.
## Alert examples
- High error rate: `rate(errors_total[5m]) / rate(http_requests_total[5m]) > 0.02`.
- Latency regression: `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.3`.
- Queue backlog: `queue_depth > 1000` for 5m.
- Job failures: `rate(worker_jobs_total{status="failed"}[10m]) > 0.01`.
## Observability hygiene
- Tag everything with `tenant`, `workload`, `env`, `region`, `version`.
- Keep metric names stable; prefer adding labels over renaming.
- No high-cardinality labels (avoid `user_id`, `path`, raw errors); bucket or hash if needed.
- Offline: scrape locally (Prometheus/OTLP); ship exports via bundle if required.
## Dashboards
- Golden signals per service: traffic, errors, saturation, latency (P50/P95/P99).
- Queue dashboards: depth, age, throughput, success/fail rates.
- Tracing overlays: link span `status` to error metrics; use exemplars where supported.
## Validation checklist
- [ ] Metrics emitted with required tags.
- [ ] Cardinality review completed (no unbounded labels).
- [ ] Alerts wired to error budget policy.
- [ ] Dashboards cover golden signals and queue health.

View File

@@ -0,0 +1,38 @@
# Telemetry Standards (DOCS-OBS-50-002)
Last updated: 2025-11-25 (Docs Tasks Md.VI)
## Common envelope
- **Trace context**: `trace_id`, `span_id`, `trace_flags`; propagate W3C `traceparent` and `baggage` end to end.
- **Tenant & workload**: `tenant`, `workload` (service name), `region`, `env` (dev/stage/prod), `version` (git sha or semver).
- **Subject**: `component` (module), `operation` (verb/name), `resource` (purl/uri/subject id when safe).
- **Timing**: UTC ISO-8601 `timestamp`; durations in milliseconds with integers.
- **Outcome**: `status` (`ok|error|fault|throttle`), `error.code` (machine), `error.message` (human, redacted), `retryable` (bool).
## Scrubbing policy
- Denylist PII/secrets before emit: emails, tokens, Authorization headers, bearer fragments, private keys, passwords, session IDs.
- Redact fields to `"[redacted]"` and add `redaction.reason` (`secret|pii|tenant_policy`).
- Hash low-cardinality identifiers when needed (`sha256` lowercase hex) and mark `hashed=true`.
- Logs must not contain full request/response bodies; store hashes plus lengths. For NDJSON exports, allow hashes + selected headers only.
## Sampling defaults
- **Traces**: 10% head sampling non-prod; 100% for `status=error|fault` and for spans tagged `audit=true`. Prod default 5% with the same error/audit boost.
- **Logs**: info logs rate-limited per component (default 100/s); warn/error never sampled. Structured JSON only.
- **Metrics**: never sampled; counters/gauges/histograms use deterministic bucket boundaries documented in component specs.
## Redaction override procedure
- Overrides are rare and must be auditable.
- To allow a field temporarily, set `telemetry.redaction.overrides=<comma list>` in service config with change-ticket id; emit `redaction.override=true` tag on affected spans/logs.
- Overrides expire automatically after `telemetry.redaction.override_ttl` (default 24h); services refuse to start with expired overrides.
- All overrides are logged to `telemetry.redaction.audit` channel with actor, ticket, fields, TTL.
## Determinism & offline posture
- No external enrichers; all enrichment data must be preloaded bundles (e.g., service map, tenant metadata).
- Sorting for exports: by `timestamp`, then `workload`, then `operation`.
- Time always UTC; avoid locale-specific formats.
## Validation checklist
- [ ] `traceparent` propagated and present on inbound/outbound.
- [ ] Required fields present (`tenant`, `workload`, `operation`, `status`).
- [ ] Scrubbing tests cover auth headers and bodies.
- [ ] Sampling knobs configurable via env vars with documented defaults.

View File

@@ -0,0 +1,37 @@
# Tracing Standards (DOCS-OBS-50-004)
Last updated: 2025-11-25 (Docs Tasks Md.VI)
## Goals
- Consistent distributed tracing across services (API, workers, CLI).
- Safe for offline/air-gapped deployments.
- Deterministic span data for replay/debug.
## Context propagation
- Use W3C headers: `traceparent` (required), `baggage` (optional key/value pairs).
- Preserve incoming `trace_id` for all downstream calls; create child spans per operation.
- For async work (queues, cron), copy `traceparent` and `baggage` into the message envelope; new span links to the stored context using **links**, not a new parent.
## Span conventions
- Names: `<component>.<operation>` (e.g., `riskengine.simulate`, `notify.deliver`).
- Required attributes: `tenant`, `workload` (service), `env`, `region`, `version`, `operation`, `status`.
- HTTP spans: add `http.method`, `http.route`, `http.status_code`, `net.peer.name`, `net.peer.port`.
- DB spans: `db.system`, `db.name`, `db.operation`, `db.statement` (omit literals).
- Message spans: `messaging.system`, `messaging.destination`, `messaging.operation` (`send|receive|process`), `messaging.message_id`.
- Errors: set `status=error`, include `error.code`, redacted `error.message`, `retryable` (bool).
## Sampling
- Default head sampling: 10% non-prod, 5% prod.
- Always sample spans with `status=error|fault` or `audit=true`.
- Allow override via env `Tracing__SampleRate` (01) per service; document in runbooks.
## Offline/air-gap posture
- No external exporters; emit OTLP to local collector or file.
- Disable remote enrichment; rely on bundled service map.
- All timestamps UTC; span ids deterministic only in scope of traceparent (no GUID reuse).
## Validation checklist
- [ ] `traceparent` forwarded on every inbound/outbound call.
- [ ] Required attributes present on spans.
- [ ] Error spans include codes and redacted messages.
- [ ] Sampling knobs documented in service config.