up
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
This commit is contained in:
37
docs/observability/aggregation.md
Normal file
37
docs/observability/aggregation.md
Normal file
@@ -0,0 +1,37 @@
|
||||
# Aggregation Observability
|
||||
|
||||
Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-LNM-22-007)
|
||||
|
||||
Covers metrics, traces, and logs for Link-Not-Merge (LNM) aggregation and evidence pipelines.
|
||||
|
||||
## Metrics
|
||||
- `aggregation_ingest_latency_seconds` (histogram) — end-to-end ingest per statement; labels: `tenant`, `source`, `status`.
|
||||
- `aggregation_conflict_total` (counter) — conflicts encountered; labels: `tenant`, `advisory`, `product`, `reason`.
|
||||
- `aggregation_overlay_cache_hits_total` / `_misses_total` — overlay cache effectiveness; labels: `tenant`, `cache`.
|
||||
- `aggregation_vex_gate_total` — VEX gating outcomes; labels: `tenant`, `status` (`affected`, `not_affected`, `unknown`).
|
||||
- `aggregation_queue_depth` (gauge) — pending statements per tenant.
|
||||
|
||||
## Traces
|
||||
- Span name `aggregation.process` with attributes:
|
||||
- `tenant`, `advisory`, `product`, `vex_status`, `source_kind`
|
||||
- `overlay_version`, `cache_hit` (bool)
|
||||
- Link to upstream ingest span (`traceparent` forwarded by Excititor/Concelier).
|
||||
- Export to OTLP; sampling default 10% outside prod, 100% for `status=error`.
|
||||
|
||||
## Logs
|
||||
Structured JSON with fields: `tenant`, `advisory`, `product`, `vex_status`, `decision` (`merged|suppressed|dropped`), `reason`, `duration_ms`, `trace_id`.
|
||||
|
||||
## SLOs
|
||||
- **Ingest latency**: p95 < 500ms per statement (steady state).
|
||||
- **Cache hit rate**: >80% for overlays; alerts when below for 15 minutes.
|
||||
- **Error rate**: <0.1% over 10 minute window.
|
||||
|
||||
## Alerts
|
||||
- `HighConflictRate` — `aggregation_conflict_total` delta > 100/minute per tenant.
|
||||
- `QueueBacklog` — `aggregation_queue_depth` > 10k for 5 minutes.
|
||||
- `LowCacheHit` — overlay cache hit rate < 60% for 10 minutes.
|
||||
|
||||
## Offline/air-gap considerations
|
||||
- Export metrics to local Prometheus scrape; no external sinks.
|
||||
- Trace sampling and log retention configured via environment without needing control-plane access.
|
||||
- Deterministic ordering preserved; cache warmers seeded from bundled fixtures.
|
||||
41
docs/observability/logging.md
Normal file
41
docs/observability/logging.md
Normal file
@@ -0,0 +1,41 @@
|
||||
# Logging Standards (DOCS-OBS-50-003)
|
||||
|
||||
Last updated: 2025-11-25 (Docs Tasks Md.VI)
|
||||
|
||||
## Goals
|
||||
- Deterministic, structured logs for all services.
|
||||
- Keep tenant safety and redaction guarantees while enabling search, correlation, and offline analysis.
|
||||
|
||||
## Log shape (JSON)
|
||||
Required fields:
|
||||
- `timestamp` (UTC ISO-8601)
|
||||
- `tenant`, `workload` (service name), `env`, `region`, `version`
|
||||
- `level` (`debug|info|warn|error|fatal`)
|
||||
- `category` (logger/category name), `operation` (verb/action)
|
||||
- `trace_id`, `span_id`, `correlation_id` (if external)
|
||||
- `message` (concise, no secrets)
|
||||
- `status` (`ok|error|fault|throttle`)
|
||||
- `error.code`, `error.message` (redacted), `retryable` (bool) when status != ok
|
||||
|
||||
Optional but recommended:
|
||||
- `resource` (subject id/purl/path when safe), `http.method`, `http.status_code`, `duration_ms`, `host`, `pid`, `thread`.
|
||||
|
||||
## Redaction rules
|
||||
- Never log Authorization headers, tokens, passwords, private keys, full request/response bodies.
|
||||
- Redact to `"[redacted]"` and add `redaction.reason` (`secret|pii|policy`).
|
||||
- Hash low-cardinality identifiers when needed (`sha256` hex) and mark `hashed=true`.
|
||||
|
||||
## Determinism & offline posture
|
||||
- Stable key ordering not required, but field set must be consistent per log type.
|
||||
- No external enrichment; rely on bundled metadata (service map, tenant labels).
|
||||
- All times UTC; newline-delimited JSON (NDJSON); LF line endings.
|
||||
|
||||
## Sampling & rate limits
|
||||
- Info logs rate-limited per component (default 100/s); warn/error/fatal never sampled.
|
||||
- Structured audit logs (`category=audit`) are never sampled and must include `actor`, `action`, `target`, `result`.
|
||||
|
||||
## Validation checklist
|
||||
- [ ] Required fields present and non-empty.
|
||||
- [ ] No secrets/PII; redaction markers recorded.
|
||||
- [ ] Correlation fields (`trace_id`, `span_id`) set when spans exist.
|
||||
- [ ] Log level matches outcome (errors use warn/error/fatal only).
|
||||
42
docs/observability/metrics-and-slos.md
Normal file
42
docs/observability/metrics-and-slos.md
Normal file
@@ -0,0 +1,42 @@
|
||||
# Metrics & SLOs (DOCS-OBS-51-001)
|
||||
|
||||
Last updated: 2025-11-25 (Docs Tasks Md.VI)
|
||||
|
||||
## Core metrics (platform-wide)
|
||||
- **Requests**: `http_requests_total{tenant,workload,route,status}` (counter); latency histogram `http_request_duration_seconds`.
|
||||
- **Jobs**: `worker_jobs_total{tenant,queue,status}`; `worker_job_duration_seconds`.
|
||||
- **DB**: `db_query_duration_seconds{db,operation}`; `db_pool_in_use`, `db_pool_available`.
|
||||
- **Cache**: `cache_requests_total{result=hit|miss}`; `cache_latency_seconds`.
|
||||
- **Queue depth**: `queue_depth{tenant,queue}` (gauge).
|
||||
- **Errors**: `errors_total{tenant,workload,code}`.
|
||||
- **Custom module metrics**: keep namespaced (e.g., `riskengine_score_duration_seconds`, `notify_delivery_attempts_total`).
|
||||
|
||||
## SLOs (suggested)
|
||||
- API availability: 99.9% monthly per public service.
|
||||
- P95 latency: <300 ms for read endpoints; <1 s for write endpoints.
|
||||
- Worker job success: >99% over 30d; P95 job duration set per queue (document locally).
|
||||
- Queue backlog: alert when `queue_depth` > 1000 for 5 minutes per tenant/queue.
|
||||
- Error budget policy: 28-day rolling window; burn-rate alerts at 2× and 14× budget.
|
||||
|
||||
## Alert examples
|
||||
- High error rate: `rate(errors_total[5m]) / rate(http_requests_total[5m]) > 0.02`.
|
||||
- Latency regression: `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.3`.
|
||||
- Queue backlog: `queue_depth > 1000` for 5m.
|
||||
- Job failures: `rate(worker_jobs_total{status="failed"}[10m]) > 0.01`.
|
||||
|
||||
## Observability hygiene
|
||||
- Tag everything with `tenant`, `workload`, `env`, `region`, `version`.
|
||||
- Keep metric names stable; prefer adding labels over renaming.
|
||||
- No high-cardinality labels (avoid `user_id`, `path`, raw errors); bucket or hash if needed.
|
||||
- Offline: scrape locally (Prometheus/OTLP); ship exports via bundle if required.
|
||||
|
||||
## Dashboards
|
||||
- Golden signals per service: traffic, errors, saturation, latency (P50/P95/P99).
|
||||
- Queue dashboards: depth, age, throughput, success/fail rates.
|
||||
- Tracing overlays: link span `status` to error metrics; use exemplars where supported.
|
||||
|
||||
## Validation checklist
|
||||
- [ ] Metrics emitted with required tags.
|
||||
- [ ] Cardinality review completed (no unbounded labels).
|
||||
- [ ] Alerts wired to error budget policy.
|
||||
- [ ] Dashboards cover golden signals and queue health.
|
||||
38
docs/observability/telemetry-standards.md
Normal file
38
docs/observability/telemetry-standards.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# Telemetry Standards (DOCS-OBS-50-002)
|
||||
|
||||
Last updated: 2025-11-25 (Docs Tasks Md.VI)
|
||||
|
||||
## Common envelope
|
||||
- **Trace context**: `trace_id`, `span_id`, `trace_flags`; propagate W3C `traceparent` and `baggage` end to end.
|
||||
- **Tenant & workload**: `tenant`, `workload` (service name), `region`, `env` (dev/stage/prod), `version` (git sha or semver).
|
||||
- **Subject**: `component` (module), `operation` (verb/name), `resource` (purl/uri/subject id when safe).
|
||||
- **Timing**: UTC ISO-8601 `timestamp`; durations in milliseconds with integers.
|
||||
- **Outcome**: `status` (`ok|error|fault|throttle`), `error.code` (machine), `error.message` (human, redacted), `retryable` (bool).
|
||||
|
||||
## Scrubbing policy
|
||||
- Denylist PII/secrets before emit: emails, tokens, Authorization headers, bearer fragments, private keys, passwords, session IDs.
|
||||
- Redact fields to `"[redacted]"` and add `redaction.reason` (`secret|pii|tenant_policy`).
|
||||
- Hash low-cardinality identifiers when needed (`sha256` lowercase hex) and mark `hashed=true`.
|
||||
- Logs must not contain full request/response bodies; store hashes plus lengths. For NDJSON exports, allow hashes + selected headers only.
|
||||
|
||||
## Sampling defaults
|
||||
- **Traces**: 10% head sampling non-prod; 100% for `status=error|fault` and for spans tagged `audit=true`. Prod default 5% with the same error/audit boost.
|
||||
- **Logs**: info logs rate-limited per component (default 100/s); warn/error never sampled. Structured JSON only.
|
||||
- **Metrics**: never sampled; counters/gauges/histograms use deterministic bucket boundaries documented in component specs.
|
||||
|
||||
## Redaction override procedure
|
||||
- Overrides are rare and must be auditable.
|
||||
- To allow a field temporarily, set `telemetry.redaction.overrides=<comma list>` in service config with change-ticket id; emit `redaction.override=true` tag on affected spans/logs.
|
||||
- Overrides expire automatically after `telemetry.redaction.override_ttl` (default 24h); services refuse to start with expired overrides.
|
||||
- All overrides are logged to `telemetry.redaction.audit` channel with actor, ticket, fields, TTL.
|
||||
|
||||
## Determinism & offline posture
|
||||
- No external enrichers; all enrichment data must be preloaded bundles (e.g., service map, tenant metadata).
|
||||
- Sorting for exports: by `timestamp`, then `workload`, then `operation`.
|
||||
- Time always UTC; avoid locale-specific formats.
|
||||
|
||||
## Validation checklist
|
||||
- [ ] `traceparent` propagated and present on inbound/outbound.
|
||||
- [ ] Required fields present (`tenant`, `workload`, `operation`, `status`).
|
||||
- [ ] Scrubbing tests cover auth headers and bodies.
|
||||
- [ ] Sampling knobs configurable via env vars with documented defaults.
|
||||
37
docs/observability/tracing.md
Normal file
37
docs/observability/tracing.md
Normal file
@@ -0,0 +1,37 @@
|
||||
# Tracing Standards (DOCS-OBS-50-004)
|
||||
|
||||
Last updated: 2025-11-25 (Docs Tasks Md.VI)
|
||||
|
||||
## Goals
|
||||
- Consistent distributed tracing across services (API, workers, CLI).
|
||||
- Safe for offline/air-gapped deployments.
|
||||
- Deterministic span data for replay/debug.
|
||||
|
||||
## Context propagation
|
||||
- Use W3C headers: `traceparent` (required), `baggage` (optional key/value pairs).
|
||||
- Preserve incoming `trace_id` for all downstream calls; create child spans per operation.
|
||||
- For async work (queues, cron), copy `traceparent` and `baggage` into the message envelope; new span links to the stored context using **links**, not a new parent.
|
||||
|
||||
## Span conventions
|
||||
- Names: `<component>.<operation>` (e.g., `riskengine.simulate`, `notify.deliver`).
|
||||
- Required attributes: `tenant`, `workload` (service), `env`, `region`, `version`, `operation`, `status`.
|
||||
- HTTP spans: add `http.method`, `http.route`, `http.status_code`, `net.peer.name`, `net.peer.port`.
|
||||
- DB spans: `db.system`, `db.name`, `db.operation`, `db.statement` (omit literals).
|
||||
- Message spans: `messaging.system`, `messaging.destination`, `messaging.operation` (`send|receive|process`), `messaging.message_id`.
|
||||
- Errors: set `status=error`, include `error.code`, redacted `error.message`, `retryable` (bool).
|
||||
|
||||
## Sampling
|
||||
- Default head sampling: 10% non-prod, 5% prod.
|
||||
- Always sample spans with `status=error|fault` or `audit=true`.
|
||||
- Allow override via env `Tracing__SampleRate` (0–1) per service; document in runbooks.
|
||||
|
||||
## Offline/air-gap posture
|
||||
- No external exporters; emit OTLP to local collector or file.
|
||||
- Disable remote enrichment; rely on bundled service map.
|
||||
- All timestamps UTC; span ids deterministic only in scope of traceparent (no GUID reuse).
|
||||
|
||||
## Validation checklist
|
||||
- [ ] `traceparent` forwarded on every inbound/outbound call.
|
||||
- [ ] Required attributes present on spans.
|
||||
- [ ] Error spans include codes and redacted messages.
|
||||
- [ ] Sampling knobs documented in service config.
|
||||
Reference in New Issue
Block a user