Files
git.stella-ops.org/docs/observability/telemetry-standards.md
StellaOps Bot 9f6e6f7fb3
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
up
2025-11-25 22:09:44 +02:00

39 lines
2.6 KiB
Markdown

# Telemetry Standards (DOCS-OBS-50-002)
Last updated: 2025-11-25 (Docs Tasks Md.VI)
## Common envelope
- **Trace context**: `trace_id`, `span_id`, `trace_flags`; propagate W3C `traceparent` and `baggage` end to end.
- **Tenant & workload**: `tenant`, `workload` (service name), `region`, `env` (dev/stage/prod), `version` (git sha or semver).
- **Subject**: `component` (module), `operation` (verb/name), `resource` (purl/uri/subject id when safe).
- **Timing**: UTC ISO-8601 `timestamp`; durations in milliseconds with integers.
- **Outcome**: `status` (`ok|error|fault|throttle`), `error.code` (machine), `error.message` (human, redacted), `retryable` (bool).
## Scrubbing policy
- Denylist PII/secrets before emit: emails, tokens, Authorization headers, bearer fragments, private keys, passwords, session IDs.
- Redact fields to `"[redacted]"` and add `redaction.reason` (`secret|pii|tenant_policy`).
- Hash low-cardinality identifiers when needed (`sha256` lowercase hex) and mark `hashed=true`.
- Logs must not contain full request/response bodies; store hashes plus lengths. For NDJSON exports, allow hashes + selected headers only.
## Sampling defaults
- **Traces**: 10% head sampling non-prod; 100% for `status=error|fault` and for spans tagged `audit=true`. Prod default 5% with the same error/audit boost.
- **Logs**: info logs rate-limited per component (default 100/s); warn/error never sampled. Structured JSON only.
- **Metrics**: never sampled; counters/gauges/histograms use deterministic bucket boundaries documented in component specs.
## Redaction override procedure
- Overrides are rare and must be auditable.
- To allow a field temporarily, set `telemetry.redaction.overrides=<comma list>` in service config with change-ticket id; emit `redaction.override=true` tag on affected spans/logs.
- Overrides expire automatically after `telemetry.redaction.override_ttl` (default 24h); services refuse to start with expired overrides.
- All overrides are logged to `telemetry.redaction.audit` channel with actor, ticket, fields, TTL.
## Determinism & offline posture
- No external enrichers; all enrichment data must be preloaded bundles (e.g., service map, tenant metadata).
- Sorting for exports: by `timestamp`, then `workload`, then `operation`.
- Time always UTC; avoid locale-specific formats.
## Validation checklist
- [ ] `traceparent` propagated and present on inbound/outbound.
- [ ] Required fields present (`tenant`, `workload`, `operation`, `status`).
- [ ] Scrubbing tests cover auth headers and bodies.
- [ ] Sampling knobs configurable via env vars with documented defaults.