Files
git.stella-ops.org/docs/observability/telemetry-standards.md
StellaOps Bot 9f6e6f7fb3
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
up
2025-11-25 22:09:44 +02:00

2.6 KiB

Telemetry Standards (DOCS-OBS-50-002)

Last updated: 2025-11-25 (Docs Tasks Md.VI)

Common envelope

  • Trace context: trace_id, span_id, trace_flags; propagate W3C traceparent and baggage end to end.
  • Tenant & workload: tenant, workload (service name), region, env (dev/stage/prod), version (git sha or semver).
  • Subject: component (module), operation (verb/name), resource (purl/uri/subject id when safe).
  • Timing: UTC ISO-8601 timestamp; durations in milliseconds with integers.
  • Outcome: status (ok|error|fault|throttle), error.code (machine), error.message (human, redacted), retryable (bool).

Scrubbing policy

  • Denylist PII/secrets before emit: emails, tokens, Authorization headers, bearer fragments, private keys, passwords, session IDs.
  • Redact fields to "[redacted]" and add redaction.reason (secret|pii|tenant_policy).
  • Hash low-cardinality identifiers when needed (sha256 lowercase hex) and mark hashed=true.
  • Logs must not contain full request/response bodies; store hashes plus lengths. For NDJSON exports, allow hashes + selected headers only.

Sampling defaults

  • Traces: 10% head sampling non-prod; 100% for status=error|fault and for spans tagged audit=true. Prod default 5% with the same error/audit boost.
  • Logs: info logs rate-limited per component (default 100/s); warn/error never sampled. Structured JSON only.
  • Metrics: never sampled; counters/gauges/histograms use deterministic bucket boundaries documented in component specs.

Redaction override procedure

  • Overrides are rare and must be auditable.
  • To allow a field temporarily, set telemetry.redaction.overrides=<comma list> in service config with change-ticket id; emit redaction.override=true tag on affected spans/logs.
  • Overrides expire automatically after telemetry.redaction.override_ttl (default 24h); services refuse to start with expired overrides.
  • All overrides are logged to telemetry.redaction.audit channel with actor, ticket, fields, TTL.

Determinism & offline posture

  • No external enrichers; all enrichment data must be preloaded bundles (e.g., service map, tenant metadata).
  • Sorting for exports: by timestamp, then workload, then operation.
  • Time always UTC; avoid locale-specific formats.

Validation checklist

  • traceparent propagated and present on inbound/outbound.
  • Required fields present (tenant, workload, operation, status).
  • Scrubbing tests cover auth headers and bodies.
  • Sampling knobs configurable via env vars with documented defaults.