Files
git.stella-ops.org/docs/modules/telemetry/guides/tracing.md
2026-01-06 19:07:48 +02:00

1.9 KiB
Raw Blame History

Tracing Standards (DOCS-OBS-50-004)

Last updated: 2025-11-25 (Docs Tasks Md.VI)

Goals

  • Consistent distributed tracing across services (API, workers, CLI).
  • Safe for offline/air-gapped deployments.
  • Deterministic span data for replay/debug.

Context propagation

  • Use W3C headers: traceparent (required), baggage (optional key/value pairs).
  • Preserve incoming trace_id for all downstream calls; create child spans per operation.
  • For async work (queues, cron), copy traceparent and baggage into the message envelope; new span links to the stored context using links, not a new parent.

Span conventions

  • Names: <component>.<operation> (e.g., riskengine.simulate, notify.deliver).
  • Required attributes: tenant, workload (service), env, region, version, operation, status.
  • HTTP spans: add http.method, http.route, http.status_code, net.peer.name, net.peer.port.
  • DB spans: db.system, db.name, db.operation, db.statement (omit literals).
  • Message spans: messaging.system, messaging.destination, messaging.operation (send|receive|process), messaging.message_id.
  • Errors: set status=error, include error.code, redacted error.message, retryable (bool).

Sampling

  • Default head sampling: 10% non-prod, 5% prod.
  • Always sample spans with status=error|fault or audit=true.
  • Allow override via env Tracing__SampleRate (01) per service; document in runbooks.

Offline/air-gap posture

  • No external exporters; emit OTLP to local collector or file.
  • Disable remote enrichment; rely on bundled service map.
  • All timestamps UTC; span ids deterministic only in scope of traceparent (no GUID reuse).

Validation checklist

  • traceparent forwarded on every inbound/outbound call.
  • Required attributes present on spans.
  • Error spans include codes and redacted messages.
  • Sampling knobs documented in service config.