Files
git.stella-ops.org/docs/observability/tracing.md
StellaOps Bot 9f6e6f7fb3
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
up
2025-11-25 22:09:44 +02:00

1.9 KiB
Raw Blame History

Tracing Standards (DOCS-OBS-50-004)

Last updated: 2025-11-25 (Docs Tasks Md.VI)

Goals

  • Consistent distributed tracing across services (API, workers, CLI).
  • Safe for offline/air-gapped deployments.
  • Deterministic span data for replay/debug.

Context propagation

  • Use W3C headers: traceparent (required), baggage (optional key/value pairs).
  • Preserve incoming trace_id for all downstream calls; create child spans per operation.
  • For async work (queues, cron), copy traceparent and baggage into the message envelope; new span links to the stored context using links, not a new parent.

Span conventions

  • Names: <component>.<operation> (e.g., riskengine.simulate, notify.deliver).
  • Required attributes: tenant, workload (service), env, region, version, operation, status.
  • HTTP spans: add http.method, http.route, http.status_code, net.peer.name, net.peer.port.
  • DB spans: db.system, db.name, db.operation, db.statement (omit literals).
  • Message spans: messaging.system, messaging.destination, messaging.operation (send|receive|process), messaging.message_id.
  • Errors: set status=error, include error.code, redacted error.message, retryable (bool).

Sampling

  • Default head sampling: 10% non-prod, 5% prod.
  • Always sample spans with status=error|fault or audit=true.
  • Allow override via env Tracing__SampleRate (01) per service; document in runbooks.

Offline/air-gap posture

  • No external exporters; emit OTLP to local collector or file.
  • Disable remote enrichment; rely on bundled service map.
  • All timestamps UTC; span ids deterministic only in scope of traceparent (no GUID reuse).

Validation checklist

  • traceparent forwarded on every inbound/outbound call.
  • Required attributes present on spans.
  • Error spans include codes and redacted messages.
  • Sampling knobs documented in service config.