Files
git.stella-ops.org/docs/observability/tracing.md
StellaOps Bot 9f6e6f7fb3
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
up
2025-11-25 22:09:44 +02:00

38 lines
1.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Tracing Standards (DOCS-OBS-50-004)
Last updated: 2025-11-25 (Docs Tasks Md.VI)
## Goals
- Consistent distributed tracing across services (API, workers, CLI).
- Safe for offline/air-gapped deployments.
- Deterministic span data for replay/debug.
## Context propagation
- Use W3C headers: `traceparent` (required), `baggage` (optional key/value pairs).
- Preserve incoming `trace_id` for all downstream calls; create child spans per operation.
- For async work (queues, cron), copy `traceparent` and `baggage` into the message envelope; new span links to the stored context using **links**, not a new parent.
## Span conventions
- Names: `<component>.<operation>` (e.g., `riskengine.simulate`, `notify.deliver`).
- Required attributes: `tenant`, `workload` (service), `env`, `region`, `version`, `operation`, `status`.
- HTTP spans: add `http.method`, `http.route`, `http.status_code`, `net.peer.name`, `net.peer.port`.
- DB spans: `db.system`, `db.name`, `db.operation`, `db.statement` (omit literals).
- Message spans: `messaging.system`, `messaging.destination`, `messaging.operation` (`send|receive|process`), `messaging.message_id`.
- Errors: set `status=error`, include `error.code`, redacted `error.message`, `retryable` (bool).
## Sampling
- Default head sampling: 10% non-prod, 5% prod.
- Always sample spans with `status=error|fault` or `audit=true`.
- Allow override via env `Tracing__SampleRate` (01) per service; document in runbooks.
## Offline/air-gap posture
- No external exporters; emit OTLP to local collector or file.
- Disable remote enrichment; rely on bundled service map.
- All timestamps UTC; span ids deterministic only in scope of traceparent (no GUID reuse).
## Validation checklist
- [ ] `traceparent` forwarded on every inbound/outbound call.
- [ ] Required attributes present on spans.
- [ ] Error spans include codes and redacted messages.
- [ ] Sampling knobs documented in service config.