Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
38 lines
1.9 KiB
Markdown
38 lines
1.9 KiB
Markdown
# Tracing Standards (DOCS-OBS-50-004)
|
||
|
||
Last updated: 2025-11-25 (Docs Tasks Md.VI)
|
||
|
||
## Goals
|
||
- Consistent distributed tracing across services (API, workers, CLI).
|
||
- Safe for offline/air-gapped deployments.
|
||
- Deterministic span data for replay/debug.
|
||
|
||
## Context propagation
|
||
- Use W3C headers: `traceparent` (required), `baggage` (optional key/value pairs).
|
||
- Preserve incoming `trace_id` for all downstream calls; create child spans per operation.
|
||
- For async work (queues, cron), copy `traceparent` and `baggage` into the message envelope; new span links to the stored context using **links**, not a new parent.
|
||
|
||
## Span conventions
|
||
- Names: `<component>.<operation>` (e.g., `riskengine.simulate`, `notify.deliver`).
|
||
- Required attributes: `tenant`, `workload` (service), `env`, `region`, `version`, `operation`, `status`.
|
||
- HTTP spans: add `http.method`, `http.route`, `http.status_code`, `net.peer.name`, `net.peer.port`.
|
||
- DB spans: `db.system`, `db.name`, `db.operation`, `db.statement` (omit literals).
|
||
- Message spans: `messaging.system`, `messaging.destination`, `messaging.operation` (`send|receive|process`), `messaging.message_id`.
|
||
- Errors: set `status=error`, include `error.code`, redacted `error.message`, `retryable` (bool).
|
||
|
||
## Sampling
|
||
- Default head sampling: 10% non-prod, 5% prod.
|
||
- Always sample spans with `status=error|fault` or `audit=true`.
|
||
- Allow override via env `Tracing__SampleRate` (0–1) per service; document in runbooks.
|
||
|
||
## Offline/air-gap posture
|
||
- No external exporters; emit OTLP to local collector or file.
|
||
- Disable remote enrichment; rely on bundled service map.
|
||
- All timestamps UTC; span ids deterministic only in scope of traceparent (no GUID reuse).
|
||
|
||
## Validation checklist
|
||
- [ ] `traceparent` forwarded on every inbound/outbound call.
|
||
- [ ] Required attributes present on spans.
|
||
- [ ] Error spans include codes and redacted messages.
|
||
- [ ] Sampling knobs documented in service config.
|