Files
git.stella-ops.org/docs/modules/telemetry/operations/observability.md
StellaOps Bot 17d45a6d30
Some checks failed
Airgap Sealed CI Smoke / sealed-smoke (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
feat: Implement Filesystem and MongoDB provenance writers for PackRun execution context
- Added `FilesystemPackRunProvenanceWriter` to write provenance manifests to the filesystem.
- Introduced `MongoPackRunArtifactReader` to read artifacts from MongoDB.
- Created `MongoPackRunProvenanceWriter` to store provenance manifests in MongoDB.
- Developed unit tests for filesystem and MongoDB provenance writers.
- Established `ITimelineEventStore` and `ITimelineIngestionService` interfaces for timeline event handling.
- Implemented `TimelineIngestionService` to validate and persist timeline events with hashing.
- Created PostgreSQL schema and migration scripts for timeline indexing.
- Added dependency injection support for timeline indexer services.
- Developed tests for timeline ingestion and schema validation.
2025-11-30 15:38:14 +02:00

2.2 KiB

Telemetry observability runbook (stub · 2025-11-29 demo)

Dashboards (offline import)

  • Grafana JSON: docs/modules/telemetry/operations/dashboards/telemetry-observability.json (import locally; no external data sources assumed).
  • Planned panels: collector uptime, scrape errors, ingestion/backlog per tenant, storage retention headroom, query latency p95/p99, and OTLP export errors.

Key metrics

  • telemetry_collector_uptime_seconds — per-collector uptime.
  • telemetry_scrape_failures_total{job} — scrape failures per job.
  • telemetry_ingest_backlog — queued spans/logs/metrics awaiting storage.
  • telemetry_storage_retention_percent_used — storage utilization against retention budget.
  • telemetry_query_latency_seconds_bucket{route} — API/query latency.
  • telemetry_otlp_export_failures_total{signal} — OTLP export failures by signal.

Logs & traces

  • Correlate by trace_id and tenant; include collector_id, pipeline, exporter fields.
  • Traces disabled by default for air-gap; enable by setting OTLP endpoints to on-prem collectors.

Health/diagnostics

  • /health/liveness and /health/readiness (collector + storage gateway) check exporter reachability and disk headroom.
  • /status exposes build version, commit, feature flags; verify against offline bundle manifest.
  • Storage probe: GET /api/storage/usage (if available) to confirm retention headroom; otherwise rely on Prometheus metrics.

Alert hints

  • OTLP export failures > 0 over 5m.
  • Ingest backlog above threshold (configurable per tenant/workload).
  • Query latency p99 > 1s for /api/query routes.
  • Storage utilization > 85% of retention budget.

Offline verification steps

  1. Import Grafana JSON locally; point to Prometheus scrape labeled telemetry.
  2. Run collector smoke: push sample OTLP spans/logs/metrics to local collector and confirm metrics emit in Prometheus.
  3. Fetch /status and compare commit/version to offline bundle manifest.

Evidence locations

  • Sprint tracker: docs/implplan/SPRINT_0330_0001_0001_docs_modules_telemetry.md.
  • Module docs: README.md, architecture.md, implementation_plan.md.
  • Dashboard stub: operations/dashboards/telemetry-observability.json.