- Added `FilesystemPackRunProvenanceWriter` to write provenance manifests to the filesystem. - Introduced `MongoPackRunArtifactReader` to read artifacts from MongoDB. - Created `MongoPackRunProvenanceWriter` to store provenance manifests in MongoDB. - Developed unit tests for filesystem and MongoDB provenance writers. - Established `ITimelineEventStore` and `ITimelineIngestionService` interfaces for timeline event handling. - Implemented `TimelineIngestionService` to validate and persist timeline events with hashing. - Created PostgreSQL schema and migration scripts for timeline indexing. - Added dependency injection support for timeline indexer services. - Developed tests for timeline ingestion and schema validation.
2.2 KiB
2.2 KiB
Telemetry observability runbook (stub · 2025-11-29 demo)
Dashboards (offline import)
- Grafana JSON:
docs/modules/telemetry/operations/dashboards/telemetry-observability.json(import locally; no external data sources assumed). - Planned panels: collector uptime, scrape errors, ingestion/backlog per tenant, storage retention headroom, query latency p95/p99, and OTLP export errors.
Key metrics
telemetry_collector_uptime_seconds— per-collector uptime.telemetry_scrape_failures_total{job}— scrape failures per job.telemetry_ingest_backlog— queued spans/logs/metrics awaiting storage.telemetry_storage_retention_percent_used— storage utilization against retention budget.telemetry_query_latency_seconds_bucket{route}— API/query latency.telemetry_otlp_export_failures_total{signal}— OTLP export failures by signal.
Logs & traces
- Correlate by
trace_idandtenant; includecollector_id,pipeline,exporterfields. - Traces disabled by default for air-gap; enable by setting OTLP endpoints to on-prem collectors.
Health/diagnostics
/health/livenessand/health/readiness(collector + storage gateway) check exporter reachability and disk headroom./statusexposes build version, commit, feature flags; verify against offline bundle manifest.- Storage probe:
GET /api/storage/usage(if available) to confirm retention headroom; otherwise rely on Prometheus metrics.
Alert hints
- OTLP export failures > 0 over 5m.
- Ingest backlog above threshold (configurable per tenant/workload).
- Query latency p99 > 1s for
/api/queryroutes. - Storage utilization > 85% of retention budget.
Offline verification steps
- Import Grafana JSON locally; point to Prometheus scrape labeled
telemetry. - Run collector smoke: push sample OTLP spans/logs/metrics to local collector and confirm metrics emit in Prometheus.
- Fetch
/statusand compare commit/version to offline bundle manifest.
Evidence locations
- Sprint tracker:
docs/implplan/SPRINT_0330_0001_0001_docs_modules_telemetry.md. - Module docs:
README.md,architecture.md,implementation_plan.md. - Dashboard stub:
operations/dashboards/telemetry-observability.json.