feat: Implement Filesystem and MongoDB provenance writers for PackRun execution context

- Added `FilesystemPackRunProvenanceWriter` to write provenance manifests to the filesystem. - Introduced `MongoPackRunArtifactReader` to read artifacts from MongoDB. - Created `MongoPackRunProvenanceWriter` to store provenance manifests in MongoDB. - Developed unit tests for filesystem and MongoDB provenance writers. - Established `ITimelineEventStore` and `ITimelineIngestionService` interfaces for timeline event handling. - Implemented `TimelineIngestionService` to validate and persist timeline events with hashing. - Created PostgreSQL schema and migration scripts for timeline indexing. - Added dependency injection support for timeline indexer services. - Developed tests for timeline ingestion and schema validation.
2025-11-30 15:38:14 +02:00
parent 8f54ffa203
commit 17d45a6d30
276 changed files with 8618 additions and 688 deletions
--- a/docs/modules/telemetry/operations/dashboards/telemetry-observability.json
+++ b/docs/modules/telemetry/operations/dashboards/telemetry-observability.json
@@ -0,0 +1,6 @@
+{
+  "_note": "Placeholder Grafana dashboard stub for Telemetry. Replace panels when metrics endpoints are available; keep offline-import friendly.",
+  "schemaVersion": 39,
+  "title": "Telemetry Observability (stub)",
+  "panels": []
+}
--- a/docs/modules/telemetry/operations/observability.md
+++ b/docs/modules/telemetry/operations/observability.md
@@ -0,0 +1,38 @@
+# Telemetry observability runbook (stub · 2025-11-29 demo)
+
+## Dashboards (offline import)
+- Grafana JSON: `docs/modules/telemetry/operations/dashboards/telemetry-observability.json` (import locally; no external data sources assumed).
+- Planned panels: collector uptime, scrape errors, ingestion/backlog per tenant, storage retention headroom, query latency p95/p99, and OTLP export errors.
+
+## Key metrics
+- `telemetry_collector_uptime_seconds` — per-collector uptime.
+- `telemetry_scrape_failures_total{job}` — scrape failures per job.
+- `telemetry_ingest_backlog` — queued spans/logs/metrics awaiting storage.
+- `telemetry_storage_retention_percent_used` — storage utilization against retention budget.
+- `telemetry_query_latency_seconds_bucket{route}` — API/query latency.
+- `telemetry_otlp_export_failures_total{signal}` — OTLP export failures by signal.
+
+## Logs & traces
+- Correlate by `trace_id` and `tenant`; include `collector_id`, `pipeline`, `exporter` fields.
+- Traces disabled by default for air-gap; enable by setting OTLP endpoints to on-prem collectors.
+
+## Health/diagnostics
+- `/health/liveness` and `/health/readiness` (collector + storage gateway) check exporter reachability and disk headroom.
+- `/status` exposes build version, commit, feature flags; verify against offline bundle manifest.
+- Storage probe: `GET /api/storage/usage` (if available) to confirm retention headroom; otherwise rely on Prometheus metrics.
+
+## Alert hints
+- OTLP export failures > 0 over 5m.
+- Ingest backlog above threshold (configurable per tenant/workload).
+- Query latency p99 > 1s for `/api/query` routes.
+- Storage utilization > 85% of retention budget.
+
+## Offline verification steps
+1) Import Grafana JSON locally; point to Prometheus scrape labeled `telemetry`.
+2) Run collector smoke: push sample OTLP spans/logs/metrics to local collector and confirm metrics emit in Prometheus.
+3) Fetch `/status` and compare commit/version to offline bundle manifest.
+
+## Evidence locations
+- Sprint tracker: `docs/implplan/SPRINT_0330_0001_0001_docs_modules_telemetry.md`.
+- Module docs: `README.md`, `architecture.md`, `implementation_plan.md`.
+- Dashboard stub: `operations/dashboards/telemetry-observability.json`.