feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
		
							
								
								
									
										64
									
								
								docs/modules/telemetry/implementation_plan.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										64
									
								
								docs/modules/telemetry/implementation_plan.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,64 @@ | ||||
| # Implementation plan — Telemetry | ||||
|  | ||||
| ## Delivery phases | ||||
| - **Phase 1 – Collector & pipeline profiles**   | ||||
|   Publish OpenTelemetry collector configs (`default`, `forensic`, `airgap`), establish ingest gateways, TLS/mTLS, and attribute redaction policies. | ||||
| - **Phase 2 – Storage backends & retention**   | ||||
|   Deploy Prometheus/Tempo/Loki (or equivalents) with retention tiers, bucket/object storage, deterministic manifest generation, and sealed-mode allowlists. | ||||
| - **Phase 3 – Incident mode & forensic capture**   | ||||
|   Implement incident toggles (CLI/API), tail sampling adjustments, forensic bundle generation (OTLP archives, manifest/signature), and Notify hooks. | ||||
| - **Phase 4 – Observability dashboards & automation**   | ||||
|   Deliver dashboards (service SLOs, queue depth, policy latency), alert rules, Grafana packages, and CLI automation for deployment and capture. | ||||
| - **Phase 5 – Offline & compliance**   | ||||
|   Ship Offline Kit artefacts (collectors, configs, dashboards, replay tooling), signed bundles, and documentation for air-gapped review workflows. | ||||
| - **Phase 6 – Hardening & SOC handoff**   | ||||
|   Complete RBAC integration, audit logging, incident response runbooks, performance tuning, and integration tests across services. | ||||
|  | ||||
| ## Work breakdown | ||||
| - **Collector configs** | ||||
|   - Maintain config templates per profile with processors (redaction, batching, resource detection) and exporters. | ||||
|   - CLI automation (`stella telemetry deploy`, `stella telemetry profile diff`), validation tests, and config signing. | ||||
| - **Storage & retention** | ||||
|   - Provision Prometheus/Tempo/Loki (or vendor equivalents) with retention tiers (default, forensic, airgap). | ||||
|   - Ensure determinism (chunk manifests, content hashing), remote-write allowlists, sealed/offline modes. | ||||
|   - Implement archivers for forensic bundles (metrics/traces/logs) with cosign signatures. | ||||
| - **Incident mode** | ||||
|   - API/CLI to toggle incident sampling, retention escalation, Notify signals, and auto bundle capture. | ||||
|   - Hook into Orchestrator to respond to incidents and revert after cooldown. | ||||
| - **Dashboards & alerts** | ||||
|   - Dashboard packages for core services (ingestion, policy, export, attestation). | ||||
|   - Alert rules for SLO burn, collector failure, exporter backlog, bundle generation errors. | ||||
|   - Self-observability metrics (`collector_export_failures_total`, `telemetry_incident_mode{}`). | ||||
| - **Offline support** | ||||
|   - Offline Kit assets: collector binaries/configs, import scripts, dashboards, replay instructions, compliance checklists. | ||||
|   - File-based exporters and manual transfer workflows with signed manifests. | ||||
| - **Docs & runbooks** | ||||
|   - Update observability overview, forensic capture guide, incident response checklist, sealed-mode instructions, RBAC matrix. | ||||
|   - SOC handoff package with control objectives and audit evidence. | ||||
|  | ||||
| ## Acceptance criteria | ||||
| - Collectors ingest metrics/logs/traces across deployments, applying redaction rules and tenant isolation; profiles validate via CI. | ||||
| - Storage backends retain data per default/forensic/airgap SLAs with deterministic chunk manifests and sealed-mode compliance. | ||||
| - Incident mode toggles sampling to 100 %, extends retention, triggers Notify, and captures forensic bundles signed with cosign. | ||||
| - Dashboards and alerts cover service SLOs, queue depth, policy latency, ingestion violations, and telemetry stack health. | ||||
| - CLI commands (`stella telemetry deploy/capture/status`) automate config rollout, forensic capture, and verification. | ||||
| - Offline bundles replay telemetry in sealed environments using provided scripts and manifests. | ||||
|  | ||||
| ## Risks & mitigations | ||||
| - **PII leakage:** strict redaction processors, policy-managed allowlists, audit tests. | ||||
| - **Collector overload:** horizontal scaling, batching, circuit breakers, incident mode throttling. | ||||
| - **Storage cost:** tiered retention, compression, pruning policies, offline archiving. | ||||
| - **Air-gap drift:** offline kit refresh schedule, deterministic manifest verification. | ||||
| - **Alert fatigue:** burn-rate alerts, deduping, SOC runbooks. | ||||
|  | ||||
| ## Test strategy | ||||
| - **Config lint/tests:** schema validation, unit tests for processors/exporters, golden configs. | ||||
| - **Integration:** simulate service traces/logs/metrics, verify pipelines, incident toggles, bundle generation. | ||||
| - **Performance:** load tests with peak ingestion, long retention windows, failover scenarios. | ||||
| - **Security:** redaction verification, RBAC/tenant scoping, sealed-mode tests, signed config verification. | ||||
| - **Offline:** capture bundles, transfer, replay, compliance attestation. | ||||
|  | ||||
| ## Definition of done | ||||
| - Collector profiles, storage backends, incident mode, dashboards, CLI, and offline kit delivered with telemetry and documentation. | ||||
| - Runbooks and SOC handoff packages published; compliance checklists appended. | ||||
| - ./TASKS.md and ../../TASKS.md updated; imposed rule statements confirmed in documentation. | ||||
		Reference in New Issue
	
	Block a user