feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules

- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes.
- Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes.
- Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables.
- Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
2025-10-30 00:09:39 +02:00
parent 3154c67978
commit 7b5bdcf4d3
503 changed files with 16136 additions and 54638 deletions

View File

@@ -0,0 +1,64 @@
# Implementation plan — Telemetry
## Delivery phases
- **Phase 1 Collector & pipeline profiles**
Publish OpenTelemetry collector configs (`default`, `forensic`, `airgap`), establish ingest gateways, TLS/mTLS, and attribute redaction policies.
- **Phase 2 Storage backends & retention**
Deploy Prometheus/Tempo/Loki (or equivalents) with retention tiers, bucket/object storage, deterministic manifest generation, and sealed-mode allowlists.
- **Phase 3 Incident mode & forensic capture**
Implement incident toggles (CLI/API), tail sampling adjustments, forensic bundle generation (OTLP archives, manifest/signature), and Notify hooks.
- **Phase 4 Observability dashboards & automation**
Deliver dashboards (service SLOs, queue depth, policy latency), alert rules, Grafana packages, and CLI automation for deployment and capture.
- **Phase 5 Offline & compliance**
Ship Offline Kit artefacts (collectors, configs, dashboards, replay tooling), signed bundles, and documentation for air-gapped review workflows.
- **Phase 6 Hardening & SOC handoff**
Complete RBAC integration, audit logging, incident response runbooks, performance tuning, and integration tests across services.
## Work breakdown
- **Collector configs**
- Maintain config templates per profile with processors (redaction, batching, resource detection) and exporters.
- CLI automation (`stella telemetry deploy`, `stella telemetry profile diff`), validation tests, and config signing.
- **Storage & retention**
- Provision Prometheus/Tempo/Loki (or vendor equivalents) with retention tiers (default, forensic, airgap).
- Ensure determinism (chunk manifests, content hashing), remote-write allowlists, sealed/offline modes.
- Implement archivers for forensic bundles (metrics/traces/logs) with cosign signatures.
- **Incident mode**
- API/CLI to toggle incident sampling, retention escalation, Notify signals, and auto bundle capture.
- Hook into Orchestrator to respond to incidents and revert after cooldown.
- **Dashboards & alerts**
- Dashboard packages for core services (ingestion, policy, export, attestation).
- Alert rules for SLO burn, collector failure, exporter backlog, bundle generation errors.
- Self-observability metrics (`collector_export_failures_total`, `telemetry_incident_mode{}`).
- **Offline support**
- Offline Kit assets: collector binaries/configs, import scripts, dashboards, replay instructions, compliance checklists.
- File-based exporters and manual transfer workflows with signed manifests.
- **Docs & runbooks**
- Update observability overview, forensic capture guide, incident response checklist, sealed-mode instructions, RBAC matrix.
- SOC handoff package with control objectives and audit evidence.
## Acceptance criteria
- Collectors ingest metrics/logs/traces across deployments, applying redaction rules and tenant isolation; profiles validate via CI.
- Storage backends retain data per default/forensic/airgap SLAs with deterministic chunk manifests and sealed-mode compliance.
- Incident mode toggles sampling to 100%, extends retention, triggers Notify, and captures forensic bundles signed with cosign.
- Dashboards and alerts cover service SLOs, queue depth, policy latency, ingestion violations, and telemetry stack health.
- CLI commands (`stella telemetry deploy/capture/status`) automate config rollout, forensic capture, and verification.
- Offline bundles replay telemetry in sealed environments using provided scripts and manifests.
## Risks & mitigations
- **PII leakage:** strict redaction processors, policy-managed allowlists, audit tests.
- **Collector overload:** horizontal scaling, batching, circuit breakers, incident mode throttling.
- **Storage cost:** tiered retention, compression, pruning policies, offline archiving.
- **Air-gap drift:** offline kit refresh schedule, deterministic manifest verification.
- **Alert fatigue:** burn-rate alerts, deduping, SOC runbooks.
## Test strategy
- **Config lint/tests:** schema validation, unit tests for processors/exporters, golden configs.
- **Integration:** simulate service traces/logs/metrics, verify pipelines, incident toggles, bundle generation.
- **Performance:** load tests with peak ingestion, long retention windows, failover scenarios.
- **Security:** redaction verification, RBAC/tenant scoping, sealed-mode tests, signed config verification.
- **Offline:** capture bundles, transfer, replay, compliance attestation.
## Definition of done
- Collector profiles, storage backends, incident mode, dashboards, CLI, and offline kit delivered with telemetry and documentation.
- Runbooks and SOC handoff packages published; compliance checklists appended.
- ./TASKS.md and ../../TASKS.md updated; imposed rule statements confirmed in documentation.