feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules

- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
2025-10-30 00:09:39 +02:00
parent 3154c67978
commit 7b5bdcf4d3
503 changed files with 16136 additions and 54638 deletions
--- a/docs/modules/orchestrator/AGENTS.md
+++ b/docs/modules/orchestrator/AGENTS.md
@@ -0,0 +1,22 @@
+# Source & Job Orchestrator agent guide
+
+## Mission
+The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform.
+
+## Key docs
+- [Module README](./README.md)
+- [Architecture](./architecture.md)
+- [Implementation plan](./implementation_plan.md)
+- [Task board](./TASKS.md)
+
+## How to get started
+1. Read the design summaries in ./architecture.md (quota governance, job lifecycle, dashboard feeds).
+2. Open ../../implplan/SPRINTS.md and locate stories for this component.
+3. Check ./TASKS.md and update status before/after work.
+4. Review ./README.md for responsibilities and ensure changes maintain determinism and offline parity.
+
+## Guardrails
+- Uphold Aggregation-Only Contract boundaries when consuming ingestion data.
+- Preserve determinism and provenance in all derived outputs.
+- Document offline/air-gap pathways for any new feature.
+- Update telemetry/observability assets alongside feature work.
--- a/docs/modules/orchestrator/README.md
+++ b/docs/modules/orchestrator/README.md
@@ -0,0 +1,29 @@
+# StellaOps Source & Job Orchestrator
+
+The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform.
+
+## Responsibilities
+- Track job state, throughput, and errors for Concelier, Excititor, Scheduler, and export pipelines.
+- Expose dashboards and APIs for throttling, replays, and failover.
+- Enforce rate-limits, concurrency and dependency chains across queues.
+- Stream structured events and audit logs for incident response.
+
+## Key components
+- Orchestrator WebService (control plane).
+- Queue adapters (Redis/NATS) and job ledger.
+- Console dashboard module and CLI integration for operators.
+
+## Integrations & dependencies
+- Authority for authN/Z on operational actions.
+- Telemetry stack for job metrics and alerts.
+- Scheduler/Concelier/Excititor workers for job lifecycle.
+- Offline Kit for state export/import during air-gap refreshes.
+
+## Operational notes
+- Job recovery runbooks and dashboard JSON as described in Epic 9.
+- Audit retention policies for job history.
+- Rate-limit reconfiguration guidelines.
+
+## Epic alignment
+- Epic 9: Source & Job Orchestrator Dashboard.
+- ORCH stories in ../../TASKS.md.
--- a/docs/modules/orchestrator/TASKS.md
+++ b/docs/modules/orchestrator/TASKS.md
@@ -0,0 +1,9 @@
+# Task board — Source & Job Orchestrator
+
+> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
+
+| ID | Status | Owner(s) | Description | Notes |
+|----|--------|----------|-------------|-------|
+| SOURCE---JOB-ORCHESTRATOR-DOCS-0001 | DOING (2025-10-29) | Docs Guild | Ensure ./README.md reflects the latest epic deliverables. | Align with ./AGENTS.md |
+| SOURCE---JOB-ORCHESTRATOR-ENG-0001 | TODO | Module Team | Break down epic milestones into actionable stories. | Sync into ../../TASKS.md |
+| SOURCE---JOB-ORCHESTRATOR-OPS-0001 | TODO | Ops Guild | Prepare runbooks/observability assets once MVP lands. | Document outputs in ./README.md |
--- a/docs/modules/orchestrator/architecture.md
+++ b/docs/modules/orchestrator/architecture.md
@@ -0,0 +1,52 @@
+# Source & Job Orchestrator architecture
+
+> Based on Epic 9 – Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability.
+
+## 1) Topology
+
+- **Orchestrator API (`StellaOps.Orchestrator`).** Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (`orchestrator:*`).
+- **Job ledger (Mongo).** Collections `jobs`, `job_history`, `sources`, `quotas`, `throttles`, `incidents`. Append-only history ensures auditability.
+- **Queue abstraction.** Supports Mongo queue, Redis Streams, or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
+- **Dashboard feeds.** SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.
+
+## 2) Job lifecycle
+
+1. **Enqueue.** Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit `JobRequest` records containing `jobType`, `tenant`, `priority`, `payloadDigest`, `dependencies`.
+2. **Scheduling.** Orchestrator applies quotas and rate limits per `{tenant, jobType}`. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
+3. **Leasing.** Workers poll `LeaseJob` endpoint; Orchestrator returns job with `leaseId`, `leaseUntil`, and instrumentation tokens. Lease renewal required for long-running tasks.
+4. **Completion.** Worker reports status (`succeeded`, `failed`, `canceled`, `timed_out`). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
+5. **Replay.** Operators trigger `POST /jobs/{id}/replay` which clones job payload, sets `replayOf` pointer, and requeues with high priority while preserving determinism metadata.
+
+## 3) Rate-limit & quota governance
+
+- Quotas defined per tenant/profile (`maxActive`, `maxPerHour`, `burst`). Stored in `quotas` and enforced before leasing.
+- Dynamic throttles allow ops to pause specific sources (`pauseSource`, `resumeSource`) or reduce concurrency.
+- Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
+
+## 4) APIs
+
+- `GET /api/jobs?status=` — list jobs with filters (tenant, jobType, status, time window).
+- `GET /api/jobs/{id}` — job detail (payload digest, attempts, worker, lease history, metrics).
+- `POST /api/jobs/{id}/cancel` — cancel running/pending job with audit reason.
+- `POST /api/jobs/{id}/replay` — schedule replay.
+- `POST /api/limits/throttle` — apply throttle (requires elevated scope).
+- `GET /api/dashboard/metrics` — aggregated metrics for Console dashboards.
+
+All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.
+
+## 5) Observability
+
+- Metrics: `job_queue_depth{jobType,tenant}`, `job_latency_seconds`, `job_failures_total`, `job_retry_total`, `lease_extensions_total`.
+- Logs: structured with `jobId`, `jobType`, `tenant`, `workerId`, `leaseId`, `status`. Incident logs flagged for Ops.
+- Traces: spans covering `enqueue`, `schedule`, `lease`, `worker_execute`, `complete`. Trace IDs propagate to worker spans for end-to-end correlation.
+
+## 6) Offline support
+
+- Orchestrator exports audit bundles: `jobs.jsonl`, `history.jsonl`, `throttles.jsonl`, `manifest.json`, `signatures/`. Used for offline investigations and compliance.
+- Replay manifests contain job digests and success/failure notes for deterministic proof.
+
+## 7) Operational considerations
+
+- HA deployment with multiple API instances; queue storage determines redundancy strategy.
+- Support for `maintenance` mode halting leases while allowing status inspection.
+- Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume).
--- a/docs/modules/orchestrator/implementation_plan.md
+++ b/docs/modules/orchestrator/implementation_plan.md
@@ -0,0 +1,62 @@
+# Implementation plan — Source & Job Orchestrator
+
+## Delivery phases
+- **Phase 1 – Core service & job ledger**  
+  Implement source registry, run/job tables, queue abstraction, lease management, token-bucket rate limiting, watchdogs, and API primitives (`/sources`, `/runs`, `/jobs`).
+- **Phase 2 – Worker SDK & artifact registry**  
+  Embed worker SDK in Conseiller, Excititor, SBOM, Policy Engine; capture artifact metadata + hashes, enforce idempotency, publish progress/metrics.
+- **Phase 3 – Observability & dashboard**  
+  Ship metrics, traces, incident logging, SSE/WebSocket feeds, and Console dashboard (DAG/timeline, heatmaps, error clustering, SLO burn rate).
+- **Phase 4 – Controls & resilience**  
+  Deliver pause/resume/throttle/retry/backfill tooling, dead-letter review, circuit breakers, blackouts, backpressure handling, and automation hooks.
+- **Phase 5 – Offline & compliance**  
+  Generate deterministic audit bundles (`jobs.jsonl`, `history.jsonl`, `throttles.jsonl`), provenance manifests, and offline replay scripts.
+
+## Work breakdown
+- **Service & persistence**
+  - Postgres schema (`sources`, `runs`, `jobs`, `artifacts`, `dag_edges`, `quotas`, `schedules`, `incidents`).
+  - Lease manager with heartbeats, retries, and dead-letter queues.
+  - Token-bucket rate limiter per `{tenant, source.host}`; adaptive refill on upstream throttles.
+  - Watermark/backfill orchestration for event-time windows.
+- **Worker SDK**
+  - Claim/heartbeat/report contract, deterministic artifact hashing, idempotency enforcement.
+  - Library release for .NET workers plus language bindings for Rust/Go ingestion agents.
+- **Control plane APIs**
+  - CRUD for sources, runs, jobs, quotas, schedules; control actions (retry, cancel, prioritize, pause/resume, backfill).
+  - SSE/WebSocket stream for Console updates.
+- **Observability**
+  - Metrics: queue depth, job latency, failure classes, rate-limit hits, burn rate.
+  - Error clustering (HTTP 429/5xx, schema mismatch, parse errors), incident logging with reason codes.
+  - Gantt timeline and DAG JSON for Console visualisation.
+- **Console & CLI**
+  - Console app sections: overview, sources, runs, job detail, incidents, throttles.
+  - CLI commands: `stella orchestrator sources|runs|jobs|throttle|backfill`.
+- **Compliance & offline**
+  - Immutable audit bundles with signatures; exports via Export Center; Offline Kit instructions.
+  - Tenant isolation validation and secret redaction for logs/UI.
+
+## Acceptance criteria
+- Orchestrator schedules all advisory/VEX/SBOM/policy jobs with quotas, rate limits, and idempotency; retries and replay preserve provenance.
+- Console dashboard reflects real-time DAG status, queue depth, SLO burn rate, and allows pause/resume/throttle/backfill with audit trail.
+- Worker SDK integrated across producer services, emitting progress and artifact metadata.
+- Observability stack exposes metrics, logs, traces, incidents, and alerts for stuck jobs, throttling, and failure spikes.
+- Offline audit bundles reproduce job history deterministically and verify signatures.
+
+## Risks & mitigations
+- **Backpressure/queue overload:** adaptive token buckets, circuit breakers, dynamic concurrency; degrade gracefully.
+- **Upstream vendor throttles:** throttle management with user-visible state, automatic jitter and retry.
+- **Tenant leakage:** enforce tenant filters at API/queue/storage, fuzz tests, redaction.
+- **Complex DAG errors:** built-in diagnostics, error clustering, partial replay tooling.
+- **Operator error:** confirmation prompts, RBAC, runbook guidance, reason codes logged.
+
+## Test strategy
+- **Unit:** scheduling, quota enforcement, lease renewals, token bucket, watermark arithmetic.
+- **Integration:** worker SDK with Conseiller/Excititor/SBOM pipelines, pause/resume/backfill flows, failure recovery.
+- **Performance:** high-volume job workloads, queue backpressure, concurrency caps, dashboard SSE load tests.
+- **Chaos:** simulate upstream outages, stuck workers, clock skew, Postgres failover.
+- **Compliance:** audit bundle generation, signature verification, offline replay.
+
+## Definition of done
+- All phases delivered with telemetry, dashboards, and runbooks published.
+- Console + CLI parity validated; Offline Kit instructions complete.
+- ./TASKS.md and ../../TASKS.md updated with status; documentation (README/architecture/this plan) reflects latest behaviour.