feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
		
							
								
								
									
										22
									
								
								docs/modules/orchestrator/AGENTS.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										22
									
								
								docs/modules/orchestrator/AGENTS.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,22 @@ | ||||
| # Source & Job Orchestrator agent guide | ||||
|  | ||||
| ## Mission | ||||
| The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform. | ||||
|  | ||||
| ## Key docs | ||||
| - [Module README](./README.md) | ||||
| - [Architecture](./architecture.md) | ||||
| - [Implementation plan](./implementation_plan.md) | ||||
| - [Task board](./TASKS.md) | ||||
|  | ||||
| ## How to get started | ||||
| 1. Read the design summaries in ./architecture.md (quota governance, job lifecycle, dashboard feeds). | ||||
| 2. Open ../../implplan/SPRINTS.md and locate stories for this component. | ||||
| 3. Check ./TASKS.md and update status before/after work. | ||||
| 4. Review ./README.md for responsibilities and ensure changes maintain determinism and offline parity. | ||||
|  | ||||
| ## Guardrails | ||||
| - Uphold Aggregation-Only Contract boundaries when consuming ingestion data. | ||||
| - Preserve determinism and provenance in all derived outputs. | ||||
| - Document offline/air-gap pathways for any new feature. | ||||
| - Update telemetry/observability assets alongside feature work. | ||||
							
								
								
									
										29
									
								
								docs/modules/orchestrator/README.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										29
									
								
								docs/modules/orchestrator/README.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,29 @@ | ||||
| # StellaOps Source & Job Orchestrator | ||||
|  | ||||
| The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform. | ||||
|  | ||||
| ## Responsibilities | ||||
| - Track job state, throughput, and errors for Concelier, Excititor, Scheduler, and export pipelines. | ||||
| - Expose dashboards and APIs for throttling, replays, and failover. | ||||
| - Enforce rate-limits, concurrency and dependency chains across queues. | ||||
| - Stream structured events and audit logs for incident response. | ||||
|  | ||||
| ## Key components | ||||
| - Orchestrator WebService (control plane). | ||||
| - Queue adapters (Redis/NATS) and job ledger. | ||||
| - Console dashboard module and CLI integration for operators. | ||||
|  | ||||
| ## Integrations & dependencies | ||||
| - Authority for authN/Z on operational actions. | ||||
| - Telemetry stack for job metrics and alerts. | ||||
| - Scheduler/Concelier/Excititor workers for job lifecycle. | ||||
| - Offline Kit for state export/import during air-gap refreshes. | ||||
|  | ||||
| ## Operational notes | ||||
| - Job recovery runbooks and dashboard JSON as described in Epic 9. | ||||
| - Audit retention policies for job history. | ||||
| - Rate-limit reconfiguration guidelines. | ||||
|  | ||||
| ## Epic alignment | ||||
| - Epic 9: Source & Job Orchestrator Dashboard. | ||||
| - ORCH stories in ../../TASKS.md. | ||||
							
								
								
									
										9
									
								
								docs/modules/orchestrator/TASKS.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										9
									
								
								docs/modules/orchestrator/TASKS.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,9 @@ | ||||
| # Task board — Source & Job Orchestrator | ||||
|  | ||||
| > Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable. | ||||
|  | ||||
| | ID | Status | Owner(s) | Description | Notes | | ||||
| |----|--------|----------|-------------|-------| | ||||
| | SOURCE---JOB-ORCHESTRATOR-DOCS-0001 | DOING (2025-10-29) | Docs Guild | Ensure ./README.md reflects the latest epic deliverables. | Align with ./AGENTS.md | | ||||
| | SOURCE---JOB-ORCHESTRATOR-ENG-0001 | TODO | Module Team | Break down epic milestones into actionable stories. | Sync into ../../TASKS.md | | ||||
| | SOURCE---JOB-ORCHESTRATOR-OPS-0001 | TODO | Ops Guild | Prepare runbooks/observability assets once MVP lands. | Document outputs in ./README.md | | ||||
							
								
								
									
										52
									
								
								docs/modules/orchestrator/architecture.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										52
									
								
								docs/modules/orchestrator/architecture.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,52 @@ | ||||
| # Source & Job Orchestrator architecture | ||||
|  | ||||
| > Based on Epic 9 – Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability. | ||||
|  | ||||
| ## 1) Topology | ||||
|  | ||||
| - **Orchestrator API (`StellaOps.Orchestrator`).** Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (`orchestrator:*`). | ||||
| - **Job ledger (Mongo).** Collections `jobs`, `job_history`, `sources`, `quotas`, `throttles`, `incidents`. Append-only history ensures auditability. | ||||
| - **Queue abstraction.** Supports Mongo queue, Redis Streams, or NATS JetStream (pluggable). Each job carries lease metadata and retry policy. | ||||
| - **Dashboard feeds.** SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status. | ||||
|  | ||||
| ## 2) Job lifecycle | ||||
|  | ||||
| 1. **Enqueue.** Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit `JobRequest` records containing `jobType`, `tenant`, `priority`, `payloadDigest`, `dependencies`. | ||||
| 2. **Scheduling.** Orchestrator applies quotas and rate limits per `{tenant, jobType}`. Jobs exceeding limits are staged in pending queue with next eligible timestamp. | ||||
| 3. **Leasing.** Workers poll `LeaseJob` endpoint; Orchestrator returns job with `leaseId`, `leaseUntil`, and instrumentation tokens. Lease renewal required for long-running tasks. | ||||
| 4. **Completion.** Worker reports status (`succeeded`, `failed`, `canceled`, `timed_out`). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded. | ||||
| 5. **Replay.** Operators trigger `POST /jobs/{id}/replay` which clones job payload, sets `replayOf` pointer, and requeues with high priority while preserving determinism metadata. | ||||
|  | ||||
| ## 3) Rate-limit & quota governance | ||||
|  | ||||
| - Quotas defined per tenant/profile (`maxActive`, `maxPerHour`, `burst`). Stored in `quotas` and enforced before leasing. | ||||
| - Dynamic throttles allow ops to pause specific sources (`pauseSource`, `resumeSource`) or reduce concurrency. | ||||
| - Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack. | ||||
|  | ||||
| ## 4) APIs | ||||
|  | ||||
| - `GET /api/jobs?status=` — list jobs with filters (tenant, jobType, status, time window). | ||||
| - `GET /api/jobs/{id}` — job detail (payload digest, attempts, worker, lease history, metrics). | ||||
| - `POST /api/jobs/{id}/cancel` — cancel running/pending job with audit reason. | ||||
| - `POST /api/jobs/{id}/replay` — schedule replay. | ||||
| - `POST /api/limits/throttle` — apply throttle (requires elevated scope). | ||||
| - `GET /api/dashboard/metrics` — aggregated metrics for Console dashboards. | ||||
|  | ||||
| All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation. | ||||
|  | ||||
| ## 5) Observability | ||||
|  | ||||
| - Metrics: `job_queue_depth{jobType,tenant}`, `job_latency_seconds`, `job_failures_total`, `job_retry_total`, `lease_extensions_total`. | ||||
| - Logs: structured with `jobId`, `jobType`, `tenant`, `workerId`, `leaseId`, `status`. Incident logs flagged for Ops. | ||||
| - Traces: spans covering `enqueue`, `schedule`, `lease`, `worker_execute`, `complete`. Trace IDs propagate to worker spans for end-to-end correlation. | ||||
|  | ||||
| ## 6) Offline support | ||||
|  | ||||
| - Orchestrator exports audit bundles: `jobs.jsonl`, `history.jsonl`, `throttles.jsonl`, `manifest.json`, `signatures/`. Used for offline investigations and compliance. | ||||
| - Replay manifests contain job digests and success/failure notes for deterministic proof. | ||||
|  | ||||
| ## 7) Operational considerations | ||||
|  | ||||
| - HA deployment with multiple API instances; queue storage determines redundancy strategy. | ||||
| - Support for `maintenance` mode halting leases while allowing status inspection. | ||||
| - Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume). | ||||
							
								
								
									
										62
									
								
								docs/modules/orchestrator/implementation_plan.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										62
									
								
								docs/modules/orchestrator/implementation_plan.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,62 @@ | ||||
| # Implementation plan — Source & Job Orchestrator | ||||
|  | ||||
| ## Delivery phases | ||||
| - **Phase 1 – Core service & job ledger**   | ||||
|   Implement source registry, run/job tables, queue abstraction, lease management, token-bucket rate limiting, watchdogs, and API primitives (`/sources`, `/runs`, `/jobs`). | ||||
| - **Phase 2 – Worker SDK & artifact registry**   | ||||
|   Embed worker SDK in Conseiller, Excititor, SBOM, Policy Engine; capture artifact metadata + hashes, enforce idempotency, publish progress/metrics. | ||||
| - **Phase 3 – Observability & dashboard**   | ||||
|   Ship metrics, traces, incident logging, SSE/WebSocket feeds, and Console dashboard (DAG/timeline, heatmaps, error clustering, SLO burn rate). | ||||
| - **Phase 4 – Controls & resilience**   | ||||
|   Deliver pause/resume/throttle/retry/backfill tooling, dead-letter review, circuit breakers, blackouts, backpressure handling, and automation hooks. | ||||
| - **Phase 5 – Offline & compliance**   | ||||
|   Generate deterministic audit bundles (`jobs.jsonl`, `history.jsonl`, `throttles.jsonl`), provenance manifests, and offline replay scripts. | ||||
|  | ||||
| ## Work breakdown | ||||
| - **Service & persistence** | ||||
|   - Postgres schema (`sources`, `runs`, `jobs`, `artifacts`, `dag_edges`, `quotas`, `schedules`, `incidents`). | ||||
|   - Lease manager with heartbeats, retries, and dead-letter queues. | ||||
|   - Token-bucket rate limiter per `{tenant, source.host}`; adaptive refill on upstream throttles. | ||||
|   - Watermark/backfill orchestration for event-time windows. | ||||
| - **Worker SDK** | ||||
|   - Claim/heartbeat/report contract, deterministic artifact hashing, idempotency enforcement. | ||||
|   - Library release for .NET workers plus language bindings for Rust/Go ingestion agents. | ||||
| - **Control plane APIs** | ||||
|   - CRUD for sources, runs, jobs, quotas, schedules; control actions (retry, cancel, prioritize, pause/resume, backfill). | ||||
|   - SSE/WebSocket stream for Console updates. | ||||
| - **Observability** | ||||
|   - Metrics: queue depth, job latency, failure classes, rate-limit hits, burn rate. | ||||
|   - Error clustering (HTTP 429/5xx, schema mismatch, parse errors), incident logging with reason codes. | ||||
|   - Gantt timeline and DAG JSON for Console visualisation. | ||||
| - **Console & CLI** | ||||
|   - Console app sections: overview, sources, runs, job detail, incidents, throttles. | ||||
|   - CLI commands: `stella orchestrator sources|runs|jobs|throttle|backfill`. | ||||
| - **Compliance & offline** | ||||
|   - Immutable audit bundles with signatures; exports via Export Center; Offline Kit instructions. | ||||
|   - Tenant isolation validation and secret redaction for logs/UI. | ||||
|  | ||||
| ## Acceptance criteria | ||||
| - Orchestrator schedules all advisory/VEX/SBOM/policy jobs with quotas, rate limits, and idempotency; retries and replay preserve provenance. | ||||
| - Console dashboard reflects real-time DAG status, queue depth, SLO burn rate, and allows pause/resume/throttle/backfill with audit trail. | ||||
| - Worker SDK integrated across producer services, emitting progress and artifact metadata. | ||||
| - Observability stack exposes metrics, logs, traces, incidents, and alerts for stuck jobs, throttling, and failure spikes. | ||||
| - Offline audit bundles reproduce job history deterministically and verify signatures. | ||||
|  | ||||
| ## Risks & mitigations | ||||
| - **Backpressure/queue overload:** adaptive token buckets, circuit breakers, dynamic concurrency; degrade gracefully. | ||||
| - **Upstream vendor throttles:** throttle management with user-visible state, automatic jitter and retry. | ||||
| - **Tenant leakage:** enforce tenant filters at API/queue/storage, fuzz tests, redaction. | ||||
| - **Complex DAG errors:** built-in diagnostics, error clustering, partial replay tooling. | ||||
| - **Operator error:** confirmation prompts, RBAC, runbook guidance, reason codes logged. | ||||
|  | ||||
| ## Test strategy | ||||
| - **Unit:** scheduling, quota enforcement, lease renewals, token bucket, watermark arithmetic. | ||||
| - **Integration:** worker SDK with Conseiller/Excititor/SBOM pipelines, pause/resume/backfill flows, failure recovery. | ||||
| - **Performance:** high-volume job workloads, queue backpressure, concurrency caps, dashboard SSE load tests. | ||||
| - **Chaos:** simulate upstream outages, stuck workers, clock skew, Postgres failover. | ||||
| - **Compliance:** audit bundle generation, signature verification, offline replay. | ||||
|  | ||||
| ## Definition of done | ||||
| - All phases delivered with telemetry, dashboards, and runbooks published. | ||||
| - Console + CLI parity validated; Offline Kit instructions complete. | ||||
| - ./TASKS.md and ../../TASKS.md updated with status; documentation (README/architecture/this plan) reflects latest behaviour. | ||||
		Reference in New Issue
	
	Block a user