feat: Add new projects to solution and implement contract testing documentation

- Added "StellaOps.Policy.Engine", "StellaOps.Cartographer", and "StellaOps.SbomService" projects to the StellaOps solution. - Created AGENTS.md to outline the Contract Testing Guild Charter, detailing mission, scope, and definition of done. - Established TASKS.md for the Contract Testing Task Board, outlining tasks for Sprint 62 and Sprint 63 related to mock servers and replay testing.
2025-10-27 07:57:55 +02:00
parent 1e41ba7ffa
commit 651b8e0fa3
355 changed files with 17276 additions and 1160 deletions
--- a/EPIC_9.md
+++ b/EPIC_9.md
@@ -0,0 +1,523 @@
+Below is the “maximum documentation” bundle for Epic 9. Paste it into your repo and pretend the ingestion chaos was always under control.
+
+> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
+
+---
+
+# Epic 9: Source & Job Orchestrator Dashboard
+
+**Short name:** `Orchestrator Dashboard`
+**Primary service:** `orchestrator` (scheduler, queues, rate‑limits, job state)
+**Surfaces:** Console (Web UI), CLI, Web API
+**Touches:** Conseiller (Feedser), Excitator (Vexer), VEX Consensus Lens, SBOM Service, Policy Engine, Findings Ledger, Authority (authN/Z), Telemetry/Analytics, Object Storage, Relational DB, Message Bus
+
+**AOC ground rule:** Conseiller and Excitator aggregate but never merge. The orchestrator schedules, tracks and recovers jobs; it does not transform evidence beyond transport and storage. No “smart” merging in flight.
+
+---
+
+## 1) What it is
+
+The Source & Job Orchestrator Dashboard is the control surface for every data source and pipeline run across StellaOps. It gives operators:
+
+* Live health of all advisory/VEX/SBOM sources and derived jobs.
+* End‑to‑end pipeline visibility as DAGs and timelines.
+* Controls for pausing, backfilling, replaying, throttling and retrying.
+* Error pattern analysis, rate‑limit observability and backpressure insights.
+* Provenance and audit trails from initial fetch through parse, normalize, index and policy evaluation.
+
+The dashboard sits over the `orchestrator` service, which maintains job state, schedules runs, enforces quotas and rate limits, and collects metrics from worker pools embedded in Conseiller, Excitator, SBOM and related services.
+
+---
+
+## 2) Why (brief)
+
+Ingestion breaks quietly and then loudly. Without a unified control plane, you learn about it from angry users or empty indexes. This dashboard shortens incident MTTR, enables safe backfills, and makes compliance reviewers stop sending emails with twelve attachments and one emoji.
+
+---
+
+## 3) How it should work (maximum detail)
+
+### 3.1 Capabilities
+
+* **Source registry**
+
+  * Register, tag and version connectors (OSV, GHSA, CSAF endpoints, vendor PDF scrapers, distro feeds, RSS, S3 drops, internal registries).
+  * Store connection details, secrets (via KMS), rate‑limit policy, schedules, and ownership metadata.
+  * Validate and “test connection” safely.
+
+* **Job orchestration**
+
+  * Create DAGs composed of job types: `fetch`, `parse`, `normalize`, `dedupe`, `index`, `consensus_compute`, `policy_eval`, `crosslink`, `sbom_ingest`, `sbom_index`.
+  * Priorities, queues, concurrency caps, exponential backoff, circuit breakers.
+  * Idempotency keys and output artifact hashing to avoid duplicate work.
+  * Event‑time watermarks for backfills without double counting.
+
+* **Observability & control**
+
+  * Gantt timeline and real‑time DAG view with critical path highlighting.
+  * Backpressure and queue depth heatmaps.
+  * Error clustering by class (HTTP 429, TLS, schema mismatch, parse failure, upstream 5xx).
+  * Per‑source SLOs and SLA budgets with burn‑rate alerts.
+  * One‑click actions: retry, replay range, pause/resume, throttle/unthrottle, reroute to canary workers.
+
+* **Provenance & audit**
+
+  * Immutable run ledger linking input artifact → every job → output artifact.
+  * Schema version tracking and drift detection.
+  * Operator actions recorded with reason and ticket reference.
+
+* **Safety**
+
+  * Secret redaction everywhere.
+  * Tenant isolation at API, queue and storage layers.
+  * AOC: no in‑flight merges of advisory or VEX content.
+
+### 3.2 Core architecture
+
+* **orchestrator (service)**
+
+  * Maintains job state in Postgres (`sources`, `runs`, `jobs`, `artifacts`, `dag_edges`, `quotas`, `schedules`).
+  * Publishes work to a message bus (e.g., `topic.jobs.ready.<queue>`).
+  * Distributed token‑bucket rate limiter per source/tenant/host.
+  * Watchdog for stuck jobs and circuit breakers for flapping sources.
+  * Watermark manager for backfills (event‑time windows).
+
+* **worker SDK**
+
+  * Lightweight library embedded in Conseiller/Excitator/SBOM workers to:
+
+    * Claim work, heartbeat, update progress, report metrics.
+    * Emit artifact metadata and checksums.
+    * Enforce idempotency via orchestrator‑supplied key.
+
+* **object store**
+
+  * Raw payloads and intermediate artifacts organized by schema and hash:
+
+    * `advisory/raw/<source_id>/<event_time>/<sha256>.json|pdf`
+    * `advisory/normalized/<schema_ver>/<hash>.json`
+    * `vex/raw|normalized/...`
+    * `sbom/raw|graph/...`
+
+* **web API**
+
+  * CRUD for sources, runs, jobs, schedules, quotas.
+  * Control actions (retry, cancel, pause, backfill).
+  * Streaming updates via WebSocket/SSE for the Console.
+
+* **console**
+
+  * React app consuming Orchestrator APIs, rendering DAGs, timelines, health charts and action panels with RBAC.
+
+### 3.3 Data model (selected tables)
+
+* `sources`
+
+  * `id`, `kind` (`advisory|vex|sbom|internal`), `subtype` (e.g., `osv`, `ghsa`, `csaf`, `vendor_pdf`), `display_name`, `owner_team`, `schedule_cron`, `rate_policy`, `enabled`, `secrets_ref`, `tags`, `schema_hint`, `created_at`, `updated_at`.
+
+* `runs`
+
+  * `id`, `source_id`, `trigger` (`schedule|manual|event|backfill`), `window_start`, `window_end`, `state`, `started_at`, `finished_at`, `stats_json`.
+
+* `jobs`
+
+  * `id`, `run_id`, `type`, `queue`, `priority`, `state` (`pending|running|succeeded|failed|canceled|deadletter`), `attempt`, `max_attempt`, `idempotency_key`, `input_artifact_id`, `output_artifact_id`, `worker_id`, `created_at`, `started_at`, `finished_at`, `error_class`, `error_message`, `metrics_json`.
+
+* `dag_edges`
+
+  * `from_job_id`, `to_job_id`, `edge_kind` (`success_only|always`).
+
+* `artifacts`
+
+  * `id`, `kind` (`raw|normalized|index|consensus`), `schema_ver`, `hash`, `uri`, `bytes`, `meta_json`, `created_at`.
+
+* `quotas`
+
+  * `tenant_id`, `resource` (`requests_per_min`, `concurrent_jobs`), `limit`, `window_sec`.
+
+* `schedules`
+
+  * Per‑source cron plus jitter, timezone, blackout windows.
+
+### 3.4 Job lifecycle
+
+1. **Plan**
+   Scheduler creates a `run` for a source and plans a DAG: e.g., `fetch → parse → normalize → dedupe → index → policy_eval` (advisory) or `fetch → parse → normalize → consensus_compute` (VEX).
+
+2. **Enqueue**
+   Ready nodes become `jobs` with queue, priority, idempotency key and optional rate‑limit tokens reserved.
+
+3. **Execute**
+   Worker claims job, heartbeats every N seconds. Output artifacts are stored and linked. Failures are classified and retried with exponential backoff and jitter, up to `max_attempt`.
+
+4. **Complete**
+   Downstream nodes unblock. On run completion, orchestrator computes SLO deltas and emits run summary.
+
+5. **Dead‑letter**
+   Jobs exceeding attempts move to a DLQ with structured context and suggested remediation.
+
+### 3.5 Scheduling, backpressure, rate‑limits
+
+* **Token bucket** per `{tenant, source.host}` with adaptive refill if upstream 429/503 seen.
+* **Concurrency caps** per source and per job type to avoid thundering herd.
+* **Backpressure signals** from queue depth, worker CPU, and upstream error rates; scheduler reduces inflight issuance accordingly.
+* **Backfills** use event‑time windows with immutable watermarks to avoid re‑processing.
+* **Blackout windows** for vendor maintenance periods.
+
+### 3.6 APIs
+
+```
+POST   /orchestrator/sources
+GET    /orchestrator/sources?kind=&tag=&q=
+GET    /orchestrator/sources/{id}
+PATCH  /orchestrator/sources/{id}
+POST   /orchestrator/sources/{id}/actions:test|pause|resume|sync-now
+POST   /orchestrator/sources/{id}/backfill { "from":"2024-01-01", "to":"2024-03-01" }
+
+GET    /orchestrator/runs?source_id=&state=&from=&to=
+GET    /orchestrator/runs/{run_id}
+GET    /orchestrator/runs/{run_id}/dag
+POST   /orchestrator/runs/{run_id}/cancel
+
+GET    /orchestrator/jobs?state=&type=&queue=&source_id=
+GET    /orchestrator/jobs/{job_id}
+POST   /orchestrator/jobs/{job_id}/actions:retry|cancel|prioritize
+
+GET    /orchestrator/metrics/overview
+GET    /orchestrator/errors/top?window=1h
+GET    /orchestrator/quotas
+PATCH  /orchestrator/quotas/{tenant_id}
+WS     /orchestrator/streams/updates
+```
+
+### 3.7 Console (Web UI)
+
+* **Overview**
+
+  * KPI tiles: sources healthy, runs in progress, queue depth, error rate, burn‑rate to SLO.
+  * Heatmap of source health by last 24h success ratio.
+
+* **Sources**
+
+  * Grid with filters, inline status (active, paused, throttled), next run eta, last error class.
+  * Detail panel: config, secrets status (redacted), schedule, rate limits, ownership, run history, action buttons.
+
+* **Runs**
+
+  * Timeline (Gantt) with critical path, duration distribution, and per‑stage breakdown.
+  * Run detail: DAG view with node metrics, artifacts, logs, action menu (cancel).
+
+* **Jobs**
+
+  * Live table with state filters and “tail” view.
+  * Job detail: payload preview (redacted), worker, attempts, stack traces, linked artifacts.
+
+* **Errors**
+
+  * Clusters by class and signature, suggested remediations (pause source, lower concurrency, patch parser).
+
+* **Queues & Backpressure**
+
+  * Per‑queue depth, service rate, inflight, age percentiles.
+  * Rate‑limit tokens graphs per source host.
+
+* **Controls**
+
+  * Backfill wizard with event‑time preview and safety checks.
+  * Canary routing: route 5% of next 100 runs to a new worker pool.
+
+* **A11y**
+
+  * Keyboard nav, ARIA roles for DAG nodes, live regions for updates, color‑blind friendly graphs.
+
+### 3.8 CLI
+
+```
+stella orch sources list --kind advisory --tag prod
+stella orch sources add --file source.yaml
+stella orch sources test <source-id>
+stella orch sources pause <source-id>  # or resume
+stella orch sources sync-now <source-id>
+stella orch sources backfill <source-id> --from 2024-01-01 --to 2024-03-01
+
+stella orch runs list --source <id> --state running
+stella orch runs show <run-id> --dag
+stella orch runs cancel <run-id>
+
+stella orch jobs list --state failed --type parse --limit 100
+stella orch jobs retry <job-id>
+stella orch jobs cancel <job-id>
+stella orch jobs tail --queue normalize --follow
+
+stella orch quotas get --tenant default
+stella orch quotas set --tenant default --concurrent-jobs 50 --rpm 1200
+```
+
+Exit codes: `0` success, `2` invalid args, `4` not found, `5` denied, `7` precondition failed, `8` rate‑limited.
+
+### 3.9 RBAC & security
+
+* **Roles**
+
+  * `Orch.Viewer`: read‑only sources/runs/jobs/metrics.
+  * `Orch.Operator`: perform actions on sources and jobs, launch backfills.
+  * `Orch.Admin`: manage quotas, schedules, connector versions, and delete sources.
+
+* **Secrets**
+
+  * Stored only as references to your KMS; never persisted in cleartext.
+  * Console shows redact badges and last rotated timestamp.
+
+* **Tenancy**
+
+  * Source, run, job rows scoped by tenant id.
+  * Queue names and token buckets namespaced per tenant.
+
+* **Compliance**
+
+  * Full audit log for every operator action with “reason” and optional ticket link.
+  * Exportable run ledger for audits.
+
+### 3.10 Observability
+
+* **Metrics (examples)**
+
+  * `orch_jobs_inflight{type,queue}`
+  * `orch_jobs_latency_ms{type,percentile}`
+  * `orch_rate_tokens_available{source}`
+  * `orch_error_rate{source,error_class}`
+  * `orch_slo_burn_rate{source,slo}`
+  * `orch_deadletter_total{source,type}`
+
+* **Traces**
+
+  * Span per job with baggage: `run_id`, `source_id`, `artifact_id`.
+  * Links across services to Conseiller/Excitator/SBOM workers.
+
+* **Logs**
+
+  * Structured JSON with correlation ids, attempt numbers and redacted payload previews.
+
+### 3.11 Performance targets
+
+* Job dispatch P95 < 150 ms after dependency satisfied.
+* Scheduler loop P95 < 500 ms for 10k pending jobs.
+* Console live updates sub‑second at 1k events/sec per tenant.
+* Backfill throughput ≥ 200 jobs/sec per worker pool with zero dupes.
+
+### 3.12 Edge cases & behaviors
+
+* **Upstream 429 storms:** auto‑throttle, pause optional, recommend extended jitter.
+* **Schema drift:** parser moves job to DLQ with `error_class=schema_mismatch` and opens a change ticket via webhook.
+* **Flapping source:** circuit breaker opens after N consecutive failures; requires human “resume”.
+* **Clock skew:** watermark logic uses upstream event time; large skews flagged.
+* **Idempotency collisions:** new attempt yields no‑op if artifact hash already exists.
+
+---
+
+## 4) Implementation plan
+
+### 4.1 Modules (new and updated)
+
+* New service: `src/StellaOps.Orchestrator`
+
+  * `api/` REST + WS handlers
+  * `scheduler/` run planner, DAG builder, watermark/backfill logic
+  * `queues/` publisher and consumer abstractions
+  * `ratelimit/` token bucket and adaptive controller
+  * `state/` Postgres repositories and migrations
+  * `audit/` action logging and export
+  * `metrics/` Prometheus exporters
+  * `security/` tenant scoping, KMS client, secret refs
+
+* Worker SDKs:
+
+  * `src/StellaOps.Orchestrator.WorkerSdk.Go` and `src/StellaOps.Orchestrator.WorkerSdk.Python` with job claim, heartbeat, progress, artifact publish, and structured error reporting.
+
+* Console:
+
+  * `console/apps/orch/` pages: Overview, Sources, Runs, Jobs, Errors, Queues.
+  * `components/dag-view/`, `components/gantt/`, `components/health-heatmap/`.
+
+* Updates to existing services:
+
+  * Conseiller/Excitator/SBOM workers adopt SDK and emit artifacts with schema/version/fingerprint.
+  * VEX Lens exposes `consensus_compute` as a jobable operation.
+  * Policy Engine exposes `policy_eval` as a job type for scheduled recalcs.
+
+### 4.2 Packaging & deployment
+
+* Containers:
+
+  * `stella/orchestrator:<ver>`
+  * `stella/worker-sdk-examples:<ver>` for canary pools
+
+* Helm values:
+
+  * Queues/topics, per‑tenant concurrency, rate‑limit defaults, WS replica count.
+  * KMS integration secrets.
+
+* Migrations:
+
+  * Flyway/Goose migrations for new tables and indexes.
+
+### 4.3 Rollout strategy
+
+* Phase 1: Read‑only dashboard fed by existing job tables; no controls.
+* Phase 2: Control actions enabled for non‑prod tenants.
+* Phase 3: Backfills and quota management, then GA.
+
+---
+
+## 5) Documentation changes
+
+Create/update the following, each ending with the imposed rule statement.
+
+1. `/docs/orchestrator/overview.md`
+   Concepts, roles, responsibilities, AOC alignment.
+
+2. `/docs/orchestrator/architecture.md`
+   Scheduler, DAGs, watermarks, queues, rate‑limits, data model.
+
+3. `/docs/orchestrator/api.md`
+   Endpoints, WebSocket events, error codes, examples.
+
+4. `/docs/orchestrator/console.md`
+   Screens, actions, a11y, live updates.
+
+5. `/docs/orchestrator/cli.md`
+   Commands, examples, exit codes, scripting patterns.
+
+6. `/docs/orchestrator/run‑ledger.md`
+   Provenance and audit export format.
+
+7. `/docs/security/secrets‑handling.md`
+   KMS references, redaction rules, operator hygiene.
+
+8. `/docs/operations/orchestrator‑runbook.md`
+   Common failures, backfill guide, circuit breakers, tuning.
+
+9. `/docs/schemas/artifacts.md`
+   Artifact kinds, schema versions, hashing, storage layout.
+
+10. `/docs/slo/orchestrator‑slo.md`
+    SLO definitions, measurement, alerting.
+
+---
+
+## 6) Engineering tasks
+
+### Backend (orchestrator)
+
+* [ ] Stand up Postgres schemas and indices for sources, runs, jobs, dag_edges, artifacts, quotas, schedules.
+* [ ] Implement scheduler: DAG planner, dependency resolver, critical path computation.
+* [ ] Implement rate limiter with adaptive behavior on 429/503 and per‑tenant tokens.
+* [ ] Implement watermark/backfill manager with event‑time windows and idempotency keys.
+* [ ] Implement API endpoints + OpenAPI spec + request validation.
+* [ ] Implement WebSocket/SSE event stream for live updates.
+* [ ] Implement audit logging and export.
+* [ ] Implement dead‑letter store and replay.
+
+### Worker SDKs and integrations
+
+* [ ] Build Go/Python SDKs with claim/heartbeat/progress API.
+* [ ] Integrate SDK into Conseiller, Excitator, SBOM workers; ensure artifact emission with schema ver.
+* [ ] Add `consensus_compute` and `policy_eval` as job types with deterministic inputs/outputs.
+
+### Console
+
+* [ ] Overview tiles and health heatmap.
+* [ ] Source list/detail with actions and config view.
+* [ ] Runs timeline (Gantt) and DAG visualization with node inspector.
+* [ ] Jobs “tail” with live updates and filters.
+* [ ] Errors clustering and suggested remediations.
+* [ ] Queues/backpressure dashboard.
+* [ ] Backfill wizard with safety checks.
+
+### Observability
+
+* [ ] Emit metrics listed in §3.10 and wire traces across services.
+* [ ] Dashboards: health, queue depth, error classes, burn‑rate, dispatch latency.
+* [ ] Alerts for SLO burn and circuit breaker opens.
+
+### Security & RBAC
+
+* [ ] Enforce tenant scoping on all endpoints; test leakage.
+* [ ] Wire KMS for secret refs and redact everywhere.
+* [ ] Implement `Orch.Viewer|Operator|Admin` roles and check in Console and API.
+
+### Docs
+
+* [ ] Author all files in §5 with examples and screenshots.
+* [ ] Cross‑link from Conseiller/Excitator/SBOM pages to the dashboard docs.
+* [ ] Append imposed rule to each page.
+
+> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
+
+---
+
+## 7) Acceptance criteria
+
+* Operators can: pause/resume a source, run “sync‑now,” initiate a backfill for a date range, and retry/cancel individual jobs from Console and CLI.
+* DAG and timeline reflect reality within 1 second of job state changes at P95.
+* Backfills do not create duplicate artifacts; idempotency proven by hash equality.
+* Rate limiter reduces 429s by ≥80% under simulated throttle tests.
+* Audit log includes who/when/why for every operator action.
+* Provenance ledger exports a complete chain for any artifact.
+* RBAC prevents non‑admins from quota changes; tenancy isolation proven via automated tests.
+* SLO dashboard shows burn‑rate and triggers alerts under injected failure.
+
+---
+
+## 8) Risks & mitigations
+
+* **Orchestrator becomes a single bottleneck.**
+  Horizontal scale stateless workers; DB indexes tuned; job state updates batched; cache hot paths.
+
+* **Secret spillage.**
+  Only KMS references stored; aggressive redaction; log scrubbing in SDK.
+
+* **Over‑eager backfills overwhelm upstream.**
+  Enforce per‑source quotas and sandbox previews; dry‑run backfills first.
+
+* **Schema drift silently corrupts normalization.**
+  Hard‑fail on mismatch; DLQ with clear signatures; schema registry gating.
+
+* **Flapping sources cause alert fatigue.**
+  Circuit breaker with cool‑down and deduped alerts; error budget policy.
+
+---
+
+## 9) Test plan
+
+* **Unit**
+  Scheduler DAG building, topological sort, backoff math, token bucket, watermark math.
+
+* **Integration**
+  Orchestrator ↔ worker SDK, artifact store wiring, DLQ replay, audit pipeline.
+
+* **Chaos**
+  Inject 429 storms, packet loss, worker crashes; verify throttling and recovery.
+
+* **Backfill**
+  Simulate overlapping windows and verify idempotency and watermark correctness.
+
+* **Perf**
+  10k concurrent jobs: dispatch latency, DB contention, WebSocket fan‑out.
+
+* **Security**
+  Multi‑tenant isolation tests; KMS mock tests for secret access; RBAC matrix.
+
+* **UX/A11y**
+  Screen reader labels on DAG, keyboard navigation, live region updates.
+
+---
+
+## 10) Philosophy
+
+* **Make the invisible visible.** Pipelines should be legible at a glance.
+* **Prefer reproducibility to heroics.** Idempotency and provenance over “we think it ran.”
+* **Safeguards before speed.** Throttle first, retry thoughtfully, never melt upstreams.
+* **No silent merges.** Evidence remains immutable; transformations are explicit, logged and reversible.
+
+> Final reminder: **Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.**