feat: Add new projects to solution and implement contract testing documentation

- Added "StellaOps.Policy.Engine", "StellaOps.Cartographer", and "StellaOps.SbomService" projects to the StellaOps solution.
- Created AGENTS.md to outline the Contract Testing Guild Charter, detailing mission, scope, and definition of done.
- Established TASKS.md for the Contract Testing Task Board, outlining tasks for Sprint 62 and Sprint 63 related to mock servers and replay testing.
This commit is contained in:
2025-10-27 07:57:55 +02:00
parent 1e41ba7ffa
commit 651b8e0fa3
355 changed files with 17276 additions and 1160 deletions

523
EPIC_9.md Normal file
View File

@@ -0,0 +1,523 @@
Below is the “maximum documentation” bundle for Epic 9. Paste it into your repo and pretend the ingestion chaos was always under control.
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
---
# Epic 9: Source & Job Orchestrator Dashboard
**Short name:** `Orchestrator Dashboard`
**Primary service:** `orchestrator` (scheduler, queues, ratelimits, job state)
**Surfaces:** Console (Web UI), CLI, Web API
**Touches:** Conseiller (Feedser), Excitator (Vexer), VEX Consensus Lens, SBOM Service, Policy Engine, Findings Ledger, Authority (authN/Z), Telemetry/Analytics, Object Storage, Relational DB, Message Bus
**AOC ground rule:** Conseiller and Excitator aggregate but never merge. The orchestrator schedules, tracks and recovers jobs; it does not transform evidence beyond transport and storage. No “smart” merging in flight.
---
## 1) What it is
The Source & Job Orchestrator Dashboard is the control surface for every data source and pipeline run across StellaOps. It gives operators:
* Live health of all advisory/VEX/SBOM sources and derived jobs.
* Endtoend pipeline visibility as DAGs and timelines.
* Controls for pausing, backfilling, replaying, throttling and retrying.
* Error pattern analysis, ratelimit observability and backpressure insights.
* Provenance and audit trails from initial fetch through parse, normalize, index and policy evaluation.
The dashboard sits over the `orchestrator` service, which maintains job state, schedules runs, enforces quotas and rate limits, and collects metrics from worker pools embedded in Conseiller, Excitator, SBOM and related services.
---
## 2) Why (brief)
Ingestion breaks quietly and then loudly. Without a unified control plane, you learn about it from angry users or empty indexes. This dashboard shortens incident MTTR, enables safe backfills, and makes compliance reviewers stop sending emails with twelve attachments and one emoji.
---
## 3) How it should work (maximum detail)
### 3.1 Capabilities
* **Source registry**
* Register, tag and version connectors (OSV, GHSA, CSAF endpoints, vendor PDF scrapers, distro feeds, RSS, S3 drops, internal registries).
* Store connection details, secrets (via KMS), ratelimit policy, schedules, and ownership metadata.
* Validate and “test connection” safely.
* **Job orchestration**
* Create DAGs composed of job types: `fetch`, `parse`, `normalize`, `dedupe`, `index`, `consensus_compute`, `policy_eval`, `crosslink`, `sbom_ingest`, `sbom_index`.
* Priorities, queues, concurrency caps, exponential backoff, circuit breakers.
* Idempotency keys and output artifact hashing to avoid duplicate work.
* Eventtime watermarks for backfills without double counting.
* **Observability & control**
* Gantt timeline and realtime DAG view with critical path highlighting.
* Backpressure and queue depth heatmaps.
* Error clustering by class (HTTP 429, TLS, schema mismatch, parse failure, upstream 5xx).
* Persource SLOs and SLA budgets with burnrate alerts.
* Oneclick actions: retry, replay range, pause/resume, throttle/unthrottle, reroute to canary workers.
* **Provenance & audit**
* Immutable run ledger linking input artifact → every job → output artifact.
* Schema version tracking and drift detection.
* Operator actions recorded with reason and ticket reference.
* **Safety**
* Secret redaction everywhere.
* Tenant isolation at API, queue and storage layers.
* AOC: no inflight merges of advisory or VEX content.
### 3.2 Core architecture
* **orchestrator (service)**
* Maintains job state in Postgres (`sources`, `runs`, `jobs`, `artifacts`, `dag_edges`, `quotas`, `schedules`).
* Publishes work to a message bus (e.g., `topic.jobs.ready.<queue>`).
* Distributed tokenbucket rate limiter per source/tenant/host.
* Watchdog for stuck jobs and circuit breakers for flapping sources.
* Watermark manager for backfills (eventtime windows).
* **worker SDK**
* Lightweight library embedded in Conseiller/Excitator/SBOM workers to:
* Claim work, heartbeat, update progress, report metrics.
* Emit artifact metadata and checksums.
* Enforce idempotency via orchestratorsupplied key.
* **object store**
* Raw payloads and intermediate artifacts organized by schema and hash:
* `advisory/raw/<source_id>/<event_time>/<sha256>.json|pdf`
* `advisory/normalized/<schema_ver>/<hash>.json`
* `vex/raw|normalized/...`
* `sbom/raw|graph/...`
* **web API**
* CRUD for sources, runs, jobs, schedules, quotas.
* Control actions (retry, cancel, pause, backfill).
* Streaming updates via WebSocket/SSE for the Console.
* **console**
* React app consuming Orchestrator APIs, rendering DAGs, timelines, health charts and action panels with RBAC.
### 3.3 Data model (selected tables)
* `sources`
* `id`, `kind` (`advisory|vex|sbom|internal`), `subtype` (e.g., `osv`, `ghsa`, `csaf`, `vendor_pdf`), `display_name`, `owner_team`, `schedule_cron`, `rate_policy`, `enabled`, `secrets_ref`, `tags`, `schema_hint`, `created_at`, `updated_at`.
* `runs`
* `id`, `source_id`, `trigger` (`schedule|manual|event|backfill`), `window_start`, `window_end`, `state`, `started_at`, `finished_at`, `stats_json`.
* `jobs`
* `id`, `run_id`, `type`, `queue`, `priority`, `state` (`pending|running|succeeded|failed|canceled|deadletter`), `attempt`, `max_attempt`, `idempotency_key`, `input_artifact_id`, `output_artifact_id`, `worker_id`, `created_at`, `started_at`, `finished_at`, `error_class`, `error_message`, `metrics_json`.
* `dag_edges`
* `from_job_id`, `to_job_id`, `edge_kind` (`success_only|always`).
* `artifacts`
* `id`, `kind` (`raw|normalized|index|consensus`), `schema_ver`, `hash`, `uri`, `bytes`, `meta_json`, `created_at`.
* `quotas`
* `tenant_id`, `resource` (`requests_per_min`, `concurrent_jobs`), `limit`, `window_sec`.
* `schedules`
* Persource cron plus jitter, timezone, blackout windows.
### 3.4 Job lifecycle
1. **Plan**
Scheduler creates a `run` for a source and plans a DAG: e.g., `fetch → parse → normalize → dedupe → index → policy_eval` (advisory) or `fetch → parse → normalize → consensus_compute` (VEX).
2. **Enqueue**
Ready nodes become `jobs` with queue, priority, idempotency key and optional ratelimit tokens reserved.
3. **Execute**
Worker claims job, heartbeats every N seconds. Output artifacts are stored and linked. Failures are classified and retried with exponential backoff and jitter, up to `max_attempt`.
4. **Complete**
Downstream nodes unblock. On run completion, orchestrator computes SLO deltas and emits run summary.
5. **Deadletter**
Jobs exceeding attempts move to a DLQ with structured context and suggested remediation.
### 3.5 Scheduling, backpressure, ratelimits
* **Token bucket** per `{tenant, source.host}` with adaptive refill if upstream 429/503 seen.
* **Concurrency caps** per source and per job type to avoid thundering herd.
* **Backpressure signals** from queue depth, worker CPU, and upstream error rates; scheduler reduces inflight issuance accordingly.
* **Backfills** use eventtime windows with immutable watermarks to avoid reprocessing.
* **Blackout windows** for vendor maintenance periods.
### 3.6 APIs
```
POST /orchestrator/sources
GET /orchestrator/sources?kind=&tag=&q=
GET /orchestrator/sources/{id}
PATCH /orchestrator/sources/{id}
POST /orchestrator/sources/{id}/actions:test|pause|resume|sync-now
POST /orchestrator/sources/{id}/backfill { "from":"2024-01-01", "to":"2024-03-01" }
GET /orchestrator/runs?source_id=&state=&from=&to=
GET /orchestrator/runs/{run_id}
GET /orchestrator/runs/{run_id}/dag
POST /orchestrator/runs/{run_id}/cancel
GET /orchestrator/jobs?state=&type=&queue=&source_id=
GET /orchestrator/jobs/{job_id}
POST /orchestrator/jobs/{job_id}/actions:retry|cancel|prioritize
GET /orchestrator/metrics/overview
GET /orchestrator/errors/top?window=1h
GET /orchestrator/quotas
PATCH /orchestrator/quotas/{tenant_id}
WS /orchestrator/streams/updates
```
### 3.7 Console (Web UI)
* **Overview**
* KPI tiles: sources healthy, runs in progress, queue depth, error rate, burnrate to SLO.
* Heatmap of source health by last 24h success ratio.
* **Sources**
* Grid with filters, inline status (active, paused, throttled), next run eta, last error class.
* Detail panel: config, secrets status (redacted), schedule, rate limits, ownership, run history, action buttons.
* **Runs**
* Timeline (Gantt) with critical path, duration distribution, and perstage breakdown.
* Run detail: DAG view with node metrics, artifacts, logs, action menu (cancel).
* **Jobs**
* Live table with state filters and “tail” view.
* Job detail: payload preview (redacted), worker, attempts, stack traces, linked artifacts.
* **Errors**
* Clusters by class and signature, suggested remediations (pause source, lower concurrency, patch parser).
* **Queues & Backpressure**
* Perqueue depth, service rate, inflight, age percentiles.
* Ratelimit tokens graphs per source host.
* **Controls**
* Backfill wizard with eventtime preview and safety checks.
* Canary routing: route 5% of next 100 runs to a new worker pool.
* **A11y**
* Keyboard nav, ARIA roles for DAG nodes, live regions for updates, colorblind friendly graphs.
### 3.8 CLI
```
stella orch sources list --kind advisory --tag prod
stella orch sources add --file source.yaml
stella orch sources test <source-id>
stella orch sources pause <source-id> # or resume
stella orch sources sync-now <source-id>
stella orch sources backfill <source-id> --from 2024-01-01 --to 2024-03-01
stella orch runs list --source <id> --state running
stella orch runs show <run-id> --dag
stella orch runs cancel <run-id>
stella orch jobs list --state failed --type parse --limit 100
stella orch jobs retry <job-id>
stella orch jobs cancel <job-id>
stella orch jobs tail --queue normalize --follow
stella orch quotas get --tenant default
stella orch quotas set --tenant default --concurrent-jobs 50 --rpm 1200
```
Exit codes: `0` success, `2` invalid args, `4` not found, `5` denied, `7` precondition failed, `8` ratelimited.
### 3.9 RBAC & security
* **Roles**
* `Orch.Viewer`: readonly sources/runs/jobs/metrics.
* `Orch.Operator`: perform actions on sources and jobs, launch backfills.
* `Orch.Admin`: manage quotas, schedules, connector versions, and delete sources.
* **Secrets**
* Stored only as references to your KMS; never persisted in cleartext.
* Console shows redact badges and last rotated timestamp.
* **Tenancy**
* Source, run, job rows scoped by tenant id.
* Queue names and token buckets namespaced per tenant.
* **Compliance**
* Full audit log for every operator action with “reason” and optional ticket link.
* Exportable run ledger for audits.
### 3.10 Observability
* **Metrics (examples)**
* `orch_jobs_inflight{type,queue}`
* `orch_jobs_latency_ms{type,percentile}`
* `orch_rate_tokens_available{source}`
* `orch_error_rate{source,error_class}`
* `orch_slo_burn_rate{source,slo}`
* `orch_deadletter_total{source,type}`
* **Traces**
* Span per job with baggage: `run_id`, `source_id`, `artifact_id`.
* Links across services to Conseiller/Excitator/SBOM workers.
* **Logs**
* Structured JSON with correlation ids, attempt numbers and redacted payload previews.
### 3.11 Performance targets
* Job dispatch P95 < 150 ms after dependency satisfied.
* Scheduler loop P95 < 500 ms for 10k pending jobs.
* Console live updates subsecond at 1k events/sec per tenant.
* Backfill throughput 200 jobs/sec per worker pool with zero dupes.
### 3.12 Edge cases & behaviors
* **Upstream 429 storms:** autothrottle, pause optional, recommend extended jitter.
* **Schema drift:** parser moves job to DLQ with `error_class=schema_mismatch` and opens a change ticket via webhook.
* **Flapping source:** circuit breaker opens after N consecutive failures; requires human resume”.
* **Clock skew:** watermark logic uses upstream event time; large skews flagged.
* **Idempotency collisions:** new attempt yields noop if artifact hash already exists.
---
## 4) Implementation plan
### 4.1 Modules (new and updated)
* New service: `src/StellaOps.Orchestrator`
* `api/` REST + WS handlers
* `scheduler/` run planner, DAG builder, watermark/backfill logic
* `queues/` publisher and consumer abstractions
* `ratelimit/` token bucket and adaptive controller
* `state/` Postgres repositories and migrations
* `audit/` action logging and export
* `metrics/` Prometheus exporters
* `security/` tenant scoping, KMS client, secret refs
* Worker SDKs:
* `src/StellaOps.Orchestrator.WorkerSdk.Go` and `src/StellaOps.Orchestrator.WorkerSdk.Python` with job claim, heartbeat, progress, artifact publish, and structured error reporting.
* Console:
* `console/apps/orch/` pages: Overview, Sources, Runs, Jobs, Errors, Queues.
* `components/dag-view/`, `components/gantt/`, `components/health-heatmap/`.
* Updates to existing services:
* Conseiller/Excitator/SBOM workers adopt SDK and emit artifacts with schema/version/fingerprint.
* VEX Lens exposes `consensus_compute` as a jobable operation.
* Policy Engine exposes `policy_eval` as a job type for scheduled recalcs.
### 4.2 Packaging & deployment
* Containers:
* `stella/orchestrator:<ver>`
* `stella/worker-sdk-examples:<ver>` for canary pools
* Helm values:
* Queues/topics, pertenant concurrency, ratelimit defaults, WS replica count.
* KMS integration secrets.
* Migrations:
* Flyway/Goose migrations for new tables and indexes.
### 4.3 Rollout strategy
* Phase 1: Readonly dashboard fed by existing job tables; no controls.
* Phase 2: Control actions enabled for nonprod tenants.
* Phase 3: Backfills and quota management, then GA.
---
## 5) Documentation changes
Create/update the following, each ending with the imposed rule statement.
1. `/docs/orchestrator/overview.md`
Concepts, roles, responsibilities, AOC alignment.
2. `/docs/orchestrator/architecture.md`
Scheduler, DAGs, watermarks, queues, ratelimits, data model.
3. `/docs/orchestrator/api.md`
Endpoints, WebSocket events, error codes, examples.
4. `/docs/orchestrator/console.md`
Screens, actions, a11y, live updates.
5. `/docs/orchestrator/cli.md`
Commands, examples, exit codes, scripting patterns.
6. `/docs/orchestrator/runledger.md`
Provenance and audit export format.
7. `/docs/security/secretshandling.md`
KMS references, redaction rules, operator hygiene.
8. `/docs/operations/orchestratorrunbook.md`
Common failures, backfill guide, circuit breakers, tuning.
9. `/docs/schemas/artifacts.md`
Artifact kinds, schema versions, hashing, storage layout.
10. `/docs/slo/orchestratorslo.md`
SLO definitions, measurement, alerting.
---
## 6) Engineering tasks
### Backend (orchestrator)
* [ ] Stand up Postgres schemas and indices for sources, runs, jobs, dag_edges, artifacts, quotas, schedules.
* [ ] Implement scheduler: DAG planner, dependency resolver, critical path computation.
* [ ] Implement rate limiter with adaptive behavior on 429/503 and pertenant tokens.
* [ ] Implement watermark/backfill manager with eventtime windows and idempotency keys.
* [ ] Implement API endpoints + OpenAPI spec + request validation.
* [ ] Implement WebSocket/SSE event stream for live updates.
* [ ] Implement audit logging and export.
* [ ] Implement deadletter store and replay.
### Worker SDKs and integrations
* [ ] Build Go/Python SDKs with claim/heartbeat/progress API.
* [ ] Integrate SDK into Conseiller, Excitator, SBOM workers; ensure artifact emission with schema ver.
* [ ] Add `consensus_compute` and `policy_eval` as job types with deterministic inputs/outputs.
### Console
* [ ] Overview tiles and health heatmap.
* [ ] Source list/detail with actions and config view.
* [ ] Runs timeline (Gantt) and DAG visualization with node inspector.
* [ ] Jobs tail with live updates and filters.
* [ ] Errors clustering and suggested remediations.
* [ ] Queues/backpressure dashboard.
* [ ] Backfill wizard with safety checks.
### Observability
* [ ] Emit metrics listed in §3.10 and wire traces across services.
* [ ] Dashboards: health, queue depth, error classes, burnrate, dispatch latency.
* [ ] Alerts for SLO burn and circuit breaker opens.
### Security & RBAC
* [ ] Enforce tenant scoping on all endpoints; test leakage.
* [ ] Wire KMS for secret refs and redact everywhere.
* [ ] Implement `Orch.Viewer|Operator|Admin` roles and check in Console and API.
### Docs
* [ ] Author all files in §5 with examples and screenshots.
* [ ] Crosslink from Conseiller/Excitator/SBOM pages to the dashboard docs.
* [ ] Append imposed rule to each page.
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
---
## 7) Acceptance criteria
* Operators can: pause/resume a source, run syncnow,” initiate a backfill for a date range, and retry/cancel individual jobs from Console and CLI.
* DAG and timeline reflect reality within 1 second of job state changes at P95.
* Backfills do not create duplicate artifacts; idempotency proven by hash equality.
* Rate limiter reduces 429s by 80% under simulated throttle tests.
* Audit log includes who/when/why for every operator action.
* Provenance ledger exports a complete chain for any artifact.
* RBAC prevents nonadmins from quota changes; tenancy isolation proven via automated tests.
* SLO dashboard shows burnrate and triggers alerts under injected failure.
---
## 8) Risks & mitigations
* **Orchestrator becomes a single bottleneck.**
Horizontal scale stateless workers; DB indexes tuned; job state updates batched; cache hot paths.
* **Secret spillage.**
Only KMS references stored; aggressive redaction; log scrubbing in SDK.
* **Overeager backfills overwhelm upstream.**
Enforce persource quotas and sandbox previews; dryrun backfills first.
* **Schema drift silently corrupts normalization.**
Hardfail on mismatch; DLQ with clear signatures; schema registry gating.
* **Flapping sources cause alert fatigue.**
Circuit breaker with cooldown and deduped alerts; error budget policy.
---
## 9) Test plan
* **Unit**
Scheduler DAG building, topological sort, backoff math, token bucket, watermark math.
* **Integration**
Orchestrator worker SDK, artifact store wiring, DLQ replay, audit pipeline.
* **Chaos**
Inject 429 storms, packet loss, worker crashes; verify throttling and recovery.
* **Backfill**
Simulate overlapping windows and verify idempotency and watermark correctness.
* **Perf**
10k concurrent jobs: dispatch latency, DB contention, WebSocket fanout.
* **Security**
Multitenant isolation tests; KMS mock tests for secret access; RBAC matrix.
* **UX/A11y**
Screen reader labels on DAG, keyboard navigation, live region updates.
---
## 10) Philosophy
* **Make the invisible visible.** Pipelines should be legible at a glance.
* **Prefer reproducibility to heroics.** Idempotency and provenance over we think it ran.”
* **Safeguards before speed.** Throttle first, retry thoughtfully, never melt upstreams.
* **No silent merges.** Evidence remains immutable; transformations are explicit, logged and reversible.
> Final reminder: **Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.**