consolidation of some of the modules, localization fixes, product advisories work, qa work

This commit is contained in:
master
2026-03-05 03:54:22 +02:00
parent 7bafcc3eef
commit 8e1cb9448d
3878 changed files with 72600 additions and 46861 deletions

View File

@@ -0,0 +1,78 @@
# StellaOps Source & Job Orchestrator
The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform.
## Latest updates (2025-11-30)
- OpenAPI discovery published at `/.well-known/openapi` with `openapi/orchestrator.json`; includes pagination/idempotency/error-envelope examples and version headers.
- Legacy job detail/summary endpoints now emit `Deprecation` + `Link` headers pointing to the stable replacements.
- Job leasing flows through the Task Runner bridge: allocations carry idempotency keys, lease durations, and retry hints; workers acknowledge via claim/ack and emit heartbeats.
- Event envelopes remain interim pending ORCH-SVC-37-101; include provenance (tenant/project, job type, correlationId, task runner id) in all notifier events.
- Authority `orch:quota` / `orch:backfill` scopes require reason/ticket audit fields; include them in runbooks and dashboard overrides.
## Responsibilities
- Track job state, throughput, and errors for Concelier, Excititor, Scheduler, and export pipelines.
- Expose dashboards and APIs for throttling, replays, and failover.
- Enforce rate-limits, concurrency and dependency chains across queues.
- Stream structured events and audit logs for incident response.
- Provide Task Runner bridge semantics (claim/ack, heartbeats, progress, artifacts, backfills) for Go/Python SDKs.
## Key components
- Orchestrator WebService (control plane).
- Queue adapters (Valkey/NATS) and job ledger.
- Console dashboard module and CLI integration for operators.
## Integrations & dependencies
- Authority for authN/Z on operational actions.
- Telemetry stack for job metrics and alerts.
- Scheduler/Concelier/Excititor workers for job lifecycle.
- Offline Kit for state export/import during air-gap refreshes.
## Operational notes
- Job recovery runbooks and dashboard JSON as described in Epic 9.
- Rate-limit and lease reconfiguration guidelines; keep lease defaults aligned across runners and SDKs (Go/Python).
- Log streaming: SSE/WS endpoints carry correlationId + tenant/project; buffer size and retention must be documented in runbooks.
- When using `orch:quota` / `orch:backfill` scopes, capture reason/ticket fields in runbooks and audit checklists.
## Implementation Status
### Phase 1 Core service & job ledger (Complete)
- PostgreSQL schema with sources, runs, jobs, artifacts, DAG edges, quotas, schedules, incidents
- Lease manager with heartbeats, retries, dead-letter queues
- Token-bucket rate limiter per tenant/source.host with adaptive refill
- Watermark/backfill orchestration for event-time windows
### Phase 2 Worker SDK & artifact registry (Complete)
- Claim/heartbeat/report contract with deterministic artifact hashing
- Idempotency enforcement and worker SDKs for .NET/Rust/Go agents
- Integrated with Concelier, Excititor, SBOM Service, Policy Engine
### Phase 3 Observability & dashboard (In Progress)
- Metrics: queue depth, job latency, failure classes, rate-limit hits, burn rate
- Error clustering for HTTP 429/5xx, schema mismatches, parse errors
- SSE/WebSocket feeds for Console updates, Gantt timeline/DAG JSON
### Phase 4 Controls & resilience (Planned)
- Pause/resume/throttle/retry/backfill tooling
- Dead-letter review, circuit breakers, blackouts, backpressure handling
- Automation hooks and control plane APIs
### Phase 5 Offline & compliance (Planned)
- Deterministic audit bundles (jobs.jsonl, history.jsonl, throttles.jsonl)
- Provenance manifests and offline replay scripts
- Tenant isolation validation and secret redaction
### Key Acceptance Criteria
- Schedules all jobs with quotas, rate limits, idempotency; preserves provenance
- Console reflects real-time DAG status, queue depth, SLO burn rate
- Observability stack exposes metrics, logs, traces, incidents for stuck jobs and throttling
- Offline audit bundles reproduce job history deterministically with verified signatures
### Technical Decisions & Risks
- Backpressure/queue overload mitigated via adaptive token buckets, circuit breakers, dynamic concurrency
- Upstream vendor throttles managed with visible state, automatic jitter and retry
- Tenant leakage prevented through API/queue/storage filters, fuzz tests, redaction
- Complex DAG errors handled with diagnostics, error clustering, partial replay tooling
## Epic alignment
- Epic 9: Source & Job Orchestrator Dashboard.
- ORCH stories in ../../TASKS.md.