docs consolidation work

This commit is contained in:
StellaOps Bot
2025-12-25 18:48:11 +02:00
parent 2a06f780cf
commit 82a49f6743
102 changed files with 3550 additions and 1679 deletions

View File

@@ -33,6 +33,46 @@ The Orchestrator schedules, observes, and recovers ingestion and analysis jobs a
- Log streaming: SSE/WS endpoints carry correlationId + tenant/project; buffer size and retention must be documented in runbooks.
- When using `orch:quota` / `orch:backfill` scopes, capture reason/ticket fields in runbooks and audit checklists.
## Implementation Status
### Phase 1 Core service & job ledger (Complete)
- PostgreSQL schema with sources, runs, jobs, artifacts, DAG edges, quotas, schedules, incidents
- Lease manager with heartbeats, retries, dead-letter queues
- Token-bucket rate limiter per tenant/source.host with adaptive refill
- Watermark/backfill orchestration for event-time windows
### Phase 2 Worker SDK & artifact registry (Complete)
- Claim/heartbeat/report contract with deterministic artifact hashing
- Idempotency enforcement and worker SDKs for .NET/Rust/Go agents
- Integrated with Concelier, Excititor, SBOM Service, Policy Engine
### Phase 3 Observability & dashboard (In Progress)
- Metrics: queue depth, job latency, failure classes, rate-limit hits, burn rate
- Error clustering for HTTP 429/5xx, schema mismatches, parse errors
- SSE/WebSocket feeds for Console updates, Gantt timeline/DAG JSON
### Phase 4 Controls & resilience (Planned)
- Pause/resume/throttle/retry/backfill tooling
- Dead-letter review, circuit breakers, blackouts, backpressure handling
- Automation hooks and control plane APIs
### Phase 5 Offline & compliance (Planned)
- Deterministic audit bundles (jobs.jsonl, history.jsonl, throttles.jsonl)
- Provenance manifests and offline replay scripts
- Tenant isolation validation and secret redaction
### Key Acceptance Criteria
- Schedules all jobs with quotas, rate limits, idempotency; preserves provenance
- Console reflects real-time DAG status, queue depth, SLO burn rate
- Observability stack exposes metrics, logs, traces, incidents for stuck jobs and throttling
- Offline audit bundles reproduce job history deterministically with verified signatures
### Technical Decisions & Risks
- Backpressure/queue overload mitigated via adaptive token buckets, circuit breakers, dynamic concurrency
- Upstream vendor throttles managed with visible state, automatic jitter and retry
- Tenant leakage prevented through API/queue/storage filters, fuzz tests, redaction
- Complex DAG errors handled with diagnostics, error clustering, partial replay tooling
## Epic alignment
- Epic 9: Source & Job Orchestrator Dashboard.
- ORCH stories in ../../TASKS.md.

View File

@@ -1,62 +0,0 @@
# Implementation plan — Source & Job Orchestrator
## Delivery phases
- **Phase 1 Core service & job ledger**
Implement source registry, run/job tables, queue abstraction, lease management, token-bucket rate limiting, watchdogs, and API primitives (`/sources`, `/runs`, `/jobs`).
- **Phase 2 Worker SDK & artifact registry**
Embed worker SDK in Conseiller, Excititor, SBOM, Policy Engine; capture artifact metadata + hashes, enforce idempotency, publish progress/metrics.
- **Phase 3 Observability & dashboard**
Ship metrics, traces, incident logging, SSE/WebSocket feeds, and Console dashboard (DAG/timeline, heatmaps, error clustering, SLO burn rate).
- **Phase 4 Controls & resilience**
Deliver pause/resume/throttle/retry/backfill tooling, dead-letter review, circuit breakers, blackouts, backpressure handling, and automation hooks.
- **Phase 5 Offline & compliance**
Generate deterministic audit bundles (`jobs.jsonl`, `history.jsonl`, `throttles.jsonl`), provenance manifests, and offline replay scripts.
## Work breakdown
- **Service & persistence**
- Postgres schema (`sources`, `runs`, `jobs`, `artifacts`, `dag_edges`, `quotas`, `schedules`, `incidents`).
- Lease manager with heartbeats, retries, and dead-letter queues.
- Token-bucket rate limiter per `{tenant, source.host}`; adaptive refill on upstream throttles.
- Watermark/backfill orchestration for event-time windows.
- **Worker SDK**
- Claim/heartbeat/report contract, deterministic artifact hashing, idempotency enforcement.
- Library release for .NET workers plus language bindings for Rust/Go ingestion agents.
- **Control plane APIs**
- CRUD for sources, runs, jobs, quotas, schedules; control actions (retry, cancel, prioritize, pause/resume, backfill).
- SSE/WebSocket stream for Console updates.
- **Observability**
- Metrics: queue depth, job latency, failure classes, rate-limit hits, burn rate.
- Error clustering (HTTP 429/5xx, schema mismatch, parse errors), incident logging with reason codes.
- Gantt timeline and DAG JSON for Console visualisation.
- **Console & CLI**
- Console app sections: overview, sources, runs, job detail, incidents, throttles.
- CLI commands: `stella orchestrator sources|runs|jobs|throttle|backfill`.
- **Compliance & offline**
- Immutable audit bundles with signatures; exports via Export Center; Offline Kit instructions.
- Tenant isolation validation and secret redaction for logs/UI.
## Acceptance criteria
- Orchestrator schedules all advisory/VEX/SBOM/policy jobs with quotas, rate limits, and idempotency; retries and replay preserve provenance.
- Console dashboard reflects real-time DAG status, queue depth, SLO burn rate, and allows pause/resume/throttle/backfill with audit trail.
- Worker SDK integrated across producer services, emitting progress and artifact metadata.
- Observability stack exposes metrics, logs, traces, incidents, and alerts for stuck jobs, throttling, and failure spikes.
- Offline audit bundles reproduce job history deterministically and verify signatures.
## Risks & mitigations
- **Backpressure/queue overload:** adaptive token buckets, circuit breakers, dynamic concurrency; degrade gracefully.
- **Upstream vendor throttles:** throttle management with user-visible state, automatic jitter and retry.
- **Tenant leakage:** enforce tenant filters at API/queue/storage, fuzz tests, redaction.
- **Complex DAG errors:** built-in diagnostics, error clustering, partial replay tooling.
- **Operator error:** confirmation prompts, RBAC, runbook guidance, reason codes logged.
## Test strategy
- **Unit:** scheduling, quota enforcement, lease renewals, token bucket, watermark arithmetic.
- **Integration:** worker SDK with Conseiller/Excititor/SBOM pipelines, pause/resume/backfill flows, failure recovery.
- **Performance:** high-volume job workloads, queue backpressure, concurrency caps, dashboard SSE load tests.
- **Chaos:** simulate upstream outages, stuck workers, clock skew, Postgres failover.
- **Compliance:** audit bundle generation, signature verification, offline replay.
## Definition of done
- All phases delivered with telemetry, dashboards, and runbooks published.
- Console + CLI parity validated; Offline Kit instructions complete.
- ./TASKS.md and ../../TASKS.md updated with status; documentation (README/architecture/this plan) reflects latest behaviour.