4.4 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			4.4 KiB
		
	
	
	
	
	
	
	
Implementation plan — Source & Job Orchestrator
Delivery phases
- Phase 1 – Core service & job ledger
Implement source registry, run/job tables, queue abstraction, lease management, token-bucket rate limiting, watchdogs, and API primitives (/sources,/runs,/jobs). - Phase 2 – Worker SDK & artifact registry
Embed worker SDK in Conseiller, Excititor, SBOM, Policy Engine; capture artifact metadata + hashes, enforce idempotency, publish progress/metrics. - Phase 3 – Observability & dashboard
Ship metrics, traces, incident logging, SSE/WebSocket feeds, and Console dashboard (DAG/timeline, heatmaps, error clustering, SLO burn rate). - Phase 4 – Controls & resilience
Deliver pause/resume/throttle/retry/backfill tooling, dead-letter review, circuit breakers, blackouts, backpressure handling, and automation hooks. - Phase 5 – Offline & compliance
Generate deterministic audit bundles (jobs.jsonl,history.jsonl,throttles.jsonl), provenance manifests, and offline replay scripts. 
Work breakdown
- Service & persistence
- Postgres schema (
sources,runs,jobs,artifacts,dag_edges,quotas,schedules,incidents). - Lease manager with heartbeats, retries, and dead-letter queues.
 - Token-bucket rate limiter per 
{tenant, source.host}; adaptive refill on upstream throttles. - Watermark/backfill orchestration for event-time windows.
 
 - Postgres schema (
 - Worker SDK
- Claim/heartbeat/report contract, deterministic artifact hashing, idempotency enforcement.
 - Library release for .NET workers plus language bindings for Rust/Go ingestion agents.
 
 - Control plane APIs
- CRUD for sources, runs, jobs, quotas, schedules; control actions (retry, cancel, prioritize, pause/resume, backfill).
 - SSE/WebSocket stream for Console updates.
 
 - Observability
- Metrics: queue depth, job latency, failure classes, rate-limit hits, burn rate.
 - Error clustering (HTTP 429/5xx, schema mismatch, parse errors), incident logging with reason codes.
 - Gantt timeline and DAG JSON for Console visualisation.
 
 - Console & CLI
- Console app sections: overview, sources, runs, job detail, incidents, throttles.
 - CLI commands: 
stella orchestrator sources|runs|jobs|throttle|backfill. 
 - Compliance & offline
- Immutable audit bundles with signatures; exports via Export Center; Offline Kit instructions.
 - Tenant isolation validation and secret redaction for logs/UI.
 
 
Acceptance criteria
- Orchestrator schedules all advisory/VEX/SBOM/policy jobs with quotas, rate limits, and idempotency; retries and replay preserve provenance.
 - Console dashboard reflects real-time DAG status, queue depth, SLO burn rate, and allows pause/resume/throttle/backfill with audit trail.
 - Worker SDK integrated across producer services, emitting progress and artifact metadata.
 - Observability stack exposes metrics, logs, traces, incidents, and alerts for stuck jobs, throttling, and failure spikes.
 - Offline audit bundles reproduce job history deterministically and verify signatures.
 
Risks & mitigations
- Backpressure/queue overload: adaptive token buckets, circuit breakers, dynamic concurrency; degrade gracefully.
 - Upstream vendor throttles: throttle management with user-visible state, automatic jitter and retry.
 - Tenant leakage: enforce tenant filters at API/queue/storage, fuzz tests, redaction.
 - Complex DAG errors: built-in diagnostics, error clustering, partial replay tooling.
 - Operator error: confirmation prompts, RBAC, runbook guidance, reason codes logged.
 
Test strategy
- Unit: scheduling, quota enforcement, lease renewals, token bucket, watermark arithmetic.
 - Integration: worker SDK with Conseiller/Excititor/SBOM pipelines, pause/resume/backfill flows, failure recovery.
 - Performance: high-volume job workloads, queue backpressure, concurrency caps, dashboard SSE load tests.
 - Chaos: simulate upstream outages, stuck workers, clock skew, Postgres failover.
 - Compliance: audit bundle generation, signature verification, offline replay.
 
Definition of done
- All phases delivered with telemetry, dashboards, and runbooks published.
 - Console + CLI parity validated; Offline Kit instructions complete.
 - ./TASKS.md and ../../TASKS.md updated with status; documentation (README/architecture/this plan) reflects latest behaviour.