3.9 KiB
3.9 KiB
Source & Job Orchestrator architecture
Based on Epic 9 – Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability.
1) Topology
- Orchestrator API (
StellaOps.Orchestrator). Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (orchestrator:*). - Job ledger (Mongo). Collections
jobs,job_history,sources,quotas,throttles,incidents. Append-only history ensures auditability. - Queue abstraction. Supports Mongo queue, Redis Streams, or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
- Dashboard feeds. SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.
2) Job lifecycle
- Enqueue. Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit
JobRequestrecords containingjobType,tenant,priority,payloadDigest,dependencies. - Scheduling. Orchestrator applies quotas and rate limits per
{tenant, jobType}. Jobs exceeding limits are staged in pending queue with next eligible timestamp. - Leasing. Workers poll
LeaseJobendpoint; Orchestrator returns job withleaseId,leaseUntil, and instrumentation tokens. Lease renewal required for long-running tasks. - Completion. Worker reports status (
succeeded,failed,canceled,timed_out). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded. - Replay. Operators trigger
POST /jobs/{id}/replaywhich clones job payload, setsreplayOfpointer, and requeues with high priority while preserving determinism metadata.
3) Rate-limit & quota governance
- Quotas defined per tenant/profile (
maxActive,maxPerHour,burst). Stored inquotasand enforced before leasing. - Dynamic throttles allow ops to pause specific sources (
pauseSource,resumeSource) or reduce concurrency. - Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
4) APIs
GET /api/jobs?status=— list jobs with filters (tenant, jobType, status, time window).GET /api/jobs/{id}— job detail (payload digest, attempts, worker, lease history, metrics).POST /api/jobs/{id}/cancel— cancel running/pending job with audit reason.POST /api/jobs/{id}/replay— schedule replay.POST /api/limits/throttle— apply throttle (requires elevated scope).GET /api/dashboard/metrics— aggregated metrics for Console dashboards.
All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.
5) Observability
- Metrics:
job_queue_depth{jobType,tenant},job_latency_seconds,job_failures_total,job_retry_total,lease_extensions_total. - Logs: structured with
jobId,jobType,tenant,workerId,leaseId,status. Incident logs flagged for Ops. - Traces: spans covering
enqueue,schedule,lease,worker_execute,complete. Trace IDs propagate to worker spans for end-to-end correlation.
6) Offline support
- Orchestrator exports audit bundles:
jobs.jsonl,history.jsonl,throttles.jsonl,manifest.json,signatures/. Used for offline investigations and compliance. - Replay manifests contain job digests and success/failure notes for deterministic proof.
7) Operational considerations
- HA deployment with multiple API instances; queue storage determines redundancy strategy.
- Support for
maintenancemode halting leases while allowing status inspection. - Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume).