Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
This commit introduces the OpenSslLegacyShim class, which sets the LD_LIBRARY_PATH environment variable to include the directory containing OpenSSL 1.1 native libraries. This is necessary for Mongo2Go to function correctly on Linux platforms that do not ship these libraries by default. The shim checks if the current operating system is Linux and whether the required directory exists before modifying the environment variable.
4.3 KiB
4.3 KiB
Source & Job Orchestrator architecture
Based on Epic 9 – Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability.
1) Topology
- Orchestrator API (
StellaOps.Orchestrator). Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (orchestrator:*). - Job ledger (Mongo). Collections
jobs,job_history,sources,quotas,throttles,incidents. Append-only history ensures auditability. - Queue abstraction. Supports Mongo queue, Redis Streams, or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
- Dashboard feeds. SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.
2) Job lifecycle
- Enqueue. Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit
JobRequestrecords containingjobType,tenant,priority,payloadDigest,dependencies. - Scheduling. Orchestrator applies quotas and rate limits per
{tenant, jobType}. Jobs exceeding limits are staged in pending queue with next eligible timestamp. - Leasing. Workers poll
LeaseJobendpoint; Orchestrator returns job withleaseId,leaseUntil, and instrumentation tokens. Lease renewal required for long-running tasks. - Completion. Worker reports status (
succeeded,failed,canceled,timed_out). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded. - Replay. Operators trigger
POST /jobs/{id}/replaywhich clones job payload, setsreplayOfpointer, and requeues with high priority while preserving determinism metadata.
3) Rate-limit & quota governance
- Quotas defined per tenant/profile (
maxActive,maxPerHour,burst). Stored inquotasand enforced before leasing. - Dynamic throttles allow ops to pause specific sources (
pauseSource,resumeSource) or reduce concurrency. - Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
- Control plane quota updates require Authority scope
orch:quota(issued viaOrch.Adminrole). Historical rebuilds/backfills additionally requireorch:backfilland must supplybackfill_reasonandbackfill_ticketalongside the operator metadata. Authority persists all four fields (quota_reason,quota_ticket,backfill_reason,backfill_ticket) for audit replay.
4) APIs
GET /api/jobs?status=— list jobs with filters (tenant, jobType, status, time window).GET /api/jobs/{id}— job detail (payload digest, attempts, worker, lease history, metrics).POST /api/jobs/{id}/cancel— cancel running/pending job with audit reason.POST /api/jobs/{id}/replay— schedule replay.POST /api/limits/throttle— apply throttle (requires elevated scope).GET /api/dashboard/metrics— aggregated metrics for Console dashboards.
All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.
5) Observability
- Metrics:
job_queue_depth{jobType,tenant},job_latency_seconds,job_failures_total,job_retry_total,lease_extensions_total. - Logs: structured with
jobId,jobType,tenant,workerId,leaseId,status. Incident logs flagged for Ops. - Traces: spans covering
enqueue,schedule,lease,worker_execute,complete. Trace IDs propagate to worker spans for end-to-end correlation.
6) Offline support
- Orchestrator exports audit bundles:
jobs.jsonl,history.jsonl,throttles.jsonl,manifest.json,signatures/. Used for offline investigations and compliance. - Replay manifests contain job digests and success/failure notes for deterministic proof.
7) Operational considerations
- HA deployment with multiple API instances; queue storage determines redundancy strategy.
- Support for
maintenancemode halting leases while allowing status inspection. - Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume).