# Source & Job Orchestrator architecture > Based on Epic 9 – Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability. ## 1) Topology - **Orchestrator API (`StellaOps.Orchestrator`).** Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (`orchestrator:*`). - **Job ledger (PostgreSQL).** Tables `jobs`, `job_history`, `sources`, `quotas`, `throttles`, `incidents` (schema `orchestrator`). Append-only history ensures auditability. - **Queue abstraction.** Supports Valkey Streams or NATS JetStream (pluggable). Each job carries lease metadata and retry policy. - **Dashboard feeds.** SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status. ## 2) Job lifecycle 1. **Enqueue.** Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit `JobRequest` records containing `jobType`, `tenant`, `priority`, `payloadDigest`, `dependencies`. 2. **Scheduling.** Orchestrator applies quotas and rate limits per `{tenant, jobType}`. Jobs exceeding limits are staged in pending queue with next eligible timestamp. 3. **Leasing (Task Runner bridge).** Workers poll `LeaseJob` endpoint; Orchestrator returns job with `leaseId`, `leaseUntil`, `idempotencyKey`, and instrumentation tokens. Lease renewal required for long-running tasks; leases carry retry hints and provenance (`tenant`, `project`, `correlationId`, `taskRunnerId`). 4. **Completion.** Worker reports status (`succeeded`, `failed`, `canceled`, `timed_out`). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded. 5. **Replay.** Operators trigger `POST /jobs/{id}/replay` which clones job payload, sets `replayOf` pointer, and requeues with high priority while preserving determinism metadata. ### Pack-run lifecycle (phase III) - **Register** `pack-run` job type with task runner hints (artifacts, log channel, heartbeat cadence). - **Logs/Artifacts**: SSE/WS stream keyed by `packRunId` + `tenant/project`; artifacts published with content digests and URI metadata. - **Events**: notifier payloads include envelope provenance (tenant, project, correlationId, idempotencyKey) pending ORCH-SVC-37-101 final spec. ## 3) Rate-limit & quota governance - Quotas defined per tenant/profile (`maxActive`, `maxPerHour`, `burst`). Stored in `quotas` and enforced before leasing. - Dynamic throttles allow ops to pause specific sources (`pauseSource`, `resumeSource`) or reduce concurrency. - Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack. - Control plane quota updates require Authority scope `orch:quota` (issued via `Orch.Admin` role). Historical rebuilds/backfills additionally require `orch:backfill` and must supply `backfill_reason` and `backfill_ticket` alongside the operator metadata. Authority persists all four fields (`quota_reason`, `quota_ticket`, `backfill_reason`, `backfill_ticket`) for audit replay. ### 3.1) Quota governance service The `QuotaGovernanceService` provides cross-tenant quota allocation with configurable policies: **Allocation strategies:** - `Equal` — Divide total capacity equally among all active tenants. - `Proportional` — Allocate based on tenant weight/priority tier. - `Priority` — Higher priority tenants get allocation first, with preemption. - `ReservedWithFairShare` — Reserved minimum per tenant, remainder distributed fairly. - `Fixed` — Static allocation per tenant regardless of demand. **Key operations:** - `CalculateAllocationAsync` — Compute quota for a tenant based on active policies. - `RequestQuotaAsync` — Request quota from shared pool; returns granted amount with burst usage. - `ReleaseQuotaAsync` — Return quota to shared pool after job completion. - `CanScheduleAsync` — Check scheduling eligibility combining quota and circuit breaker state. **Quota allocation policy properties:** - `TotalCapacity` — Pool size to allocate from (for proportional/fair strategies). - `MinimumPerTenant` / `MaximumPerTenant` — Allocation bounds. - `ReservedCapacity` — Guaranteed capacity for high-priority tenants. - `AllowBurst` / `BurstMultiplier` — Allow temporary overallocation when capacity exists. - `Priority` — Policy evaluation order (higher = first). - `JobType` — Optional job type filter (null = applies to all). ### 3.2) Circuit breaker service The `CircuitBreakerService` implements the circuit breaker pattern for downstream services: **States:** - `Closed` — Normal operation; requests pass through. Failures are tracked. - `Open` — Circuit tripped; requests are blocked for `OpenDuration`. Prevents cascade failures. - `HalfOpen` — After open duration, limited test requests allowed. Success → Closed; Failure → Open. **Thresholds:** - `FailureThreshold` (0.0–1.0) — Failure rate that triggers circuit open. - `WindowDuration` — Sliding window for failure rate calculation. - `MinimumSamples` — Minimum requests before circuit can trip. - `OpenDuration` — How long circuit stays open before half-open transition. - `HalfOpenTestCount` — Number of test requests allowed in half-open state. **Key operations:** - `CheckAsync` — Verify if request is allowed; returns `CircuitBreakerCheckResult`. - `RecordSuccessAsync` / `RecordFailureAsync` — Update circuit state after request. - `ForceOpenAsync` / `ForceCloseAsync` — Manual operator intervention (audited). - `ListAsync` — View all circuit breakers for a tenant with optional state filter. **Downstream services protected:** - Scanner - Attestor - Policy Engine - Registry clients - External integrations ## 4) APIs ### 4.1) Job management - `GET /api/jobs?status=` — list jobs with filters (tenant, jobType, status, time window). - `GET /api/jobs/{id}` — job detail (payload digest, attempts, worker, lease history, metrics). - `POST /api/jobs/{id}/cancel` — cancel running/pending job with audit reason. - `POST /api/jobs/{id}/replay` — schedule replay. - `POST /api/limits/throttle` — apply throttle (requires elevated scope). - `GET /api/dashboard/metrics` — aggregated metrics for Console dashboards. ### 4.2) Circuit breaker endpoints (`/api/v1/orchestrator/circuit-breakers`) - `GET /` — List all circuit breakers for tenant (optional `?state=` filter). - `GET /{serviceId}` — Get circuit breaker state for specific downstream service. - `GET /{serviceId}/check` — Check if requests are allowed; returns `IsAllowed`, `State`, `FailureRate`, `TimeUntilRetry`. - `POST /{serviceId}/success` — Record successful request to downstream service. - `POST /{serviceId}/failure` — Record failed request (body: `failureReason`). - `POST /{serviceId}/force-open` — Manually open circuit (body: `reason`; audited). - `POST /{serviceId}/force-close` — Manually close circuit (audited). ### 4.3) Quota governance endpoints (`/api/v1/orchestrator/quota-governance`) - `GET /policies` — List quota allocation policies (optional `?enabled=` filter). - `GET /policies/{policyId}` — Get specific policy. - `POST /policies` — Create new policy. - `PUT /policies/{policyId}` — Update policy. - `DELETE /policies/{policyId}` — Delete policy. - `GET /allocation` — Calculate allocation for current tenant (optional `?jobType=`). - `POST /request` — Request quota from pool (body: `jobType`, `requestedAmount`). - `POST /release` — Release quota back to pool (body: `jobType`, `releasedAmount`). - `GET /status` — Get tenant quota status (optional `?jobType=`). - `GET /summary` — Get quota governance summary across all tenants (optional `?policyId=`). - `GET /can-schedule` — Check if job can be scheduled (optional `?jobType=`). ### 4.4) Discovery and documentation - Event envelope draft (`docs/modules/orchestrator/event-envelope.md`) defines notifier/webhook/SSE payloads with idempotency keys, provenance, and task runner metadata for job/pack-run events. - OpenAPI discovery: `/.well-known/openapi` exposes `/openapi/orchestrator.json` (OAS 3.1) with pagination/idempotency/error-envelope examples; legacy job detail/summary endpoints now ship `Deprecation` + `Link` headers that point to their replacements. All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation. ## 5) Observability - Metrics: `job_queue_depth{jobType,tenant}`, `job_latency_seconds`, `job_failures_total`, `job_retry_total`, `lease_extensions_total`. - Task Runner bridge adds `pack_run_logs_stream_lag_seconds`, `pack_run_heartbeats_total`, `pack_run_artifacts_total`. - Logs: structured with `jobId`, `jobType`, `tenant`, `workerId`, `leaseId`, `status`. Incident logs flagged for Ops. - Traces: spans covering `enqueue`, `schedule`, `lease`, `worker_execute`, `complete`. Trace IDs propagate to worker spans for end-to-end correlation. ## 6) Offline support - Orchestrator exports audit bundles: `jobs.jsonl`, `history.jsonl`, `throttles.jsonl`, `manifest.json`, `signatures/`. Used for offline investigations and compliance. - Replay manifests contain job digests and success/failure notes for deterministic proof. ## 7) Operational considerations - HA deployment with multiple API instances; queue storage determines redundancy strategy. - Support for `maintenance` mode halting leases while allowing status inspection. - Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume).