9.5 KiB
9.5 KiB
Source & Job Orchestrator architecture
Based on Epic 9 – Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability.
1) Topology
- Orchestrator API (
StellaOps.Orchestrator). Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (orchestrator:*). - Job ledger (PostgreSQL). Tables
jobs,job_history,sources,quotas,throttles,incidents(schemaorchestrator). Append-only history ensures auditability. - Queue abstraction. Supports Valkey Streams or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
- Dashboard feeds. SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.
2) Job lifecycle
- Enqueue. Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit
JobRequestrecords containingjobType,tenant,priority,payloadDigest,dependencies. - Scheduling. Orchestrator applies quotas and rate limits per
{tenant, jobType}. Jobs exceeding limits are staged in pending queue with next eligible timestamp. - Leasing (Task Runner bridge). Workers poll
LeaseJobendpoint; Orchestrator returns job withleaseId,leaseUntil,idempotencyKey, and instrumentation tokens. Lease renewal required for long-running tasks; leases carry retry hints and provenance (tenant,project,correlationId,taskRunnerId). - Completion. Worker reports status (
succeeded,failed,canceled,timed_out). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded. - Replay. Operators trigger
POST /jobs/{id}/replaywhich clones job payload, setsreplayOfpointer, and requeues with high priority while preserving determinism metadata.
Pack-run lifecycle (phase III)
- Register
pack-runjob type with task runner hints (artifacts, log channel, heartbeat cadence). - Logs/Artifacts: SSE/WS stream keyed by
packRunId+tenant/project; artifacts published with content digests and URI metadata. - Events: notifier payloads include envelope provenance (tenant, project, correlationId, idempotencyKey) pending ORCH-SVC-37-101 final spec.
3) Rate-limit & quota governance
- Quotas defined per tenant/profile (
maxActive,maxPerHour,burst). Stored inquotasand enforced before leasing. - Dynamic throttles allow ops to pause specific sources (
pauseSource,resumeSource) or reduce concurrency. - Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
- Control plane quota updates require Authority scope
orch:quota(issued viaOrch.Adminrole). Historical rebuilds/backfills additionally requireorch:backfilland must supplybackfill_reasonandbackfill_ticketalongside the operator metadata. Authority persists all four fields (quota_reason,quota_ticket,backfill_reason,backfill_ticket) for audit replay.
3.1) Quota governance service
The QuotaGovernanceService provides cross-tenant quota allocation with configurable policies:
Allocation strategies:
Equal— Divide total capacity equally among all active tenants.Proportional— Allocate based on tenant weight/priority tier.Priority— Higher priority tenants get allocation first, with preemption.ReservedWithFairShare— Reserved minimum per tenant, remainder distributed fairly.Fixed— Static allocation per tenant regardless of demand.
Key operations:
CalculateAllocationAsync— Compute quota for a tenant based on active policies.RequestQuotaAsync— Request quota from shared pool; returns granted amount with burst usage.ReleaseQuotaAsync— Return quota to shared pool after job completion.CanScheduleAsync— Check scheduling eligibility combining quota and circuit breaker state.
Quota allocation policy properties:
TotalCapacity— Pool size to allocate from (for proportional/fair strategies).MinimumPerTenant/MaximumPerTenant— Allocation bounds.ReservedCapacity— Guaranteed capacity for high-priority tenants.AllowBurst/BurstMultiplier— Allow temporary overallocation when capacity exists.Priority— Policy evaluation order (higher = first).JobType— Optional job type filter (null = applies to all).
3.2) Circuit breaker service
The CircuitBreakerService implements the circuit breaker pattern for downstream services:
States:
Closed— Normal operation; requests pass through. Failures are tracked.Open— Circuit tripped; requests are blocked forOpenDuration. Prevents cascade failures.HalfOpen— After open duration, limited test requests allowed. Success → Closed; Failure → Open.
Thresholds:
FailureThreshold(0.0–1.0) — Failure rate that triggers circuit open.WindowDuration— Sliding window for failure rate calculation.MinimumSamples— Minimum requests before circuit can trip.OpenDuration— How long circuit stays open before half-open transition.HalfOpenTestCount— Number of test requests allowed in half-open state.
Key operations:
CheckAsync— Verify if request is allowed; returnsCircuitBreakerCheckResult.RecordSuccessAsync/RecordFailureAsync— Update circuit state after request.ForceOpenAsync/ForceCloseAsync— Manual operator intervention (audited).ListAsync— View all circuit breakers for a tenant with optional state filter.
Downstream services protected:
- Scanner
- Attestor
- Policy Engine
- Registry clients
- External integrations
4) APIs
4.1) Job management
GET /api/jobs?status=— list jobs with filters (tenant, jobType, status, time window).GET /api/jobs/{id}— job detail (payload digest, attempts, worker, lease history, metrics).POST /api/jobs/{id}/cancel— cancel running/pending job with audit reason.POST /api/jobs/{id}/replay— schedule replay.POST /api/limits/throttle— apply throttle (requires elevated scope).GET /api/dashboard/metrics— aggregated metrics for Console dashboards.
4.2) Circuit breaker endpoints (/api/v1/orchestrator/circuit-breakers)
GET /— List all circuit breakers for tenant (optional?state=filter).GET /{serviceId}— Get circuit breaker state for specific downstream service.GET /{serviceId}/check— Check if requests are allowed; returnsIsAllowed,State,FailureRate,TimeUntilRetry.POST /{serviceId}/success— Record successful request to downstream service.POST /{serviceId}/failure— Record failed request (body:failureReason).POST /{serviceId}/force-open— Manually open circuit (body:reason; audited).POST /{serviceId}/force-close— Manually close circuit (audited).
4.3) Quota governance endpoints (/api/v1/orchestrator/quota-governance)
GET /policies— List quota allocation policies (optional?enabled=filter).GET /policies/{policyId}— Get specific policy.POST /policies— Create new policy.PUT /policies/{policyId}— Update policy.DELETE /policies/{policyId}— Delete policy.GET /allocation— Calculate allocation for current tenant (optional?jobType=).POST /request— Request quota from pool (body:jobType,requestedAmount).POST /release— Release quota back to pool (body:jobType,releasedAmount).GET /status— Get tenant quota status (optional?jobType=).GET /summary— Get quota governance summary across all tenants (optional?policyId=).GET /can-schedule— Check if job can be scheduled (optional?jobType=).
4.4) Discovery and documentation
- Event envelope draft (
docs/modules/orchestrator/event-envelope.md) defines notifier/webhook/SSE payloads with idempotency keys, provenance, and task runner metadata for job/pack-run events. - OpenAPI discovery:
/.well-known/openapiexposes/openapi/orchestrator.json(OAS 3.1) with pagination/idempotency/error-envelope examples; legacy job detail/summary endpoints now shipDeprecation+Linkheaders that point to their replacements.
All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.
5) Observability
- Metrics:
job_queue_depth{jobType,tenant},job_latency_seconds,job_failures_total,job_retry_total,lease_extensions_total. - Task Runner bridge adds
pack_run_logs_stream_lag_seconds,pack_run_heartbeats_total,pack_run_artifacts_total. - Logs: structured with
jobId,jobType,tenant,workerId,leaseId,status. Incident logs flagged for Ops. - Traces: spans covering
enqueue,schedule,lease,worker_execute,complete. Trace IDs propagate to worker spans for end-to-end correlation.
6) Offline support
- Orchestrator exports audit bundles:
jobs.jsonl,history.jsonl,throttles.jsonl,manifest.json,signatures/. Used for offline investigations and compliance. - Replay manifests contain job digests and success/failure notes for deterministic proof.
7) Operational considerations
- HA deployment with multiple API instances; queue storage determines redundancy strategy.
- Support for
maintenancemode halting leases while allowing status inspection. - Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume).