233 lines
16 KiB
Markdown
233 lines
16 KiB
Markdown
# Source & Job Orchestrator architecture
|
||
|
||
> Based on Epic 9 – Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability.
|
||
|
||
## 1) Topology
|
||
|
||
- **Orchestrator API (`StellaOps.JobEngine`).** Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (`orchestrator:*`).
|
||
- **Job ledger (PostgreSQL).** Tables `jobs`, `job_history`, `sources`, `quotas`, `throttles`, `incidents` (schema `orchestrator`). Startup migrations execute with PostgreSQL `search_path` bound to `orchestrator, public` so unqualified DDL lands in the module schema during scratch installs and resets. Append-only history ensures auditability.
|
||
- **Queue abstraction.** Supports Valkey Streams or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
|
||
- **Dashboard feeds.** SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.
|
||
|
||
## 2) Job lifecycle
|
||
|
||
1. **Enqueue.** Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit `JobRequest` records containing `jobType`, `tenant`, `priority`, `payloadDigest`, `dependencies`.
|
||
2. **Scheduling.** Orchestrator applies quotas and rate limits per `{tenant, jobType}`. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
|
||
3. **Leasing (Task Runner bridge).** Workers poll `LeaseJob` endpoint; Orchestrator returns job with `leaseId`, `leaseUntil`, `idempotencyKey`, and instrumentation tokens. Lease renewal required for long-running tasks; leases carry retry hints and provenance (`tenant`, `project`, `correlationId`, `taskRunnerId`).
|
||
4. **Completion.** Worker reports status (`succeeded`, `failed`, `canceled`, `timed_out`). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
|
||
5. **Replay.** Operators trigger `POST /jobs/{id}/replay` which clones job payload, sets `replayOf` pointer, and requeues with high priority while preserving determinism metadata.
|
||
|
||
### Pack-run lifecycle (phase III)
|
||
- **Register** `pack-run` job type with task runner hints (artifacts, log channel, heartbeat cadence).
|
||
- **Logs/Artifacts**: SSE/WS stream keyed by `packRunId` + `tenant/project`; artifacts published with content digests and URI metadata.
|
||
- **Events**: notifier payloads include envelope provenance (tenant, project, correlationId, idempotencyKey) pending ORCH-SVC-37-101 final spec.
|
||
|
||
## 3) Rate-limit & quota governance
|
||
|
||
- Quotas defined per tenant/profile (`maxActive`, `maxPerHour`, `burst`). Stored in `quotas` and enforced before leasing.
|
||
- Dynamic throttles allow ops to pause specific sources (`pauseSource`, `resumeSource`) or reduce concurrency.
|
||
- Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
|
||
- Control plane quota updates require Authority scope `orch:quota` (issued via `Orch.Admin` role). Historical rebuilds/backfills additionally require `orch:backfill` and must supply `backfill_reason` and `backfill_ticket` alongside the operator metadata. Authority persists all four fields (`quota_reason`, `quota_ticket`, `backfill_reason`, `backfill_ticket`) for audit replay.
|
||
|
||
### 3.1) Quota governance service
|
||
|
||
The `QuotaGovernanceService` provides cross-tenant quota allocation with configurable policies:
|
||
|
||
**Allocation strategies:**
|
||
- `Equal` — Divide total capacity equally among all active tenants.
|
||
- `Proportional` — Allocate based on tenant weight/priority tier.
|
||
- `Priority` — Higher priority tenants get allocation first, with preemption.
|
||
- `ReservedWithFairShare` — Reserved minimum per tenant, remainder distributed fairly.
|
||
- `Fixed` — Static allocation per tenant regardless of demand.
|
||
|
||
**Key operations:**
|
||
- `CalculateAllocationAsync` — Compute quota for a tenant based on active policies.
|
||
- `RequestQuotaAsync` — Request quota from shared pool; returns granted amount with burst usage.
|
||
- `ReleaseQuotaAsync` — Return quota to shared pool after job completion.
|
||
- `CanScheduleAsync` — Check scheduling eligibility combining quota and circuit breaker state.
|
||
|
||
**Quota allocation policy properties:**
|
||
- `TotalCapacity` — Pool size to allocate from (for proportional/fair strategies).
|
||
- `MinimumPerTenant` / `MaximumPerTenant` — Allocation bounds.
|
||
- `ReservedCapacity` — Guaranteed capacity for high-priority tenants.
|
||
- `AllowBurst` / `BurstMultiplier` — Allow temporary overallocation when capacity exists.
|
||
- `Priority` — Policy evaluation order (higher = first).
|
||
- `JobType` — Optional job type filter (null = applies to all).
|
||
|
||
### 3.2) Circuit breaker service
|
||
|
||
The `CircuitBreakerService` implements the circuit breaker pattern for downstream services:
|
||
|
||
**States:**
|
||
- `Closed` — Normal operation; requests pass through. Failures are tracked.
|
||
- `Open` — Circuit tripped; requests are blocked for `OpenDuration`. Prevents cascade failures.
|
||
- `HalfOpen` — After open duration, limited test requests allowed. Success → Closed; Failure → Open.
|
||
|
||
**Thresholds:**
|
||
- `FailureThreshold` (0.0–1.0) — Failure rate that triggers circuit open.
|
||
- `WindowDuration` — Sliding window for failure rate calculation.
|
||
- `MinimumSamples` — Minimum requests before circuit can trip.
|
||
- `OpenDuration` — How long circuit stays open before half-open transition.
|
||
- `HalfOpenTestCount` — Number of test requests allowed in half-open state.
|
||
|
||
**Key operations:**
|
||
- `CheckAsync` — Verify if request is allowed; returns `CircuitBreakerCheckResult`.
|
||
- `RecordSuccessAsync` / `RecordFailureAsync` — Update circuit state after request.
|
||
- `ForceOpenAsync` / `ForceCloseAsync` — Manual operator intervention (audited).
|
||
- `ListAsync` — View all circuit breakers for a tenant with optional state filter.
|
||
|
||
**Downstream services protected:**
|
||
- Scanner
|
||
- Attestor
|
||
- Policy Engine
|
||
- Registry clients
|
||
- External integrations
|
||
|
||
## 4) APIs
|
||
|
||
### 4.1) Job management
|
||
- `GET /api/jobs?status=` — list jobs with filters (tenant, jobType, status, time window).
|
||
- `GET /api/jobs/{id}` — job detail (payload digest, attempts, worker, lease history, metrics).
|
||
- `POST /api/jobs/{id}/cancel` — cancel running/pending job with audit reason.
|
||
- `POST /api/jobs/{id}/replay` — schedule replay.
|
||
- `POST /api/limits/throttle` — apply throttle (requires elevated scope).
|
||
- `GET /api/dashboard/metrics` — aggregated metrics for Console dashboards.
|
||
|
||
### 4.2) Circuit breaker endpoints (`/api/v1/jobengine/circuit-breakers`)
|
||
- `GET /` — List all circuit breakers for tenant (optional `?state=` filter).
|
||
- `GET /{serviceId}` — Get circuit breaker state for specific downstream service.
|
||
- `GET /{serviceId}/check` — Check if requests are allowed; returns `IsAllowed`, `State`, `FailureRate`, `TimeUntilRetry`.
|
||
- `POST /{serviceId}/success` — Record successful request to downstream service.
|
||
- `POST /{serviceId}/failure` — Record failed request (body: `failureReason`).
|
||
- `POST /{serviceId}/force-open` — Manually open circuit (body: `reason`; audited).
|
||
- `POST /{serviceId}/force-close` — Manually close circuit (audited).
|
||
|
||
### 4.3) Quota governance endpoints (`/api/v1/jobengine/quota-governance`)
|
||
- `GET /policies` — List quota allocation policies (optional `?enabled=` filter).
|
||
- `GET /policies/{policyId}` — Get specific policy.
|
||
- `POST /policies` — Create new policy.
|
||
- `PUT /policies/{policyId}` — Update policy.
|
||
- `DELETE /policies/{policyId}` — Delete policy.
|
||
- `GET /allocation` — Calculate allocation for current tenant (optional `?jobType=`).
|
||
- `POST /request` — Request quota from pool (body: `jobType`, `requestedAmount`).
|
||
- `POST /release` — Release quota back to pool (body: `jobType`, `releasedAmount`).
|
||
- `GET /status` — Get tenant quota status (optional `?jobType=`).
|
||
- `GET /summary` — Get quota governance summary across all tenants (optional `?policyId=`).
|
||
- `GET /can-schedule` — Check if job can be scheduled (optional `?jobType=`).
|
||
|
||
### 4.4) Discovery and documentation
|
||
- Event envelope draft (`docs/modules/jobengine/event-envelope.md`) defines notifier/webhook/SSE payloads with idempotency keys, provenance, and task runner metadata for job/pack-run events.
|
||
- OpenAPI discovery: `/.well-known/openapi` exposes `/openapi/jobengine.json` (OAS 3.1) with pagination/idempotency/error-envelope examples; legacy job detail/summary endpoints now ship `Deprecation` + `Link` headers that point to their replacements.
|
||
|
||
### 4.5) Release control plane dashboard endpoints
|
||
- `GET /api/v1/release-jobengine/dashboard` — control-plane dashboard payload (pipeline, pending approvals, active deployments, recent releases).
|
||
- `POST /api/v1/release-jobengine/promotions/{id}/approve` — approve a pending promotion from dashboard context.
|
||
- `POST /api/v1/release-jobengine/promotions/{id}/reject` — reject a pending promotion from dashboard context.
|
||
- Compatibility aliases are exposed for legacy clients under `/api/release-jobengine/*`.
|
||
|
||
All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.
|
||
|
||
## 5) Observability
|
||
|
||
- Metrics: `job_queue_depth{jobType,tenant}`, `job_latency_seconds`, `job_failures_total`, `job_retry_total`, `lease_extensions_total`.
|
||
- Task Runner bridge adds `pack_run_logs_stream_lag_seconds`, `pack_run_heartbeats_total`, `pack_run_artifacts_total`.
|
||
- Logs: structured with `jobId`, `jobType`, `tenant`, `workerId`, `leaseId`, `status`. Incident logs flagged for Ops.
|
||
- Traces: spans covering `enqueue`, `schedule`, `lease`, `worker_execute`, `complete`. Trace IDs propagate to worker spans for end-to-end correlation.
|
||
|
||
## 6) Offline support
|
||
|
||
- Orchestrator exports audit bundles: `jobs.jsonl`, `history.jsonl`, `throttles.jsonl`, `manifest.json`, `signatures/`. Used for offline investigations and compliance.
|
||
- Replay manifests contain job digests and success/failure notes for deterministic proof.
|
||
|
||
## 7) Operational considerations
|
||
|
||
- HA deployment with multiple API instances; queue storage determines redundancy strategy.
|
||
- Support for `maintenance` mode halting leases while allowing status inspection.
|
||
- Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume).
|
||
|
||
---
|
||
|
||
## 8) Orchestration domain subdomains (Sprint 208)
|
||
|
||
Sprint 208 consolidated Scheduler, TaskRunner, and PacksRegistry source trees under `src/JobEngine/` as subdomains of the orchestration domain. Each subdomain retains its own project names, namespaces, and runtime identities. No namespace renames were performed.
|
||
|
||
### 8.1) Scheduler subdomain
|
||
|
||
**Source location:** `src/JobEngine/StellaOps.Scheduler.*`
|
||
|
||
The Scheduler service re-evaluates already-cataloged images when intelligence changes (Concelier/Excititor/policy), orchestrates nightly and ad-hoc runs, targets only impacted images using the BOM-Index, and emits report-ready events for downstream Notify. Default mode is analysis-only (no image pull); optional content-refresh can be enabled per schedule.
|
||
|
||
**Deployables:** `StellaOps.Scheduler.WebService` (stateless), `StellaOps.Scheduler.Worker.Host` (scale-out).
|
||
|
||
**Database:** `SchedulerDbContext` (schema `scheduler`, 11 entities). Owns `schedules`, `runs`, `impact_cursors`, `locks`, `audit` tables. See archived docs: `docs-archived/modules/scheduler/architecture.md`.
|
||
|
||
### 8.2) TaskRunner subdomain
|
||
|
||
**Source location:** `src/JobEngine/StellaOps.TaskRunner/`, `src/JobEngine/StellaOps.TaskRunner.__Libraries/`
|
||
|
||
The TaskRunner provides the execution substrate for Orchestrator jobs. Workers poll lease endpoints, execute tasks, report outcomes, and stream logs/artifacts for pack-runs.
|
||
|
||
**Deployables:** `StellaOps.TaskRunner.WebService`, `StellaOps.TaskRunner.Worker`.
|
||
|
||
**Database and storage contract (Sprint 312):**
|
||
- `Storage:Driver=postgres` is the production default for run state, logs, and approvals.
|
||
- Postgres-backed stores: `PostgresPackRunStateStore`, `PostgresPackRunLogStore`, `PostgresPackRunApprovalStore` via `TaskRunnerDataSource`.
|
||
- Artifact payload channel uses object storage path (`seed-fs` driver) configured with `TaskRunner:Storage:ObjectStore:SeedFs:RootPath`.
|
||
- Startup fails fast when `Storage:ObjectStore:Driver` is set to `rustfs` (not implemented) or any unsupported driver value.
|
||
- Non-development startup fails fast when `Storage:Driver=postgres` and no connection string is configured.
|
||
- Explicit non-production overrides remain available (`filesystem`, `inmemory`) but are no longer implicit defaults.
|
||
|
||
### 8.3) PacksRegistry subdomain
|
||
|
||
**Source location:** `src/JobEngine/StellaOps.PacksRegistry/`, `src/JobEngine/StellaOps.PacksRegistry.__Libraries/`
|
||
|
||
The PacksRegistry manages compliance/automation pack definitions, versions, and distribution for the task execution pipeline.
|
||
|
||
**Deployables:** `StellaOps.PacksRegistry.WebService`, `StellaOps.PacksRegistry.Worker`.
|
||
|
||
**Database and storage contract (Sprint 312):**
|
||
- `Storage:Driver=postgres` is the production default for metadata/state repositories (`pack`, `parity`, `lifecycle`, `mirror`, `audit`, `attestation metadata`).
|
||
- Blob/object payloads (`pack content`, `provenance content`, `attestation content`) are persisted through the seed-fs object-store channel (`SeedFsPacksRegistryBlobStore`).
|
||
- Startup fails fast when `Storage:ObjectStore:Driver` is set to `rustfs` (not implemented) or any unsupported driver value.
|
||
- Non-development startup fails fast when `Storage:Driver=postgres` and no connection string is configured.
|
||
- PostgreSQL keeps metadata and compatibility placeholders; payload retrieval resolves from object storage first.
|
||
- Explicit non-production overrides remain available (`filesystem`, `inmemory`) but are no longer implicit defaults.
|
||
|
||
---
|
||
|
||
## 9) Architecture Decision Record: No DB merge (Sprint 208)
|
||
|
||
**Decision:** OrchestratorDbContext and SchedulerDbContext remain as separate DbContexts with separate PostgreSQL schemas. No cross-schema DB merge.
|
||
|
||
**Context:** Sprint 208 evaluated merging the Orchestrator (39 entities) and Scheduler (11 entities) DbContexts into a single unified context. Both define `Jobs` and `JobHistory` entities.
|
||
|
||
**Problem:** The `Jobs` and `JobHistory` entities have fundamentally incompatible semantics:
|
||
- **OrchestratorDbContext.Jobs:** Represents pipeline orchestration runs (source ingestion, policy evaluation, release promotion). Fields include `payloadDigest`, `dependencies`, `leaseId`, `retryPolicy`.
|
||
- **SchedulerDbContext.Jobs:** Represents cron-scheduled rescan executions (image re-evaluation, impact-index-driven). Fields include `scheduleId`, `trigger` (cron/conselier/excitor/manual), `impactSet`, `runStats`.
|
||
|
||
Merging would require renaming one set of entities (e.g., `SchedulerJobs`, `SchedulerJobHistory`), propagating through repositories, query code, compiled models, migrations, and external contracts. The schemas already provide clean separation at no operational cost since both live in the same `stellaops_platform` database.
|
||
|
||
**Decision rationale:**
|
||
1. Entity name collision with incompatible models makes merge risky and disruptive.
|
||
2. Compiled models from Sprint 219 would need regeneration for both contexts.
|
||
3. Schemas provide clean separation at zero cost.
|
||
4. Future domain rename (Sprint 221) is a better venue for any schema consolidation.
|
||
|
||
**Consequences:** TaskRunner and PacksRegistry remain independent subdomains and now implement explicit storage contracts (Postgres state/metadata plus object-store payload channels) without cross-schema DB merge.
|
||
|
||
---
|
||
|
||
## 10) Schema continuity remediation (Sprint 311)
|
||
|
||
Sprint 221 renamed the domain from Orchestrator to JobEngine but intentionally preserved the PostgreSQL schema name `orchestrator` for continuity. Sprint 311 closed the implementation drift so runtime, design-time, and compiled-model paths now align on the same preserved schema default.
|
||
|
||
Implemented alignment:
|
||
- Runtime default schema is centralized in `JobEngineDbContext.DefaultSchemaName` (`orchestrator`) and schema normalization is centralized in `JobEngineDbContext.ResolveSchemaName(...)`.
|
||
- Repository runtime context creation (`JobEngineDbContextFactory`) uses that same shared default and normalization logic.
|
||
- Design-time context creation now passes `JobEngineDbContext.DefaultSchemaName` explicitly instead of relying on implicit constructor fallback.
|
||
- EF compiled model schema annotations were aligned to `orchestrator` so compiled-model and runtime model behavior match.
|
||
|
||
Out of scope for Sprint 311:
|
||
- No schema migration from `orchestrator` to `jobengine` was introduced.
|
||
- Any future physical schema rename requires a dedicated migration sprint with data/backfill and rollback planning.
|