Files
master da76d6e93e Add topology auth policies + journey findings notes
Concelier:
- Register Topology.Read, Topology.Manage, Topology.Admin authorization
  policies mapped to OrchRead/OrchOperate/PlatformContextRead/IntegrationWrite
  scopes. Previously these policies were referenced by endpoints but never
  registered, causing System.InvalidOperationException on every topology
  API call.

Gateway routes:
- Simplified targets/environments routes (removed specific sub-path routes,
  use catch-all patterns instead)
- Changed environments base route to JobEngine (where CRUD lives)
- Changed to ReverseProxy type for all topology routes

KNOWN ISSUE (not yet fixed):
- ReverseProxy routes don't forward the gateway's identity envelope to
  Concelier. The regions/targets/bindings endpoints return 401 because
  hasPrincipal=False — the gateway authenticates the user but doesn't
  pass the identity to the backend via ReverseProxy. Microservice routes
  use Valkey transport which includes envelope headers. Topology endpoints
  need either: (a) Valkey transport registration in Concelier, or
  (b) Concelier configured to accept raw bearer tokens on ReverseProxy paths.
  This is an architecture-level fix.

Journey findings collected so far:
- Integration wizard (Harbor + GitHub App): works end-to-end
- Advisory Check All: fixed (parallel individual checks)
- Mirror domain creation: works, generate-immediately fails silently
- Topology wizard Step 1 (Region): blocked by auth passthrough issue
- Topology wizard Step 2 (Environment): POST to JobEngine needs verify
- User ID resolution: raw hashes shown everywhere

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 08:12:39 +02:00

17 KiB
Raw Permalink Blame History

Source & Job Orchestrator architecture

Based on Epic9 Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability.

1) Topology

  • Orchestrator API (StellaOps.JobEngine). Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (orchestrator:*).
  • Job ledger (PostgreSQL). Tables jobs, job_history, sources, quotas, throttles, incidents (schema orchestrator); pack registry tables packs, pack_versions (schema packs). Runtime sessions set search_path to orchestrator, packs, public so both schemas and their enum types (job_status, pack_status, pack_version_status) resolve correctly. The jobs/summary endpoint uses a single COUNT(*) FILTER (WHERE status::text = ...) aggregate query to avoid enum-vs-text mismatch and eliminate per-status round trips. Startup migrations execute idempotently; append-only history ensures auditability.
  • Queue abstraction. Supports Valkey Streams or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
  • Dashboard feeds. SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.

2) Job lifecycle

  1. Enqueue. Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit JobRequest records containing jobType, tenant, priority, payloadDigest, dependencies.
  2. Scheduling. Orchestrator applies quotas and rate limits per {tenant, jobType}. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
  3. Leasing (Task Runner bridge). Workers poll LeaseJob endpoint; Orchestrator returns job with leaseId, leaseUntil, idempotencyKey, and instrumentation tokens. Lease renewal required for long-running tasks; leases carry retry hints and provenance (tenant, project, correlationId, taskRunnerId).
  4. Completion. Worker reports status (succeeded, failed, canceled, timed_out). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
  5. Replay. Operators trigger POST /jobs/{id}/replay which clones job payload, sets replayOf pointer, and requeues with high priority while preserving determinism metadata.

Pack-run lifecycle (phase III)

  • Register pack-run job type with task runner hints (artifacts, log channel, heartbeat cadence).
  • Logs/Artifacts: SSE/WS stream keyed by packRunId + tenant/project; artifacts published with content digests and URI metadata.
  • Events: notifier payloads include envelope provenance (tenant, project, correlationId, idempotencyKey) pending ORCH-SVC-37-101 final spec.

3) Rate-limit & quota governance

  • Quotas defined per tenant/profile (maxActive, maxPerHour, burst). Stored in quotas and enforced before leasing.
  • Dynamic throttles allow ops to pause specific sources (pauseSource, resumeSource) or reduce concurrency.
  • Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
  • Control plane quota updates require Authority scope orch:quota (issued via Orch.Admin role). Historical rebuilds/backfills additionally require orch:backfill and must supply backfill_reason and backfill_ticket alongside the operator metadata. Authority persists all four fields (quota_reason, quota_ticket, backfill_reason, backfill_ticket) for audit replay.

3.1) Quota governance service

The QuotaGovernanceService provides cross-tenant quota allocation with configurable policies:

Allocation strategies:

  • Equal — Divide total capacity equally among all active tenants.
  • Proportional — Allocate based on tenant weight/priority tier.
  • Priority — Higher priority tenants get allocation first, with preemption.
  • ReservedWithFairShare — Reserved minimum per tenant, remainder distributed fairly.
  • Fixed — Static allocation per tenant regardless of demand.

Key operations:

  • CalculateAllocationAsync — Compute quota for a tenant based on active policies.
  • RequestQuotaAsync — Request quota from shared pool; returns granted amount with burst usage.
  • ReleaseQuotaAsync — Return quota to shared pool after job completion.
  • CanScheduleAsync — Check scheduling eligibility combining quota and circuit breaker state.

Quota allocation policy properties:

  • TotalCapacity — Pool size to allocate from (for proportional/fair strategies).
  • MinimumPerTenant / MaximumPerTenant — Allocation bounds.
  • ReservedCapacity — Guaranteed capacity for high-priority tenants.
  • AllowBurst / BurstMultiplier — Allow temporary overallocation when capacity exists.
  • Priority — Policy evaluation order (higher = first).
  • JobType — Optional job type filter (null = applies to all).

3.2) Circuit breaker service

The CircuitBreakerService implements the circuit breaker pattern for downstream services:

States:

  • Closed — Normal operation; requests pass through. Failures are tracked.
  • Open — Circuit tripped; requests are blocked for OpenDuration. Prevents cascade failures.
  • HalfOpen — After open duration, limited test requests allowed. Success → Closed; Failure → Open.

Thresholds:

  • FailureThreshold (0.01.0) — Failure rate that triggers circuit open.
  • WindowDuration — Sliding window for failure rate calculation.
  • MinimumSamples — Minimum requests before circuit can trip.
  • OpenDuration — How long circuit stays open before half-open transition.
  • HalfOpenTestCount — Number of test requests allowed in half-open state.

Key operations:

  • CheckAsync — Verify if request is allowed; returns CircuitBreakerCheckResult.
  • RecordSuccessAsync / RecordFailureAsync — Update circuit state after request.
  • ForceOpenAsync / ForceCloseAsync — Manual operator intervention (audited).
  • ListAsync — View all circuit breakers for a tenant with optional state filter.

Downstream services protected:

  • Scanner
  • Attestor
  • Policy Engine
  • Registry clients
  • External integrations

4) APIs

4.1) Job management

  • GET /api/jobs?status= — list jobs with filters (tenant, jobType, status, time window).
  • GET /api/jobs/{id} — job detail (payload digest, attempts, worker, lease history, metrics).
  • POST /api/jobs/{id}/cancel — cancel running/pending job with audit reason.
  • POST /api/jobs/{id}/replay — schedule replay.
  • POST /api/limits/throttle — apply throttle (requires elevated scope).
  • GET /api/dashboard/metrics — aggregated metrics for Console dashboards.

4.2) Circuit breaker endpoints (/api/v1/jobengine/circuit-breakers)

  • GET / — List all circuit breakers for tenant (optional ?state= filter).
  • GET /{serviceId} — Get circuit breaker state for specific downstream service.
  • GET /{serviceId}/check — Check if requests are allowed; returns IsAllowed, State, FailureRate, TimeUntilRetry.
  • POST /{serviceId}/success — Record successful request to downstream service.
  • POST /{serviceId}/failure — Record failed request (body: failureReason).
  • POST /{serviceId}/force-open — Manually open circuit (body: reason; audited).
  • POST /{serviceId}/force-close — Manually close circuit (audited).

4.3) Quota governance endpoints (/api/v1/jobengine/quota-governance)

  • GET /policies — List quota allocation policies (optional ?enabled= filter).
  • GET /policies/{policyId} — Get specific policy.
  • POST /policies — Create new policy.
  • PUT /policies/{policyId} — Update policy.
  • DELETE /policies/{policyId} — Delete policy.
  • GET /allocation — Calculate allocation for current tenant (optional ?jobType=).
  • POST /request — Request quota from pool (body: jobType, requestedAmount).
  • POST /release — Release quota back to pool (body: jobType, releasedAmount).
  • GET /status — Get tenant quota status (optional ?jobType=).
  • GET /summary — Get quota governance summary across all tenants (optional ?policyId=).
  • GET /can-schedule — Check if job can be scheduled (optional ?jobType=).

4.4) Discovery and documentation

  • Event envelope draft (docs/modules/jobengine/event-envelope.md) defines notifier/webhook/SSE payloads with idempotency keys, provenance, and task runner metadata for job/pack-run events.
  • OpenAPI discovery: /.well-known/openapi exposes /openapi/jobengine.json (OAS 3.1) with pagination/idempotency/error-envelope examples; legacy job detail/summary endpoints now ship Deprecation + Link headers that point to their replacements.

4.5) Release control plane dashboard endpoints

  • GET /api/v1/release-jobengine/dashboard — control-plane dashboard payload (pipeline, pending approvals, active deployments, recent releases).
  • POST /api/v1/release-jobengine/promotions/{id}/approve — approve a pending promotion from dashboard context.
  • POST /api/v1/release-jobengine/promotions/{id}/reject — reject a pending promotion from dashboard context.
  • Compatibility aliases are exposed for legacy clients under /api/release-jobengine/*.

All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.

5) Observability

  • Metrics: job_queue_depth{jobType,tenant}, job_latency_seconds, job_failures_total, job_retry_total, lease_extensions_total.
  • Task Runner bridge adds pack_run_logs_stream_lag_seconds, pack_run_heartbeats_total, pack_run_artifacts_total.
  • Logs: structured with jobId, jobType, tenant, workerId, leaseId, status. Incident logs flagged for Ops.
  • Traces: spans covering enqueue, schedule, lease, worker_execute, complete. Trace IDs propagate to worker spans for end-to-end correlation.

6) Offline support

  • Orchestrator exports audit bundles: jobs.jsonl, history.jsonl, throttles.jsonl, manifest.json, signatures/. Used for offline investigations and compliance.
  • Replay manifests contain job digests and success/failure notes for deterministic proof.

7) Operational considerations

  • HA deployment with multiple API instances; queue storage determines redundancy strategy.
  • Support for maintenance mode halting leases while allowing status inspection.
  • Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume).

8) Orchestration domain subdomains (Sprint 208)

Sprint 208 consolidated Scheduler, TaskRunner, and PacksRegistry source trees under src/JobEngine/ as subdomains of the orchestration domain. Each subdomain retains its own project names, namespaces, and runtime identities. No namespace renames were performed.

8.1) Scheduler subdomain

Source location: src/JobEngine/StellaOps.Scheduler.*

The Scheduler service re-evaluates already-cataloged images when intelligence changes (Concelier/Excititor/policy), orchestrates nightly and ad-hoc runs, targets only impacted images using the BOM-Index, and emits report-ready events for downstream Notify. Default mode is analysis-only (no image pull); optional content-refresh can be enabled per schedule.

Deployables: StellaOps.Scheduler.WebService (stateless), StellaOps.Scheduler.Worker.Host (scale-out).

Database: SchedulerDbContext (schema scheduler, 11 entities). Owns schedules, runs, impact_cursors, locks, audit tables. See archived docs: docs-archived/modules/scheduler/architecture.md.

8.2) TaskRunner subdomain

Source location: src/JobEngine/StellaOps.TaskRunner/, src/JobEngine/StellaOps.TaskRunner.__Libraries/

The TaskRunner provides the execution substrate for Orchestrator jobs. Workers poll lease endpoints, execute tasks, report outcomes, and stream logs/artifacts for pack-runs.

Deployables: StellaOps.TaskRunner.WebService, StellaOps.TaskRunner.Worker.

Database and storage contract (Sprint 312):

  • Storage:Driver=postgres is the production default for run state, logs, and approvals.
  • Postgres-backed stores: PostgresPackRunStateStore, PostgresPackRunLogStore, PostgresPackRunApprovalStore via TaskRunnerDataSource.
  • Artifact payload channel uses object storage path (seed-fs driver) configured with TaskRunner:Storage:ObjectStore:SeedFs:RootPath.
  • Startup fails fast when Storage:ObjectStore:Driver is set to rustfs (not implemented) or any unsupported driver value.
  • Non-development startup fails fast when Storage:Driver=postgres and no connection string is configured.
  • Explicit non-production overrides remain available (filesystem, inmemory) but are no longer implicit defaults.

8.3) PacksRegistry subdomain

Source location: src/JobEngine/StellaOps.PacksRegistry/, src/JobEngine/StellaOps.PacksRegistry.__Libraries/

The PacksRegistry manages compliance/automation pack definitions, versions, and distribution for the task execution pipeline.

Deployables: StellaOps.PacksRegistry.WebService, StellaOps.PacksRegistry.Worker.

Database and storage contract (Sprint 312):

  • Storage:Driver=postgres is the production default for metadata/state repositories (pack, parity, lifecycle, mirror, audit, attestation metadata).
  • Blob/object payloads (pack content, provenance content, attestation content) are persisted through the seed-fs object-store channel (SeedFsPacksRegistryBlobStore).
  • Startup fails fast when Storage:ObjectStore:Driver is set to rustfs (not implemented) or any unsupported driver value.
  • Non-development startup fails fast when Storage:Driver=postgres and no connection string is configured.
  • PostgreSQL keeps metadata and compatibility placeholders; payload retrieval resolves from object storage first.
  • Explicit non-production overrides remain available (filesystem, inmemory) but are no longer implicit defaults.

9) Architecture Decision Record: No DB merge (Sprint 208)

Decision: OrchestratorDbContext and SchedulerDbContext remain as separate DbContexts with separate PostgreSQL schemas. No cross-schema DB merge.

Context: Sprint 208 evaluated merging the Orchestrator (39 entities) and Scheduler (11 entities) DbContexts into a single unified context. Both define Jobs and JobHistory entities.

Problem: The Jobs and JobHistory entities have fundamentally incompatible semantics:

  • OrchestratorDbContext.Jobs: Represents pipeline orchestration runs (source ingestion, policy evaluation, release promotion). Fields include payloadDigest, dependencies, leaseId, retryPolicy.
  • SchedulerDbContext.Jobs: Represents cron-scheduled rescan executions (image re-evaluation, impact-index-driven). Fields include scheduleId, trigger (cron/conselier/excitor/manual), impactSet, runStats.

Merging would require renaming one set of entities (e.g., SchedulerJobs, SchedulerJobHistory), propagating through repositories, query code, compiled models, migrations, and external contracts. The schemas already provide clean separation at no operational cost since both live in the same stellaops_platform database.

Decision rationale:

  1. Entity name collision with incompatible models makes merge risky and disruptive.
  2. Compiled models from Sprint 219 would need regeneration for both contexts.
  3. Schemas provide clean separation at zero cost.
  4. Future domain rename (Sprint 221) is a better venue for any schema consolidation.

Consequences: TaskRunner and PacksRegistry remain independent subdomains and now implement explicit storage contracts (Postgres state/metadata plus object-store payload channels) without cross-schema DB merge.


10) Schema continuity remediation (Sprint 311)

Sprint 221 renamed the domain from Orchestrator to JobEngine but intentionally preserved the PostgreSQL schema name orchestrator for continuity. Sprint 311 closed the implementation drift so runtime, design-time, and compiled-model paths now align on the same preserved schema default.

Implemented alignment:

  • Runtime default schema is centralized in JobEngineDbContext.DefaultSchemaName (orchestrator) and schema normalization is centralized in JobEngineDbContext.ResolveSchemaName(...).
  • Repository runtime context creation (JobEngineDbContextFactory) uses that same shared default and normalization logic.
  • Design-time context creation now passes JobEngineDbContext.DefaultSchemaName explicitly instead of relying on implicit constructor fallback.
  • EF compiled model schema annotations were aligned to orchestrator so compiled-model and runtime model behavior match.

Out of scope for Sprint 311:

  • No schema migration from orchestrator to jobengine was introduced.
  • Any future physical schema rename requires a dedicated migration sprint with data/backfill and rollback planning.