consolidation of some of the modules, localization fixes, product advisories work, qa work
This commit is contained in:
34
docs/modules/jobengine/AGENTS.md
Normal file
34
docs/modules/jobengine/AGENTS.md
Normal file
@@ -0,0 +1,34 @@
|
||||
# Source & Job Orchestrator agent guide
|
||||
|
||||
## Mission
|
||||
The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform.
|
||||
|
||||
## Key docs
|
||||
- [Module README](./README.md)
|
||||
- [Architecture](./architecture.md)
|
||||
- [Implementation plan](./implementation_plan.md)
|
||||
- [Task board](./TASKS.md)
|
||||
|
||||
## How to get started
|
||||
1. Read the design summaries in ./architecture.md (quota governance, job lifecycle, dashboard feeds).
|
||||
2. Open sprint file `/docs/implplan/SPRINT_*.md` and locate stories for this component.
|
||||
3. Check ./TASKS.md and update status before/after work.
|
||||
4. Review ./README.md for responsibilities and ensure changes maintain determinism and offline parity.
|
||||
|
||||
## Guardrails
|
||||
- Uphold Aggregation-Only Contract boundaries when consuming ingestion data.
|
||||
- Preserve determinism and provenance in all derived outputs.
|
||||
- Document offline/air-gap pathways for any new feature.
|
||||
- Update telemetry/observability assets alongside feature work.
|
||||
## Required Reading
|
||||
- `docs/modules/jobengine/README.md`
|
||||
- `docs/modules/jobengine/architecture.md`
|
||||
- `docs/modules/jobengine/implementation_plan.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
|
||||
## Working Agreement
|
||||
- 1. Update task status to `DOING`/`DONE` in both correspoding sprint file `/docs/implplan/SPRINT_*.md` and the local `TASKS.md` when you start or finish work.
|
||||
- 2. Review this charter and the Required Reading documents before coding; confirm prerequisites are met.
|
||||
- 3. Keep changes deterministic (stable ordering, timestamps, hashes) and align with offline/air-gap expectations.
|
||||
- 4. Coordinate doc updates, tests, and cross-guild communication whenever contracts or workflows change.
|
||||
- 5. Revert to `TODO` if you pause the task without shipping changes; leave notes in commit/PR descriptions for context.
|
||||
78
docs/modules/jobengine/README.md
Normal file
78
docs/modules/jobengine/README.md
Normal file
@@ -0,0 +1,78 @@
|
||||
# StellaOps Source & Job Orchestrator
|
||||
|
||||
The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform.
|
||||
|
||||
## Latest updates (2025-11-30)
|
||||
- OpenAPI discovery published at `/.well-known/openapi` with `openapi/orchestrator.json`; includes pagination/idempotency/error-envelope examples and version headers.
|
||||
- Legacy job detail/summary endpoints now emit `Deprecation` + `Link` headers pointing to the stable replacements.
|
||||
- Job leasing flows through the Task Runner bridge: allocations carry idempotency keys, lease durations, and retry hints; workers acknowledge via claim/ack and emit heartbeats.
|
||||
- Event envelopes remain interim pending ORCH-SVC-37-101; include provenance (tenant/project, job type, correlationId, task runner id) in all notifier events.
|
||||
- Authority `orch:quota` / `orch:backfill` scopes require reason/ticket audit fields; include them in runbooks and dashboard overrides.
|
||||
|
||||
## Responsibilities
|
||||
- Track job state, throughput, and errors for Concelier, Excititor, Scheduler, and export pipelines.
|
||||
- Expose dashboards and APIs for throttling, replays, and failover.
|
||||
- Enforce rate-limits, concurrency and dependency chains across queues.
|
||||
- Stream structured events and audit logs for incident response.
|
||||
- Provide Task Runner bridge semantics (claim/ack, heartbeats, progress, artifacts, backfills) for Go/Python SDKs.
|
||||
|
||||
## Key components
|
||||
- Orchestrator WebService (control plane).
|
||||
- Queue adapters (Valkey/NATS) and job ledger.
|
||||
- Console dashboard module and CLI integration for operators.
|
||||
|
||||
## Integrations & dependencies
|
||||
- Authority for authN/Z on operational actions.
|
||||
- Telemetry stack for job metrics and alerts.
|
||||
- Scheduler/Concelier/Excititor workers for job lifecycle.
|
||||
- Offline Kit for state export/import during air-gap refreshes.
|
||||
|
||||
## Operational notes
|
||||
- Job recovery runbooks and dashboard JSON as described in Epic 9.
|
||||
- Rate-limit and lease reconfiguration guidelines; keep lease defaults aligned across runners and SDKs (Go/Python).
|
||||
- Log streaming: SSE/WS endpoints carry correlationId + tenant/project; buffer size and retention must be documented in runbooks.
|
||||
- When using `orch:quota` / `orch:backfill` scopes, capture reason/ticket fields in runbooks and audit checklists.
|
||||
|
||||
## Implementation Status
|
||||
|
||||
### Phase 1 – Core service & job ledger (Complete)
|
||||
- PostgreSQL schema with sources, runs, jobs, artifacts, DAG edges, quotas, schedules, incidents
|
||||
- Lease manager with heartbeats, retries, dead-letter queues
|
||||
- Token-bucket rate limiter per tenant/source.host with adaptive refill
|
||||
- Watermark/backfill orchestration for event-time windows
|
||||
|
||||
### Phase 2 – Worker SDK & artifact registry (Complete)
|
||||
- Claim/heartbeat/report contract with deterministic artifact hashing
|
||||
- Idempotency enforcement and worker SDKs for .NET/Rust/Go agents
|
||||
- Integrated with Concelier, Excititor, SBOM Service, Policy Engine
|
||||
|
||||
### Phase 3 – Observability & dashboard (In Progress)
|
||||
- Metrics: queue depth, job latency, failure classes, rate-limit hits, burn rate
|
||||
- Error clustering for HTTP 429/5xx, schema mismatches, parse errors
|
||||
- SSE/WebSocket feeds for Console updates, Gantt timeline/DAG JSON
|
||||
|
||||
### Phase 4 – Controls & resilience (Planned)
|
||||
- Pause/resume/throttle/retry/backfill tooling
|
||||
- Dead-letter review, circuit breakers, blackouts, backpressure handling
|
||||
- Automation hooks and control plane APIs
|
||||
|
||||
### Phase 5 – Offline & compliance (Planned)
|
||||
- Deterministic audit bundles (jobs.jsonl, history.jsonl, throttles.jsonl)
|
||||
- Provenance manifests and offline replay scripts
|
||||
- Tenant isolation validation and secret redaction
|
||||
|
||||
### Key Acceptance Criteria
|
||||
- Schedules all jobs with quotas, rate limits, idempotency; preserves provenance
|
||||
- Console reflects real-time DAG status, queue depth, SLO burn rate
|
||||
- Observability stack exposes metrics, logs, traces, incidents for stuck jobs and throttling
|
||||
- Offline audit bundles reproduce job history deterministically with verified signatures
|
||||
|
||||
### Technical Decisions & Risks
|
||||
- Backpressure/queue overload mitigated via adaptive token buckets, circuit breakers, dynamic concurrency
|
||||
- Upstream vendor throttles managed with visible state, automatic jitter and retry
|
||||
- Tenant leakage prevented through API/queue/storage filters, fuzz tests, redaction
|
||||
- Complex DAG errors handled with diagnostics, error clustering, partial replay tooling
|
||||
|
||||
## Epic alignment
|
||||
- Epic 9: Source & Job Orchestrator Dashboard.
|
||||
- ORCH stories in ../../TASKS.md.
|
||||
228
docs/modules/jobengine/architecture.md
Normal file
228
docs/modules/jobengine/architecture.md
Normal file
@@ -0,0 +1,228 @@
|
||||
# Source & Job Orchestrator architecture
|
||||
|
||||
> Based on Epic 9 – Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability.
|
||||
|
||||
## 1) Topology
|
||||
|
||||
- **Orchestrator API (`StellaOps.JobEngine`).** Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (`orchestrator:*`).
|
||||
- **Job ledger (PostgreSQL).** Tables `jobs`, `job_history`, `sources`, `quotas`, `throttles`, `incidents` (schema `orchestrator`). Append-only history ensures auditability.
|
||||
- **Queue abstraction.** Supports Valkey Streams or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
|
||||
- **Dashboard feeds.** SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.
|
||||
|
||||
## 2) Job lifecycle
|
||||
|
||||
1. **Enqueue.** Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit `JobRequest` records containing `jobType`, `tenant`, `priority`, `payloadDigest`, `dependencies`.
|
||||
2. **Scheduling.** Orchestrator applies quotas and rate limits per `{tenant, jobType}`. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
|
||||
3. **Leasing (Task Runner bridge).** Workers poll `LeaseJob` endpoint; Orchestrator returns job with `leaseId`, `leaseUntil`, `idempotencyKey`, and instrumentation tokens. Lease renewal required for long-running tasks; leases carry retry hints and provenance (`tenant`, `project`, `correlationId`, `taskRunnerId`).
|
||||
4. **Completion.** Worker reports status (`succeeded`, `failed`, `canceled`, `timed_out`). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
|
||||
5. **Replay.** Operators trigger `POST /jobs/{id}/replay` which clones job payload, sets `replayOf` pointer, and requeues with high priority while preserving determinism metadata.
|
||||
|
||||
### Pack-run lifecycle (phase III)
|
||||
- **Register** `pack-run` job type with task runner hints (artifacts, log channel, heartbeat cadence).
|
||||
- **Logs/Artifacts**: SSE/WS stream keyed by `packRunId` + `tenant/project`; artifacts published with content digests and URI metadata.
|
||||
- **Events**: notifier payloads include envelope provenance (tenant, project, correlationId, idempotencyKey) pending ORCH-SVC-37-101 final spec.
|
||||
|
||||
## 3) Rate-limit & quota governance
|
||||
|
||||
- Quotas defined per tenant/profile (`maxActive`, `maxPerHour`, `burst`). Stored in `quotas` and enforced before leasing.
|
||||
- Dynamic throttles allow ops to pause specific sources (`pauseSource`, `resumeSource`) or reduce concurrency.
|
||||
- Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
|
||||
- Control plane quota updates require Authority scope `orch:quota` (issued via `Orch.Admin` role). Historical rebuilds/backfills additionally require `orch:backfill` and must supply `backfill_reason` and `backfill_ticket` alongside the operator metadata. Authority persists all four fields (`quota_reason`, `quota_ticket`, `backfill_reason`, `backfill_ticket`) for audit replay.
|
||||
|
||||
### 3.1) Quota governance service
|
||||
|
||||
The `QuotaGovernanceService` provides cross-tenant quota allocation with configurable policies:
|
||||
|
||||
**Allocation strategies:**
|
||||
- `Equal` — Divide total capacity equally among all active tenants.
|
||||
- `Proportional` — Allocate based on tenant weight/priority tier.
|
||||
- `Priority` — Higher priority tenants get allocation first, with preemption.
|
||||
- `ReservedWithFairShare` — Reserved minimum per tenant, remainder distributed fairly.
|
||||
- `Fixed` — Static allocation per tenant regardless of demand.
|
||||
|
||||
**Key operations:**
|
||||
- `CalculateAllocationAsync` — Compute quota for a tenant based on active policies.
|
||||
- `RequestQuotaAsync` — Request quota from shared pool; returns granted amount with burst usage.
|
||||
- `ReleaseQuotaAsync` — Return quota to shared pool after job completion.
|
||||
- `CanScheduleAsync` — Check scheduling eligibility combining quota and circuit breaker state.
|
||||
|
||||
**Quota allocation policy properties:**
|
||||
- `TotalCapacity` — Pool size to allocate from (for proportional/fair strategies).
|
||||
- `MinimumPerTenant` / `MaximumPerTenant` — Allocation bounds.
|
||||
- `ReservedCapacity` — Guaranteed capacity for high-priority tenants.
|
||||
- `AllowBurst` / `BurstMultiplier` — Allow temporary overallocation when capacity exists.
|
||||
- `Priority` — Policy evaluation order (higher = first).
|
||||
- `JobType` — Optional job type filter (null = applies to all).
|
||||
|
||||
### 3.2) Circuit breaker service
|
||||
|
||||
The `CircuitBreakerService` implements the circuit breaker pattern for downstream services:
|
||||
|
||||
**States:**
|
||||
- `Closed` — Normal operation; requests pass through. Failures are tracked.
|
||||
- `Open` — Circuit tripped; requests are blocked for `OpenDuration`. Prevents cascade failures.
|
||||
- `HalfOpen` — After open duration, limited test requests allowed. Success → Closed; Failure → Open.
|
||||
|
||||
**Thresholds:**
|
||||
- `FailureThreshold` (0.0–1.0) — Failure rate that triggers circuit open.
|
||||
- `WindowDuration` — Sliding window for failure rate calculation.
|
||||
- `MinimumSamples` — Minimum requests before circuit can trip.
|
||||
- `OpenDuration` — How long circuit stays open before half-open transition.
|
||||
- `HalfOpenTestCount` — Number of test requests allowed in half-open state.
|
||||
|
||||
**Key operations:**
|
||||
- `CheckAsync` — Verify if request is allowed; returns `CircuitBreakerCheckResult`.
|
||||
- `RecordSuccessAsync` / `RecordFailureAsync` — Update circuit state after request.
|
||||
- `ForceOpenAsync` / `ForceCloseAsync` — Manual operator intervention (audited).
|
||||
- `ListAsync` — View all circuit breakers for a tenant with optional state filter.
|
||||
|
||||
**Downstream services protected:**
|
||||
- Scanner
|
||||
- Attestor
|
||||
- Policy Engine
|
||||
- Registry clients
|
||||
- External integrations
|
||||
|
||||
## 4) APIs
|
||||
|
||||
### 4.1) Job management
|
||||
- `GET /api/jobs?status=` — list jobs with filters (tenant, jobType, status, time window).
|
||||
- `GET /api/jobs/{id}` — job detail (payload digest, attempts, worker, lease history, metrics).
|
||||
- `POST /api/jobs/{id}/cancel` — cancel running/pending job with audit reason.
|
||||
- `POST /api/jobs/{id}/replay` — schedule replay.
|
||||
- `POST /api/limits/throttle` — apply throttle (requires elevated scope).
|
||||
- `GET /api/dashboard/metrics` — aggregated metrics for Console dashboards.
|
||||
|
||||
### 4.2) Circuit breaker endpoints (`/api/v1/jobengine/circuit-breakers`)
|
||||
- `GET /` — List all circuit breakers for tenant (optional `?state=` filter).
|
||||
- `GET /{serviceId}` — Get circuit breaker state for specific downstream service.
|
||||
- `GET /{serviceId}/check` — Check if requests are allowed; returns `IsAllowed`, `State`, `FailureRate`, `TimeUntilRetry`.
|
||||
- `POST /{serviceId}/success` — Record successful request to downstream service.
|
||||
- `POST /{serviceId}/failure` — Record failed request (body: `failureReason`).
|
||||
- `POST /{serviceId}/force-open` — Manually open circuit (body: `reason`; audited).
|
||||
- `POST /{serviceId}/force-close` — Manually close circuit (audited).
|
||||
|
||||
### 4.3) Quota governance endpoints (`/api/v1/jobengine/quota-governance`)
|
||||
- `GET /policies` — List quota allocation policies (optional `?enabled=` filter).
|
||||
- `GET /policies/{policyId}` — Get specific policy.
|
||||
- `POST /policies` — Create new policy.
|
||||
- `PUT /policies/{policyId}` — Update policy.
|
||||
- `DELETE /policies/{policyId}` — Delete policy.
|
||||
- `GET /allocation` — Calculate allocation for current tenant (optional `?jobType=`).
|
||||
- `POST /request` — Request quota from pool (body: `jobType`, `requestedAmount`).
|
||||
- `POST /release` — Release quota back to pool (body: `jobType`, `releasedAmount`).
|
||||
- `GET /status` — Get tenant quota status (optional `?jobType=`).
|
||||
- `GET /summary` — Get quota governance summary across all tenants (optional `?policyId=`).
|
||||
- `GET /can-schedule` — Check if job can be scheduled (optional `?jobType=`).
|
||||
|
||||
### 4.4) Discovery and documentation
|
||||
- Event envelope draft (`docs/modules/jobengine/event-envelope.md`) defines notifier/webhook/SSE payloads with idempotency keys, provenance, and task runner metadata for job/pack-run events.
|
||||
- OpenAPI discovery: `/.well-known/openapi` exposes `/openapi/jobengine.json` (OAS 3.1) with pagination/idempotency/error-envelope examples; legacy job detail/summary endpoints now ship `Deprecation` + `Link` headers that point to their replacements.
|
||||
|
||||
### 4.5) Release control plane dashboard endpoints
|
||||
- `GET /api/v1/release-jobengine/dashboard` — control-plane dashboard payload (pipeline, pending approvals, active deployments, recent releases).
|
||||
- `POST /api/v1/release-jobengine/promotions/{id}/approve` — approve a pending promotion from dashboard context.
|
||||
- `POST /api/v1/release-jobengine/promotions/{id}/reject` — reject a pending promotion from dashboard context.
|
||||
- Compatibility aliases are exposed for legacy clients under `/api/release-jobengine/*`.
|
||||
|
||||
All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.
|
||||
|
||||
## 5) Observability
|
||||
|
||||
- Metrics: `job_queue_depth{jobType,tenant}`, `job_latency_seconds`, `job_failures_total`, `job_retry_total`, `lease_extensions_total`.
|
||||
- Task Runner bridge adds `pack_run_logs_stream_lag_seconds`, `pack_run_heartbeats_total`, `pack_run_artifacts_total`.
|
||||
- Logs: structured with `jobId`, `jobType`, `tenant`, `workerId`, `leaseId`, `status`. Incident logs flagged for Ops.
|
||||
- Traces: spans covering `enqueue`, `schedule`, `lease`, `worker_execute`, `complete`. Trace IDs propagate to worker spans for end-to-end correlation.
|
||||
|
||||
## 6) Offline support
|
||||
|
||||
- Orchestrator exports audit bundles: `jobs.jsonl`, `history.jsonl`, `throttles.jsonl`, `manifest.json`, `signatures/`. Used for offline investigations and compliance.
|
||||
- Replay manifests contain job digests and success/failure notes for deterministic proof.
|
||||
|
||||
## 7) Operational considerations
|
||||
|
||||
- HA deployment with multiple API instances; queue storage determines redundancy strategy.
|
||||
- Support for `maintenance` mode halting leases while allowing status inspection.
|
||||
- Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume).
|
||||
|
||||
---
|
||||
|
||||
## 8) Orchestration domain subdomains (Sprint 208)
|
||||
|
||||
Sprint 208 consolidated Scheduler, TaskRunner, and PacksRegistry source trees under `src/JobEngine/` as subdomains of the orchestration domain. Each subdomain retains its own project names, namespaces, and runtime identities. No namespace renames were performed.
|
||||
|
||||
### 8.1) Scheduler subdomain
|
||||
|
||||
**Source location:** `src/JobEngine/StellaOps.Scheduler.*`
|
||||
|
||||
The Scheduler service re-evaluates already-cataloged images when intelligence changes (Concelier/Excititor/policy), orchestrates nightly and ad-hoc runs, targets only impacted images using the BOM-Index, and emits report-ready events for downstream Notify. Default mode is analysis-only (no image pull); optional content-refresh can be enabled per schedule.
|
||||
|
||||
**Deployables:** `StellaOps.Scheduler.WebService` (stateless), `StellaOps.Scheduler.Worker.Host` (scale-out).
|
||||
|
||||
**Database:** `SchedulerDbContext` (schema `scheduler`, 11 entities). Owns `schedules`, `runs`, `impact_cursors`, `locks`, `audit` tables. See archived docs: `docs-archived/modules/scheduler/architecture.md`.
|
||||
|
||||
### 8.2) TaskRunner subdomain
|
||||
|
||||
**Source location:** `src/JobEngine/StellaOps.TaskRunner/`, `src/JobEngine/StellaOps.TaskRunner.__Libraries/`
|
||||
|
||||
The TaskRunner provides the execution substrate for Orchestrator jobs. Workers poll lease endpoints, execute tasks, report outcomes, and stream logs/artifacts for pack-runs.
|
||||
|
||||
**Deployables:** `StellaOps.TaskRunner.WebService`, `StellaOps.TaskRunner.Worker`.
|
||||
|
||||
**Database and storage contract (Sprint 312):**
|
||||
- `Storage:Driver=postgres` is the production default for run state, logs, and approvals.
|
||||
- Postgres-backed stores: `PostgresPackRunStateStore`, `PostgresPackRunLogStore`, `PostgresPackRunApprovalStore` via `TaskRunnerDataSource`.
|
||||
- Artifact payload channel uses object storage path (`seed-fs` driver) configured with `TaskRunner:Storage:ObjectStore:SeedFs:RootPath`.
|
||||
- Explicit non-production overrides remain available (`filesystem`, `inmemory`) but are no longer implicit defaults.
|
||||
|
||||
### 8.3) PacksRegistry subdomain
|
||||
|
||||
**Source location:** `src/JobEngine/StellaOps.PacksRegistry/`, `src/JobEngine/StellaOps.PacksRegistry.__Libraries/`
|
||||
|
||||
The PacksRegistry manages compliance/automation pack definitions, versions, and distribution for the task execution pipeline.
|
||||
|
||||
**Deployables:** `StellaOps.PacksRegistry.WebService`, `StellaOps.PacksRegistry.Worker`.
|
||||
|
||||
**Database and storage contract (Sprint 312):**
|
||||
- `Storage:Driver=postgres` is the production default for metadata/state repositories (`pack`, `parity`, `lifecycle`, `mirror`, `audit`, `attestation metadata`).
|
||||
- Blob/object payloads (`pack content`, `provenance content`, `attestation content`) are persisted through the seed-fs object-store channel (`SeedFsPacksRegistryBlobStore`).
|
||||
- PostgreSQL keeps metadata and compatibility placeholders; payload retrieval resolves from object storage first.
|
||||
- Explicit non-production overrides remain available (`filesystem`, `inmemory`) but are no longer implicit defaults.
|
||||
|
||||
---
|
||||
|
||||
## 9) Architecture Decision Record: No DB merge (Sprint 208)
|
||||
|
||||
**Decision:** OrchestratorDbContext and SchedulerDbContext remain as separate DbContexts with separate PostgreSQL schemas. No cross-schema DB merge.
|
||||
|
||||
**Context:** Sprint 208 evaluated merging the Orchestrator (39 entities) and Scheduler (11 entities) DbContexts into a single unified context. Both define `Jobs` and `JobHistory` entities.
|
||||
|
||||
**Problem:** The `Jobs` and `JobHistory` entities have fundamentally incompatible semantics:
|
||||
- **OrchestratorDbContext.Jobs:** Represents pipeline orchestration runs (source ingestion, policy evaluation, release promotion). Fields include `payloadDigest`, `dependencies`, `leaseId`, `retryPolicy`.
|
||||
- **SchedulerDbContext.Jobs:** Represents cron-scheduled rescan executions (image re-evaluation, impact-index-driven). Fields include `scheduleId`, `trigger` (cron/conselier/excitor/manual), `impactSet`, `runStats`.
|
||||
|
||||
Merging would require renaming one set of entities (e.g., `SchedulerJobs`, `SchedulerJobHistory`), propagating through repositories, query code, compiled models, migrations, and external contracts. The schemas already provide clean separation at no operational cost since both live in the same `stellaops_platform` database.
|
||||
|
||||
**Decision rationale:**
|
||||
1. Entity name collision with incompatible models makes merge risky and disruptive.
|
||||
2. Compiled models from Sprint 219 would need regeneration for both contexts.
|
||||
3. Schemas provide clean separation at zero cost.
|
||||
4. Future domain rename (Sprint 221) is a better venue for any schema consolidation.
|
||||
|
||||
**Consequences:** TaskRunner and PacksRegistry remain independent subdomains and now implement explicit storage contracts (Postgres state/metadata plus object-store payload channels) without cross-schema DB merge.
|
||||
|
||||
---
|
||||
|
||||
## 10) Schema continuity remediation (Sprint 311)
|
||||
|
||||
Sprint 221 renamed the domain from Orchestrator to JobEngine but intentionally preserved the PostgreSQL schema name `orchestrator` for continuity. Sprint 311 closed the implementation drift so runtime, design-time, and compiled-model paths now align on the same preserved schema default.
|
||||
|
||||
Implemented alignment:
|
||||
- Runtime default schema is centralized in `JobEngineDbContext.DefaultSchemaName` (`orchestrator`) and schema normalization is centralized in `JobEngineDbContext.ResolveSchemaName(...)`.
|
||||
- Repository runtime context creation (`JobEngineDbContextFactory`) uses that same shared default and normalization logic.
|
||||
- Design-time context creation now passes `JobEngineDbContext.DefaultSchemaName` explicitly instead of relying on implicit constructor fallback.
|
||||
- EF compiled model schema annotations were aligned to `orchestrator` so compiled-model and runtime model behavior match.
|
||||
|
||||
Out of scope for Sprint 311:
|
||||
- No schema migration from `orchestrator` to `jobengine` was introduced.
|
||||
- Any future physical schema rename requires a dedicated migration sprint with data/backfill and rollback planning.
|
||||
69
docs/modules/jobengine/event-envelope.md
Normal file
69
docs/modules/jobengine/event-envelope.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# Orchestrator Event Envelope (draft)
|
||||
|
||||
Status: draft for ORCH-SVC-38-101 (pending ORCH-SVC-37-101 approval)
|
||||
|
||||
## Goals
|
||||
- Single, provenance-rich envelope for policy/export/job lifecycle events.
|
||||
- Idempotent across retries and transports (Notifier bus, webhooks, SSE/WS streams).
|
||||
- Tenant/project isolation and offline-friendly replays.
|
||||
|
||||
## Envelope
|
||||
```jsonc
|
||||
{
|
||||
"schemaVersion": "orch.event.v1",
|
||||
"eventId": "urn:orch:event:...", // UUIDv7 or ULID
|
||||
"eventType": "job.failed|job.completed|pack_run.log|pack_run.artifact|policy.updated|export.completed",
|
||||
"occurredAt": "2025-11-19T12:34:56Z",
|
||||
"idempotencyKey": "orch-{eventType}-{jobId}-{attempt}",
|
||||
"correlationId": "corr-...", // propagated from producer
|
||||
"tenantId": "...",
|
||||
"projectId": "...", // optional but preferred
|
||||
"actor": {
|
||||
"subject": "service/worker-sdk-go", // who emitted the event
|
||||
"scopes": ["orch:quota", "orch:backfill"]
|
||||
},
|
||||
"job": {
|
||||
"id": "job_018f...",
|
||||
"type": "pack-run|ingest|export|policy-simulate",
|
||||
"runId": "run_018f...", // for pack runs / sims
|
||||
"attempt": 3,
|
||||
"leaseId": "lease_018f...",
|
||||
"taskRunnerId": "tr_018f...",
|
||||
"status": "completed|failed|running|canceled",
|
||||
"reason": "user_cancelled|retry_backoff|quota_paused",
|
||||
"payloadDigest": "sha256:...",
|
||||
"artifacts": [
|
||||
{"uri": "s3://...", "digest": "sha256:...", "mime": "application/json"}
|
||||
]
|
||||
},
|
||||
"metrics": {
|
||||
"durationSeconds": 12.345,
|
||||
"logStreamLagSeconds": 0.8,
|
||||
"backoffSeconds": 30
|
||||
},
|
||||
"notifier": {
|
||||
"channel": "orch.jobs",
|
||||
"delivery": "dsse",
|
||||
"replay": {"ordinal": 5, "total": 12}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Idempotency rules
|
||||
- `eventId` globally unique; `idempotencyKey` dedupe per channel.
|
||||
- Emit once per state transition; retries reuse the same `eventId`/`idempotencyKey`.
|
||||
|
||||
## Provenance
|
||||
- Always include `tenantId` and `projectId` (if available).
|
||||
- Carry `correlationId` from upstream producers and `taskRunnerId` from leasing bridge.
|
||||
- Include `actor.scopes` when events are triggered via elevated tokens (`orch:quota`, `orch:backfill`).
|
||||
|
||||
## Transport bindings
|
||||
- **Notifier bus**: DSSE-wrapped envelope; subject `orch.event` and `eventType`.
|
||||
- **Webhooks**: HMAC with `X-Orchestrator-Signature` (sha256), replay-safe via `idempotencyKey`.
|
||||
- **SSE/WS**: stream per `tenantId` filtered by `projectId`; client dedupe via `eventId`.
|
||||
|
||||
## Backlog & follow-ups
|
||||
- Align field names with ORCH-SVC-37-101 once finalized.
|
||||
- Add examples for policy/export events and pack-run log/manifest payloads.
|
||||
- Document retry/backoff semantics in Notify/Console subscribers.
|
||||
30
docs/modules/jobengine/guides/orchestrator-slo.md
Normal file
30
docs/modules/jobengine/guides/orchestrator-slo.md
Normal file
@@ -0,0 +1,30 @@
|
||||
# Orchestrator SLOs (DOCS-ORCH-34-005)
|
||||
|
||||
Last updated: 2025-11-25
|
||||
|
||||
## Service level objectives
|
||||
- **Availability**: 99.9% monthly for WebService API per tenant.
|
||||
- **Run completion**: P95 run duration < 5m for standard DAGs; failure rate <1% over 30d.
|
||||
- **Queue health**: backlog < 1000 items per tenant for >95% of 5m windows.
|
||||
- **Event delivery**: WebSocket/stream delivery success > 99.5% (per day).
|
||||
|
||||
## Error budget policy
|
||||
- Window: 28 days. Burn alerts:
|
||||
- 2× burn: page on-call.
|
||||
- 14× burn: immediate mitigation (disable offending DAGs, scale workers).
|
||||
|
||||
## Alerts (examples)
|
||||
- Availability: `probe_success{job="orchestrator-api"} < 0.999 over 1h`.
|
||||
- Latency: `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.5`.
|
||||
- Run failures: `rate(orchestrator_runs_total{status="failed"}[30m]) / rate(orchestrator_runs_total[30m]) > 0.01`.
|
||||
- Queue backlog: `orchestrator_queue_depth > 1000` for 10m.
|
||||
|
||||
## Dashboards
|
||||
- Golden signals per service (traffic, errors, latency, saturation).
|
||||
- Run outcome panel: success/fail/cancel counts, retry counts.
|
||||
- Queue panel: depth, age, worker consumption rate.
|
||||
- Burn-rate panel tied to error budget.
|
||||
|
||||
## Ownership & review
|
||||
- SLOs owned by Orchestrator Guild; reviewed quarterly or when architecture changes.
|
||||
- Changes must be reflected in runbook and alert rules; update manifests for offline/air-gap monitoring kits.
|
||||
24
docs/modules/jobengine/implementation_plan.md
Normal file
24
docs/modules/jobengine/implementation_plan.md
Normal file
@@ -0,0 +1,24 @@
|
||||
# Orchestrator Implementation Plan
|
||||
|
||||
## Purpose
|
||||
Provide a living plan for Orchestrator deliverables, dependencies, and evidence.
|
||||
|
||||
## Active work
|
||||
- Track current sprints under `docs/implplan/SPRINT_*.md` for this module.
|
||||
- Update this file when new scoped work is approved.
|
||||
|
||||
## Near-term deliverables
|
||||
- TBD (add when sprint is staffed).
|
||||
|
||||
## Dependencies
|
||||
- `docs/modules/jobengine/architecture.md`
|
||||
- `docs/modules/jobengine/README.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
|
||||
## Evidence of completion
|
||||
- Code changes under `src/JobEngine/**`.
|
||||
- Tests and fixtures under the module's `__Tests` / `__Libraries`.
|
||||
- Docs and runbooks under `docs/modules/jobengine/**`.
|
||||
|
||||
## Notes
|
||||
- Keep deterministic and offline-first expectations aligned with module AGENTS.
|
||||
51
docs/modules/jobengine/job-export-contract.md
Normal file
51
docs/modules/jobengine/job-export-contract.md
Normal file
@@ -0,0 +1,51 @@
|
||||
# Orchestrator → Findings Ledger Export Contract
|
||||
|
||||
Status: Available (2025-12-03)
|
||||
Scope: defines the deterministic payload Orchestrator emits for job/run exports that Findings Ledger ingests for provenance (LEDGER-34-101).
|
||||
|
||||
## Payload shape
|
||||
```jsonc
|
||||
{
|
||||
"runId": "uuid", // job/run correlation id
|
||||
"jobType": "string", // e.g., mirror-build, policy-sim, scan
|
||||
"artifactHash": "sha256:...", // CAS digest of primary artifact
|
||||
"policyHash": "sha256:...", // optional; policy bundle hash
|
||||
"startedAt": "2025-12-02T00:00:00Z",
|
||||
"completedAt": "2025-12-02T00:05:30Z",
|
||||
"status": "succeeded|failed|canceled",
|
||||
"manifestPath": "cas://.../manifest.json", // DSSE or CAS path
|
||||
"logsPath": "cas://.../logs.ndjson",
|
||||
"tenantId": "string",
|
||||
"environment": "prod|stage|dev",
|
||||
"idempotencyKey": "sha256:...", // runId+artifactHash
|
||||
"signatures": [ { "type": "dsse", "keyId": "...", "signature": "..." } ]
|
||||
}
|
||||
```
|
||||
|
||||
## Determinism & ordering
|
||||
- Entries sorted by `runId` when streamed; pagination stable via `runId, startedAt` ascending.
|
||||
- `idempotencyKey = sha256(runId + artifactHash + tenantId)`; duplicate submissions are rejected with 409.
|
||||
- Timestamps UTC ISO-8601; no clock-skew correction performed by Ledger.
|
||||
|
||||
## Transport
|
||||
- REST: `POST /internal/jobengine/exports` (Orchestrator) → Findings Ledger ingest queue.
|
||||
- Events: `orchestrator.export.created` carries the same payload; consumers must verify DSSE before persistence.
|
||||
|
||||
## Validation rules (Ledger side)
|
||||
- Require DSSE signature (Ed25519) when `signatures` present; fail closed if verification fails.
|
||||
- Enforce presence of `runId`, `artifactHash`, `startedAt`, `status`.
|
||||
- Hash fields must match `^sha256:[A-Fa-f0-9]{64}$`.
|
||||
- Allowed status transitions: pending→running→succeeded/failed/canceled; replays only allowed when `idempotencyKey` matches existing record.
|
||||
|
||||
## Mapping to Findings Ledger
|
||||
- Stored in collection/table `orchestrator_exports` with index `(artifactHash, runId)` and TTL optional on logs if configured.
|
||||
- Timeline entry type `ledger_export` references `runId`, `artifactHash`, `policyHash`, `manifestPath`, `logsPath`, DSSE envelope digest, and `idempotencyKey`.
|
||||
- Cross-links: `bundleId` (if mirror build) or `scanId` (if scan job) may be added as optional fields; Ledger treats them as opaque strings.
|
||||
|
||||
## Security / scopes
|
||||
- Required scope: `orchestrator:exports:write` for producer; Ledger ingress validates tenant headers and scope.
|
||||
- Max payload: 1 MiB; logs must be CAS/DSSE referenced, not inline.
|
||||
|
||||
## Offline/air-gap considerations
|
||||
- CAS/DSSE paths must resolve within offline kit bundles; no external URIs permitted.
|
||||
- Deterministic ordering + idempotency allow replay without side effects; Ledger rejects writes when DSSE or hash mismatch.
|
||||
Reference in New Issue
Block a user