consolidation of some of the modules, localization fixes, product advisories work, qa work

This commit is contained in:
master
2026-03-05 03:54:22 +02:00
parent 7bafcc3eef
commit 8e1cb9448d
3878 changed files with 72600 additions and 46861 deletions

View File

@@ -0,0 +1,34 @@
# Source & Job Orchestrator agent guide
## Mission
The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform.
## Key docs
- [Module README](./README.md)
- [Architecture](./architecture.md)
- [Implementation plan](./implementation_plan.md)
- [Task board](./TASKS.md)
## How to get started
1. Read the design summaries in ./architecture.md (quota governance, job lifecycle, dashboard feeds).
2. Open sprint file `/docs/implplan/SPRINT_*.md` and locate stories for this component.
3. Check ./TASKS.md and update status before/after work.
4. Review ./README.md for responsibilities and ensure changes maintain determinism and offline parity.
## Guardrails
- Uphold Aggregation-Only Contract boundaries when consuming ingestion data.
- Preserve determinism and provenance in all derived outputs.
- Document offline/air-gap pathways for any new feature.
- Update telemetry/observability assets alongside feature work.
## Required Reading
- `docs/modules/jobengine/README.md`
- `docs/modules/jobengine/architecture.md`
- `docs/modules/jobengine/implementation_plan.md`
- `docs/modules/platform/architecture-overview.md`
## Working Agreement
- 1. Update task status to `DOING`/`DONE` in both correspoding sprint file `/docs/implplan/SPRINT_*.md` and the local `TASKS.md` when you start or finish work.
- 2. Review this charter and the Required Reading documents before coding; confirm prerequisites are met.
- 3. Keep changes deterministic (stable ordering, timestamps, hashes) and align with offline/air-gap expectations.
- 4. Coordinate doc updates, tests, and cross-guild communication whenever contracts or workflows change.
- 5. Revert to `TODO` if you pause the task without shipping changes; leave notes in commit/PR descriptions for context.

View File

@@ -0,0 +1,78 @@
# StellaOps Source & Job Orchestrator
The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform.
## Latest updates (2025-11-30)
- OpenAPI discovery published at `/.well-known/openapi` with `openapi/orchestrator.json`; includes pagination/idempotency/error-envelope examples and version headers.
- Legacy job detail/summary endpoints now emit `Deprecation` + `Link` headers pointing to the stable replacements.
- Job leasing flows through the Task Runner bridge: allocations carry idempotency keys, lease durations, and retry hints; workers acknowledge via claim/ack and emit heartbeats.
- Event envelopes remain interim pending ORCH-SVC-37-101; include provenance (tenant/project, job type, correlationId, task runner id) in all notifier events.
- Authority `orch:quota` / `orch:backfill` scopes require reason/ticket audit fields; include them in runbooks and dashboard overrides.
## Responsibilities
- Track job state, throughput, and errors for Concelier, Excititor, Scheduler, and export pipelines.
- Expose dashboards and APIs for throttling, replays, and failover.
- Enforce rate-limits, concurrency and dependency chains across queues.
- Stream structured events and audit logs for incident response.
- Provide Task Runner bridge semantics (claim/ack, heartbeats, progress, artifacts, backfills) for Go/Python SDKs.
## Key components
- Orchestrator WebService (control plane).
- Queue adapters (Valkey/NATS) and job ledger.
- Console dashboard module and CLI integration for operators.
## Integrations & dependencies
- Authority for authN/Z on operational actions.
- Telemetry stack for job metrics and alerts.
- Scheduler/Concelier/Excititor workers for job lifecycle.
- Offline Kit for state export/import during air-gap refreshes.
## Operational notes
- Job recovery runbooks and dashboard JSON as described in Epic 9.
- Rate-limit and lease reconfiguration guidelines; keep lease defaults aligned across runners and SDKs (Go/Python).
- Log streaming: SSE/WS endpoints carry correlationId + tenant/project; buffer size and retention must be documented in runbooks.
- When using `orch:quota` / `orch:backfill` scopes, capture reason/ticket fields in runbooks and audit checklists.
## Implementation Status
### Phase 1 Core service & job ledger (Complete)
- PostgreSQL schema with sources, runs, jobs, artifacts, DAG edges, quotas, schedules, incidents
- Lease manager with heartbeats, retries, dead-letter queues
- Token-bucket rate limiter per tenant/source.host with adaptive refill
- Watermark/backfill orchestration for event-time windows
### Phase 2 Worker SDK & artifact registry (Complete)
- Claim/heartbeat/report contract with deterministic artifact hashing
- Idempotency enforcement and worker SDKs for .NET/Rust/Go agents
- Integrated with Concelier, Excititor, SBOM Service, Policy Engine
### Phase 3 Observability & dashboard (In Progress)
- Metrics: queue depth, job latency, failure classes, rate-limit hits, burn rate
- Error clustering for HTTP 429/5xx, schema mismatches, parse errors
- SSE/WebSocket feeds for Console updates, Gantt timeline/DAG JSON
### Phase 4 Controls & resilience (Planned)
- Pause/resume/throttle/retry/backfill tooling
- Dead-letter review, circuit breakers, blackouts, backpressure handling
- Automation hooks and control plane APIs
### Phase 5 Offline & compliance (Planned)
- Deterministic audit bundles (jobs.jsonl, history.jsonl, throttles.jsonl)
- Provenance manifests and offline replay scripts
- Tenant isolation validation and secret redaction
### Key Acceptance Criteria
- Schedules all jobs with quotas, rate limits, idempotency; preserves provenance
- Console reflects real-time DAG status, queue depth, SLO burn rate
- Observability stack exposes metrics, logs, traces, incidents for stuck jobs and throttling
- Offline audit bundles reproduce job history deterministically with verified signatures
### Technical Decisions & Risks
- Backpressure/queue overload mitigated via adaptive token buckets, circuit breakers, dynamic concurrency
- Upstream vendor throttles managed with visible state, automatic jitter and retry
- Tenant leakage prevented through API/queue/storage filters, fuzz tests, redaction
- Complex DAG errors handled with diagnostics, error clustering, partial replay tooling
## Epic alignment
- Epic 9: Source & Job Orchestrator Dashboard.
- ORCH stories in ../../TASKS.md.

View File

@@ -0,0 +1,228 @@
# Source & Job Orchestrator architecture
> Based on Epic9 Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability.
## 1) Topology
- **Orchestrator API (`StellaOps.JobEngine`).** Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (`orchestrator:*`).
- **Job ledger (PostgreSQL).** Tables `jobs`, `job_history`, `sources`, `quotas`, `throttles`, `incidents` (schema `orchestrator`). Append-only history ensures auditability.
- **Queue abstraction.** Supports Valkey Streams or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
- **Dashboard feeds.** SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.
## 2) Job lifecycle
1. **Enqueue.** Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit `JobRequest` records containing `jobType`, `tenant`, `priority`, `payloadDigest`, `dependencies`.
2. **Scheduling.** Orchestrator applies quotas and rate limits per `{tenant, jobType}`. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
3. **Leasing (Task Runner bridge).** Workers poll `LeaseJob` endpoint; Orchestrator returns job with `leaseId`, `leaseUntil`, `idempotencyKey`, and instrumentation tokens. Lease renewal required for long-running tasks; leases carry retry hints and provenance (`tenant`, `project`, `correlationId`, `taskRunnerId`).
4. **Completion.** Worker reports status (`succeeded`, `failed`, `canceled`, `timed_out`). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
5. **Replay.** Operators trigger `POST /jobs/{id}/replay` which clones job payload, sets `replayOf` pointer, and requeues with high priority while preserving determinism metadata.
### Pack-run lifecycle (phase III)
- **Register** `pack-run` job type with task runner hints (artifacts, log channel, heartbeat cadence).
- **Logs/Artifacts**: SSE/WS stream keyed by `packRunId` + `tenant/project`; artifacts published with content digests and URI metadata.
- **Events**: notifier payloads include envelope provenance (tenant, project, correlationId, idempotencyKey) pending ORCH-SVC-37-101 final spec.
## 3) Rate-limit & quota governance
- Quotas defined per tenant/profile (`maxActive`, `maxPerHour`, `burst`). Stored in `quotas` and enforced before leasing.
- Dynamic throttles allow ops to pause specific sources (`pauseSource`, `resumeSource`) or reduce concurrency.
- Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
- Control plane quota updates require Authority scope `orch:quota` (issued via `Orch.Admin` role). Historical rebuilds/backfills additionally require `orch:backfill` and must supply `backfill_reason` and `backfill_ticket` alongside the operator metadata. Authority persists all four fields (`quota_reason`, `quota_ticket`, `backfill_reason`, `backfill_ticket`) for audit replay.
### 3.1) Quota governance service
The `QuotaGovernanceService` provides cross-tenant quota allocation with configurable policies:
**Allocation strategies:**
- `Equal` — Divide total capacity equally among all active tenants.
- `Proportional` — Allocate based on tenant weight/priority tier.
- `Priority` — Higher priority tenants get allocation first, with preemption.
- `ReservedWithFairShare` — Reserved minimum per tenant, remainder distributed fairly.
- `Fixed` — Static allocation per tenant regardless of demand.
**Key operations:**
- `CalculateAllocationAsync` — Compute quota for a tenant based on active policies.
- `RequestQuotaAsync` — Request quota from shared pool; returns granted amount with burst usage.
- `ReleaseQuotaAsync` — Return quota to shared pool after job completion.
- `CanScheduleAsync` — Check scheduling eligibility combining quota and circuit breaker state.
**Quota allocation policy properties:**
- `TotalCapacity` — Pool size to allocate from (for proportional/fair strategies).
- `MinimumPerTenant` / `MaximumPerTenant` — Allocation bounds.
- `ReservedCapacity` — Guaranteed capacity for high-priority tenants.
- `AllowBurst` / `BurstMultiplier` — Allow temporary overallocation when capacity exists.
- `Priority` — Policy evaluation order (higher = first).
- `JobType` — Optional job type filter (null = applies to all).
### 3.2) Circuit breaker service
The `CircuitBreakerService` implements the circuit breaker pattern for downstream services:
**States:**
- `Closed` — Normal operation; requests pass through. Failures are tracked.
- `Open` — Circuit tripped; requests are blocked for `OpenDuration`. Prevents cascade failures.
- `HalfOpen` — After open duration, limited test requests allowed. Success → Closed; Failure → Open.
**Thresholds:**
- `FailureThreshold` (0.01.0) — Failure rate that triggers circuit open.
- `WindowDuration` — Sliding window for failure rate calculation.
- `MinimumSamples` — Minimum requests before circuit can trip.
- `OpenDuration` — How long circuit stays open before half-open transition.
- `HalfOpenTestCount` — Number of test requests allowed in half-open state.
**Key operations:**
- `CheckAsync` — Verify if request is allowed; returns `CircuitBreakerCheckResult`.
- `RecordSuccessAsync` / `RecordFailureAsync` — Update circuit state after request.
- `ForceOpenAsync` / `ForceCloseAsync` — Manual operator intervention (audited).
- `ListAsync` — View all circuit breakers for a tenant with optional state filter.
**Downstream services protected:**
- Scanner
- Attestor
- Policy Engine
- Registry clients
- External integrations
## 4) APIs
### 4.1) Job management
- `GET /api/jobs?status=` — list jobs with filters (tenant, jobType, status, time window).
- `GET /api/jobs/{id}` — job detail (payload digest, attempts, worker, lease history, metrics).
- `POST /api/jobs/{id}/cancel` — cancel running/pending job with audit reason.
- `POST /api/jobs/{id}/replay` — schedule replay.
- `POST /api/limits/throttle` — apply throttle (requires elevated scope).
- `GET /api/dashboard/metrics` — aggregated metrics for Console dashboards.
### 4.2) Circuit breaker endpoints (`/api/v1/jobengine/circuit-breakers`)
- `GET /` — List all circuit breakers for tenant (optional `?state=` filter).
- `GET /{serviceId}` — Get circuit breaker state for specific downstream service.
- `GET /{serviceId}/check` — Check if requests are allowed; returns `IsAllowed`, `State`, `FailureRate`, `TimeUntilRetry`.
- `POST /{serviceId}/success` — Record successful request to downstream service.
- `POST /{serviceId}/failure` — Record failed request (body: `failureReason`).
- `POST /{serviceId}/force-open` — Manually open circuit (body: `reason`; audited).
- `POST /{serviceId}/force-close` — Manually close circuit (audited).
### 4.3) Quota governance endpoints (`/api/v1/jobengine/quota-governance`)
- `GET /policies` — List quota allocation policies (optional `?enabled=` filter).
- `GET /policies/{policyId}` — Get specific policy.
- `POST /policies` — Create new policy.
- `PUT /policies/{policyId}` — Update policy.
- `DELETE /policies/{policyId}` — Delete policy.
- `GET /allocation` — Calculate allocation for current tenant (optional `?jobType=`).
- `POST /request` — Request quota from pool (body: `jobType`, `requestedAmount`).
- `POST /release` — Release quota back to pool (body: `jobType`, `releasedAmount`).
- `GET /status` — Get tenant quota status (optional `?jobType=`).
- `GET /summary` — Get quota governance summary across all tenants (optional `?policyId=`).
- `GET /can-schedule` — Check if job can be scheduled (optional `?jobType=`).
### 4.4) Discovery and documentation
- Event envelope draft (`docs/modules/jobengine/event-envelope.md`) defines notifier/webhook/SSE payloads with idempotency keys, provenance, and task runner metadata for job/pack-run events.
- OpenAPI discovery: `/.well-known/openapi` exposes `/openapi/jobengine.json` (OAS 3.1) with pagination/idempotency/error-envelope examples; legacy job detail/summary endpoints now ship `Deprecation` + `Link` headers that point to their replacements.
### 4.5) Release control plane dashboard endpoints
- `GET /api/v1/release-jobengine/dashboard` — control-plane dashboard payload (pipeline, pending approvals, active deployments, recent releases).
- `POST /api/v1/release-jobengine/promotions/{id}/approve` — approve a pending promotion from dashboard context.
- `POST /api/v1/release-jobengine/promotions/{id}/reject` — reject a pending promotion from dashboard context.
- Compatibility aliases are exposed for legacy clients under `/api/release-jobengine/*`.
All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.
## 5) Observability
- Metrics: `job_queue_depth{jobType,tenant}`, `job_latency_seconds`, `job_failures_total`, `job_retry_total`, `lease_extensions_total`.
- Task Runner bridge adds `pack_run_logs_stream_lag_seconds`, `pack_run_heartbeats_total`, `pack_run_artifacts_total`.
- Logs: structured with `jobId`, `jobType`, `tenant`, `workerId`, `leaseId`, `status`. Incident logs flagged for Ops.
- Traces: spans covering `enqueue`, `schedule`, `lease`, `worker_execute`, `complete`. Trace IDs propagate to worker spans for end-to-end correlation.
## 6) Offline support
- Orchestrator exports audit bundles: `jobs.jsonl`, `history.jsonl`, `throttles.jsonl`, `manifest.json`, `signatures/`. Used for offline investigations and compliance.
- Replay manifests contain job digests and success/failure notes for deterministic proof.
## 7) Operational considerations
- HA deployment with multiple API instances; queue storage determines redundancy strategy.
- Support for `maintenance` mode halting leases while allowing status inspection.
- Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume).
---
## 8) Orchestration domain subdomains (Sprint 208)
Sprint 208 consolidated Scheduler, TaskRunner, and PacksRegistry source trees under `src/JobEngine/` as subdomains of the orchestration domain. Each subdomain retains its own project names, namespaces, and runtime identities. No namespace renames were performed.
### 8.1) Scheduler subdomain
**Source location:** `src/JobEngine/StellaOps.Scheduler.*`
The Scheduler service re-evaluates already-cataloged images when intelligence changes (Concelier/Excititor/policy), orchestrates nightly and ad-hoc runs, targets only impacted images using the BOM-Index, and emits report-ready events for downstream Notify. Default mode is analysis-only (no image pull); optional content-refresh can be enabled per schedule.
**Deployables:** `StellaOps.Scheduler.WebService` (stateless), `StellaOps.Scheduler.Worker.Host` (scale-out).
**Database:** `SchedulerDbContext` (schema `scheduler`, 11 entities). Owns `schedules`, `runs`, `impact_cursors`, `locks`, `audit` tables. See archived docs: `docs-archived/modules/scheduler/architecture.md`.
### 8.2) TaskRunner subdomain
**Source location:** `src/JobEngine/StellaOps.TaskRunner/`, `src/JobEngine/StellaOps.TaskRunner.__Libraries/`
The TaskRunner provides the execution substrate for Orchestrator jobs. Workers poll lease endpoints, execute tasks, report outcomes, and stream logs/artifacts for pack-runs.
**Deployables:** `StellaOps.TaskRunner.WebService`, `StellaOps.TaskRunner.Worker`.
**Database and storage contract (Sprint 312):**
- `Storage:Driver=postgres` is the production default for run state, logs, and approvals.
- Postgres-backed stores: `PostgresPackRunStateStore`, `PostgresPackRunLogStore`, `PostgresPackRunApprovalStore` via `TaskRunnerDataSource`.
- Artifact payload channel uses object storage path (`seed-fs` driver) configured with `TaskRunner:Storage:ObjectStore:SeedFs:RootPath`.
- Explicit non-production overrides remain available (`filesystem`, `inmemory`) but are no longer implicit defaults.
### 8.3) PacksRegistry subdomain
**Source location:** `src/JobEngine/StellaOps.PacksRegistry/`, `src/JobEngine/StellaOps.PacksRegistry.__Libraries/`
The PacksRegistry manages compliance/automation pack definitions, versions, and distribution for the task execution pipeline.
**Deployables:** `StellaOps.PacksRegistry.WebService`, `StellaOps.PacksRegistry.Worker`.
**Database and storage contract (Sprint 312):**
- `Storage:Driver=postgres` is the production default for metadata/state repositories (`pack`, `parity`, `lifecycle`, `mirror`, `audit`, `attestation metadata`).
- Blob/object payloads (`pack content`, `provenance content`, `attestation content`) are persisted through the seed-fs object-store channel (`SeedFsPacksRegistryBlobStore`).
- PostgreSQL keeps metadata and compatibility placeholders; payload retrieval resolves from object storage first.
- Explicit non-production overrides remain available (`filesystem`, `inmemory`) but are no longer implicit defaults.
---
## 9) Architecture Decision Record: No DB merge (Sprint 208)
**Decision:** OrchestratorDbContext and SchedulerDbContext remain as separate DbContexts with separate PostgreSQL schemas. No cross-schema DB merge.
**Context:** Sprint 208 evaluated merging the Orchestrator (39 entities) and Scheduler (11 entities) DbContexts into a single unified context. Both define `Jobs` and `JobHistory` entities.
**Problem:** The `Jobs` and `JobHistory` entities have fundamentally incompatible semantics:
- **OrchestratorDbContext.Jobs:** Represents pipeline orchestration runs (source ingestion, policy evaluation, release promotion). Fields include `payloadDigest`, `dependencies`, `leaseId`, `retryPolicy`.
- **SchedulerDbContext.Jobs:** Represents cron-scheduled rescan executions (image re-evaluation, impact-index-driven). Fields include `scheduleId`, `trigger` (cron/conselier/excitor/manual), `impactSet`, `runStats`.
Merging would require renaming one set of entities (e.g., `SchedulerJobs`, `SchedulerJobHistory`), propagating through repositories, query code, compiled models, migrations, and external contracts. The schemas already provide clean separation at no operational cost since both live in the same `stellaops_platform` database.
**Decision rationale:**
1. Entity name collision with incompatible models makes merge risky and disruptive.
2. Compiled models from Sprint 219 would need regeneration for both contexts.
3. Schemas provide clean separation at zero cost.
4. Future domain rename (Sprint 221) is a better venue for any schema consolidation.
**Consequences:** TaskRunner and PacksRegistry remain independent subdomains and now implement explicit storage contracts (Postgres state/metadata plus object-store payload channels) without cross-schema DB merge.
---
## 10) Schema continuity remediation (Sprint 311)
Sprint 221 renamed the domain from Orchestrator to JobEngine but intentionally preserved the PostgreSQL schema name `orchestrator` for continuity. Sprint 311 closed the implementation drift so runtime, design-time, and compiled-model paths now align on the same preserved schema default.
Implemented alignment:
- Runtime default schema is centralized in `JobEngineDbContext.DefaultSchemaName` (`orchestrator`) and schema normalization is centralized in `JobEngineDbContext.ResolveSchemaName(...)`.
- Repository runtime context creation (`JobEngineDbContextFactory`) uses that same shared default and normalization logic.
- Design-time context creation now passes `JobEngineDbContext.DefaultSchemaName` explicitly instead of relying on implicit constructor fallback.
- EF compiled model schema annotations were aligned to `orchestrator` so compiled-model and runtime model behavior match.
Out of scope for Sprint 311:
- No schema migration from `orchestrator` to `jobengine` was introduced.
- Any future physical schema rename requires a dedicated migration sprint with data/backfill and rollback planning.

View File

@@ -0,0 +1,69 @@
# Orchestrator Event Envelope (draft)
Status: draft for ORCH-SVC-38-101 (pending ORCH-SVC-37-101 approval)
## Goals
- Single, provenance-rich envelope for policy/export/job lifecycle events.
- Idempotent across retries and transports (Notifier bus, webhooks, SSE/WS streams).
- Tenant/project isolation and offline-friendly replays.
## Envelope
```jsonc
{
"schemaVersion": "orch.event.v1",
"eventId": "urn:orch:event:...", // UUIDv7 or ULID
"eventType": "job.failed|job.completed|pack_run.log|pack_run.artifact|policy.updated|export.completed",
"occurredAt": "2025-11-19T12:34:56Z",
"idempotencyKey": "orch-{eventType}-{jobId}-{attempt}",
"correlationId": "corr-...", // propagated from producer
"tenantId": "...",
"projectId": "...", // optional but preferred
"actor": {
"subject": "service/worker-sdk-go", // who emitted the event
"scopes": ["orch:quota", "orch:backfill"]
},
"job": {
"id": "job_018f...",
"type": "pack-run|ingest|export|policy-simulate",
"runId": "run_018f...", // for pack runs / sims
"attempt": 3,
"leaseId": "lease_018f...",
"taskRunnerId": "tr_018f...",
"status": "completed|failed|running|canceled",
"reason": "user_cancelled|retry_backoff|quota_paused",
"payloadDigest": "sha256:...",
"artifacts": [
{"uri": "s3://...", "digest": "sha256:...", "mime": "application/json"}
]
},
"metrics": {
"durationSeconds": 12.345,
"logStreamLagSeconds": 0.8,
"backoffSeconds": 30
},
"notifier": {
"channel": "orch.jobs",
"delivery": "dsse",
"replay": {"ordinal": 5, "total": 12}
}
}
```
## Idempotency rules
- `eventId` globally unique; `idempotencyKey` dedupe per channel.
- Emit once per state transition; retries reuse the same `eventId`/`idempotencyKey`.
## Provenance
- Always include `tenantId` and `projectId` (if available).
- Carry `correlationId` from upstream producers and `taskRunnerId` from leasing bridge.
- Include `actor.scopes` when events are triggered via elevated tokens (`orch:quota`, `orch:backfill`).
## Transport bindings
- **Notifier bus**: DSSE-wrapped envelope; subject `orch.event` and `eventType`.
- **Webhooks**: HMAC with `X-Orchestrator-Signature` (sha256), replay-safe via `idempotencyKey`.
- **SSE/WS**: stream per `tenantId` filtered by `projectId`; client dedupe via `eventId`.
## Backlog & follow-ups
- Align field names with ORCH-SVC-37-101 once finalized.
- Add examples for policy/export events and pack-run log/manifest payloads.
- Document retry/backoff semantics in Notify/Console subscribers.

View File

@@ -0,0 +1,30 @@
# Orchestrator SLOs (DOCS-ORCH-34-005)
Last updated: 2025-11-25
## Service level objectives
- **Availability**: 99.9% monthly for WebService API per tenant.
- **Run completion**: P95 run duration < 5m for standard DAGs; failure rate <1% over 30d.
- **Queue health**: backlog < 1000 items per tenant for >95% of 5m windows.
- **Event delivery**: WebSocket/stream delivery success > 99.5% (per day).
## Error budget policy
- Window: 28 days. Burn alerts:
- 2× burn: page on-call.
- 14× burn: immediate mitigation (disable offending DAGs, scale workers).
## Alerts (examples)
- Availability: `probe_success{job="orchestrator-api"} < 0.999 over 1h`.
- Latency: `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.5`.
- Run failures: `rate(orchestrator_runs_total{status="failed"}[30m]) / rate(orchestrator_runs_total[30m]) > 0.01`.
- Queue backlog: `orchestrator_queue_depth > 1000` for 10m.
## Dashboards
- Golden signals per service (traffic, errors, latency, saturation).
- Run outcome panel: success/fail/cancel counts, retry counts.
- Queue panel: depth, age, worker consumption rate.
- Burn-rate panel tied to error budget.
## Ownership & review
- SLOs owned by Orchestrator Guild; reviewed quarterly or when architecture changes.
- Changes must be reflected in runbook and alert rules; update manifests for offline/air-gap monitoring kits.

View File

@@ -0,0 +1,24 @@
# Orchestrator Implementation Plan
## Purpose
Provide a living plan for Orchestrator deliverables, dependencies, and evidence.
## Active work
- Track current sprints under `docs/implplan/SPRINT_*.md` for this module.
- Update this file when new scoped work is approved.
## Near-term deliverables
- TBD (add when sprint is staffed).
## Dependencies
- `docs/modules/jobengine/architecture.md`
- `docs/modules/jobengine/README.md`
- `docs/modules/platform/architecture-overview.md`
## Evidence of completion
- Code changes under `src/JobEngine/**`.
- Tests and fixtures under the module's `__Tests` / `__Libraries`.
- Docs and runbooks under `docs/modules/jobengine/**`.
## Notes
- Keep deterministic and offline-first expectations aligned with module AGENTS.

View File

@@ -0,0 +1,51 @@
# Orchestrator → Findings Ledger Export Contract
Status: Available (2025-12-03)
Scope: defines the deterministic payload Orchestrator emits for job/run exports that Findings Ledger ingests for provenance (LEDGER-34-101).
## Payload shape
```jsonc
{
"runId": "uuid", // job/run correlation id
"jobType": "string", // e.g., mirror-build, policy-sim, scan
"artifactHash": "sha256:...", // CAS digest of primary artifact
"policyHash": "sha256:...", // optional; policy bundle hash
"startedAt": "2025-12-02T00:00:00Z",
"completedAt": "2025-12-02T00:05:30Z",
"status": "succeeded|failed|canceled",
"manifestPath": "cas://.../manifest.json", // DSSE or CAS path
"logsPath": "cas://.../logs.ndjson",
"tenantId": "string",
"environment": "prod|stage|dev",
"idempotencyKey": "sha256:...", // runId+artifactHash
"signatures": [ { "type": "dsse", "keyId": "...", "signature": "..." } ]
}
```
## Determinism & ordering
- Entries sorted by `runId` when streamed; pagination stable via `runId, startedAt` ascending.
- `idempotencyKey = sha256(runId + artifactHash + tenantId)`; duplicate submissions are rejected with 409.
- Timestamps UTC ISO-8601; no clock-skew correction performed by Ledger.
## Transport
- REST: `POST /internal/jobengine/exports` (Orchestrator) → Findings Ledger ingest queue.
- Events: `orchestrator.export.created` carries the same payload; consumers must verify DSSE before persistence.
## Validation rules (Ledger side)
- Require DSSE signature (Ed25519) when `signatures` present; fail closed if verification fails.
- Enforce presence of `runId`, `artifactHash`, `startedAt`, `status`.
- Hash fields must match `^sha256:[A-Fa-f0-9]{64}$`.
- Allowed status transitions: pending→running→succeeded/failed/canceled; replays only allowed when `idempotencyKey` matches existing record.
## Mapping to Findings Ledger
- Stored in collection/table `orchestrator_exports` with index `(artifactHash, runId)` and TTL optional on logs if configured.
- Timeline entry type `ledger_export` references `runId`, `artifactHash`, `policyHash`, `manifestPath`, `logsPath`, DSSE envelope digest, and `idempotencyKey`.
- Cross-links: `bundleId` (if mirror build) or `scanId` (if scan job) may be added as optional fields; Ledger treats them as opaque strings.
## Security / scopes
- Required scope: `orchestrator:exports:write` for producer; Ledger ingress validates tenant headers and scope.
- Max payload: 1 MiB; logs must be CAS/DSSE referenced, not inline.
## Offline/air-gap considerations
- CAS/DSSE paths must resolve within offline kit bundles; no external URIs permitted.
- Deterministic ordering + idempotency allow replay without side effects; Ledger rejects writes when DSSE or hash mismatch.