consolidation of some of the modules, localization fixes, product advisories work, qa work

2026-03-05 03:54:22 +02:00
parent 7bafcc3eef
commit 8e1cb9448d
3878 changed files with 72600 additions and 46861 deletions
--- a/docs/modules/jobengine/AGENTS.md
+++ b/docs/modules/jobengine/AGENTS.md
@@ -0,0 +1,34 @@
+# Source & Job Orchestrator agent guide
+
+## Mission
+The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform.
+
+## Key docs
+- [Module README](./README.md)
+- [Architecture](./architecture.md)
+- [Implementation plan](./implementation_plan.md)
+- [Task board](./TASKS.md)
+
+## How to get started
+1. Read the design summaries in ./architecture.md (quota governance, job lifecycle, dashboard feeds).
+2. Open sprint file `/docs/implplan/SPRINT_*.md` and locate stories for this component.
+3. Check ./TASKS.md and update status before/after work.
+4. Review ./README.md for responsibilities and ensure changes maintain determinism and offline parity.
+
+## Guardrails
+- Uphold Aggregation-Only Contract boundaries when consuming ingestion data.
+- Preserve determinism and provenance in all derived outputs.
+- Document offline/air-gap pathways for any new feature.
+- Update telemetry/observability assets alongside feature work.
+## Required Reading
+- `docs/modules/jobengine/README.md`
+- `docs/modules/jobengine/architecture.md`
+- `docs/modules/jobengine/implementation_plan.md`
+- `docs/modules/platform/architecture-overview.md`
+
+## Working Agreement
+- 1. Update task status to `DOING`/`DONE` in both correspoding sprint file `/docs/implplan/SPRINT_*.md` and the local `TASKS.md` when you start or finish work.
+- 2. Review this charter and the Required Reading documents before coding; confirm prerequisites are met.
+- 3. Keep changes deterministic (stable ordering, timestamps, hashes) and align with offline/air-gap expectations.
+- 4. Coordinate doc updates, tests, and cross-guild communication whenever contracts or workflows change.
+- 5. Revert to `TODO` if you pause the task without shipping changes; leave notes in commit/PR descriptions for context.
--- a/docs/modules/jobengine/README.md
+++ b/docs/modules/jobengine/README.md
@@ -0,0 +1,78 @@
+# StellaOps Source & Job Orchestrator
+
+The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform.
+
+## Latest updates (2025-11-30)
+- OpenAPI discovery published at `/.well-known/openapi` with `openapi/orchestrator.json`; includes pagination/idempotency/error-envelope examples and version headers.
+- Legacy job detail/summary endpoints now emit `Deprecation` + `Link` headers pointing to the stable replacements.
+- Job leasing flows through the Task Runner bridge: allocations carry idempotency keys, lease durations, and retry hints; workers acknowledge via claim/ack and emit heartbeats.
+- Event envelopes remain interim pending ORCH-SVC-37-101; include provenance (tenant/project, job type, correlationId, task runner id) in all notifier events.
+- Authority `orch:quota` / `orch:backfill` scopes require reason/ticket audit fields; include them in runbooks and dashboard overrides.
+
+## Responsibilities
+- Track job state, throughput, and errors for Concelier, Excititor, Scheduler, and export pipelines.
+- Expose dashboards and APIs for throttling, replays, and failover.
+- Enforce rate-limits, concurrency and dependency chains across queues.
+- Stream structured events and audit logs for incident response.
+- Provide Task Runner bridge semantics (claim/ack, heartbeats, progress, artifacts, backfills) for Go/Python SDKs.
+
+## Key components
+- Orchestrator WebService (control plane).
+- Queue adapters (Valkey/NATS) and job ledger.
+- Console dashboard module and CLI integration for operators.
+
+## Integrations & dependencies
+- Authority for authN/Z on operational actions.
+- Telemetry stack for job metrics and alerts.
+- Scheduler/Concelier/Excititor workers for job lifecycle.
+- Offline Kit for state export/import during air-gap refreshes.
+
+## Operational notes
+- Job recovery runbooks and dashboard JSON as described in Epic 9.
+- Rate-limit and lease reconfiguration guidelines; keep lease defaults aligned across runners and SDKs (Go/Python).
+- Log streaming: SSE/WS endpoints carry correlationId + tenant/project; buffer size and retention must be documented in runbooks.
+- When using `orch:quota` / `orch:backfill` scopes, capture reason/ticket fields in runbooks and audit checklists.
+
+## Implementation Status
+
+### Phase 1 – Core service & job ledger (Complete)
+- PostgreSQL schema with sources, runs, jobs, artifacts, DAG edges, quotas, schedules, incidents
+- Lease manager with heartbeats, retries, dead-letter queues
+- Token-bucket rate limiter per tenant/source.host with adaptive refill
+- Watermark/backfill orchestration for event-time windows
+
+### Phase 2 – Worker SDK & artifact registry (Complete)
+- Claim/heartbeat/report contract with deterministic artifact hashing
+- Idempotency enforcement and worker SDKs for .NET/Rust/Go agents
+- Integrated with Concelier, Excititor, SBOM Service, Policy Engine
+
+### Phase 3 – Observability & dashboard (In Progress)
+- Metrics: queue depth, job latency, failure classes, rate-limit hits, burn rate
+- Error clustering for HTTP 429/5xx, schema mismatches, parse errors
+- SSE/WebSocket feeds for Console updates, Gantt timeline/DAG JSON
+
+### Phase 4 – Controls & resilience (Planned)
+- Pause/resume/throttle/retry/backfill tooling
+- Dead-letter review, circuit breakers, blackouts, backpressure handling
+- Automation hooks and control plane APIs
+
+### Phase 5 – Offline & compliance (Planned)
+- Deterministic audit bundles (jobs.jsonl, history.jsonl, throttles.jsonl)
+- Provenance manifests and offline replay scripts
+- Tenant isolation validation and secret redaction
+
+### Key Acceptance Criteria
+- Schedules all jobs with quotas, rate limits, idempotency; preserves provenance
+- Console reflects real-time DAG status, queue depth, SLO burn rate
+- Observability stack exposes metrics, logs, traces, incidents for stuck jobs and throttling
+- Offline audit bundles reproduce job history deterministically with verified signatures
+
+### Technical Decisions & Risks
+- Backpressure/queue overload mitigated via adaptive token buckets, circuit breakers, dynamic concurrency
+- Upstream vendor throttles managed with visible state, automatic jitter and retry
+- Tenant leakage prevented through API/queue/storage filters, fuzz tests, redaction
+- Complex DAG errors handled with diagnostics, error clustering, partial replay tooling
+
+## Epic alignment
+- Epic 9: Source & Job Orchestrator Dashboard.
+- ORCH stories in ../../TASKS.md.
--- a/docs/modules/jobengine/architecture.md
+++ b/docs/modules/jobengine/architecture.md
@@ -0,0 +1,228 @@
+# Source & Job Orchestrator architecture
+
+> Based on Epic 9 – Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability.
+
+## 1) Topology
+
+- **Orchestrator API (`StellaOps.JobEngine`).** Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (`orchestrator:*`).
+- **Job ledger (PostgreSQL).** Tables `jobs`, `job_history`, `sources`, `quotas`, `throttles`, `incidents` (schema `orchestrator`). Append-only history ensures auditability.
+- **Queue abstraction.** Supports Valkey Streams or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
+- **Dashboard feeds.** SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.
+
+## 2) Job lifecycle
+
+1. **Enqueue.** Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit `JobRequest` records containing `jobType`, `tenant`, `priority`, `payloadDigest`, `dependencies`.
+2. **Scheduling.** Orchestrator applies quotas and rate limits per `{tenant, jobType}`. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
+3. **Leasing (Task Runner bridge).** Workers poll `LeaseJob` endpoint; Orchestrator returns job with `leaseId`, `leaseUntil`, `idempotencyKey`, and instrumentation tokens. Lease renewal required for long-running tasks; leases carry retry hints and provenance (`tenant`, `project`, `correlationId`, `taskRunnerId`).
+4. **Completion.** Worker reports status (`succeeded`, `failed`, `canceled`, `timed_out`). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
+5. **Replay.** Operators trigger `POST /jobs/{id}/replay` which clones job payload, sets `replayOf` pointer, and requeues with high priority while preserving determinism metadata.
+
+### Pack-run lifecycle (phase III)
+- **Register** `pack-run` job type with task runner hints (artifacts, log channel, heartbeat cadence).
+- **Logs/Artifacts**: SSE/WS stream keyed by `packRunId` + `tenant/project`; artifacts published with content digests and URI metadata.
+- **Events**: notifier payloads include envelope provenance (tenant, project, correlationId, idempotencyKey) pending ORCH-SVC-37-101 final spec.
+
+## 3) Rate-limit & quota governance
+
+- Quotas defined per tenant/profile (`maxActive`, `maxPerHour`, `burst`). Stored in `quotas` and enforced before leasing.
+- Dynamic throttles allow ops to pause specific sources (`pauseSource`, `resumeSource`) or reduce concurrency.
+- Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
+- Control plane quota updates require Authority scope `orch:quota` (issued via `Orch.Admin` role). Historical rebuilds/backfills additionally require `orch:backfill` and must supply `backfill_reason` and `backfill_ticket` alongside the operator metadata. Authority persists all four fields (`quota_reason`, `quota_ticket`, `backfill_reason`, `backfill_ticket`) for audit replay.
+
+### 3.1) Quota governance service
+
+The `QuotaGovernanceService` provides cross-tenant quota allocation with configurable policies:
+
+**Allocation strategies:**
+- `Equal` — Divide total capacity equally among all active tenants.
+- `Proportional` — Allocate based on tenant weight/priority tier.
+- `Priority` — Higher priority tenants get allocation first, with preemption.
+- `ReservedWithFairShare` — Reserved minimum per tenant, remainder distributed fairly.
+- `Fixed` — Static allocation per tenant regardless of demand.
+
+**Key operations:**
+- `CalculateAllocationAsync` — Compute quota for a tenant based on active policies.
+- `RequestQuotaAsync` — Request quota from shared pool; returns granted amount with burst usage.
+- `ReleaseQuotaAsync` — Return quota to shared pool after job completion.
+- `CanScheduleAsync` — Check scheduling eligibility combining quota and circuit breaker state.
+
+**Quota allocation policy properties:**
+- `TotalCapacity` — Pool size to allocate from (for proportional/fair strategies).
+- `MinimumPerTenant` / `MaximumPerTenant` — Allocation bounds.
+- `ReservedCapacity` — Guaranteed capacity for high-priority tenants.
+- `AllowBurst` / `BurstMultiplier` — Allow temporary overallocation when capacity exists.
+- `Priority` — Policy evaluation order (higher = first).
+- `JobType` — Optional job type filter (null = applies to all).
+
+### 3.2) Circuit breaker service
+
+The `CircuitBreakerService` implements the circuit breaker pattern for downstream services:
+
+**States:**
+- `Closed` — Normal operation; requests pass through. Failures are tracked.
+- `Open` — Circuit tripped; requests are blocked for `OpenDuration`. Prevents cascade failures.
+- `HalfOpen` — After open duration, limited test requests allowed. Success → Closed; Failure → Open.
+
+**Thresholds:**
+- `FailureThreshold` (0.0–1.0) — Failure rate that triggers circuit open.
+- `WindowDuration` — Sliding window for failure rate calculation.
+- `MinimumSamples` — Minimum requests before circuit can trip.
+- `OpenDuration` — How long circuit stays open before half-open transition.
+- `HalfOpenTestCount` — Number of test requests allowed in half-open state.
+
+**Key operations:**
+- `CheckAsync` — Verify if request is allowed; returns `CircuitBreakerCheckResult`.
+- `RecordSuccessAsync` / `RecordFailureAsync` — Update circuit state after request.
+- `ForceOpenAsync` / `ForceCloseAsync` — Manual operator intervention (audited).
+- `ListAsync` — View all circuit breakers for a tenant with optional state filter.
+
+**Downstream services protected:**
+- Scanner
+- Attestor
+- Policy Engine
+- Registry clients
+- External integrations
+
+## 4) APIs
+
+### 4.1) Job management
+- `GET /api/jobs?status=` — list jobs with filters (tenant, jobType, status, time window).
+- `GET /api/jobs/{id}` — job detail (payload digest, attempts, worker, lease history, metrics).
+- `POST /api/jobs/{id}/cancel` — cancel running/pending job with audit reason.
+- `POST /api/jobs/{id}/replay` — schedule replay.
+- `POST /api/limits/throttle` — apply throttle (requires elevated scope).
+- `GET /api/dashboard/metrics` — aggregated metrics for Console dashboards.
+
+### 4.2) Circuit breaker endpoints (`/api/v1/jobengine/circuit-breakers`)
+- `GET /` — List all circuit breakers for tenant (optional `?state=` filter).
+- `GET /{serviceId}` — Get circuit breaker state for specific downstream service.
+- `GET /{serviceId}/check` — Check if requests are allowed; returns `IsAllowed`, `State`, `FailureRate`, `TimeUntilRetry`.
+- `POST /{serviceId}/success` — Record successful request to downstream service.
+- `POST /{serviceId}/failure` — Record failed request (body: `failureReason`).
+- `POST /{serviceId}/force-open` — Manually open circuit (body: `reason`; audited).
+- `POST /{serviceId}/force-close` — Manually close circuit (audited).
+
+### 4.3) Quota governance endpoints (`/api/v1/jobengine/quota-governance`)
+- `GET /policies` — List quota allocation policies (optional `?enabled=` filter).
+- `GET /policies/{policyId}` — Get specific policy.
+- `POST /policies` — Create new policy.
+- `PUT /policies/{policyId}` — Update policy.
+- `DELETE /policies/{policyId}` — Delete policy.
+- `GET /allocation` — Calculate allocation for current tenant (optional `?jobType=`).
+- `POST /request` — Request quota from pool (body: `jobType`, `requestedAmount`).
+- `POST /release` — Release quota back to pool (body: `jobType`, `releasedAmount`).
+- `GET /status` — Get tenant quota status (optional `?jobType=`).
+- `GET /summary` — Get quota governance summary across all tenants (optional `?policyId=`).
+- `GET /can-schedule` — Check if job can be scheduled (optional `?jobType=`).
+
+### 4.4) Discovery and documentation
+- Event envelope draft (`docs/modules/jobengine/event-envelope.md`) defines notifier/webhook/SSE payloads with idempotency keys, provenance, and task runner metadata for job/pack-run events.
+- OpenAPI discovery: `/.well-known/openapi` exposes `/openapi/jobengine.json` (OAS 3.1) with pagination/idempotency/error-envelope examples; legacy job detail/summary endpoints now ship `Deprecation` + `Link` headers that point to their replacements.
+
+### 4.5) Release control plane dashboard endpoints
+- `GET /api/v1/release-jobengine/dashboard` — control-plane dashboard payload (pipeline, pending approvals, active deployments, recent releases).
+- `POST /api/v1/release-jobengine/promotions/{id}/approve` — approve a pending promotion from dashboard context.
+- `POST /api/v1/release-jobengine/promotions/{id}/reject` — reject a pending promotion from dashboard context.
+- Compatibility aliases are exposed for legacy clients under `/api/release-jobengine/*`.
+
+All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.
+
+## 5) Observability
+
+- Metrics: `job_queue_depth{jobType,tenant}`, `job_latency_seconds`, `job_failures_total`, `job_retry_total`, `lease_extensions_total`.
+- Task Runner bridge adds `pack_run_logs_stream_lag_seconds`, `pack_run_heartbeats_total`, `pack_run_artifacts_total`.
+- Logs: structured with `jobId`, `jobType`, `tenant`, `workerId`, `leaseId`, `status`. Incident logs flagged for Ops.
+- Traces: spans covering `enqueue`, `schedule`, `lease`, `worker_execute`, `complete`. Trace IDs propagate to worker spans for end-to-end correlation.
+
+## 6) Offline support
+
+- Orchestrator exports audit bundles: `jobs.jsonl`, `history.jsonl`, `throttles.jsonl`, `manifest.json`, `signatures/`. Used for offline investigations and compliance.
+- Replay manifests contain job digests and success/failure notes for deterministic proof.
+
+## 7) Operational considerations
+
+- HA deployment with multiple API instances; queue storage determines redundancy strategy.
+- Support for `maintenance` mode halting leases while allowing status inspection.
+- Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume).
+
+---
+
+## 8) Orchestration domain subdomains (Sprint 208)
+
+Sprint 208 consolidated Scheduler, TaskRunner, and PacksRegistry source trees under `src/JobEngine/` as subdomains of the orchestration domain. Each subdomain retains its own project names, namespaces, and runtime identities. No namespace renames were performed.
+
+### 8.1) Scheduler subdomain
+
+**Source location:** `src/JobEngine/StellaOps.Scheduler.*`
+
+The Scheduler service re-evaluates already-cataloged images when intelligence changes (Concelier/Excititor/policy), orchestrates nightly and ad-hoc runs, targets only impacted images using the BOM-Index, and emits report-ready events for downstream Notify. Default mode is analysis-only (no image pull); optional content-refresh can be enabled per schedule.
+
+**Deployables:** `StellaOps.Scheduler.WebService` (stateless), `StellaOps.Scheduler.Worker.Host` (scale-out).
+
+**Database:** `SchedulerDbContext` (schema `scheduler`, 11 entities). Owns `schedules`, `runs`, `impact_cursors`, `locks`, `audit` tables. See archived docs: `docs-archived/modules/scheduler/architecture.md`.
+
+### 8.2) TaskRunner subdomain
+
+**Source location:** `src/JobEngine/StellaOps.TaskRunner/`, `src/JobEngine/StellaOps.TaskRunner.__Libraries/`
+
+The TaskRunner provides the execution substrate for Orchestrator jobs. Workers poll lease endpoints, execute tasks, report outcomes, and stream logs/artifacts for pack-runs.
+
+**Deployables:** `StellaOps.TaskRunner.WebService`, `StellaOps.TaskRunner.Worker`.
+
+**Database and storage contract (Sprint 312):**
+- `Storage:Driver=postgres` is the production default for run state, logs, and approvals.
+- Postgres-backed stores: `PostgresPackRunStateStore`, `PostgresPackRunLogStore`, `PostgresPackRunApprovalStore` via `TaskRunnerDataSource`.
+- Artifact payload channel uses object storage path (`seed-fs` driver) configured with `TaskRunner:Storage:ObjectStore:SeedFs:RootPath`.
+- Explicit non-production overrides remain available (`filesystem`, `inmemory`) but are no longer implicit defaults.
+
+### 8.3) PacksRegistry subdomain
+
+**Source location:** `src/JobEngine/StellaOps.PacksRegistry/`, `src/JobEngine/StellaOps.PacksRegistry.__Libraries/`
+
+The PacksRegistry manages compliance/automation pack definitions, versions, and distribution for the task execution pipeline.
+
+**Deployables:** `StellaOps.PacksRegistry.WebService`, `StellaOps.PacksRegistry.Worker`.
+
+**Database and storage contract (Sprint 312):**
+- `Storage:Driver=postgres` is the production default for metadata/state repositories (`pack`, `parity`, `lifecycle`, `mirror`, `audit`, `attestation metadata`).
+- Blob/object payloads (`pack content`, `provenance content`, `attestation content`) are persisted through the seed-fs object-store channel (`SeedFsPacksRegistryBlobStore`).
+- PostgreSQL keeps metadata and compatibility placeholders; payload retrieval resolves from object storage first.
+- Explicit non-production overrides remain available (`filesystem`, `inmemory`) but are no longer implicit defaults.
+
+---
+
+## 9) Architecture Decision Record: No DB merge (Sprint 208)
+
+**Decision:** OrchestratorDbContext and SchedulerDbContext remain as separate DbContexts with separate PostgreSQL schemas. No cross-schema DB merge.
+
+**Context:** Sprint 208 evaluated merging the Orchestrator (39 entities) and Scheduler (11 entities) DbContexts into a single unified context. Both define `Jobs` and `JobHistory` entities.
+
+**Problem:** The `Jobs` and `JobHistory` entities have fundamentally incompatible semantics:
+- **OrchestratorDbContext.Jobs:** Represents pipeline orchestration runs (source ingestion, policy evaluation, release promotion). Fields include `payloadDigest`, `dependencies`, `leaseId`, `retryPolicy`.
+- **SchedulerDbContext.Jobs:** Represents cron-scheduled rescan executions (image re-evaluation, impact-index-driven). Fields include `scheduleId`, `trigger` (cron/conselier/excitor/manual), `impactSet`, `runStats`.
+
+Merging would require renaming one set of entities (e.g., `SchedulerJobs`, `SchedulerJobHistory`), propagating through repositories, query code, compiled models, migrations, and external contracts. The schemas already provide clean separation at no operational cost since both live in the same `stellaops_platform` database.
+
+**Decision rationale:**
+1. Entity name collision with incompatible models makes merge risky and disruptive.
+2. Compiled models from Sprint 219 would need regeneration for both contexts.
+3. Schemas provide clean separation at zero cost.
+4. Future domain rename (Sprint 221) is a better venue for any schema consolidation.
+
+**Consequences:** TaskRunner and PacksRegistry remain independent subdomains and now implement explicit storage contracts (Postgres state/metadata plus object-store payload channels) without cross-schema DB merge.
+
+---
+
+## 10) Schema continuity remediation (Sprint 311)
+
+Sprint 221 renamed the domain from Orchestrator to JobEngine but intentionally preserved the PostgreSQL schema name `orchestrator` for continuity. Sprint 311 closed the implementation drift so runtime, design-time, and compiled-model paths now align on the same preserved schema default.
+
+Implemented alignment:
+- Runtime default schema is centralized in `JobEngineDbContext.DefaultSchemaName` (`orchestrator`) and schema normalization is centralized in `JobEngineDbContext.ResolveSchemaName(...)`.
+- Repository runtime context creation (`JobEngineDbContextFactory`) uses that same shared default and normalization logic.
+- Design-time context creation now passes `JobEngineDbContext.DefaultSchemaName` explicitly instead of relying on implicit constructor fallback.
+- EF compiled model schema annotations were aligned to `orchestrator` so compiled-model and runtime model behavior match.
+
+Out of scope for Sprint 311:
+- No schema migration from `orchestrator` to `jobengine` was introduced.
+- Any future physical schema rename requires a dedicated migration sprint with data/backfill and rollback planning.
--- a/docs/modules/jobengine/event-envelope.md
+++ b/docs/modules/jobengine/event-envelope.md
@@ -0,0 +1,69 @@
+# Orchestrator Event Envelope (draft)
+
+Status: draft for ORCH-SVC-38-101 (pending ORCH-SVC-37-101 approval)
+
+## Goals
+- Single, provenance-rich envelope for policy/export/job lifecycle events.
+- Idempotent across retries and transports (Notifier bus, webhooks, SSE/WS streams).
+- Tenant/project isolation and offline-friendly replays.
+
+## Envelope
+```jsonc
+{
+  "schemaVersion": "orch.event.v1",
+  "eventId": "urn:orch:event:...",            // UUIDv7 or ULID
+  "eventType": "job.failed|job.completed|pack_run.log|pack_run.artifact|policy.updated|export.completed",
+  "occurredAt": "2025-11-19T12:34:56Z",
+  "idempotencyKey": "orch-{eventType}-{jobId}-{attempt}",
+  "correlationId": "corr-...",                 // propagated from producer
+  "tenantId": "...",
+  "projectId": "...",                          // optional but preferred
+  "actor": {
+    "subject": "service/worker-sdk-go",        // who emitted the event
+    "scopes": ["orch:quota", "orch:backfill"]
+  },
+  "job": {
+    "id": "job_018f...",
+    "type": "pack-run|ingest|export|policy-simulate",
+    "runId": "run_018f...",                    // for pack runs / sims
+    "attempt": 3,
+    "leaseId": "lease_018f...",
+    "taskRunnerId": "tr_018f...",
+    "status": "completed|failed|running|canceled",
+    "reason": "user_cancelled|retry_backoff|quota_paused",
+    "payloadDigest": "sha256:...",
+    "artifacts": [
+      {"uri": "s3://...", "digest": "sha256:...", "mime": "application/json"}
+    ]
+  },
+  "metrics": {
+    "durationSeconds": 12.345,
+    "logStreamLagSeconds": 0.8,
+    "backoffSeconds": 30
+  },
+  "notifier": {
+    "channel": "orch.jobs",
+    "delivery": "dsse",
+    "replay": {"ordinal": 5, "total": 12}
+  }
+}
+```
+
+## Idempotency rules
+- `eventId` globally unique; `idempotencyKey` dedupe per channel.
+- Emit once per state transition; retries reuse the same `eventId`/`idempotencyKey`.
+
+## Provenance
+- Always include `tenantId` and `projectId` (if available).
+- Carry `correlationId` from upstream producers and `taskRunnerId` from leasing bridge.
+- Include `actor.scopes` when events are triggered via elevated tokens (`orch:quota`, `orch:backfill`).
+
+## Transport bindings
+- **Notifier bus**: DSSE-wrapped envelope; subject `orch.event` and `eventType`.
+- **Webhooks**: HMAC with `X-Orchestrator-Signature` (sha256), replay-safe via `idempotencyKey`.
+- **SSE/WS**: stream per `tenantId` filtered by `projectId`; client dedupe via `eventId`.
+
+## Backlog & follow-ups
+- Align field names with ORCH-SVC-37-101 once finalized.
+- Add examples for policy/export events and pack-run log/manifest payloads.
+- Document retry/backoff semantics in Notify/Console subscribers.
--- a/docs/modules/jobengine/guides/orchestrator-slo.md
+++ b/docs/modules/jobengine/guides/orchestrator-slo.md
@@ -0,0 +1,30 @@
+# Orchestrator SLOs (DOCS-ORCH-34-005)
+
+Last updated: 2025-11-25
+
+## Service level objectives
+- **Availability**: 99.9% monthly for WebService API per tenant.
+- **Run completion**: P95 run duration < 5m for standard DAGs; failure rate <1% over 30d.
+- **Queue health**: backlog < 1000 items per tenant for >95% of 5m windows.
+- **Event delivery**: WebSocket/stream delivery success > 99.5% (per day).
+
+## Error budget policy
+- Window: 28 days. Burn alerts:
+  - 2× burn: page on-call.
+  - 14× burn: immediate mitigation (disable offending DAGs, scale workers).
+
+## Alerts (examples)
+- Availability: `probe_success{job="orchestrator-api"} < 0.999 over 1h`.
+- Latency: `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.5`.
+- Run failures: `rate(orchestrator_runs_total{status="failed"}[30m]) / rate(orchestrator_runs_total[30m]) > 0.01`.
+- Queue backlog: `orchestrator_queue_depth > 1000` for 10m.
+
+## Dashboards
+- Golden signals per service (traffic, errors, latency, saturation).
+- Run outcome panel: success/fail/cancel counts, retry counts.
+- Queue panel: depth, age, worker consumption rate.
+- Burn-rate panel tied to error budget.
+
+## Ownership & review
+- SLOs owned by Orchestrator Guild; reviewed quarterly or when architecture changes.
+- Changes must be reflected in runbook and alert rules; update manifests for offline/air-gap monitoring kits.
--- a/docs/modules/jobengine/implementation_plan.md
+++ b/docs/modules/jobengine/implementation_plan.md
@@ -0,0 +1,24 @@
+# Orchestrator Implementation Plan
+
+## Purpose
+Provide a living plan for Orchestrator deliverables, dependencies, and evidence.
+
+## Active work
+- Track current sprints under `docs/implplan/SPRINT_*.md` for this module.
+- Update this file when new scoped work is approved.
+
+## Near-term deliverables
+- TBD (add when sprint is staffed).
+
+## Dependencies
+- `docs/modules/jobengine/architecture.md`
+- `docs/modules/jobengine/README.md`
+- `docs/modules/platform/architecture-overview.md`
+
+## Evidence of completion
+- Code changes under `src/JobEngine/**`.
+- Tests and fixtures under the module's `__Tests` / `__Libraries`.
+- Docs and runbooks under `docs/modules/jobengine/**`.
+
+## Notes
+- Keep deterministic and offline-first expectations aligned with module AGENTS.
--- a/docs/modules/jobengine/job-export-contract.md
+++ b/docs/modules/jobengine/job-export-contract.md
@@ -0,0 +1,51 @@
+# Orchestrator → Findings Ledger Export Contract
+
+Status: Available (2025-12-03)
+Scope: defines the deterministic payload Orchestrator emits for job/run exports that Findings Ledger ingests for provenance (LEDGER-34-101).
+
+## Payload shape
+```jsonc
+{
+  "runId": "uuid",                 // job/run correlation id
+  "jobType": "string",             // e.g., mirror-build, policy-sim, scan
+  "artifactHash": "sha256:...",    // CAS digest of primary artifact
+  "policyHash": "sha256:...",      // optional; policy bundle hash
+  "startedAt": "2025-12-02T00:00:00Z",
+  "completedAt": "2025-12-02T00:05:30Z",
+  "status": "succeeded|failed|canceled",
+  "manifestPath": "cas://.../manifest.json",  // DSSE or CAS path
+  "logsPath": "cas://.../logs.ndjson",
+  "tenantId": "string",
+  "environment": "prod|stage|dev",
+  "idempotencyKey": "sha256:...",   // runId+artifactHash
+  "signatures": [ { "type": "dsse", "keyId": "...", "signature": "..." } ]
+}
+```
+
+## Determinism & ordering
+- Entries sorted by `runId` when streamed; pagination stable via `runId, startedAt` ascending.
+- `idempotencyKey = sha256(runId + artifactHash + tenantId)`; duplicate submissions are rejected with 409.
+- Timestamps UTC ISO-8601; no clock-skew correction performed by Ledger.
+
+## Transport
+- REST: `POST /internal/jobengine/exports` (Orchestrator) → Findings Ledger ingest queue.
+- Events: `orchestrator.export.created` carries the same payload; consumers must verify DSSE before persistence.
+
+## Validation rules (Ledger side)
+- Require DSSE signature (Ed25519) when `signatures` present; fail closed if verification fails.
+- Enforce presence of `runId`, `artifactHash`, `startedAt`, `status`.
+- Hash fields must match `^sha256:[A-Fa-f0-9]{64}$`.
+- Allowed status transitions: pending→running→succeeded/failed/canceled; replays only allowed when `idempotencyKey` matches existing record.
+
+## Mapping to Findings Ledger
+- Stored in collection/table `orchestrator_exports` with index `(artifactHash, runId)` and TTL optional on logs if configured.
+- Timeline entry type `ledger_export` references `runId`, `artifactHash`, `policyHash`, `manifestPath`, `logsPath`, DSSE envelope digest, and `idempotencyKey`.
+- Cross-links: `bundleId` (if mirror build) or `scanId` (if scan job) may be added as optional fields; Ledger treats them as opaque strings.
+
+## Security / scopes
+- Required scope: `orchestrator:exports:write` for producer; Ledger ingress validates tenant headers and scope.
+- Max payload: 1 MiB; logs must be CAS/DSSE referenced, not inline.
+
+## Offline/air-gap considerations
+- CAS/DSSE paths must resolve within offline kit bundles; no external URIs permitted.
+- Deterministic ordering + idempotency allow replay without side effects; Ledger rejects writes when DSSE or hash mismatch.