Align AOC tasks for Excititor and Concelier

This commit is contained in:
master
2025-10-31 18:50:15 +02:00
committed by root
parent 9e6d9fbae8
commit 8da4e12a90
334 changed files with 35528 additions and 34546 deletions

View File

@@ -1,52 +1,52 @@
# Source & Job Orchestrator architecture
> Based on Epic9 Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability.
## 1) Topology
- **Orchestrator API (`StellaOps.Orchestrator`).** Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (`orchestrator:*`).
- **Job ledger (Mongo).** Collections `jobs`, `job_history`, `sources`, `quotas`, `throttles`, `incidents`. Append-only history ensures auditability.
- **Queue abstraction.** Supports Mongo queue, Redis Streams, or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
- **Dashboard feeds.** SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.
## 2) Job lifecycle
1. **Enqueue.** Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit `JobRequest` records containing `jobType`, `tenant`, `priority`, `payloadDigest`, `dependencies`.
2. **Scheduling.** Orchestrator applies quotas and rate limits per `{tenant, jobType}`. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
3. **Leasing.** Workers poll `LeaseJob` endpoint; Orchestrator returns job with `leaseId`, `leaseUntil`, and instrumentation tokens. Lease renewal required for long-running tasks.
4. **Completion.** Worker reports status (`succeeded`, `failed`, `canceled`, `timed_out`). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
5. **Replay.** Operators trigger `POST /jobs/{id}/replay` which clones job payload, sets `replayOf` pointer, and requeues with high priority while preserving determinism metadata.
## 3) Rate-limit & quota governance
- Quotas defined per tenant/profile (`maxActive`, `maxPerHour`, `burst`). Stored in `quotas` and enforced before leasing.
- Dynamic throttles allow ops to pause specific sources (`pauseSource`, `resumeSource`) or reduce concurrency.
- Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
## 4) APIs
- `GET /api/jobs?status=` — list jobs with filters (tenant, jobType, status, time window).
- `GET /api/jobs/{id}` — job detail (payload digest, attempts, worker, lease history, metrics).
- `POST /api/jobs/{id}/cancel` — cancel running/pending job with audit reason.
- `POST /api/jobs/{id}/replay` — schedule replay.
- `POST /api/limits/throttle` — apply throttle (requires elevated scope).
- `GET /api/dashboard/metrics` — aggregated metrics for Console dashboards.
All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.
## 5) Observability
- Metrics: `job_queue_depth{jobType,tenant}`, `job_latency_seconds`, `job_failures_total`, `job_retry_total`, `lease_extensions_total`.
- Logs: structured with `jobId`, `jobType`, `tenant`, `workerId`, `leaseId`, `status`. Incident logs flagged for Ops.
- Traces: spans covering `enqueue`, `schedule`, `lease`, `worker_execute`, `complete`. Trace IDs propagate to worker spans for end-to-end correlation.
## 6) Offline support
- Orchestrator exports audit bundles: `jobs.jsonl`, `history.jsonl`, `throttles.jsonl`, `manifest.json`, `signatures/`. Used for offline investigations and compliance.
- Replay manifests contain job digests and success/failure notes for deterministic proof.
## 7) Operational considerations
- HA deployment with multiple API instances; queue storage determines redundancy strategy.
- Support for `maintenance` mode halting leases while allowing status inspection.
- Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume).
# Source & Job Orchestrator architecture
> Based on Epic9 Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability.
## 1) Topology
- **Orchestrator API (`StellaOps.Orchestrator`).** Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (`orchestrator:*`).
- **Job ledger (Mongo).** Collections `jobs`, `job_history`, `sources`, `quotas`, `throttles`, `incidents`. Append-only history ensures auditability.
- **Queue abstraction.** Supports Mongo queue, Redis Streams, or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
- **Dashboard feeds.** SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.
## 2) Job lifecycle
1. **Enqueue.** Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit `JobRequest` records containing `jobType`, `tenant`, `priority`, `payloadDigest`, `dependencies`.
2. **Scheduling.** Orchestrator applies quotas and rate limits per `{tenant, jobType}`. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
3. **Leasing.** Workers poll `LeaseJob` endpoint; Orchestrator returns job with `leaseId`, `leaseUntil`, and instrumentation tokens. Lease renewal required for long-running tasks.
4. **Completion.** Worker reports status (`succeeded`, `failed`, `canceled`, `timed_out`). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
5. **Replay.** Operators trigger `POST /jobs/{id}/replay` which clones job payload, sets `replayOf` pointer, and requeues with high priority while preserving determinism metadata.
## 3) Rate-limit & quota governance
- Quotas defined per tenant/profile (`maxActive`, `maxPerHour`, `burst`). Stored in `quotas` and enforced before leasing.
- Dynamic throttles allow ops to pause specific sources (`pauseSource`, `resumeSource`) or reduce concurrency.
- Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
## 4) APIs
- `GET /api/jobs?status=` — list jobs with filters (tenant, jobType, status, time window).
- `GET /api/jobs/{id}` — job detail (payload digest, attempts, worker, lease history, metrics).
- `POST /api/jobs/{id}/cancel` — cancel running/pending job with audit reason.
- `POST /api/jobs/{id}/replay` — schedule replay.
- `POST /api/limits/throttle` — apply throttle (requires elevated scope).
- `GET /api/dashboard/metrics` — aggregated metrics for Console dashboards.
All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.
## 5) Observability
- Metrics: `job_queue_depth{jobType,tenant}`, `job_latency_seconds`, `job_failures_total`, `job_retry_total`, `lease_extensions_total`.
- Logs: structured with `jobId`, `jobType`, `tenant`, `workerId`, `leaseId`, `status`. Incident logs flagged for Ops.
- Traces: spans covering `enqueue`, `schedule`, `lease`, `worker_execute`, `complete`. Trace IDs propagate to worker spans for end-to-end correlation.
## 6) Offline support
- Orchestrator exports audit bundles: `jobs.jsonl`, `history.jsonl`, `throttles.jsonl`, `manifest.json`, `signatures/`. Used for offline investigations and compliance.
- Replay manifests contain job digests and success/failure notes for deterministic proof.
## 7) Operational considerations
- HA deployment with multiple API instances; queue storage determines redundancy strategy.
- Support for `maintenance` mode halting leases while allowing status inspection.
- Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume).