# Orchestrator Event Model and Job Lifecycle **Version:** 1.0 **Date:** 2025-11-29 **Status:** Canonical This advisory defines the product rationale, job lifecycle semantics, and implementation strategy for the Orchestrator module, covering event models, quota governance, replay semantics, and TaskRunner bridge. --- ## 1. Executive Summary The Orchestrator is the **central job coordination layer** for all Stella Ops asynchronous operations. Key capabilities: - **Unified Job Lifecycle** - Enqueue, schedule, lease, complete with audit trail - **Quota Governance** - Per-tenant rate limits, burst controls, circuit breakers - **Replay Semantics** - Deterministic job replay for audit and recovery - **TaskRunner Bridge** - Pack-run integration with heartbeats and artifacts - **Event Fan-Out** - SSE/GraphQL feeds for dashboards and notifications - **Offline Export** - Audit bundles for compliance and investigations --- ## 2. Market Drivers ### 2.1 Target Segments | Segment | Orchestration Requirements | Use Case | |---------|---------------------------|----------| | **Enterprise** | Rate limiting, quota management | Multi-team resource sharing | | **MSP/MSSP** | Multi-tenant isolation | Managed security services | | **Compliance Teams** | Audit trails, replay | SOC 2, FedRAMP evidence | | **DevSecOps** | CI/CD integration, webhooks | Pipeline automation | ### 2.2 Competitive Positioning Most vulnerability platforms lack sophisticated job orchestration. Stella Ops differentiates with: - **Deterministic replay** for audit and debugging - **Fine-grained quotas** per tenant/job-type - **Circuit breakers** for automatic failure isolation - **Native pack-run integration** for workflow automation - **Offline-compatible** audit bundles --- ## 3. Job Lifecycle Model ### 3.1 State Machine ``` [Created] --> [Queued] --> [Leased] --> [Running] --> [Completed] | | | | | | v v | +-------> [Failed] <----[Canceled] | | v v [Throttled] [Incident] ``` ### 3.2 Lifecycle Phases | Phase | Description | Transitions | |-------|-------------|-------------| | **Created** | Job request received | -> Queued | | **Queued** | Awaiting scheduling | -> Leased, Throttled | | **Throttled** | Rate limit applied | -> Queued (after delay) | | **Leased** | Worker acquired job | -> Running, Expired | | **Running** | Active execution | -> Completed, Failed, Canceled | | **Completed** | Success, archived | Terminal | | **Failed** | Error, may retry | -> Queued (retry), Incident | | **Canceled** | Operator abort | Terminal | | **Incident** | Escalated failure | Terminal (requires operator) | ### 3.3 Job Request Structure ```json { "jobId": "uuid", "jobType": "scan|policy-run|export|pack-run|advisory-sync", "tenant": "tenant-id", "priority": "low|normal|high|emergency", "payloadDigest": "sha256:...", "payload": { "imageRef": "nginx:latest", "options": {} }, "dependencies": ["job-id-1", "job-id-2"], "idempotencyKey": "unique-request-key", "correlationId": "trace-id", "requestedBy": "user-id|service-id", "requestedAt": "2025-11-29T12:00:00Z" } ``` --- ## 4. Quota Governance ### 4.1 Quota Model ```yaml quotas: - tenant: "acme-corp" jobType: "*" maxActive: 50 maxPerHour: 500 burst: 10 priority: emergency: maxActive: 5 skipQueue: true - tenant: "acme-corp" jobType: "export" maxActive: 4 maxPerHour: 100 ``` ### 4.2 Rate Limit Enforcement 1. **Quota Check** - Before leasing, verify tenant hasn't exceeded limits 2. **Burst Control** - Allow short bursts within configured window 3. **Staging** - Jobs exceeding limits staged with `nextEligibleAt` timestamp 4. **Priority Bypass** - Emergency jobs can skip queue (with separate limits) ### 4.3 Dynamic Controls | Control | API | Purpose | |---------|-----|---------| | `pauseSource` | `POST /api/limits/pause` | Halt specific job sources | | `resumeSource` | `POST /api/limits/resume` | Resume paused sources | | `throttle` | `POST /api/limits/throttle` | Apply temporary throttle | | `updateQuota` | `PATCH /api/quotas/{id}` | Modify quota limits | ### 4.4 Circuit Breakers - Auto-pause job types when failure rate > threshold (default 50%) - Incident events generated via Notify - Half-open testing after cooldown period - Manual reset via operator action --- ## 5. TaskRunner Bridge ### 5.1 Pack-Run Integration The Orchestrator provides specialized support for TaskRunner pack executions: ```json { "jobType": "pack-run", "payload": { "packId": "vuln-scan-and-report", "packVersion": "1.2.0", "planHash": "sha256:...", "inputs": { "imageRef": "nginx:latest" }, "artifacts": [], "logChannel": "sse:/runs/{runId}/logs", "heartbeatCadence": 30 } } ``` ### 5.2 Heartbeat Protocol - Workers send heartbeats every `heartbeatCadence` seconds - Missed heartbeats trigger lease expiration - Lease can be extended for long-running tasks - Dead workers detected within 2x heartbeat interval ### 5.3 Artifact & Log Streaming | Endpoint | Method | Purpose | |----------|--------|---------| | `/runs/{runId}/logs` | SSE | Stream execution logs | | `/runs/{runId}/artifacts` | GET | List produced artifacts | | `/runs/{runId}/artifacts/{name}` | GET | Download artifact | | `/runs/{runId}/heartbeat` | POST | Extend lease | --- ## 6. Event Model ### 6.1 Event Envelope ```json { "eventId": "uuid", "eventType": "job.queued|job.leased|job.completed|job.failed", "timestamp": "2025-11-29T12:00:00Z", "tenant": "tenant-id", "jobId": "job-id", "jobType": "scan", "correlationId": "trace-id", "idempotencyKey": "unique-key", "payload": { "status": "completed", "duration": 45.2, "result": { "verdict": "pass" } }, "provenance": { "workerId": "worker-1", "leaseId": "lease-id", "taskRunnerId": "runner-1" } } ``` ### 6.2 Event Types | Event | Trigger | Consumers | |-------|---------|-----------| | `job.queued` | Job enqueued | Dashboard, Notify | | `job.leased` | Worker acquired job | Dashboard | | `job.started` | Execution began | Dashboard, Notify | | `job.progress` | Progress update | Dashboard (SSE) | | `job.completed` | Success | Dashboard, Notify, Export | | `job.failed` | Error occurred | Dashboard, Notify, Incident | | `job.canceled` | Operator abort | Dashboard, Notify | | `job.replayed` | Replay initiated | Dashboard, Audit | ### 6.3 Fan-Out Channels - **SSE** - Real-time dashboard feeds - **GraphQL Subscriptions** - Console UI - **Notify** - Alert routing based on rules - **Webhooks** - External integrations - **Audit Log** - Compliance storage --- ## 7. Replay Semantics ### 7.1 Deterministic Replay Jobs can be replayed for audit, debugging, or recovery: ```bash # Replay a completed job stella job replay --id job-12345 # Replay with sealed mode (offline verification) stella job replay --id job-12345 --sealed --bundle output.tar.gz ``` ### 7.2 Replay Guarantees | Property | Guarantee | |----------|-----------| | **Input preservation** | Same payloadDigest, cursors | | **Ordering** | Same processing order | | **Determinism** | Same outputs for same inputs | | **Provenance** | `replayOf` pointer to original | ### 7.3 Replay Record ```json { "jobId": "replay-job-id", "replayOf": "original-job-id", "priority": "high", "reason": "audit-verification", "requestedBy": "auditor@example.com", "cursors": { "advisory": "cursor-abc", "vex": "cursor-def" } } ``` --- ## 8. Implementation Strategy ### 8.1 Phase 1: Core Lifecycle (Complete) - [x] Job state machine - [x] MongoDB queue with leasing - [x] Basic quota enforcement - [x] Dashboard SSE feeds ### 8.2 Phase 2: Pack-Run Bridge (In Progress) - [x] Pack-run job type registration - [x] Log/artifact streaming - [ ] Heartbeat protocol (ORCH-PACK-37-001) - [ ] Event envelope finalization (ORCH-SVC-37-101) ### 8.3 Phase 3: Advanced Controls (Planned) - [ ] Circuit breaker automation - [ ] Quota analytics dashboard - [ ] Replay verification tooling - [ ] Incident mode integration --- ## 9. API Surface ### 9.1 Job Management | Endpoint | Method | Scope | Description | |----------|--------|-------|-------------| | `/api/jobs` | GET | `orch:read` | List jobs with filters | | `/api/jobs/{id}` | GET | `orch:read` | Job detail | | `/api/jobs/{id}/cancel` | POST | `orch:operate` | Cancel job | | `/api/jobs/{id}/replay` | POST | `orch:operate` | Schedule replay | ### 9.2 Quota Management | Endpoint | Method | Scope | Description | |----------|--------|-------|-------------| | `/api/quotas` | GET | `orch:read` | List quotas | | `/api/quotas/{id}` | PATCH | `orch:quota` | Update quota | | `/api/limits/throttle` | POST | `orch:quota` | Apply throttle | | `/api/limits/pause` | POST | `orch:quota` | Pause source | | `/api/limits/resume` | POST | `orch:quota` | Resume source | ### 9.3 Dashboard | Endpoint | Method | Scope | Description | |----------|--------|-------|-------------| | `/api/dashboard/metrics` | GET | `orch:read` | Aggregated metrics | | `/api/dashboard/events` | SSE | `orch:read` | Real-time events | --- ## 10. Storage Model ### 10.1 Collections | Collection | Purpose | Key Fields | |------------|---------|------------| | `jobs` | Current job state | `_id`, `tenant`, `jobType`, `status`, `priority` | | `job_history` | Append-only audit | `jobId`, `event`, `timestamp`, `actor` | | `sources` | Job sources registry | `sourceId`, `tenant`, `status` | | `quotas` | Quota definitions | `tenant`, `jobType`, `limits` | | `throttles` | Active throttles | `tenant`, `source`, `until` | | `incidents` | Escalated failures | `jobId`, `reason`, `status` | ### 10.2 Indexes - `{tenant, jobType, status}` on `jobs` - `{tenant, status, startedAt}` on `jobs` - `{jobId, timestamp}` on `job_history` - TTL index on transient lease records --- ## 11. Observability ### 11.1 Metrics - `job_queue_depth{jobType,tenant}` - `job_latency_seconds{jobType,phase}` - `job_failures_total{jobType,reason}` - `job_retry_total{jobType}` - `lease_extensions_total{jobType}` - `quota_exceeded_total{tenant}` - `circuit_breaker_state{jobType}` ### 11.2 Pack-Run Metrics - `pack_run_logs_stream_lag_seconds` - `pack_run_heartbeats_total` - `pack_run_artifacts_total` - `pack_run_duration_seconds` --- ## 12. Offline Support ### 12.1 Audit Bundle Export ```bash stella orch export --tenant acme-corp --since 2025-11-01 --output audit-bundle.tar.gz ``` Bundle contents: - `jobs.jsonl` - Job records - `history.jsonl` - State transitions - `throttles.jsonl` - Throttle events - `manifest.json` - Bundle metadata - `signatures/` - DSSE signatures ### 12.2 Replay Verification ```bash # Verify job determinism stella job verify --bundle audit-bundle.tar.gz --job-id job-12345 ``` --- ## 13. Related Documentation | Resource | Location | |----------|----------| | Orchestrator architecture | `docs/modules/orchestrator/architecture.md` | | Event envelope spec | `docs/modules/orchestrator/event-envelope.md` | | TaskRunner integration | `docs/modules/taskrunner/orchestrator-bridge.md` | --- ## 14. Sprint Mapping - **Primary Sprint:** SPRINT_0151_0001_0001_orchestrator_i.md - **Related Sprints:** - SPRINT_0152_0001_0002_orchestrator_ii.md - SPRINT_0153_0001_0003_orchestrator_iii.md - SPRINT_0157_0001_0001_taskrunner_i.md **Key Task IDs:** - `ORCH-CORE-30-001` - Job lifecycle (DONE) - `ORCH-QUOTA-31-001` - Quota governance (DONE) - `ORCH-PACK-37-001` - Pack-run bridge (IN PROGRESS) - `ORCH-SVC-37-101` - Event envelope (IN PROGRESS) - `ORCH-REPLAY-38-001` - Replay verification (TODO) --- ## 15. Success Metrics | Metric | Target | |--------|--------| | Job scheduling latency | < 100ms p99 | | Lease acquisition time | < 50ms p99 | | Event fan-out delay | < 500ms | | Quota enforcement accuracy | 100% | | Replay determinism | 100% match | --- *Last updated: 2025-11-29*