Orchestrator Event Model and Job Lifecycle
Version: 1.0
Date: 2025-11-29
Status: Canonical
This advisory defines the product rationale, job lifecycle semantics, and implementation strategy for the Orchestrator module, covering event models, quota governance, replay semantics, and TaskRunner bridge.
1. Executive Summary
The Orchestrator is the central job coordination layer for all Stella Ops asynchronous operations. Key capabilities:
- Unified Job Lifecycle - Enqueue, schedule, lease, complete with audit trail
- Quota Governance - Per-tenant rate limits, burst controls, circuit breakers
- Replay Semantics - Deterministic job replay for audit and recovery
- TaskRunner Bridge - Pack-run integration with heartbeats and artifacts
- Event Fan-Out - SSE/GraphQL feeds for dashboards and notifications
- Offline Export - Audit bundles for compliance and investigations
2. Market Drivers
2.1 Target Segments
| Segment |
Orchestration Requirements |
Use Case |
| Enterprise |
Rate limiting, quota management |
Multi-team resource sharing |
| MSP/MSSP |
Multi-tenant isolation |
Managed security services |
| Compliance Teams |
Audit trails, replay |
SOC 2, FedRAMP evidence |
| DevSecOps |
CI/CD integration, webhooks |
Pipeline automation |
2.2 Competitive Positioning
Most vulnerability platforms lack sophisticated job orchestration. Stella Ops differentiates with:
- Deterministic replay for audit and debugging
- Fine-grained quotas per tenant/job-type
- Circuit breakers for automatic failure isolation
- Native pack-run integration for workflow automation
- Offline-compatible audit bundles
3. Job Lifecycle Model
3.1 State Machine
3.2 Lifecycle Phases
| Phase |
Description |
Transitions |
| Created |
Job request received |
-> Queued |
| Queued |
Awaiting scheduling |
-> Leased, Throttled |
| Throttled |
Rate limit applied |
-> Queued (after delay) |
| Leased |
Worker acquired job |
-> Running, Expired |
| Running |
Active execution |
-> Completed, Failed, Canceled |
| Completed |
Success, archived |
Terminal |
| Failed |
Error, may retry |
-> Queued (retry), Incident |
| Canceled |
Operator abort |
Terminal |
| Incident |
Escalated failure |
Terminal (requires operator) |
3.3 Job Request Structure
4. Quota Governance
4.1 Quota Model
4.2 Rate Limit Enforcement
- Quota Check - Before leasing, verify tenant hasn't exceeded limits
- Burst Control - Allow short bursts within configured window
- Staging - Jobs exceeding limits staged with
nextEligibleAt timestamp
- Priority Bypass - Emergency jobs can skip queue (with separate limits)
4.3 Dynamic Controls
| Control |
API |
Purpose |
pauseSource |
POST /api/limits/pause |
Halt specific job sources |
resumeSource |
POST /api/limits/resume |
Resume paused sources |
throttle |
POST /api/limits/throttle |
Apply temporary throttle |
updateQuota |
PATCH /api/quotas/{id} |
Modify quota limits |
4.4 Circuit Breakers
- Auto-pause job types when failure rate > threshold (default 50%)
- Incident events generated via Notify
- Half-open testing after cooldown period
- Manual reset via operator action
5. TaskRunner Bridge
5.1 Pack-Run Integration
The Orchestrator provides specialized support for TaskRunner pack executions:
5.2 Heartbeat Protocol
- Workers send heartbeats every
heartbeatCadence seconds
- Missed heartbeats trigger lease expiration
- Lease can be extended for long-running tasks
- Dead workers detected within 2x heartbeat interval
5.3 Artifact & Log Streaming
| Endpoint |
Method |
Purpose |
/runs/{runId}/logs |
SSE |
Stream execution logs |
/runs/{runId}/artifacts |
GET |
List produced artifacts |
/runs/{runId}/artifacts/{name} |
GET |
Download artifact |
/runs/{runId}/heartbeat |
POST |
Extend lease |
6. Event Model
6.1 Event Envelope
6.2 Event Types
| Event |
Trigger |
Consumers |
job.queued |
Job enqueued |
Dashboard, Notify |
job.leased |
Worker acquired job |
Dashboard |
job.started |
Execution began |
Dashboard, Notify |
job.progress |
Progress update |
Dashboard (SSE) |
job.completed |
Success |
Dashboard, Notify, Export |
job.failed |
Error occurred |
Dashboard, Notify, Incident |
job.canceled |
Operator abort |
Dashboard, Notify |
job.replayed |
Replay initiated |
Dashboard, Audit |
6.3 Fan-Out Channels
- SSE - Real-time dashboard feeds
- GraphQL Subscriptions - Console UI
- Notify - Alert routing based on rules
- Webhooks - External integrations
- Audit Log - Compliance storage
7. Replay Semantics
7.1 Deterministic Replay
Jobs can be replayed for audit, debugging, or recovery:
7.2 Replay Guarantees
| Property |
Guarantee |
| Input preservation |
Same payloadDigest, cursors |
| Ordering |
Same processing order |
| Determinism |
Same outputs for same inputs |
| Provenance |
replayOf pointer to original |
7.3 Replay Record
8. Implementation Strategy
8.1 Phase 1: Core Lifecycle (Complete)
8.2 Phase 2: Pack-Run Bridge (In Progress)
8.3 Phase 3: Advanced Controls (Planned)
9. API Surface
9.1 Job Management
| Endpoint |
Method |
Scope |
Description |
/api/jobs |
GET |
orch:read |
List jobs with filters |
/api/jobs/{id} |
GET |
orch:read |
Job detail |
/api/jobs/{id}/cancel |
POST |
orch:operate |
Cancel job |
/api/jobs/{id}/replay |
POST |
orch:operate |
Schedule replay |
9.2 Quota Management
| Endpoint |
Method |
Scope |
Description |
/api/quotas |
GET |
orch:read |
List quotas |
/api/quotas/{id} |
PATCH |
orch:quota |
Update quota |
/api/limits/throttle |
POST |
orch:quota |
Apply throttle |
/api/limits/pause |
POST |
orch:quota |
Pause source |
/api/limits/resume |
POST |
orch:quota |
Resume source |
9.3 Dashboard
| Endpoint |
Method |
Scope |
Description |
/api/dashboard/metrics |
GET |
orch:read |
Aggregated metrics |
/api/dashboard/events |
SSE |
orch:read |
Real-time events |
10. Storage Model
10.1 Collections
| Collection |
Purpose |
Key Fields |
jobs |
Current job state |
_id, tenant, jobType, status, priority |
job_history |
Append-only audit |
jobId, event, timestamp, actor |
sources |
Job sources registry |
sourceId, tenant, status |
quotas |
Quota definitions |
tenant, jobType, limits |
throttles |
Active throttles |
tenant, source, until |
incidents |
Escalated failures |
jobId, reason, status |
10.2 Indexes
{tenant, jobType, status} on jobs
{tenant, status, startedAt} on jobs
{jobId, timestamp} on job_history
- TTL index on transient lease records
11. Observability
11.1 Metrics
job_queue_depth{jobType,tenant}
job_latency_seconds{jobType,phase}
job_failures_total{jobType,reason}
job_retry_total{jobType}
lease_extensions_total{jobType}
quota_exceeded_total{tenant}
circuit_breaker_state{jobType}
11.2 Pack-Run Metrics
pack_run_logs_stream_lag_seconds
pack_run_heartbeats_total
pack_run_artifacts_total
pack_run_duration_seconds
12. Offline Support
12.1 Audit Bundle Export
Bundle contents:
jobs.jsonl - Job records
history.jsonl - State transitions
throttles.jsonl - Throttle events
manifest.json - Bundle metadata
signatures/ - DSSE signatures
12.2 Replay Verification
13. Related Documentation
| Resource |
Location |
| Orchestrator architecture |
docs/modules/orchestrator/architecture.md |
| Event envelope spec |
docs/modules/orchestrator/event-envelope.md |
| TaskRunner integration |
docs/modules/taskrunner/orchestrator-bridge.md |
14. Sprint Mapping
- Primary Sprint: SPRINT_0151_0001_0001_orchestrator_i.md
- Related Sprints:
- SPRINT_0152_0001_0002_orchestrator_ii.md
- SPRINT_0153_0001_0003_orchestrator_iii.md
- SPRINT_0157_0001_0001_taskrunner_i.md
Key Task IDs:
ORCH-CORE-30-001 - Job lifecycle (DONE)
ORCH-QUOTA-31-001 - Quota governance (DONE)
ORCH-PACK-37-001 - Pack-run bridge (IN PROGRESS)
ORCH-SVC-37-101 - Event envelope (IN PROGRESS)
ORCH-REPLAY-38-001 - Replay verification (TODO)
15. Success Metrics
| Metric |
Target |
| Job scheduling latency |
< 100ms p99 |
| Lease acquisition time |
< 50ms p99 |
| Event fan-out delay |
< 500ms |
| Quota enforcement accuracy |
100% |
| Replay determinism |
100% match |
Last updated: 2025-11-29