This commit is contained in:
@@ -0,0 +1,432 @@
|
||||
# Orchestrator Event Model and Job Lifecycle
|
||||
|
||||
**Version:** 1.0
|
||||
**Date:** 2025-11-29
|
||||
**Status:** Canonical
|
||||
|
||||
This advisory defines the product rationale, job lifecycle semantics, and implementation strategy for the Orchestrator module, covering event models, quota governance, replay semantics, and TaskRunner bridge.
|
||||
|
||||
---
|
||||
|
||||
## 1. Executive Summary
|
||||
|
||||
The Orchestrator is the **central job coordination layer** for all Stella Ops asynchronous operations. Key capabilities:
|
||||
|
||||
- **Unified Job Lifecycle** - Enqueue, schedule, lease, complete with audit trail
|
||||
- **Quota Governance** - Per-tenant rate limits, burst controls, circuit breakers
|
||||
- **Replay Semantics** - Deterministic job replay for audit and recovery
|
||||
- **TaskRunner Bridge** - Pack-run integration with heartbeats and artifacts
|
||||
- **Event Fan-Out** - SSE/GraphQL feeds for dashboards and notifications
|
||||
- **Offline Export** - Audit bundles for compliance and investigations
|
||||
|
||||
---
|
||||
|
||||
## 2. Market Drivers
|
||||
|
||||
### 2.1 Target Segments
|
||||
|
||||
| Segment | Orchestration Requirements | Use Case |
|
||||
|---------|---------------------------|----------|
|
||||
| **Enterprise** | Rate limiting, quota management | Multi-team resource sharing |
|
||||
| **MSP/MSSP** | Multi-tenant isolation | Managed security services |
|
||||
| **Compliance Teams** | Audit trails, replay | SOC 2, FedRAMP evidence |
|
||||
| **DevSecOps** | CI/CD integration, webhooks | Pipeline automation |
|
||||
|
||||
### 2.2 Competitive Positioning
|
||||
|
||||
Most vulnerability platforms lack sophisticated job orchestration. Stella Ops differentiates with:
|
||||
- **Deterministic replay** for audit and debugging
|
||||
- **Fine-grained quotas** per tenant/job-type
|
||||
- **Circuit breakers** for automatic failure isolation
|
||||
- **Native pack-run integration** for workflow automation
|
||||
- **Offline-compatible** audit bundles
|
||||
|
||||
---
|
||||
|
||||
## 3. Job Lifecycle Model
|
||||
|
||||
### 3.1 State Machine
|
||||
|
||||
```
|
||||
[Created] --> [Queued] --> [Leased] --> [Running] --> [Completed]
|
||||
| | | |
|
||||
| | v v
|
||||
| +-------> [Failed] <----[Canceled]
|
||||
| |
|
||||
v v
|
||||
[Throttled] [Incident]
|
||||
```
|
||||
|
||||
### 3.2 Lifecycle Phases
|
||||
|
||||
| Phase | Description | Transitions |
|
||||
|-------|-------------|-------------|
|
||||
| **Created** | Job request received | -> Queued |
|
||||
| **Queued** | Awaiting scheduling | -> Leased, Throttled |
|
||||
| **Throttled** | Rate limit applied | -> Queued (after delay) |
|
||||
| **Leased** | Worker acquired job | -> Running, Expired |
|
||||
| **Running** | Active execution | -> Completed, Failed, Canceled |
|
||||
| **Completed** | Success, archived | Terminal |
|
||||
| **Failed** | Error, may retry | -> Queued (retry), Incident |
|
||||
| **Canceled** | Operator abort | Terminal |
|
||||
| **Incident** | Escalated failure | Terminal (requires operator) |
|
||||
|
||||
### 3.3 Job Request Structure
|
||||
|
||||
```json
|
||||
{
|
||||
"jobId": "uuid",
|
||||
"jobType": "scan|policy-run|export|pack-run|advisory-sync",
|
||||
"tenant": "tenant-id",
|
||||
"priority": "low|normal|high|emergency",
|
||||
"payloadDigest": "sha256:...",
|
||||
"payload": { "imageRef": "nginx:latest", "options": {} },
|
||||
"dependencies": ["job-id-1", "job-id-2"],
|
||||
"idempotencyKey": "unique-request-key",
|
||||
"correlationId": "trace-id",
|
||||
"requestedBy": "user-id|service-id",
|
||||
"requestedAt": "2025-11-29T12:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Quota Governance
|
||||
|
||||
### 4.1 Quota Model
|
||||
|
||||
```yaml
|
||||
quotas:
|
||||
- tenant: "acme-corp"
|
||||
jobType: "*"
|
||||
maxActive: 50
|
||||
maxPerHour: 500
|
||||
burst: 10
|
||||
priority:
|
||||
emergency:
|
||||
maxActive: 5
|
||||
skipQueue: true
|
||||
|
||||
- tenant: "acme-corp"
|
||||
jobType: "export"
|
||||
maxActive: 4
|
||||
maxPerHour: 100
|
||||
```
|
||||
|
||||
### 4.2 Rate Limit Enforcement
|
||||
|
||||
1. **Quota Check** - Before leasing, verify tenant hasn't exceeded limits
|
||||
2. **Burst Control** - Allow short bursts within configured window
|
||||
3. **Staging** - Jobs exceeding limits staged with `nextEligibleAt` timestamp
|
||||
4. **Priority Bypass** - Emergency jobs can skip queue (with separate limits)
|
||||
|
||||
### 4.3 Dynamic Controls
|
||||
|
||||
| Control | API | Purpose |
|
||||
|---------|-----|---------|
|
||||
| `pauseSource` | `POST /api/limits/pause` | Halt specific job sources |
|
||||
| `resumeSource` | `POST /api/limits/resume` | Resume paused sources |
|
||||
| `throttle` | `POST /api/limits/throttle` | Apply temporary throttle |
|
||||
| `updateQuota` | `PATCH /api/quotas/{id}` | Modify quota limits |
|
||||
|
||||
### 4.4 Circuit Breakers
|
||||
|
||||
- Auto-pause job types when failure rate > threshold (default 50%)
|
||||
- Incident events generated via Notify
|
||||
- Half-open testing after cooldown period
|
||||
- Manual reset via operator action
|
||||
|
||||
---
|
||||
|
||||
## 5. TaskRunner Bridge
|
||||
|
||||
### 5.1 Pack-Run Integration
|
||||
|
||||
The Orchestrator provides specialized support for TaskRunner pack executions:
|
||||
|
||||
```json
|
||||
{
|
||||
"jobType": "pack-run",
|
||||
"payload": {
|
||||
"packId": "vuln-scan-and-report",
|
||||
"packVersion": "1.2.0",
|
||||
"planHash": "sha256:...",
|
||||
"inputs": { "imageRef": "nginx:latest" },
|
||||
"artifacts": [],
|
||||
"logChannel": "sse:/runs/{runId}/logs",
|
||||
"heartbeatCadence": 30
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5.2 Heartbeat Protocol
|
||||
|
||||
- Workers send heartbeats every `heartbeatCadence` seconds
|
||||
- Missed heartbeats trigger lease expiration
|
||||
- Lease can be extended for long-running tasks
|
||||
- Dead workers detected within 2x heartbeat interval
|
||||
|
||||
### 5.3 Artifact & Log Streaming
|
||||
|
||||
| Endpoint | Method | Purpose |
|
||||
|----------|--------|---------|
|
||||
| `/runs/{runId}/logs` | SSE | Stream execution logs |
|
||||
| `/runs/{runId}/artifacts` | GET | List produced artifacts |
|
||||
| `/runs/{runId}/artifacts/{name}` | GET | Download artifact |
|
||||
| `/runs/{runId}/heartbeat` | POST | Extend lease |
|
||||
|
||||
---
|
||||
|
||||
## 6. Event Model
|
||||
|
||||
### 6.1 Event Envelope
|
||||
|
||||
```json
|
||||
{
|
||||
"eventId": "uuid",
|
||||
"eventType": "job.queued|job.leased|job.completed|job.failed",
|
||||
"timestamp": "2025-11-29T12:00:00Z",
|
||||
"tenant": "tenant-id",
|
||||
"jobId": "job-id",
|
||||
"jobType": "scan",
|
||||
"correlationId": "trace-id",
|
||||
"idempotencyKey": "unique-key",
|
||||
"payload": {
|
||||
"status": "completed",
|
||||
"duration": 45.2,
|
||||
"result": { "verdict": "pass" }
|
||||
},
|
||||
"provenance": {
|
||||
"workerId": "worker-1",
|
||||
"leaseId": "lease-id",
|
||||
"taskRunnerId": "runner-1"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 6.2 Event Types
|
||||
|
||||
| Event | Trigger | Consumers |
|
||||
|-------|---------|-----------|
|
||||
| `job.queued` | Job enqueued | Dashboard, Notify |
|
||||
| `job.leased` | Worker acquired job | Dashboard |
|
||||
| `job.started` | Execution began | Dashboard, Notify |
|
||||
| `job.progress` | Progress update | Dashboard (SSE) |
|
||||
| `job.completed` | Success | Dashboard, Notify, Export |
|
||||
| `job.failed` | Error occurred | Dashboard, Notify, Incident |
|
||||
| `job.canceled` | Operator abort | Dashboard, Notify |
|
||||
| `job.replayed` | Replay initiated | Dashboard, Audit |
|
||||
|
||||
### 6.3 Fan-Out Channels
|
||||
|
||||
- **SSE** - Real-time dashboard feeds
|
||||
- **GraphQL Subscriptions** - Console UI
|
||||
- **Notify** - Alert routing based on rules
|
||||
- **Webhooks** - External integrations
|
||||
- **Audit Log** - Compliance storage
|
||||
|
||||
---
|
||||
|
||||
## 7. Replay Semantics
|
||||
|
||||
### 7.1 Deterministic Replay
|
||||
|
||||
Jobs can be replayed for audit, debugging, or recovery:
|
||||
|
||||
```bash
|
||||
# Replay a completed job
|
||||
stella job replay --id job-12345
|
||||
|
||||
# Replay with sealed mode (offline verification)
|
||||
stella job replay --id job-12345 --sealed --bundle output.tar.gz
|
||||
```
|
||||
|
||||
### 7.2 Replay Guarantees
|
||||
|
||||
| Property | Guarantee |
|
||||
|----------|-----------|
|
||||
| **Input preservation** | Same payloadDigest, cursors |
|
||||
| **Ordering** | Same processing order |
|
||||
| **Determinism** | Same outputs for same inputs |
|
||||
| **Provenance** | `replayOf` pointer to original |
|
||||
|
||||
### 7.3 Replay Record
|
||||
|
||||
```json
|
||||
{
|
||||
"jobId": "replay-job-id",
|
||||
"replayOf": "original-job-id",
|
||||
"priority": "high",
|
||||
"reason": "audit-verification",
|
||||
"requestedBy": "auditor@example.com",
|
||||
"cursors": {
|
||||
"advisory": "cursor-abc",
|
||||
"vex": "cursor-def"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Implementation Strategy
|
||||
|
||||
### 8.1 Phase 1: Core Lifecycle (Complete)
|
||||
|
||||
- [x] Job state machine
|
||||
- [x] MongoDB queue with leasing
|
||||
- [x] Basic quota enforcement
|
||||
- [x] Dashboard SSE feeds
|
||||
|
||||
### 8.2 Phase 2: Pack-Run Bridge (In Progress)
|
||||
|
||||
- [x] Pack-run job type registration
|
||||
- [x] Log/artifact streaming
|
||||
- [ ] Heartbeat protocol (ORCH-PACK-37-001)
|
||||
- [ ] Event envelope finalization (ORCH-SVC-37-101)
|
||||
|
||||
### 8.3 Phase 3: Advanced Controls (Planned)
|
||||
|
||||
- [ ] Circuit breaker automation
|
||||
- [ ] Quota analytics dashboard
|
||||
- [ ] Replay verification tooling
|
||||
- [ ] Incident mode integration
|
||||
|
||||
---
|
||||
|
||||
## 9. API Surface
|
||||
|
||||
### 9.1 Job Management
|
||||
|
||||
| Endpoint | Method | Scope | Description |
|
||||
|----------|--------|-------|-------------|
|
||||
| `/api/jobs` | GET | `orch:read` | List jobs with filters |
|
||||
| `/api/jobs/{id}` | GET | `orch:read` | Job detail |
|
||||
| `/api/jobs/{id}/cancel` | POST | `orch:operate` | Cancel job |
|
||||
| `/api/jobs/{id}/replay` | POST | `orch:operate` | Schedule replay |
|
||||
|
||||
### 9.2 Quota Management
|
||||
|
||||
| Endpoint | Method | Scope | Description |
|
||||
|----------|--------|-------|-------------|
|
||||
| `/api/quotas` | GET | `orch:read` | List quotas |
|
||||
| `/api/quotas/{id}` | PATCH | `orch:quota` | Update quota |
|
||||
| `/api/limits/throttle` | POST | `orch:quota` | Apply throttle |
|
||||
| `/api/limits/pause` | POST | `orch:quota` | Pause source |
|
||||
| `/api/limits/resume` | POST | `orch:quota` | Resume source |
|
||||
|
||||
### 9.3 Dashboard
|
||||
|
||||
| Endpoint | Method | Scope | Description |
|
||||
|----------|--------|-------|-------------|
|
||||
| `/api/dashboard/metrics` | GET | `orch:read` | Aggregated metrics |
|
||||
| `/api/dashboard/events` | SSE | `orch:read` | Real-time events |
|
||||
|
||||
---
|
||||
|
||||
## 10. Storage Model
|
||||
|
||||
### 10.1 Collections
|
||||
|
||||
| Collection | Purpose | Key Fields |
|
||||
|------------|---------|------------|
|
||||
| `jobs` | Current job state | `_id`, `tenant`, `jobType`, `status`, `priority` |
|
||||
| `job_history` | Append-only audit | `jobId`, `event`, `timestamp`, `actor` |
|
||||
| `sources` | Job sources registry | `sourceId`, `tenant`, `status` |
|
||||
| `quotas` | Quota definitions | `tenant`, `jobType`, `limits` |
|
||||
| `throttles` | Active throttles | `tenant`, `source`, `until` |
|
||||
| `incidents` | Escalated failures | `jobId`, `reason`, `status` |
|
||||
|
||||
### 10.2 Indexes
|
||||
|
||||
- `{tenant, jobType, status}` on `jobs`
|
||||
- `{tenant, status, startedAt}` on `jobs`
|
||||
- `{jobId, timestamp}` on `job_history`
|
||||
- TTL index on transient lease records
|
||||
|
||||
---
|
||||
|
||||
## 11. Observability
|
||||
|
||||
### 11.1 Metrics
|
||||
|
||||
- `job_queue_depth{jobType,tenant}`
|
||||
- `job_latency_seconds{jobType,phase}`
|
||||
- `job_failures_total{jobType,reason}`
|
||||
- `job_retry_total{jobType}`
|
||||
- `lease_extensions_total{jobType}`
|
||||
- `quota_exceeded_total{tenant}`
|
||||
- `circuit_breaker_state{jobType}`
|
||||
|
||||
### 11.2 Pack-Run Metrics
|
||||
|
||||
- `pack_run_logs_stream_lag_seconds`
|
||||
- `pack_run_heartbeats_total`
|
||||
- `pack_run_artifacts_total`
|
||||
- `pack_run_duration_seconds`
|
||||
|
||||
---
|
||||
|
||||
## 12. Offline Support
|
||||
|
||||
### 12.1 Audit Bundle Export
|
||||
|
||||
```bash
|
||||
stella orch export --tenant acme-corp --since 2025-11-01 --output audit-bundle.tar.gz
|
||||
```
|
||||
|
||||
Bundle contents:
|
||||
- `jobs.jsonl` - Job records
|
||||
- `history.jsonl` - State transitions
|
||||
- `throttles.jsonl` - Throttle events
|
||||
- `manifest.json` - Bundle metadata
|
||||
- `signatures/` - DSSE signatures
|
||||
|
||||
### 12.2 Replay Verification
|
||||
|
||||
```bash
|
||||
# Verify job determinism
|
||||
stella job verify --bundle audit-bundle.tar.gz --job-id job-12345
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 13. Related Documentation
|
||||
|
||||
| Resource | Location |
|
||||
|----------|----------|
|
||||
| Orchestrator architecture | `docs/modules/orchestrator/architecture.md` |
|
||||
| Event envelope spec | `docs/modules/orchestrator/event-envelope.md` |
|
||||
| TaskRunner integration | `docs/modules/taskrunner/orchestrator-bridge.md` |
|
||||
|
||||
---
|
||||
|
||||
## 14. Sprint Mapping
|
||||
|
||||
- **Primary Sprint:** SPRINT_0151_0001_0001_orchestrator_i.md
|
||||
- **Related Sprints:**
|
||||
- SPRINT_0152_0001_0002_orchestrator_ii.md
|
||||
- SPRINT_0153_0001_0003_orchestrator_iii.md
|
||||
- SPRINT_0157_0001_0001_taskrunner_i.md
|
||||
|
||||
**Key Task IDs:**
|
||||
- `ORCH-CORE-30-001` - Job lifecycle (DONE)
|
||||
- `ORCH-QUOTA-31-001` - Quota governance (DONE)
|
||||
- `ORCH-PACK-37-001` - Pack-run bridge (IN PROGRESS)
|
||||
- `ORCH-SVC-37-101` - Event envelope (IN PROGRESS)
|
||||
- `ORCH-REPLAY-38-001` - Replay verification (TODO)
|
||||
|
||||
---
|
||||
|
||||
## 15. Success Metrics
|
||||
|
||||
| Metric | Target |
|
||||
|--------|--------|
|
||||
| Job scheduling latency | < 100ms p99 |
|
||||
| Lease acquisition time | < 50ms p99 |
|
||||
| Event fan-out delay | < 500ms |
|
||||
| Quota enforcement accuracy | 100% |
|
||||
| Replay determinism | 100% match |
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2025-11-29*
|
||||
Reference in New Issue
Block a user