true the date
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

This commit is contained in:
StellaOps Bot
2025-11-30 19:23:21 +02:00
parent 71e9a56cfd
commit 0bef705bcc
14 changed files with 0 additions and 0 deletions

View File

@@ -0,0 +1,432 @@
# Orchestrator Event Model and Job Lifecycle
**Version:** 1.0
**Date:** 2025-11-29
**Status:** Canonical
This advisory defines the product rationale, job lifecycle semantics, and implementation strategy for the Orchestrator module, covering event models, quota governance, replay semantics, and TaskRunner bridge.
---
## 1. Executive Summary
The Orchestrator is the **central job coordination layer** for all Stella Ops asynchronous operations. Key capabilities:
- **Unified Job Lifecycle** - Enqueue, schedule, lease, complete with audit trail
- **Quota Governance** - Per-tenant rate limits, burst controls, circuit breakers
- **Replay Semantics** - Deterministic job replay for audit and recovery
- **TaskRunner Bridge** - Pack-run integration with heartbeats and artifacts
- **Event Fan-Out** - SSE/GraphQL feeds for dashboards and notifications
- **Offline Export** - Audit bundles for compliance and investigations
---
## 2. Market Drivers
### 2.1 Target Segments
| Segment | Orchestration Requirements | Use Case |
|---------|---------------------------|----------|
| **Enterprise** | Rate limiting, quota management | Multi-team resource sharing |
| **MSP/MSSP** | Multi-tenant isolation | Managed security services |
| **Compliance Teams** | Audit trails, replay | SOC 2, FedRAMP evidence |
| **DevSecOps** | CI/CD integration, webhooks | Pipeline automation |
### 2.2 Competitive Positioning
Most vulnerability platforms lack sophisticated job orchestration. Stella Ops differentiates with:
- **Deterministic replay** for audit and debugging
- **Fine-grained quotas** per tenant/job-type
- **Circuit breakers** for automatic failure isolation
- **Native pack-run integration** for workflow automation
- **Offline-compatible** audit bundles
---
## 3. Job Lifecycle Model
### 3.1 State Machine
```
[Created] --> [Queued] --> [Leased] --> [Running] --> [Completed]
| | | |
| | v v
| +-------> [Failed] <----[Canceled]
| |
v v
[Throttled] [Incident]
```
### 3.2 Lifecycle Phases
| Phase | Description | Transitions |
|-------|-------------|-------------|
| **Created** | Job request received | -> Queued |
| **Queued** | Awaiting scheduling | -> Leased, Throttled |
| **Throttled** | Rate limit applied | -> Queued (after delay) |
| **Leased** | Worker acquired job | -> Running, Expired |
| **Running** | Active execution | -> Completed, Failed, Canceled |
| **Completed** | Success, archived | Terminal |
| **Failed** | Error, may retry | -> Queued (retry), Incident |
| **Canceled** | Operator abort | Terminal |
| **Incident** | Escalated failure | Terminal (requires operator) |
### 3.3 Job Request Structure
```json
{
"jobId": "uuid",
"jobType": "scan|policy-run|export|pack-run|advisory-sync",
"tenant": "tenant-id",
"priority": "low|normal|high|emergency",
"payloadDigest": "sha256:...",
"payload": { "imageRef": "nginx:latest", "options": {} },
"dependencies": ["job-id-1", "job-id-2"],
"idempotencyKey": "unique-request-key",
"correlationId": "trace-id",
"requestedBy": "user-id|service-id",
"requestedAt": "2025-11-29T12:00:00Z"
}
```
---
## 4. Quota Governance
### 4.1 Quota Model
```yaml
quotas:
- tenant: "acme-corp"
jobType: "*"
maxActive: 50
maxPerHour: 500
burst: 10
priority:
emergency:
maxActive: 5
skipQueue: true
- tenant: "acme-corp"
jobType: "export"
maxActive: 4
maxPerHour: 100
```
### 4.2 Rate Limit Enforcement
1. **Quota Check** - Before leasing, verify tenant hasn't exceeded limits
2. **Burst Control** - Allow short bursts within configured window
3. **Staging** - Jobs exceeding limits staged with `nextEligibleAt` timestamp
4. **Priority Bypass** - Emergency jobs can skip queue (with separate limits)
### 4.3 Dynamic Controls
| Control | API | Purpose |
|---------|-----|---------|
| `pauseSource` | `POST /api/limits/pause` | Halt specific job sources |
| `resumeSource` | `POST /api/limits/resume` | Resume paused sources |
| `throttle` | `POST /api/limits/throttle` | Apply temporary throttle |
| `updateQuota` | `PATCH /api/quotas/{id}` | Modify quota limits |
### 4.4 Circuit Breakers
- Auto-pause job types when failure rate > threshold (default 50%)
- Incident events generated via Notify
- Half-open testing after cooldown period
- Manual reset via operator action
---
## 5. TaskRunner Bridge
### 5.1 Pack-Run Integration
The Orchestrator provides specialized support for TaskRunner pack executions:
```json
{
"jobType": "pack-run",
"payload": {
"packId": "vuln-scan-and-report",
"packVersion": "1.2.0",
"planHash": "sha256:...",
"inputs": { "imageRef": "nginx:latest" },
"artifacts": [],
"logChannel": "sse:/runs/{runId}/logs",
"heartbeatCadence": 30
}
}
```
### 5.2 Heartbeat Protocol
- Workers send heartbeats every `heartbeatCadence` seconds
- Missed heartbeats trigger lease expiration
- Lease can be extended for long-running tasks
- Dead workers detected within 2x heartbeat interval
### 5.3 Artifact & Log Streaming
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/runs/{runId}/logs` | SSE | Stream execution logs |
| `/runs/{runId}/artifacts` | GET | List produced artifacts |
| `/runs/{runId}/artifacts/{name}` | GET | Download artifact |
| `/runs/{runId}/heartbeat` | POST | Extend lease |
---
## 6. Event Model
### 6.1 Event Envelope
```json
{
"eventId": "uuid",
"eventType": "job.queued|job.leased|job.completed|job.failed",
"timestamp": "2025-11-29T12:00:00Z",
"tenant": "tenant-id",
"jobId": "job-id",
"jobType": "scan",
"correlationId": "trace-id",
"idempotencyKey": "unique-key",
"payload": {
"status": "completed",
"duration": 45.2,
"result": { "verdict": "pass" }
},
"provenance": {
"workerId": "worker-1",
"leaseId": "lease-id",
"taskRunnerId": "runner-1"
}
}
```
### 6.2 Event Types
| Event | Trigger | Consumers |
|-------|---------|-----------|
| `job.queued` | Job enqueued | Dashboard, Notify |
| `job.leased` | Worker acquired job | Dashboard |
| `job.started` | Execution began | Dashboard, Notify |
| `job.progress` | Progress update | Dashboard (SSE) |
| `job.completed` | Success | Dashboard, Notify, Export |
| `job.failed` | Error occurred | Dashboard, Notify, Incident |
| `job.canceled` | Operator abort | Dashboard, Notify |
| `job.replayed` | Replay initiated | Dashboard, Audit |
### 6.3 Fan-Out Channels
- **SSE** - Real-time dashboard feeds
- **GraphQL Subscriptions** - Console UI
- **Notify** - Alert routing based on rules
- **Webhooks** - External integrations
- **Audit Log** - Compliance storage
---
## 7. Replay Semantics
### 7.1 Deterministic Replay
Jobs can be replayed for audit, debugging, or recovery:
```bash
# Replay a completed job
stella job replay --id job-12345
# Replay with sealed mode (offline verification)
stella job replay --id job-12345 --sealed --bundle output.tar.gz
```
### 7.2 Replay Guarantees
| Property | Guarantee |
|----------|-----------|
| **Input preservation** | Same payloadDigest, cursors |
| **Ordering** | Same processing order |
| **Determinism** | Same outputs for same inputs |
| **Provenance** | `replayOf` pointer to original |
### 7.3 Replay Record
```json
{
"jobId": "replay-job-id",
"replayOf": "original-job-id",
"priority": "high",
"reason": "audit-verification",
"requestedBy": "auditor@example.com",
"cursors": {
"advisory": "cursor-abc",
"vex": "cursor-def"
}
}
```
---
## 8. Implementation Strategy
### 8.1 Phase 1: Core Lifecycle (Complete)
- [x] Job state machine
- [x] MongoDB queue with leasing
- [x] Basic quota enforcement
- [x] Dashboard SSE feeds
### 8.2 Phase 2: Pack-Run Bridge (In Progress)
- [x] Pack-run job type registration
- [x] Log/artifact streaming
- [ ] Heartbeat protocol (ORCH-PACK-37-001)
- [ ] Event envelope finalization (ORCH-SVC-37-101)
### 8.3 Phase 3: Advanced Controls (Planned)
- [ ] Circuit breaker automation
- [ ] Quota analytics dashboard
- [ ] Replay verification tooling
- [ ] Incident mode integration
---
## 9. API Surface
### 9.1 Job Management
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/jobs` | GET | `orch:read` | List jobs with filters |
| `/api/jobs/{id}` | GET | `orch:read` | Job detail |
| `/api/jobs/{id}/cancel` | POST | `orch:operate` | Cancel job |
| `/api/jobs/{id}/replay` | POST | `orch:operate` | Schedule replay |
### 9.2 Quota Management
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/quotas` | GET | `orch:read` | List quotas |
| `/api/quotas/{id}` | PATCH | `orch:quota` | Update quota |
| `/api/limits/throttle` | POST | `orch:quota` | Apply throttle |
| `/api/limits/pause` | POST | `orch:quota` | Pause source |
| `/api/limits/resume` | POST | `orch:quota` | Resume source |
### 9.3 Dashboard
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/dashboard/metrics` | GET | `orch:read` | Aggregated metrics |
| `/api/dashboard/events` | SSE | `orch:read` | Real-time events |
---
## 10. Storage Model
### 10.1 Collections
| Collection | Purpose | Key Fields |
|------------|---------|------------|
| `jobs` | Current job state | `_id`, `tenant`, `jobType`, `status`, `priority` |
| `job_history` | Append-only audit | `jobId`, `event`, `timestamp`, `actor` |
| `sources` | Job sources registry | `sourceId`, `tenant`, `status` |
| `quotas` | Quota definitions | `tenant`, `jobType`, `limits` |
| `throttles` | Active throttles | `tenant`, `source`, `until` |
| `incidents` | Escalated failures | `jobId`, `reason`, `status` |
### 10.2 Indexes
- `{tenant, jobType, status}` on `jobs`
- `{tenant, status, startedAt}` on `jobs`
- `{jobId, timestamp}` on `job_history`
- TTL index on transient lease records
---
## 11. Observability
### 11.1 Metrics
- `job_queue_depth{jobType,tenant}`
- `job_latency_seconds{jobType,phase}`
- `job_failures_total{jobType,reason}`
- `job_retry_total{jobType}`
- `lease_extensions_total{jobType}`
- `quota_exceeded_total{tenant}`
- `circuit_breaker_state{jobType}`
### 11.2 Pack-Run Metrics
- `pack_run_logs_stream_lag_seconds`
- `pack_run_heartbeats_total`
- `pack_run_artifacts_total`
- `pack_run_duration_seconds`
---
## 12. Offline Support
### 12.1 Audit Bundle Export
```bash
stella orch export --tenant acme-corp --since 2025-11-01 --output audit-bundle.tar.gz
```
Bundle contents:
- `jobs.jsonl` - Job records
- `history.jsonl` - State transitions
- `throttles.jsonl` - Throttle events
- `manifest.json` - Bundle metadata
- `signatures/` - DSSE signatures
### 12.2 Replay Verification
```bash
# Verify job determinism
stella job verify --bundle audit-bundle.tar.gz --job-id job-12345
```
---
## 13. Related Documentation
| Resource | Location |
|----------|----------|
| Orchestrator architecture | `docs/modules/orchestrator/architecture.md` |
| Event envelope spec | `docs/modules/orchestrator/event-envelope.md` |
| TaskRunner integration | `docs/modules/taskrunner/orchestrator-bridge.md` |
---
## 14. Sprint Mapping
- **Primary Sprint:** SPRINT_0151_0001_0001_orchestrator_i.md
- **Related Sprints:**
- SPRINT_0152_0001_0002_orchestrator_ii.md
- SPRINT_0153_0001_0003_orchestrator_iii.md
- SPRINT_0157_0001_0001_taskrunner_i.md
**Key Task IDs:**
- `ORCH-CORE-30-001` - Job lifecycle (DONE)
- `ORCH-QUOTA-31-001` - Quota governance (DONE)
- `ORCH-PACK-37-001` - Pack-run bridge (IN PROGRESS)
- `ORCH-SVC-37-101` - Event envelope (IN PROGRESS)
- `ORCH-REPLAY-38-001` - Replay verification (TODO)
---
## 15. Success Metrics
| Metric | Target |
|--------|--------|
| Job scheduling latency | < 100ms p99 |
| Lease acquisition time | < 50ms p99 |
| Event fan-out delay | < 500ms |
| Quota enforcement accuracy | 100% |
| Replay determinism | 100% match |
---
*Last updated: 2025-11-29*