Files
git.stella-ops.org/docs/product-advisories/28-Nov-2025 - Orchestrator Event Model and Job Lifecycle.md
StellaOps Bot 0bef705bcc
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
true the date
2025-11-30 19:23:21 +02:00

12 KiB

Orchestrator Event Model and Job Lifecycle

Version: 1.0 Date: 2025-11-29 Status: Canonical

This advisory defines the product rationale, job lifecycle semantics, and implementation strategy for the Orchestrator module, covering event models, quota governance, replay semantics, and TaskRunner bridge.


1. Executive Summary

The Orchestrator is the central job coordination layer for all Stella Ops asynchronous operations. Key capabilities:

  • Unified Job Lifecycle - Enqueue, schedule, lease, complete with audit trail
  • Quota Governance - Per-tenant rate limits, burst controls, circuit breakers
  • Replay Semantics - Deterministic job replay for audit and recovery
  • TaskRunner Bridge - Pack-run integration with heartbeats and artifacts
  • Event Fan-Out - SSE/GraphQL feeds for dashboards and notifications
  • Offline Export - Audit bundles for compliance and investigations

2. Market Drivers

2.1 Target Segments

Segment Orchestration Requirements Use Case
Enterprise Rate limiting, quota management Multi-team resource sharing
MSP/MSSP Multi-tenant isolation Managed security services
Compliance Teams Audit trails, replay SOC 2, FedRAMP evidence
DevSecOps CI/CD integration, webhooks Pipeline automation

2.2 Competitive Positioning

Most vulnerability platforms lack sophisticated job orchestration. Stella Ops differentiates with:

  • Deterministic replay for audit and debugging
  • Fine-grained quotas per tenant/job-type
  • Circuit breakers for automatic failure isolation
  • Native pack-run integration for workflow automation
  • Offline-compatible audit bundles

3. Job Lifecycle Model

3.1 State Machine

[Created] --> [Queued] --> [Leased] --> [Running] --> [Completed]
                 |            |             |              |
                 |            |             v              v
                 |            +-------> [Failed] <----[Canceled]
                 |                          |
                 v                          v
            [Throttled]              [Incident]

3.2 Lifecycle Phases

Phase Description Transitions
Created Job request received -> Queued
Queued Awaiting scheduling -> Leased, Throttled
Throttled Rate limit applied -> Queued (after delay)
Leased Worker acquired job -> Running, Expired
Running Active execution -> Completed, Failed, Canceled
Completed Success, archived Terminal
Failed Error, may retry -> Queued (retry), Incident
Canceled Operator abort Terminal
Incident Escalated failure Terminal (requires operator)

3.3 Job Request Structure

{
  "jobId": "uuid",
  "jobType": "scan|policy-run|export|pack-run|advisory-sync",
  "tenant": "tenant-id",
  "priority": "low|normal|high|emergency",
  "payloadDigest": "sha256:...",
  "payload": { "imageRef": "nginx:latest", "options": {} },
  "dependencies": ["job-id-1", "job-id-2"],
  "idempotencyKey": "unique-request-key",
  "correlationId": "trace-id",
  "requestedBy": "user-id|service-id",
  "requestedAt": "2025-11-29T12:00:00Z"
}

4. Quota Governance

4.1 Quota Model

quotas:
  - tenant: "acme-corp"
    jobType: "*"
    maxActive: 50
    maxPerHour: 500
    burst: 10
    priority:
      emergency:
        maxActive: 5
        skipQueue: true

  - tenant: "acme-corp"
    jobType: "export"
    maxActive: 4
    maxPerHour: 100

4.2 Rate Limit Enforcement

  1. Quota Check - Before leasing, verify tenant hasn't exceeded limits
  2. Burst Control - Allow short bursts within configured window
  3. Staging - Jobs exceeding limits staged with nextEligibleAt timestamp
  4. Priority Bypass - Emergency jobs can skip queue (with separate limits)

4.3 Dynamic Controls

Control API Purpose
pauseSource POST /api/limits/pause Halt specific job sources
resumeSource POST /api/limits/resume Resume paused sources
throttle POST /api/limits/throttle Apply temporary throttle
updateQuota PATCH /api/quotas/{id} Modify quota limits

4.4 Circuit Breakers

  • Auto-pause job types when failure rate > threshold (default 50%)
  • Incident events generated via Notify
  • Half-open testing after cooldown period
  • Manual reset via operator action

5. TaskRunner Bridge

5.1 Pack-Run Integration

The Orchestrator provides specialized support for TaskRunner pack executions:

{
  "jobType": "pack-run",
  "payload": {
    "packId": "vuln-scan-and-report",
    "packVersion": "1.2.0",
    "planHash": "sha256:...",
    "inputs": { "imageRef": "nginx:latest" },
    "artifacts": [],
    "logChannel": "sse:/runs/{runId}/logs",
    "heartbeatCadence": 30
  }
}

5.2 Heartbeat Protocol

  • Workers send heartbeats every heartbeatCadence seconds
  • Missed heartbeats trigger lease expiration
  • Lease can be extended for long-running tasks
  • Dead workers detected within 2x heartbeat interval

5.3 Artifact & Log Streaming

Endpoint Method Purpose
/runs/{runId}/logs SSE Stream execution logs
/runs/{runId}/artifacts GET List produced artifacts
/runs/{runId}/artifacts/{name} GET Download artifact
/runs/{runId}/heartbeat POST Extend lease

6. Event Model

6.1 Event Envelope

{
  "eventId": "uuid",
  "eventType": "job.queued|job.leased|job.completed|job.failed",
  "timestamp": "2025-11-29T12:00:00Z",
  "tenant": "tenant-id",
  "jobId": "job-id",
  "jobType": "scan",
  "correlationId": "trace-id",
  "idempotencyKey": "unique-key",
  "payload": {
    "status": "completed",
    "duration": 45.2,
    "result": { "verdict": "pass" }
  },
  "provenance": {
    "workerId": "worker-1",
    "leaseId": "lease-id",
    "taskRunnerId": "runner-1"
  }
}

6.2 Event Types

Event Trigger Consumers
job.queued Job enqueued Dashboard, Notify
job.leased Worker acquired job Dashboard
job.started Execution began Dashboard, Notify
job.progress Progress update Dashboard (SSE)
job.completed Success Dashboard, Notify, Export
job.failed Error occurred Dashboard, Notify, Incident
job.canceled Operator abort Dashboard, Notify
job.replayed Replay initiated Dashboard, Audit

6.3 Fan-Out Channels

  • SSE - Real-time dashboard feeds
  • GraphQL Subscriptions - Console UI
  • Notify - Alert routing based on rules
  • Webhooks - External integrations
  • Audit Log - Compliance storage

7. Replay Semantics

7.1 Deterministic Replay

Jobs can be replayed for audit, debugging, or recovery:

# Replay a completed job
stella job replay --id job-12345

# Replay with sealed mode (offline verification)
stella job replay --id job-12345 --sealed --bundle output.tar.gz

7.2 Replay Guarantees

Property Guarantee
Input preservation Same payloadDigest, cursors
Ordering Same processing order
Determinism Same outputs for same inputs
Provenance replayOf pointer to original

7.3 Replay Record

{
  "jobId": "replay-job-id",
  "replayOf": "original-job-id",
  "priority": "high",
  "reason": "audit-verification",
  "requestedBy": "auditor@example.com",
  "cursors": {
    "advisory": "cursor-abc",
    "vex": "cursor-def"
  }
}

8. Implementation Strategy

8.1 Phase 1: Core Lifecycle (Complete)

  • Job state machine
  • MongoDB queue with leasing
  • Basic quota enforcement
  • Dashboard SSE feeds

8.2 Phase 2: Pack-Run Bridge (In Progress)

  • Pack-run job type registration
  • Log/artifact streaming
  • Heartbeat protocol (ORCH-PACK-37-001)
  • Event envelope finalization (ORCH-SVC-37-101)

8.3 Phase 3: Advanced Controls (Planned)

  • Circuit breaker automation
  • Quota analytics dashboard
  • Replay verification tooling
  • Incident mode integration

9. API Surface

9.1 Job Management

Endpoint Method Scope Description
/api/jobs GET orch:read List jobs with filters
/api/jobs/{id} GET orch:read Job detail
/api/jobs/{id}/cancel POST orch:operate Cancel job
/api/jobs/{id}/replay POST orch:operate Schedule replay

9.2 Quota Management

Endpoint Method Scope Description
/api/quotas GET orch:read List quotas
/api/quotas/{id} PATCH orch:quota Update quota
/api/limits/throttle POST orch:quota Apply throttle
/api/limits/pause POST orch:quota Pause source
/api/limits/resume POST orch:quota Resume source

9.3 Dashboard

Endpoint Method Scope Description
/api/dashboard/metrics GET orch:read Aggregated metrics
/api/dashboard/events SSE orch:read Real-time events

10. Storage Model

10.1 Collections

Collection Purpose Key Fields
jobs Current job state _id, tenant, jobType, status, priority
job_history Append-only audit jobId, event, timestamp, actor
sources Job sources registry sourceId, tenant, status
quotas Quota definitions tenant, jobType, limits
throttles Active throttles tenant, source, until
incidents Escalated failures jobId, reason, status

10.2 Indexes

  • {tenant, jobType, status} on jobs
  • {tenant, status, startedAt} on jobs
  • {jobId, timestamp} on job_history
  • TTL index on transient lease records

11. Observability

11.1 Metrics

  • job_queue_depth{jobType,tenant}
  • job_latency_seconds{jobType,phase}
  • job_failures_total{jobType,reason}
  • job_retry_total{jobType}
  • lease_extensions_total{jobType}
  • quota_exceeded_total{tenant}
  • circuit_breaker_state{jobType}

11.2 Pack-Run Metrics

  • pack_run_logs_stream_lag_seconds
  • pack_run_heartbeats_total
  • pack_run_artifacts_total
  • pack_run_duration_seconds

12. Offline Support

12.1 Audit Bundle Export

stella orch export --tenant acme-corp --since 2025-11-01 --output audit-bundle.tar.gz

Bundle contents:

  • jobs.jsonl - Job records
  • history.jsonl - State transitions
  • throttles.jsonl - Throttle events
  • manifest.json - Bundle metadata
  • signatures/ - DSSE signatures

12.2 Replay Verification

# Verify job determinism
stella job verify --bundle audit-bundle.tar.gz --job-id job-12345

Resource Location
Orchestrator architecture docs/modules/orchestrator/architecture.md
Event envelope spec docs/modules/orchestrator/event-envelope.md
TaskRunner integration docs/modules/taskrunner/orchestrator-bridge.md

14. Sprint Mapping

  • Primary Sprint: SPRINT_0151_0001_0001_orchestrator_i.md
  • Related Sprints:
    • SPRINT_0152_0001_0002_orchestrator_ii.md
    • SPRINT_0153_0001_0003_orchestrator_iii.md
    • SPRINT_0157_0001_0001_taskrunner_i.md

Key Task IDs:

  • ORCH-CORE-30-001 - Job lifecycle (DONE)
  • ORCH-QUOTA-31-001 - Quota governance (DONE)
  • ORCH-PACK-37-001 - Pack-run bridge (IN PROGRESS)
  • ORCH-SVC-37-101 - Event envelope (IN PROGRESS)
  • ORCH-REPLAY-38-001 - Replay verification (TODO)

15. Success Metrics

Metric Target
Job scheduling latency < 100ms p99
Lease acquisition time < 50ms p99
Event fan-out delay < 500ms
Quota enforcement accuracy 100%
Replay determinism 100% match

Last updated: 2025-11-29