semi implemented and features implemented save checkpoint

This commit is contained in:
master
2026-02-08 18:00:49 +02:00
parent 04360dff63
commit 1bf6bbf395
20895 changed files with 716795 additions and 64 deletions

View File

@@ -0,0 +1,35 @@
# DAG Planner with Critical-Path Metadata
## Module
Orchestrator
## Status
IMPLEMENTED
## Description
DAG-based job planner that computes critical-path metadata for orchestrator execution plans, enabling dependency-aware scheduling and parallel execution of independent job chains.
## Implementation Details
- **Modules**: `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scheduling/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/`
- **Key Classes**:
- `DagPlanner` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scheduling/DagPlanner.cs`) - computes execution DAGs from job dependency graphs, identifies critical path, and enables parallel scheduling of independent chains
- `DagEdge` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/DagEdge.cs`) - edge model representing dependencies between jobs in the execution DAG
- `JobScheduler` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scheduling/JobScheduler.cs`) - schedules jobs based on DAG planner output, respecting dependency ordering
- `JobStateMachine` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scheduling/JobStateMachine.cs`) - state machine governing job lifecycle transitions within the DAG execution
- `Job` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Job.cs`) - job entity with status, dependencies, and scheduling metadata
- `JobStatus` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/JobStatus.cs`) - enum defining job lifecycle states
- `JobHistory` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/JobHistory.cs`) - historical record of job state transitions
- `DagEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/DagEndpoints.cs`) - REST API for querying DAG execution plans
- `DagContracts` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Contracts/DagContracts.cs`) - API contracts for DAG responses
- **Interfaces**: `IDagEdgeRepository` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Repositories/IDagEdgeRepository.cs`)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Create a DAG with 5 jobs (A->B->C, A->D->E) and verify `DagPlanner` identifies A as the root and C/E as leaves
- [ ] Verify critical path computation: the longest dependency chain (A->B->C or A->D->E) is marked as the critical path
- [ ] Schedule the DAG via `JobScheduler` and verify B and D execute in parallel after A completes
- [ ] Add a new dependency (D->C) creating a diamond DAG and verify the critical path updates
- [ ] Query the DAG via `DagEndpoints` and verify the response includes all edges, critical path markers, and parallel groups
- [ ] Create a cyclic DAG (A->B->A) and verify `DagPlanner` rejects it with a cycle detection error
- [ ] Verify DAG metadata: each job node in the `DagContracts` response includes estimated duration and dependency count
- [ ] Schedule a DAG with one failed job and verify `JobStateMachine` marks downstream dependencies as blocked

View File

@@ -0,0 +1,35 @@
# Event Fan-Out (SSE/Streaming)
## Module
Orchestrator
## Status
IMPLEMENTED
## Description
Job and pack-run streaming coordinators with stream payload models for real-time SSE event delivery.
## Implementation Details
- **Modules**: `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Streaming/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Events/`
- **Key Classes**:
- `JobStreamCoordinator` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Streaming/JobStreamCoordinator.cs`) - coordinates SSE streaming for job lifecycle events to connected clients
- `PackRunStreamCoordinator` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Streaming/PackRunStreamCoordinator.cs`) - coordinates streaming for pack-run execution events
- `RunStreamCoordinator` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Streaming/RunStreamCoordinator.cs`) - coordinates streaming for individual run events
- `SseWriter` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Streaming/SseWriter.cs`) - writes Server-Sent Events to HTTP response streams
- `StreamOptions` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Streaming/StreamOptions.cs`) - configuration for stream connections (heartbeat interval, buffer size, timeout)
- `StreamPayloads` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Streaming/StreamPayloads.cs`) - typed payload models for stream events (job progress, pack-run status, log lines)
- `StreamEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/StreamEndpoints.cs`) - REST endpoints for SSE stream subscription
- `EventEnvelope` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Events/EventEnvelope.cs`) - typed event envelope wrapping domain events for streaming
- `OrchestratorEventPublisher` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Events/OrchestratorEventPublisher.cs`) - concrete event publisher routing events to stream coordinators
- **Interfaces**: `IEventPublisher` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Events/IEventPublisher.cs`)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Subscribe to the job stream via `StreamEndpoints` and trigger a job; verify SSE events are received for each state transition
- [ ] Subscribe to the pack-run stream via `PackRunStreamCoordinator` and execute a pack; verify progress events include step index, status, and log lines
- [ ] Verify heartbeat: subscribe to a stream and wait without events; confirm heartbeat events arrive at the `StreamOptions` configured interval
- [ ] Subscribe with two clients to the same job stream and verify both receive identical events (fan-out via `JobStreamCoordinator`)
- [ ] Disconnect a client mid-stream and verify the stream coordinator cleans up the connection without affecting other subscribers
- [ ] Trigger a rapid sequence of events and verify `SseWriter` delivers them in order without drops
- [ ] Verify stream payloads: each event contains a typed payload matching the `StreamPayloads` model
- [ ] Test stream timeout: idle for longer than `StreamOptions.Timeout` and verify the connection closes gracefully

View File

@@ -0,0 +1,33 @@
# Export Job Service
## Module
Orchestrator
## Status
IMPLEMENTED
## Description
Export job management with service and domain model for orchestrated export operations.
## Implementation Details
- **Modules**: `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Services/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Export/`
- **Key Classes**:
- `ExportJobService` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Services/ExportJobService.cs`) - manages export job lifecycle: creation, scheduling, execution tracking, and completion
- `ExportJob` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Export/ExportJob.cs`) - export job entity with status, target, format, and schedule
- `ExportJobPolicy` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Export/ExportJobPolicy.cs`) - policy controlling export permissions and constraints
- `ExportJobTypes` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Export/ExportJobTypes.cs`) - enumeration of supported export types (evidence pack, audit report, snapshot)
- `ExportSchedule` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Export/ExportSchedule.cs`) - scheduling configuration for recurring exports
- `LedgerExporter` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Ledger/LedgerExporter.cs`) - exports audit ledger data for compliance and audit
- `ExportJobEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/ExportJobEndpoints.cs`) - REST API for creating, querying, and managing export jobs
- **Interfaces**: `ILedgerExporter` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Ledger/ILedgerExporter.cs`)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Create an export job via `ExportJobEndpoints` with type=evidence_pack and verify it is persisted with status=Pending
- [ ] Execute the export job via `ExportJobService` and verify status transitions: Pending -> Running -> Completed
- [ ] Verify export policy enforcement: create an export job with a restricted type and verify `ExportJobPolicy` rejects it
- [ ] Schedule a recurring export via `ExportSchedule` and verify the next execution is computed correctly
- [ ] Export audit ledger data via `LedgerExporter` and verify the output contains all entries within the specified time range
- [ ] Create an export job with retention policy and verify completed exports are cleaned up after expiry
- [ ] Query export jobs via `ExportJobEndpoints` with status filter and verify pagination works correctly
- [ ] Test export failure: simulate an export error and verify the job transitions to Failed with error details

View File

@@ -0,0 +1,37 @@
# Job Lifecycle State Machine
## Module
Orchestrator
## Status
IMPLEMENTED
## Description
Job scheduling with Postgres-backed job repository, event envelope domain model, and air-gap compatible scheduling tests.
## Implementation Details
- **Modules**: `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scheduling/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/`
- **Key Classes**:
- `JobStateMachine` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scheduling/JobStateMachine.cs`) - finite state machine governing job lifecycle transitions (Pending -> Scheduled -> Running -> Completed/Failed/Cancelled)
- `JobScheduler` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scheduling/JobScheduler.cs`) - schedules jobs based on state machine rules and DAG dependencies
- `RetryPolicy` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scheduling/RetryPolicy.cs`) - configurable retry policy for failed jobs (max retries, backoff strategy)
- `Job` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Job.cs`) - job entity with current status, attempts, and metadata
- `JobStatus` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/JobStatus.cs`) - enum defining all valid job states
- `JobHistory` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/JobHistory.cs`) - historical record of all state transitions with timestamps
- `EventEnvelope` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Events/EventEnvelope.cs`) - typed event envelope emitted on state transitions
- `TimelineEvent` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Events/TimelineEvent.cs`) - timeline event for job lifecycle tracking
- `TimelineEventEmitter` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Events/TimelineEventEmitter.cs`) - emits timeline events on state transitions
- `JobEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/JobEndpoints.cs`) - REST API for job management
- `JobContracts` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Contracts/JobContracts.cs`) - API contracts for job operations
- **Interfaces**: `IJobRepository` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Repositories/IJobRepository.cs`), `IJobHistoryRepository` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Repositories/IJobHistoryRepository.cs`)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Create a job via `JobEndpoints` and verify initial state is Pending
- [ ] Schedule the job via `JobScheduler` and verify state transition: Pending -> Scheduled, with `TimelineEvent` emitted
- [ ] Start the job and verify `JobStateMachine` transition: Scheduled -> Running
- [ ] Complete the job and verify transition: Running -> Completed with completion timestamp in `JobHistory`
- [ ] Fail the job and verify transition: Running -> Failed with retry attempt incremented
- [ ] Verify `RetryPolicy`: fail a job with max_retries=3 and verify it re-enters Scheduled up to 3 times before terminal failure
- [ ] Attempt an invalid transition (e.g., Completed -> Running) and verify `JobStateMachine` rejects it
- [ ] Verify air-gap scheduling: schedule a job in sealed mode and verify it does not attempt network egress

View File

@@ -0,0 +1,36 @@
# Network Intent Validator (Air-Gap Orchestrator Controls)
## Module
Orchestrator
## Status
IMPLEMENTED
## Description
NetworkIntentValidator enforces air-gap network policies on orchestrator jobs, preventing egress in sealed mode. Includes MirrorJobTypes and MirrorOperationRecorder for offline mirror operations.
## Implementation Details
- **Modules**: `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/AirGap/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/AirGap/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Mirror/`
- **Key Classes**:
- `NetworkIntentValidator` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/AirGap/NetworkIntentValidator.cs`) - validates job network intent against air-gap policy, blocking egress requests in sealed mode
- `StalenessValidator` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/AirGap/StalenessValidator.cs`) - validates data freshness in air-gapped environments, ensuring cached data is within acceptable staleness bounds
- `NetworkIntent` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/AirGap/NetworkIntent.cs`) - declares the network intent of a job (egress, ingress, local-only)
- `SealingStatus` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/AirGap/SealingStatus.cs`) - enum for air-gap sealing state (Sealed, Unsealed, Transitioning)
- `StalenessConfig` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/AirGap/StalenessConfig.cs`) - configuration for acceptable data staleness in air-gap mode
- `StalenessValidationResult` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/AirGap/StalenessValidationResult.cs`) - result of staleness validation
- `BundleProvenance` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/AirGap/BundleProvenance.cs`) - provenance tracking for air-gap bundles
- `MirrorBundle` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Mirror/MirrorBundle.cs`) - bundle model for offline mirror operations
- `MirrorJobTypes` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Mirror/MirrorJobTypes.cs`) - types of mirror jobs (sync, verify, prune)
- `MirrorOperationRecorder` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Mirror/MirrorOperationRecorder.cs`) - records mirror operations for audit trail
- **Interfaces**: None (uses concrete implementations)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Set `SealingStatus` to Sealed and submit a job with egress intent; verify `NetworkIntentValidator` rejects it
- [ ] Set `SealingStatus` to Unsealed and submit a job with egress intent; verify it is allowed
- [ ] Validate staleness: set `StalenessConfig` max staleness to 24 hours and verify data older than 24 hours is rejected by `StalenessValidator`
- [ ] Create a mirror job with type=sync and verify `MirrorOperationRecorder` records the operation
- [ ] Verify bundle provenance: create a `MirrorBundle` and verify `BundleProvenance` captures origin, sync timestamp, and hash
- [ ] Transition sealing status from Unsealed to Sealed and verify in-flight egress jobs are blocked
- [ ] Submit a local-only `NetworkIntent` job in sealed mode and verify it is allowed
- [ ] Verify staleness config: set different staleness thresholds per data type in `StalenessConfig` and verify per-type enforcement

View File

@@ -0,0 +1,35 @@
# Orchestrator Admin Quota Controls (orch:quota, orch:backfill)
## Module
Orchestrator
## Status
IMPLEMENTED
## Description
New `orch:quota` and `orch:backfill` scopes with mandatory reason/ticket fields. Token requests must include `quota_reason`/`backfill_reason` and optionally `quota_ticket`/`backfill_ticket`. Authority persists these as claims and audit properties for traceability of capacity-affecting operations.
## Implementation Details
- **Modules**: `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Backfill/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/`
- **Key Classes**:
- `Quota` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Quota.cs`) - quota entity with limits, current usage, and allocation metadata
- `BackfillRequest` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/BackfillRequest.cs`) - backfill request model with reason, ticket, and scope
- `BackfillManager` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Backfill/BackfillManager.cs`) - manages backfill operations with duplicate suppression and event time window tracking
- `DuplicateSuppressor` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Backfill/DuplicateSuppressor.cs`) - prevents duplicate backfill requests within a time window
- `EventTimeWindow` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Backfill/EventTimeWindow.cs`) - time window for backfill event deduplication
- `QuotaEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/QuotaEndpoints.cs`) - REST API for quota management (view, adjust, allocate)
- `QuotaContracts` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Contracts/QuotaContracts.cs`) - API contracts for quota operations
- `AuditEntry` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/AuditEntry.cs`) - audit entry capturing quota/backfill actions with reason and ticket
- `TenantResolver` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Services/TenantResolver.cs`) - resolves tenant context for quota scoping
- **Interfaces**: `IQuotaRepository` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Repositories/IQuotaRepository.cs`), `IBackfillRepository` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Repositories/IBackfillRepository.cs`)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Request a quota adjustment via `QuotaEndpoints` with `quota_reason` and `quota_ticket`; verify the adjustment is applied and audited in `AuditEntry`
- [ ] Attempt a quota adjustment without `quota_reason` and verify it is rejected with a 400 error
- [ ] Request a backfill via `BackfillManager` with `backfill_reason` and verify the backfill is initiated
- [ ] Submit a duplicate backfill request within the `EventTimeWindow` and verify `DuplicateSuppressor` rejects it
- [ ] Verify audit trail: check the `AuditEntry` for the quota adjustment and confirm reason and ticket are captured
- [ ] Query current quota usage via `QuotaEndpoints` and verify limits and current usage are returned
- [ ] Adjust quota beyond the maximum limit and verify the operation is rejected by policy
- [ ] Verify tenant scoping via `TenantResolver`: adjust quota for tenant A and verify tenant B's quota is unchanged

View File

@@ -0,0 +1,39 @@
# Orchestrator Audit Ledger
## Module
Orchestrator
## Status
IMPLEMENTED
## Description
Append-only audit ledger tracking all orchestrator job lifecycle state changes, rate-limit decisions, and dead-letter events with tenant-scoped isolation.
## Implementation Details
- **Modules**: `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/DeadLetter/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Ledger/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/`
- **Key Classes**:
- `AuditEntry` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/AuditEntry.cs`) - audit entry model with action type, actor, tenant, timestamp, and metadata
- `RunLedger` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/RunLedger.cs`) - run-level ledger tracking execution history
- `SignedManifest` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/SignedManifest.cs`) - signed manifest for tamper-evident ledger export
- `LedgerExporter` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Ledger/LedgerExporter.cs`) - exports ledger data for compliance and audit
- `AuditEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/AuditEndpoints.cs`) - REST API for querying audit ledger entries
- `LedgerEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/LedgerEndpoints.cs`) - REST API for ledger export and querying
- `AuditLedgerContracts` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Contracts/AuditLedgerContracts.cs`) - API contracts for audit responses
- `DeadLetterEntry` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/DeadLetterEntry.cs`) - dead-letter entry in the audit trail
- `DeadLetterNotifier` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/DeadLetter/DeadLetterNotifier.cs`) - notifies on dead-letter events
- `ErrorClassification` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/DeadLetter/ErrorClassification.cs`) - classifies errors for dead-letter categorization
- `ReplayManager` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/DeadLetter/ReplayManager.cs`) - manages replay of dead-letter entries
- `DeadLetterEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/DeadLetterEndpoints.cs`) - REST API for dead-letter management
- `TenantResolver` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Services/TenantResolver.cs`) - ensures tenant-scoped audit isolation
- **Interfaces**: `ILedgerExporter` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Ledger/ILedgerExporter.cs`), `IAuditRepository` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Repositories/IAuditRepository.cs`), `IDeadLetterRepository` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/DeadLetter/IDeadLetterRepository.cs`), `ILedgerRepository` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Repositories/ILedgerRepository.cs`)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Trigger a job state transition and verify an `AuditEntry` is created in the ledger with action type, actor, and timestamp
- [ ] Query the audit ledger via `AuditEndpoints` with a time range filter and verify only matching entries are returned
- [ ] Verify tenant isolation via `TenantResolver`: create audit entries for two tenants and verify each tenant only sees their own entries
- [ ] Trigger a dead-letter event and verify it appears in both the `DeadLetterEntry` store and the audit ledger
- [ ] Export the audit ledger via `LedgerExporter` and verify the export contains all entries within the specified range
- [ ] Replay a dead-letter entry via `ReplayManager` and verify the replay action is also audited
- [ ] Verify `ErrorClassification` categorizes different error types correctly (transient, permanent, unknown)
- [ ] Query dead-letter entries via `DeadLetterEndpoints` and verify pagination and filtering work

View File

@@ -0,0 +1,40 @@
# Orchestrator Event Envelopes with SSE/WebSocket Streaming
## Module
Orchestrator
## Status
IMPLEMENTED
## Description
Typed event envelope system with SSE and WebSocket streaming for real-time orchestrator job progress, enabling live UI updates and CLI monitoring of pack-run execution.
## Implementation Details
- **Modules**: `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Events/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Hashing/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Streaming/`
- **Key Classes**:
- `EventEnvelope` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Events/EventEnvelope.cs`) - typed event envelope with event type, payload, timestamp, and correlation ID
- `EventEnvelope` (legacy) (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/EventEnvelope.cs`) - legacy event envelope model
- `TimelineEvent` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Events/TimelineEvent.cs`) - timeline event for job lifecycle tracking
- `TimelineEventEmitter` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Events/TimelineEventEmitter.cs`) - emits timeline events on domain actions
- `OrchestratorEventPublisher` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Events/OrchestratorEventPublisher.cs`) - concrete publisher routing events to stream coordinators
- `EventEnvelopeHasher` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Hashing/EventEnvelopeHasher.cs`) - hashes event envelopes for integrity verification
- `CanonicalJsonHasher` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Hashing/CanonicalJsonHasher.cs`) - canonical JSON hashing for deterministic event hashes
- `SseWriter` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Streaming/SseWriter.cs`) - Server-Sent Events writer
- `JobStreamCoordinator` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Streaming/JobStreamCoordinator.cs`) - job event stream coordinator
- `PackRunStreamCoordinator` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Streaming/PackRunStreamCoordinator.cs`) - pack-run stream coordinator
- `RunStreamCoordinator` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Streaming/RunStreamCoordinator.cs`) - run-level stream coordinator
- `StreamEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/StreamEndpoints.cs`) - REST endpoints for SSE subscriptions
- `StreamOptions` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Streaming/StreamOptions.cs`) - stream configuration
- `StreamPayloads` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Streaming/StreamPayloads.cs`) - typed event payloads
- **Interfaces**: `IEventPublisher` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Events/IEventPublisher.cs`)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Create an `EventEnvelope` with type=job_completed and payload; verify it is hashed via `EventEnvelopeHasher` and the hash is deterministic
- [ ] Publish an event via `OrchestratorEventPublisher` and verify it reaches the `JobStreamCoordinator`
- [ ] Subscribe to SSE via `StreamEndpoints` and verify events arrive as formatted SSE messages (data: + newline)
- [ ] Verify canonical hashing: create two identical events and verify `CanonicalJsonHasher` produces identical hashes
- [ ] Subscribe to pack-run stream via `PackRunStreamCoordinator` and execute a pack; verify real-time progress events include step index and status
- [ ] Verify `StreamOptions`: configure heartbeat interval and verify heartbeats arrive at the configured cadence
- [ ] Publish 100 events rapidly and verify `SseWriter` delivers all of them in order
- [ ] Verify event envelope correlation: publish events with the same correlation ID and verify they can be filtered by correlation

View File

@@ -0,0 +1,38 @@
# Orchestrator Golden Signals Observability
## Module
Orchestrator
## Status
IMPLEMENTED
## Description
Built-in golden signal metrics (latency, traffic, errors, saturation) for orchestrator job execution, with timeline event emission and job capsule provenance tracking.
## Implementation Details
- **Modules**: `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Observability/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Evidence/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scale/`
- **Key Classes**:
- `OrchestratorGoldenSignals` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Observability/OrchestratorGoldenSignals.cs`) - golden signal metrics: latency (p50/p95/p99), traffic (requests/sec), errors (error rate), saturation (queue depth, CPU, memory)
- `OrchestratorMetrics` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Observability/OrchestratorMetrics.cs`) - OpenTelemetry metrics registration for orchestrator operations
- `IncidentModeHooks` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Observability/IncidentModeHooks.cs`) - hooks triggered when golden signals breach thresholds, activating incident mode
- `JobAttestationService` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Evidence/JobAttestationService.cs`) - generates attestations for job execution with provenance data
- `JobAttestation` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Evidence/JobAttestation.cs`) - attestation model for a completed job
- `JobCapsule` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Evidence/JobCapsule.cs`) - capsule containing job execution evidence (inputs, outputs, metrics)
- `JobCapsuleGenerator` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Evidence/JobCapsuleGenerator.cs`) - generates job capsules from execution data
- `JobRedactionGuard` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Evidence/JobRedactionGuard.cs`) - redacts sensitive data from job capsules before attestation
- `SnapshotHook` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Evidence/SnapshotHook.cs`) - hook capturing execution state snapshots at key points
- `ScaleMetrics` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scale/ScaleMetrics.cs`) - metrics for auto-scaling decisions
- `KpiEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/KpiEndpoints.cs`) - REST endpoints for KPI/metrics queries
- `HealthEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/HealthEndpoints.cs`) - health check endpoints
- **Interfaces**: None (uses concrete implementations)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Execute a job and verify `OrchestratorGoldenSignals` records latency, traffic, and error metrics
- [ ] Verify golden signal latency: execute 10 jobs with varying durations and verify p50/p95/p99 percentiles are computed correctly
- [ ] Trigger an error threshold breach and verify `IncidentModeHooks` activates incident mode
- [ ] Generate a `JobCapsule` via `JobCapsuleGenerator` and verify it contains job inputs, outputs, and execution metrics
- [ ] Verify redaction: include sensitive data in job inputs and verify `JobRedactionGuard` removes it from the capsule
- [ ] Generate a `JobAttestation` via `JobAttestationService` and verify it contains the capsule hash and provenance data
- [ ] Query KPI metrics via `KpiEndpoints` and verify golden signal data is returned
- [ ] Verify `HealthEndpoints` report healthy when golden signals are within thresholds

View File

@@ -0,0 +1,33 @@
# Orchestrator Operator Scope with Audit Metadata
## Module
Orchestrator
## Status
IMPLEMENTED
## Description
New `orch:operate` scope and `Orch.Operator` role requiring explicit `operator_reason` and `operator_ticket` parameters on token requests. Authority enforces these fields and captures them as audit properties, giving SecOps traceability for every orchestrator control action.
## Implementation Details
- **Modules**: `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/`
- **Key Classes**:
- `AuditEntry` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/AuditEntry.cs`) - audit entry capturing operator actions with reason and ticket metadata
- `TenantResolver` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Services/TenantResolver.cs`) - resolves tenant and operator context from token claims
- `AuditEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/AuditEndpoints.cs`) - REST API for querying operator audit trail
- `AuditLedgerContracts` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Contracts/AuditLedgerContracts.cs`) - API contracts including operator metadata
- `Quota` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Quota.cs`) - quota model with operator attribution
- `Job` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Job.cs`) - job model with operator tracking
- `DeprecationHeaders` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Services/DeprecationHeaders.cs`) - deprecation header support for versioned operator APIs
- **Interfaces**: `IAuditRepository` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Repositories/IAuditRepository.cs`)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Request a token with `orch:operate` scope, `operator_reason="maintenance"`, and `operator_ticket="TICKET-123"`; verify the token is issued
- [ ] Perform an operator action (e.g., cancel a job) with the scoped token; verify an `AuditEntry` captures the operator_reason and operator_ticket
- [ ] Attempt an operator action without `operator_reason` and verify it is rejected with a 400 error
- [ ] Query the audit trail via `AuditEndpoints` and filter by operator_ticket; verify matching entries are returned
- [ ] Verify operator scope enforcement: use a token without `orch:operate` scope and verify operator actions are forbidden (403)
- [ ] Perform multiple operator actions and verify each generates a separate `AuditEntry` with correct metadata
- [ ] Verify tenant scoping via `TenantResolver`: operator actions for tenant A are not visible in tenant B's audit trail
- [ ] Verify audit entry immutability: attempt to modify an existing `AuditEntry` and verify it is rejected

View File

@@ -0,0 +1,40 @@
# Orchestrator Worker SDKs (Go and Python)
## Module
Orchestrator
## Status
IMPLEMENTED
## Description
Multi-language Worker SDKs enabling external workers to participate in orchestrator job execution via Go and Python clients, with examples and structured API packages.
## Implementation Details
- **Modules**: `src/Orchestrator/StellaOps.Orchestrator.WorkerSdk.Go/`, `src/Orchestrator/StellaOps.Orchestrator.WorkerSdk.Python/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/`
- **Key Classes**:
- `client.go` (`src/Orchestrator/StellaOps.Orchestrator.WorkerSdk.Go/pkg/workersdk/client.go`) - Go SDK client for worker communication
- `config.go` (`src/Orchestrator/StellaOps.Orchestrator.WorkerSdk.Go/pkg/workersdk/config.go`) - Go SDK configuration
- `artifact.go` (`src/Orchestrator/StellaOps.Orchestrator.WorkerSdk.Go/pkg/workersdk/artifact.go`) - artifact handling in Go SDK
- `backfill.go` (`src/Orchestrator/StellaOps.Orchestrator.WorkerSdk.Go/pkg/workersdk/backfill.go`) - backfill support in Go SDK
- `retry.go` (`src/Orchestrator/StellaOps.Orchestrator.WorkerSdk.Go/pkg/workersdk/retry.go`) - retry logic in Go SDK
- `errors.go` (`src/Orchestrator/StellaOps.Orchestrator.WorkerSdk.Go/pkg/workersdk/errors.go`) - error types in Go SDK
- `transport.go` (`src/Orchestrator/StellaOps.Orchestrator.WorkerSdk.Go/internal/transport/transport.go`) - HTTP transport layer for Go SDK
- `main.go` (`src/Orchestrator/StellaOps.Orchestrator.WorkerSdk.Go/examples/smoke/main.go`) - smoke test example worker
- `client.py` (`src/Orchestrator/StellaOps.Orchestrator.WorkerSdk.Python/stellaops_orchestrator_worker/client.py`) - Python SDK client
- `config.py` (`src/Orchestrator/StellaOps.Orchestrator.WorkerSdk.Python/stellaops_orchestrator_worker/config.py`) - Python SDK configuration
- `backfill.py` (`src/Orchestrator/StellaOps.Orchestrator.WorkerSdk.Python/stellaops_orchestrator_worker/backfill.py`) - Python backfill support
- `WorkerEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/WorkerEndpoints.cs`) - REST API for worker registration and job assignment
- `WorkerContracts` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Contracts/WorkerContracts.cs`) - API contracts for worker communication
- `Worker` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Worker/Worker.cs`) - .NET worker implementation
- **Interfaces**: None (SDK clients are standalone)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Register a Go worker via `WorkerEndpoints` and verify it receives a job assignment
- [ ] Execute a job with the Go worker SDK `client.go` and verify results are reported back via the API
- [ ] Register a Python worker via `client.py` and verify it receives a job assignment
- [ ] Verify Go SDK retry: configure `retry.go` policy and simulate a transient failure; verify the SDK retries and succeeds
- [ ] Verify artifact handling: upload an artifact via `artifact.go` and verify it is persisted
- [ ] Verify backfill: trigger a backfill via `backfill.py` and verify it processes historical events
- [ ] Verify Go SDK error types: trigger different error conditions and verify `errors.go` returns appropriate error types
- [ ] Run the Go smoke test example `main.go` and verify it completes successfully against the orchestrator API

View File

@@ -0,0 +1,37 @@
# Pack-Run Bridge (TaskRunner Integration)
## Module
Orchestrator
## Status
IMPLEMENTED
## Description
Pack-run integration with Postgres repository, API endpoints, stream coordinator for log/artifact streaming, and domain model.
## Implementation Details
- **Modules**: `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/`
- **Key Classes**:
- `Pack` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Pack.cs`) - pack entity containing a set of jobs to execute as a unit
- `PackRun` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/PackRun.cs`) - pack-run entity tracking execution of a pack instance
- `PackRunLog` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/PackRunLog.cs`) - log entries for pack-run execution
- `PackRunStreamCoordinator` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Streaming/PackRunStreamCoordinator.cs`) - coordinates real-time streaming of pack-run logs and artifacts
- `PackRunEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/PackRunEndpoints.cs`) - REST API for creating, querying, and managing pack runs
- `PackRegistryEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/PackRegistryEndpoints.cs`) - REST API for pack registration and versioning
- `PackRunContracts` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Contracts/PackRunContracts.cs`) - API contracts for pack-run operations
- `PackRegistryContracts` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Contracts/PackRegistryContracts.cs`) - API contracts for pack registry
- `Run` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Run.cs`) - individual run within a pack execution
- `RunEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/RunEndpoints.cs`) - REST API for run management
- `RunContracts` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Contracts/RunContracts.cs`) - API contracts for run operations
- **Interfaces**: `IPackRunRepository` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Repositories/IPackRunRepository.cs`), `IPackRegistryRepository` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Repositories/IPackRegistryRepository.cs`), `IRunRepository` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Repositories/IRunRepository.cs`)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Register a pack via `PackRegistryEndpoints` with 3 jobs and verify it is persisted with version 1
- [ ] Create a pack run via `PackRunEndpoints` and verify it starts executing the pack's jobs
- [ ] Subscribe to the pack-run stream via `PackRunStreamCoordinator` and verify real-time log entries arrive as jobs execute
- [ ] Verify pack-run completion: all 3 jobs complete and the `PackRun` transitions to Completed
- [ ] Verify pack versioning: update a pack and verify `PackRegistryEndpoints` creates version 2 while preserving version 1
- [ ] Query `PackRunLog` entries via the API and verify all log entries are returned in chronological order
- [ ] Fail one job in a pack run and verify the pack run reports partial failure
- [ ] Create multiple pack runs concurrently and verify they execute independently

View File

@@ -0,0 +1,36 @@
# SKIP LOCKED Queue Pattern
## Module
Orchestrator
## Status
IMPLEMENTED
## Description
SKIP LOCKED queue pattern is used in Scheduler and Orchestrator job repositories for reliable work distribution.
## Implementation Details
- **Modules**: `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scheduling/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/RateLimiting/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scale/`
- **Key Classes**:
- `JobScheduler` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scheduling/JobScheduler.cs`) - job scheduler using PostgreSQL `SELECT ... FOR UPDATE SKIP LOCKED` for concurrent job dequeuing without contention
- `Job` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Job.cs`) - job entity with status field used for queue filtering
- `JobStatus` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/JobStatus.cs`) - job states used in queue queries (Pending jobs are available for dequeuing)
- `Watermark` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Watermark.cs`) - watermark tracking for ordered processing
- `AdaptiveRateLimiter` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/RateLimiting/AdaptiveRateLimiter.cs`) - rate limiter that adjusts based on queue depth and processing speed
- `ConcurrencyLimiter` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/RateLimiting/ConcurrencyLimiter.cs`) - limits concurrent job processing
- `TokenBucket` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/RateLimiting/TokenBucket.cs`) - token bucket rate limiter for smooth job distribution
- `BackpressureHandler` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/RateLimiting/BackpressureHandler.cs`) - applies backpressure when queue depth exceeds thresholds
- `LoadShedder` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scale/LoadShedder.cs`) - sheds load when system is saturated
- `ScaleMetrics` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scale/ScaleMetrics.cs`) - metrics for monitoring queue depth and throughput
- **Interfaces**: `IJobRepository` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Repositories/IJobRepository.cs`), `IWatermarkRepository` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Repositories/IWatermarkRepository.cs`)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Enqueue 10 jobs and dequeue from 3 concurrent workers using SKIP LOCKED via `JobScheduler`; verify each job is assigned to exactly one worker
- [ ] Verify no contention: dequeue rapidly from 5 workers and verify no blocking or deadlocks occur
- [ ] Verify job visibility: a job locked by worker A is not visible to worker B during dequeue
- [ ] Complete a locked job and verify it is no longer in the queue
- [ ] Verify `AdaptiveRateLimiter`: increase queue depth and verify the rate limiter increases throughput
- [ ] Verify `BackpressureHandler`: fill the queue beyond the threshold and verify backpressure is signaled to producers
- [ ] Verify `LoadShedder`: saturate the system and verify new jobs are rejected with a 503 response
- [ ] Test `TokenBucket`: configure a rate of 10 jobs/second and verify the bucket enforces the limit

View File

@@ -0,0 +1,32 @@
# SLO Burn-Rate Computation and Alert Budget Tracking
## Module
Orchestrator
## Status
IMPLEMENTED
## Description
SLO burn-rate computation for orchestrator operations with configurable alert budgets, enabling proactive capacity and reliability management.
## Implementation Details
- **Modules**: `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/SloManagement/`, `src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/`
- **Key Classes**:
- `BurnRateEngine` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/SloManagement/BurnRateEngine.cs`) - computes SLO burn rate from error budget consumption over rolling windows (1h, 6h, 24h, 30d)
- `Slo` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Slo.cs`) - SLO entity with target (e.g., 99.9%), error budget, and current burn rate
- `SloEndpoints` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/SloEndpoints.cs`) - REST API for SLO queries and burn rate dashboards
- `IncidentModeHooks` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Observability/IncidentModeHooks.cs`) - activates incident mode when burn rate exceeds alert thresholds
- `OrchestratorGoldenSignals` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Observability/OrchestratorGoldenSignals.cs`) - provides underlying error/latency data for SLO computation
- `ScaleMetrics` (`src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scale/ScaleMetrics.cs`) - metrics feeding SLO saturation signals
- **Interfaces**: None (uses concrete implementations)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Define an `Slo` with target=99.9% and error budget=43.2 minutes/month; verify the SLO is persisted
- [ ] Generate 10 successful and 1 failed request and verify `BurnRateEngine` computation reflects the error
- [ ] Verify rolling window: compute burn rates for 1h, 6h, and 24h windows via `BurnRateEngine` and verify each reflects the appropriate time range
- [ ] Exceed the alert threshold (e.g., 2x burn rate) and verify `IncidentModeHooks` triggers incident mode
- [ ] Query SLO via `SloEndpoints` and verify the response includes current burn rate, remaining budget, and alert status
- [ ] Verify budget depletion: consume the entire error budget and verify the `Slo` shows 0% remaining
- [ ] Reset the SLO period (monthly rollover) and verify the error budget resets to full
- [ ] Verify multi-SLO: define SLOs for latency and availability, verify `BurnRateEngine` computes each independently