Files
git.stella-ops.org/docs/features/checked/jobengine/jobengine-golden-signals-observability.md

45 lines
3.9 KiB
Markdown

# Orchestrator Golden Signals Observability
## Module
Orchestrator
## Status
VERIFIED
## Description
Built-in golden signal metrics (latency, traffic, errors, saturation) for orchestrator job execution, with timeline event emission and job capsule provenance tracking.
## Implementation Details
- **Modules**: `src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Infrastructure/Observability/`, `src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Evidence/`, `src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Scale/`
- **Key Classes**:
- `OrchestratorGoldenSignals` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Infrastructure/Observability/OrchestratorGoldenSignals.cs`) - golden signal metrics: latency (p50/p95/p99), traffic (requests/sec), errors (error rate), saturation (queue depth, CPU, memory)
- `OrchestratorMetrics` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Infrastructure/Observability/OrchestratorMetrics.cs`) - OpenTelemetry metrics registration for orchestrator operations
- `IncidentModeHooks` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Observability/IncidentModeHooks.cs`) - hooks triggered when golden signals breach thresholds, activating incident mode
- `JobAttestationService` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Evidence/JobAttestationService.cs`) - generates attestations for job execution with provenance data
- `JobAttestation` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Evidence/JobAttestation.cs`) - attestation model for a completed job
- `JobCapsule` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Evidence/JobCapsule.cs`) - capsule containing job execution evidence (inputs, outputs, metrics)
- `JobCapsuleGenerator` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Evidence/JobCapsuleGenerator.cs`) - generates job capsules from execution data
- `JobRedactionGuard` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Evidence/JobRedactionGuard.cs`) - redacts sensitive data from job capsules before attestation
- `SnapshotHook` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Evidence/SnapshotHook.cs`) - hook capturing execution state snapshots at key points
- `ScaleMetrics` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Scale/ScaleMetrics.cs`) - metrics for auto-scaling decisions
- `KpiEndpoints` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.WebService/Endpoints/KpiEndpoints.cs`) - REST endpoints for KPI/metrics queries
- `HealthEndpoints` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.WebService/Endpoints/HealthEndpoints.cs`) - health check endpoints
- **Interfaces**: None (uses concrete implementations)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Execute a job and verify `OrchestratorGoldenSignals` records latency, traffic, and error metrics
- [ ] Verify golden signal latency: execute 10 jobs with varying durations and verify p50/p95/p99 percentiles are computed correctly
- [ ] Trigger an error threshold breach and verify `IncidentModeHooks` activates incident mode
- [ ] Generate a `JobCapsule` via `JobCapsuleGenerator` and verify it contains job inputs, outputs, and execution metrics
- [ ] Verify redaction: include sensitive data in job inputs and verify `JobRedactionGuard` removes it from the capsule
- [ ] Generate a `JobAttestation` via `JobAttestationService` and verify it contains the capsule hash and provenance data
- [ ] Query KPI metrics via `KpiEndpoints` and verify golden signal data is returned
- [ ] Verify `HealthEndpoints` report healthy when golden signals are within thresholds
## Verification
- Verified on 2026-02-13 via `run-002`.
- Tier 0: Source files confirmed present on disk.
- Tier 1: `dotnet build` passed (0 errors); 1292/1292 tests passed.
- Tier 2d: `docs/qa/feature-checks/runs/jobengine/orchestrator-golden-signals-observability/run-002/tier2-integration-check.json`