Files
git.stella-ops.org/src/Scheduler/StellaOps.Scheduler.WebService/docs/SCHED-WEB-27-002-POLICY-SIMULATION-WEBHOOKS.md
master 3a2100aa78 Add unit and integration tests for VexCandidateEmitter and SmartDiff repositories
- Implemented comprehensive unit tests for VexCandidateEmitter to validate candidate emission logic based on various scenarios including absent and present APIs, confidence thresholds, and rate limiting.
- Added integration tests for SmartDiff PostgreSQL repositories, covering snapshot storage and retrieval, candidate storage, and material risk change handling.
- Ensured tests validate correct behavior for storing, retrieving, and querying snapshots and candidates, including edge cases and expected outcomes.
2025-12-16 19:00:43 +02:00

3.2 KiB

SCHED-CONSOLE-27-002 · Policy Simulation Telemetry & Webhooks

Owners: Scheduler WebService Guild, Observability Guild
Scope: Policy simulation metrics endpoint and completion webhooks feeding Registry/Console integrations.

1. Metrics endpoint refresher

  • GET /api/v1/scheduler/policies/simulations/metrics (scope: policy:simulate)
  • Returns queue depth grouped by status plus latency percentiles derived from the most recent sample window (default 200 terminal runs).
  • Surface area is unchanged from the implementation in Sprint 27 week 1; consumers should continue to rely on the contract in samples/api/scheduler/policy-simulation-metrics.json.
  • When backing storage is not PostgreSQL the endpoint responds 501 Not Implemented.

2. Completion webhooks

Scheduler Worker now emits policy simulation webhooks whenever a simulation reaches a terminal state (succeeded, failed, cancelled). Payloads are aligned with the SSE completed event shape and include idempotency headers so downstream systems can safely de-duplicate.

2.1 Configuration

// scheduler-worker.appsettings.json
{
  "Scheduler": {
    "Worker": {
      "Policy": {
        "Webhook": {
          "Enabled": true,
          "Endpoint": "https://registry.internal/hooks/policy-simulation",
          "ApiKeyHeader": "X-StellaOps-Webhook-Key",
          "ApiKey": "replace-me",
          "TimeoutSeconds": 10
        }
      }
    }
  }
}
  • Enabled: feature flag; disabled by default to preserve air-gap behaviour.
  • Endpoint: absolute HTTPS endpoint; requests use POST.
  • ApiKeyHeader/ApiKey: optional bearer for Registry verification.
  • TimeoutSeconds: per-request timeout (defaults to 10s).

2.2 Headers

Header Purpose
X-StellaOps-Tenant Tenant identifier for the simulation.
X-StellaOps-Run-Id Stable run id (use as idempotency key).
X-StellaOps-Webhook-Key Optional API key as configured.

2.3 Payload

See samples/api/scheduler/policy-simulation-webhook.json for a canonical example.

{
  "tenantId": "tenant-alpha",
  "simulation": { /* PolicyRunStatus document */ },
  "result": "failed",
  "observedAt": "2025-11-03T20:05:12Z",
  "latencySeconds": 14.287,
  "reason": "policy engine timeout"
}
  • result: succeeded, failed, cancelled, running, or queued. Terminal webhooks are emitted only for the first three.
  • latencySeconds: bounded to four decimal places; derived from finishedAt - queuedAt when timestamps exist, else falls back to observer timestamp.
  • reason: surfaced for failures (error) and cancellations (cancellationReason); omitted otherwise.

2.4 Delivery semantics

  • Best effort with no retry from the worker — Registry should use X-StellaOps-Run-Id for idempotency.
  • Failures emit WARN logs (prefix Policy run job {JobId}).
  • Disabled configuration short-circuits without network calls (debug log only).

3. SSE compatibility

No changes were required on the streaming endpoint (GET /api/v1/scheduler/policies/simulations/{id}/stream); Console continues to receive completed events containing the same PolicyRunStatus payload that the webhook publishes.