feat: Add initial implementation of Vulnerability Resolver Jobs
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

- Created project for StellaOps.Scanner.Analyzers.Native.Tests with necessary dependencies.
- Documented roles and guidelines in AGENTS.md for Scheduler module.
- Implemented IResolverJobService interface and InMemoryResolverJobService for handling resolver jobs.
- Added ResolverBacklogNotifier and ResolverBacklogService for monitoring job metrics.
- Developed API endpoints for managing resolver jobs and retrieving metrics.
- Defined models for resolver job requests and responses.
- Integrated dependency injection for resolver job services.
- Implemented ImpactIndexSnapshot for persisting impact index data.
- Introduced SignalsScoringOptions for configurable scoring weights in reachability scoring.
- Added unit tests for ReachabilityScoringService and RuntimeFactsIngestionService.
- Created dotnet-filter.sh script to handle command-line arguments for dotnet.
- Established nuget-prime project for managing package downloads.
This commit is contained in:
master
2025-11-18 07:52:15 +02:00
parent e69b57d467
commit 8355e2ff75
299 changed files with 13293 additions and 2444 deletions

View File

@@ -2,14 +2,17 @@
The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform.
## Latest updates (2025-11-01)
- Authority added `orch:quota` and `orch:backfill` scopes for quota/backfill operations, plus token reason/ticket auditing (`docs/updates/2025-11-01-orch-admin-scope.md`). Operators must supply `quota_reason` / `quota_ticket` (or `backfill_reason` / `backfill_ticket`) when requesting elevated tokens and surface those claims in change reviews.
## Latest updates (2025-11-18)
- Job leasing now flows through the Task Runner bridge: allocations carry idempotency keys, lease durations, and retry hints; workers acknowledge via claim/ack and emit heartbeats.
- Event envelopes remain interim pending ORCH-SVC-37-101; include provenance (tenant/project, job type, correlationId, task runner id) in all notifier events.
- Authority `orch:quota` / `orch:backfill` scopes require reason/ticket audit fields; include them in runbooks and dashboard overrides.
## Responsibilities
- Track job state, throughput, and errors for Concelier, Excititor, Scheduler, and export pipelines.
- Expose dashboards and APIs for throttling, replays, and failover.
- Enforce rate-limits, concurrency and dependency chains across queues.
- Stream structured events and audit logs for incident response.
- Provide Task Runner bridge semantics (claim/ack, heartbeats, progress, artifacts, backfills) for Go/Python SDKs.
## Key components
- Orchestrator WebService (control plane).
@@ -24,9 +27,9 @@ The Orchestrator schedules, observes, and recovers ingestion and analysis jobs a
## Operational notes
- Job recovery runbooks and dashboard JSON as described in Epic 9.
- Audit retention policies for job history.
- Rate-limit reconfiguration guidelines.
- When using the new `orch:quota` / `orch:backfill` scopes, ensure reason/ticket fields are captured in runbooks and audit checklists per the 2025-11-01 Authority update.
- Rate-limit and lease reconfiguration guidelines; keep lease defaults aligned across runners and SDKs (Go/Python).
- Log streaming: SSE/WS endpoints carry correlationId + tenant/project; buffer size and retention must be documented in runbooks.
- When using `orch:quota` / `orch:backfill` scopes, capture reason/ticket fields in runbooks and audit checklists.
## Epic alignment
- Epic 9: Source & Job Orchestrator Dashboard.

View File

@@ -0,0 +1,9 @@
# Orchestrator docs task board
| Task ID | Status | Owner(s) | Notes |
| --- | --- | --- | --- |
| ORCH-DOCS-0001 | DONE | Docs Guild | README updated with leasing / task runner bridge notes and interim envelope guidance. |
| ORCH-ENG-0001 | DONE | Module Team | Sprint references normalized; notes synced to doc sprint. |
| ORCH-OPS-0001 | DONE | Ops Guild | Runbook impacts captured in README; follow-up to update ops docs. |
Status rules: mirror changes in `docs/implplan/SPRINT_0323_0001_0001_docs_modules_orchestrator.md`; use TODO → DOING → DONE/BLOCKED; add brief note if pausing.

View File

@@ -9,13 +9,18 @@
- **Queue abstraction.** Supports Mongo queue, Redis Streams, or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
- **Dashboard feeds.** SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.
## 2) Job lifecycle
1. **Enqueue.** Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit `JobRequest` records containing `jobType`, `tenant`, `priority`, `payloadDigest`, `dependencies`.
2. **Scheduling.** Orchestrator applies quotas and rate limits per `{tenant, jobType}`. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
3. **Leasing.** Workers poll `LeaseJob` endpoint; Orchestrator returns job with `leaseId`, `leaseUntil`, and instrumentation tokens. Lease renewal required for long-running tasks.
4. **Completion.** Worker reports status (`succeeded`, `failed`, `canceled`, `timed_out`). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
5. **Replay.** Operators trigger `POST /jobs/{id}/replay` which clones job payload, sets `replayOf` pointer, and requeues with high priority while preserving determinism metadata.
## 2) Job lifecycle
1. **Enqueue.** Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit `JobRequest` records containing `jobType`, `tenant`, `priority`, `payloadDigest`, `dependencies`.
2. **Scheduling.** Orchestrator applies quotas and rate limits per `{tenant, jobType}`. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
3. **Leasing (Task Runner bridge).** Workers poll `LeaseJob` endpoint; Orchestrator returns job with `leaseId`, `leaseUntil`, `idempotencyKey`, and instrumentation tokens. Lease renewal required for long-running tasks; leases carry retry hints and provenance (`tenant`, `project`, `correlationId`, `taskRunnerId`).
4. **Completion.** Worker reports status (`succeeded`, `failed`, `canceled`, `timed_out`). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
5. **Replay.** Operators trigger `POST /jobs/{id}/replay` which clones job payload, sets `replayOf` pointer, and requeues with high priority while preserving determinism metadata.
### Pack-run lifecycle (phase III)
- **Register** `pack-run` job type with task runner hints (artifacts, log channel, heartbeat cadence).
- **Logs/Artifacts**: SSE/WS stream keyed by `packRunId` + `tenant/project`; artifacts published with content digests and URI metadata.
- **Events**: notifier payloads include envelope provenance (tenant, project, correlationId, idempotencyKey) pending ORCH-SVC-37-101 final spec.
## 3) Rate-limit & quota governance
@@ -24,22 +29,24 @@
- Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
- Control plane quota updates require Authority scope `orch:quota` (issued via `Orch.Admin` role). Historical rebuilds/backfills additionally require `orch:backfill` and must supply `backfill_reason` and `backfill_ticket` alongside the operator metadata. Authority persists all four fields (`quota_reason`, `quota_ticket`, `backfill_reason`, `backfill_ticket`) for audit replay.
## 4) APIs
- `GET /api/jobs?status=` — list jobs with filters (tenant, jobType, status, time window).
- `GET /api/jobs/{id}` — job detail (payload digest, attempts, worker, lease history, metrics).
- `POST /api/jobs/{id}/cancel` — cancel running/pending job with audit reason.
- `POST /api/jobs/{id}/replay` — schedule replay.
- `POST /api/limits/throttle` — apply throttle (requires elevated scope).
- `GET /api/dashboard/metrics` — aggregated metrics for Console dashboards.
## 4) APIs
- `GET /api/jobs?status=` — list jobs with filters (tenant, jobType, status, time window).
- `GET /api/jobs/{id}` — job detail (payload digest, attempts, worker, lease history, metrics).
- `POST /api/jobs/{id}/cancel` — cancel running/pending job with audit reason.
- `POST /api/jobs/{id}/replay` — schedule replay.
- `POST /api/limits/throttle` — apply throttle (requires elevated scope).
- `GET /api/dashboard/metrics` — aggregated metrics for Console dashboards.
- Event envelope draft (`docs/modules/orchestrator/event-envelope.md`) defines notifier/webhook/SSE payloads with idempotency keys, provenance, and task runner metadata for job/pack-run events.
All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.
## 5) Observability
- Metrics: `job_queue_depth{jobType,tenant}`, `job_latency_seconds`, `job_failures_total`, `job_retry_total`, `lease_extensions_total`.
- Logs: structured with `jobId`, `jobType`, `tenant`, `workerId`, `leaseId`, `status`. Incident logs flagged for Ops.
- Traces: spans covering `enqueue`, `schedule`, `lease`, `worker_execute`, `complete`. Trace IDs propagate to worker spans for end-to-end correlation.
## 5) Observability
- Metrics: `job_queue_depth{jobType,tenant}`, `job_latency_seconds`, `job_failures_total`, `job_retry_total`, `lease_extensions_total`.
- Task Runner bridge adds `pack_run_logs_stream_lag_seconds`, `pack_run_heartbeats_total`, `pack_run_artifacts_total`.
- Logs: structured with `jobId`, `jobType`, `tenant`, `workerId`, `leaseId`, `status`. Incident logs flagged for Ops.
- Traces: spans covering `enqueue`, `schedule`, `lease`, `worker_execute`, `complete`. Trace IDs propagate to worker spans for end-to-end correlation.
## 6) Offline support

View File

@@ -0,0 +1,69 @@
# Orchestrator Event Envelope (draft)
Status: draft for ORCH-SVC-38-101 (pending ORCH-SVC-37-101 approval)
## Goals
- Single, provenance-rich envelope for policy/export/job lifecycle events.
- Idempotent across retries and transports (Notifier bus, webhooks, SSE/WS streams).
- Tenant/project isolation and offline-friendly replays.
## Envelope
```jsonc
{
"schemaVersion": "orch.event.v1",
"eventId": "urn:orch:event:...", // UUIDv7 or ULID
"eventType": "job.failed|job.completed|pack_run.log|pack_run.artifact|policy.updated|export.completed",
"occurredAt": "2025-11-19T12:34:56Z",
"idempotencyKey": "orch-{eventType}-{jobId}-{attempt}",
"correlationId": "corr-...", // propagated from producer
"tenantId": "...",
"projectId": "...", // optional but preferred
"actor": {
"subject": "service/worker-sdk-go", // who emitted the event
"scopes": ["orch:quota", "orch:backfill"]
},
"job": {
"id": "job_018f...",
"type": "pack-run|ingest|export|policy-simulate",
"runId": "run_018f...", // for pack runs / sims
"attempt": 3,
"leaseId": "lease_018f...",
"taskRunnerId": "tr_018f...",
"status": "completed|failed|running|canceled",
"reason": "user_cancelled|retry_backoff|quota_paused",
"payloadDigest": "sha256:...",
"artifacts": [
{"uri": "s3://...", "digest": "sha256:...", "mime": "application/json"}
]
},
"metrics": {
"durationSeconds": 12.345,
"logStreamLagSeconds": 0.8,
"backoffSeconds": 30
},
"notifier": {
"channel": "orch.jobs",
"delivery": "dsse",
"replay": {"ordinal": 5, "total": 12}
}
}
```
## Idempotency rules
- `eventId` globally unique; `idempotencyKey` dedupe per channel.
- Emit once per state transition; retries reuse the same `eventId`/`idempotencyKey`.
## Provenance
- Always include `tenantId` and `projectId` (if available).
- Carry `correlationId` from upstream producers and `taskRunnerId` from leasing bridge.
- Include `actor.scopes` when events are triggered via elevated tokens (`orch:quota`, `orch:backfill`).
## Transport bindings
- **Notifier bus**: DSSE-wrapped envelope; subject `orch.event` and `eventType`.
- **Webhooks**: HMAC with `X-Orchestrator-Signature` (sha256), replay-safe via `idempotencyKey`.
- **SSE/WS**: stream per `tenantId` filtered by `projectId`; client dedupe via `eventId`.
## Backlog & follow-ups
- Align field names with ORCH-SVC-37-101 once finalized.
- Add examples for policy/export events and pack-run log/manifest payloads.
- Document retry/backoff semantics in Notify/Console subscribers.