feat: Add initial implementation of Vulnerability Resolver Jobs
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
- Created project for StellaOps.Scanner.Analyzers.Native.Tests with necessary dependencies. - Documented roles and guidelines in AGENTS.md for Scheduler module. - Implemented IResolverJobService interface and InMemoryResolverJobService for handling resolver jobs. - Added ResolverBacklogNotifier and ResolverBacklogService for monitoring job metrics. - Developed API endpoints for managing resolver jobs and retrieving metrics. - Defined models for resolver job requests and responses. - Integrated dependency injection for resolver job services. - Implemented ImpactIndexSnapshot for persisting impact index data. - Introduced SignalsScoringOptions for configurable scoring weights in reachability scoring. - Added unit tests for ReachabilityScoringService and RuntimeFactsIngestionService. - Created dotnet-filter.sh script to handle command-line arguments for dotnet. - Established nuget-prime project for managing package downloads.
This commit is contained in:
@@ -2,14 +2,17 @@
|
||||
|
||||
The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform.
|
||||
|
||||
## Latest updates (2025-11-01)
|
||||
- Authority added `orch:quota` and `orch:backfill` scopes for quota/backfill operations, plus token reason/ticket auditing (`docs/updates/2025-11-01-orch-admin-scope.md`). Operators must supply `quota_reason` / `quota_ticket` (or `backfill_reason` / `backfill_ticket`) when requesting elevated tokens and surface those claims in change reviews.
|
||||
## Latest updates (2025-11-18)
|
||||
- Job leasing now flows through the Task Runner bridge: allocations carry idempotency keys, lease durations, and retry hints; workers acknowledge via claim/ack and emit heartbeats.
|
||||
- Event envelopes remain interim pending ORCH-SVC-37-101; include provenance (tenant/project, job type, correlationId, task runner id) in all notifier events.
|
||||
- Authority `orch:quota` / `orch:backfill` scopes require reason/ticket audit fields; include them in runbooks and dashboard overrides.
|
||||
|
||||
## Responsibilities
|
||||
- Track job state, throughput, and errors for Concelier, Excititor, Scheduler, and export pipelines.
|
||||
- Expose dashboards and APIs for throttling, replays, and failover.
|
||||
- Enforce rate-limits, concurrency and dependency chains across queues.
|
||||
- Stream structured events and audit logs for incident response.
|
||||
- Provide Task Runner bridge semantics (claim/ack, heartbeats, progress, artifacts, backfills) for Go/Python SDKs.
|
||||
|
||||
## Key components
|
||||
- Orchestrator WebService (control plane).
|
||||
@@ -24,9 +27,9 @@ The Orchestrator schedules, observes, and recovers ingestion and analysis jobs a
|
||||
|
||||
## Operational notes
|
||||
- Job recovery runbooks and dashboard JSON as described in Epic 9.
|
||||
- Audit retention policies for job history.
|
||||
- Rate-limit reconfiguration guidelines.
|
||||
- When using the new `orch:quota` / `orch:backfill` scopes, ensure reason/ticket fields are captured in runbooks and audit checklists per the 2025-11-01 Authority update.
|
||||
- Rate-limit and lease reconfiguration guidelines; keep lease defaults aligned across runners and SDKs (Go/Python).
|
||||
- Log streaming: SSE/WS endpoints carry correlationId + tenant/project; buffer size and retention must be documented in runbooks.
|
||||
- When using `orch:quota` / `orch:backfill` scopes, capture reason/ticket fields in runbooks and audit checklists.
|
||||
|
||||
## Epic alignment
|
||||
- Epic 9: Source & Job Orchestrator Dashboard.
|
||||
|
||||
9
docs/modules/orchestrator/TASKS.md
Normal file
9
docs/modules/orchestrator/TASKS.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Orchestrator docs task board
|
||||
|
||||
| Task ID | Status | Owner(s) | Notes |
|
||||
| --- | --- | --- | --- |
|
||||
| ORCH-DOCS-0001 | DONE | Docs Guild | README updated with leasing / task runner bridge notes and interim envelope guidance. |
|
||||
| ORCH-ENG-0001 | DONE | Module Team | Sprint references normalized; notes synced to doc sprint. |
|
||||
| ORCH-OPS-0001 | DONE | Ops Guild | Runbook impacts captured in README; follow-up to update ops docs. |
|
||||
|
||||
Status rules: mirror changes in `docs/implplan/SPRINT_0323_0001_0001_docs_modules_orchestrator.md`; use TODO → DOING → DONE/BLOCKED; add brief note if pausing.
|
||||
@@ -9,13 +9,18 @@
|
||||
- **Queue abstraction.** Supports Mongo queue, Redis Streams, or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
|
||||
- **Dashboard feeds.** SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.
|
||||
|
||||
## 2) Job lifecycle
|
||||
|
||||
1. **Enqueue.** Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit `JobRequest` records containing `jobType`, `tenant`, `priority`, `payloadDigest`, `dependencies`.
|
||||
2. **Scheduling.** Orchestrator applies quotas and rate limits per `{tenant, jobType}`. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
|
||||
3. **Leasing.** Workers poll `LeaseJob` endpoint; Orchestrator returns job with `leaseId`, `leaseUntil`, and instrumentation tokens. Lease renewal required for long-running tasks.
|
||||
4. **Completion.** Worker reports status (`succeeded`, `failed`, `canceled`, `timed_out`). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
|
||||
5. **Replay.** Operators trigger `POST /jobs/{id}/replay` which clones job payload, sets `replayOf` pointer, and requeues with high priority while preserving determinism metadata.
|
||||
## 2) Job lifecycle
|
||||
|
||||
1. **Enqueue.** Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit `JobRequest` records containing `jobType`, `tenant`, `priority`, `payloadDigest`, `dependencies`.
|
||||
2. **Scheduling.** Orchestrator applies quotas and rate limits per `{tenant, jobType}`. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
|
||||
3. **Leasing (Task Runner bridge).** Workers poll `LeaseJob` endpoint; Orchestrator returns job with `leaseId`, `leaseUntil`, `idempotencyKey`, and instrumentation tokens. Lease renewal required for long-running tasks; leases carry retry hints and provenance (`tenant`, `project`, `correlationId`, `taskRunnerId`).
|
||||
4. **Completion.** Worker reports status (`succeeded`, `failed`, `canceled`, `timed_out`). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
|
||||
5. **Replay.** Operators trigger `POST /jobs/{id}/replay` which clones job payload, sets `replayOf` pointer, and requeues with high priority while preserving determinism metadata.
|
||||
|
||||
### Pack-run lifecycle (phase III)
|
||||
- **Register** `pack-run` job type with task runner hints (artifacts, log channel, heartbeat cadence).
|
||||
- **Logs/Artifacts**: SSE/WS stream keyed by `packRunId` + `tenant/project`; artifacts published with content digests and URI metadata.
|
||||
- **Events**: notifier payloads include envelope provenance (tenant, project, correlationId, idempotencyKey) pending ORCH-SVC-37-101 final spec.
|
||||
|
||||
## 3) Rate-limit & quota governance
|
||||
|
||||
@@ -24,22 +29,24 @@
|
||||
- Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
|
||||
- Control plane quota updates require Authority scope `orch:quota` (issued via `Orch.Admin` role). Historical rebuilds/backfills additionally require `orch:backfill` and must supply `backfill_reason` and `backfill_ticket` alongside the operator metadata. Authority persists all four fields (`quota_reason`, `quota_ticket`, `backfill_reason`, `backfill_ticket`) for audit replay.
|
||||
|
||||
## 4) APIs
|
||||
|
||||
- `GET /api/jobs?status=` — list jobs with filters (tenant, jobType, status, time window).
|
||||
- `GET /api/jobs/{id}` — job detail (payload digest, attempts, worker, lease history, metrics).
|
||||
- `POST /api/jobs/{id}/cancel` — cancel running/pending job with audit reason.
|
||||
- `POST /api/jobs/{id}/replay` — schedule replay.
|
||||
- `POST /api/limits/throttle` — apply throttle (requires elevated scope).
|
||||
- `GET /api/dashboard/metrics` — aggregated metrics for Console dashboards.
|
||||
## 4) APIs
|
||||
|
||||
- `GET /api/jobs?status=` — list jobs with filters (tenant, jobType, status, time window).
|
||||
- `GET /api/jobs/{id}` — job detail (payload digest, attempts, worker, lease history, metrics).
|
||||
- `POST /api/jobs/{id}/cancel` — cancel running/pending job with audit reason.
|
||||
- `POST /api/jobs/{id}/replay` — schedule replay.
|
||||
- `POST /api/limits/throttle` — apply throttle (requires elevated scope).
|
||||
- `GET /api/dashboard/metrics` — aggregated metrics for Console dashboards.
|
||||
- Event envelope draft (`docs/modules/orchestrator/event-envelope.md`) defines notifier/webhook/SSE payloads with idempotency keys, provenance, and task runner metadata for job/pack-run events.
|
||||
|
||||
All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.
|
||||
|
||||
## 5) Observability
|
||||
|
||||
- Metrics: `job_queue_depth{jobType,tenant}`, `job_latency_seconds`, `job_failures_total`, `job_retry_total`, `lease_extensions_total`.
|
||||
- Logs: structured with `jobId`, `jobType`, `tenant`, `workerId`, `leaseId`, `status`. Incident logs flagged for Ops.
|
||||
- Traces: spans covering `enqueue`, `schedule`, `lease`, `worker_execute`, `complete`. Trace IDs propagate to worker spans for end-to-end correlation.
|
||||
## 5) Observability
|
||||
|
||||
- Metrics: `job_queue_depth{jobType,tenant}`, `job_latency_seconds`, `job_failures_total`, `job_retry_total`, `lease_extensions_total`.
|
||||
- Task Runner bridge adds `pack_run_logs_stream_lag_seconds`, `pack_run_heartbeats_total`, `pack_run_artifacts_total`.
|
||||
- Logs: structured with `jobId`, `jobType`, `tenant`, `workerId`, `leaseId`, `status`. Incident logs flagged for Ops.
|
||||
- Traces: spans covering `enqueue`, `schedule`, `lease`, `worker_execute`, `complete`. Trace IDs propagate to worker spans for end-to-end correlation.
|
||||
|
||||
## 6) Offline support
|
||||
|
||||
|
||||
69
docs/modules/orchestrator/event-envelope.md
Normal file
69
docs/modules/orchestrator/event-envelope.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# Orchestrator Event Envelope (draft)
|
||||
|
||||
Status: draft for ORCH-SVC-38-101 (pending ORCH-SVC-37-101 approval)
|
||||
|
||||
## Goals
|
||||
- Single, provenance-rich envelope for policy/export/job lifecycle events.
|
||||
- Idempotent across retries and transports (Notifier bus, webhooks, SSE/WS streams).
|
||||
- Tenant/project isolation and offline-friendly replays.
|
||||
|
||||
## Envelope
|
||||
```jsonc
|
||||
{
|
||||
"schemaVersion": "orch.event.v1",
|
||||
"eventId": "urn:orch:event:...", // UUIDv7 or ULID
|
||||
"eventType": "job.failed|job.completed|pack_run.log|pack_run.artifact|policy.updated|export.completed",
|
||||
"occurredAt": "2025-11-19T12:34:56Z",
|
||||
"idempotencyKey": "orch-{eventType}-{jobId}-{attempt}",
|
||||
"correlationId": "corr-...", // propagated from producer
|
||||
"tenantId": "...",
|
||||
"projectId": "...", // optional but preferred
|
||||
"actor": {
|
||||
"subject": "service/worker-sdk-go", // who emitted the event
|
||||
"scopes": ["orch:quota", "orch:backfill"]
|
||||
},
|
||||
"job": {
|
||||
"id": "job_018f...",
|
||||
"type": "pack-run|ingest|export|policy-simulate",
|
||||
"runId": "run_018f...", // for pack runs / sims
|
||||
"attempt": 3,
|
||||
"leaseId": "lease_018f...",
|
||||
"taskRunnerId": "tr_018f...",
|
||||
"status": "completed|failed|running|canceled",
|
||||
"reason": "user_cancelled|retry_backoff|quota_paused",
|
||||
"payloadDigest": "sha256:...",
|
||||
"artifacts": [
|
||||
{"uri": "s3://...", "digest": "sha256:...", "mime": "application/json"}
|
||||
]
|
||||
},
|
||||
"metrics": {
|
||||
"durationSeconds": 12.345,
|
||||
"logStreamLagSeconds": 0.8,
|
||||
"backoffSeconds": 30
|
||||
},
|
||||
"notifier": {
|
||||
"channel": "orch.jobs",
|
||||
"delivery": "dsse",
|
||||
"replay": {"ordinal": 5, "total": 12}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Idempotency rules
|
||||
- `eventId` globally unique; `idempotencyKey` dedupe per channel.
|
||||
- Emit once per state transition; retries reuse the same `eventId`/`idempotencyKey`.
|
||||
|
||||
## Provenance
|
||||
- Always include `tenantId` and `projectId` (if available).
|
||||
- Carry `correlationId` from upstream producers and `taskRunnerId` from leasing bridge.
|
||||
- Include `actor.scopes` when events are triggered via elevated tokens (`orch:quota`, `orch:backfill`).
|
||||
|
||||
## Transport bindings
|
||||
- **Notifier bus**: DSSE-wrapped envelope; subject `orch.event` and `eventType`.
|
||||
- **Webhooks**: HMAC with `X-Orchestrator-Signature` (sha256), replay-safe via `idempotencyKey`.
|
||||
- **SSE/WS**: stream per `tenantId` filtered by `projectId`; client dedupe via `eventId`.
|
||||
|
||||
## Backlog & follow-ups
|
||||
- Align field names with ORCH-SVC-37-101 once finalized.
|
||||
- Add examples for policy/export events and pack-run log/manifest payloads.
|
||||
- Document retry/backoff semantics in Notify/Console subscribers.
|
||||
Reference in New Issue
Block a user