feat: Add initial implementation of Vulnerability Resolver Jobs
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

- Created project for StellaOps.Scanner.Analyzers.Native.Tests with necessary dependencies.
- Documented roles and guidelines in AGENTS.md for Scheduler module.
- Implemented IResolverJobService interface and InMemoryResolverJobService for handling resolver jobs.
- Added ResolverBacklogNotifier and ResolverBacklogService for monitoring job metrics.
- Developed API endpoints for managing resolver jobs and retrieving metrics.
- Defined models for resolver job requests and responses.
- Integrated dependency injection for resolver job services.
- Implemented ImpactIndexSnapshot for persisting impact index data.
- Introduced SignalsScoringOptions for configurable scoring weights in reachability scoring.
- Added unit tests for ReachabilityScoringService and RuntimeFactsIngestionService.
- Created dotnet-filter.sh script to handle command-line arguments for dotnet.
- Established nuget-prime project for managing package downloads.
This commit is contained in:
master
2025-11-18 07:52:15 +02:00
parent e69b57d467
commit 8355e2ff75
299 changed files with 13293 additions and 2444 deletions

View File

@@ -28,9 +28,12 @@ The `stella` CLI is the operator-facing Swiss army knife for scans, exports, pol
- ./guides/cli-reference.md
- ./guides/policy.md
## Backlog references
- DOCS-CLI-OBS-52-001 / DOCS-CLI-FORENSICS-53-001 in ../../TASKS.md.
- CLI-CORE-41-001 epic in `src/Cli/StellaOps.Cli/TASKS.md`.
## Backlog references
- DOCS-CLI-OBS-52-001 / DOCS-CLI-FORENSICS-53-001 in ../../TASKS.md.
- CLI-CORE-41-001 epic in `src/Cli/StellaOps.Cli/TASKS.md`.
## Current workstreams (Q42025)
- Active docs sprint: `docs/implplan/SPRINT_0316_0001_0001_docs_modules_cli.md` — normalised sprint naming, doc sync, and upcoming ops/runbook refresh.
## Epic alignment
- **Epic 2 Policy Engine & Editor:** deliver deterministic policy authoring, simulation, and explain verbs.

View File

@@ -4,10 +4,11 @@
- Maintain deterministic behaviour and offline parity across releases.
- Keep documentation, telemetry, and runbooks aligned with the latest sprint outcomes.
## Workstreams
- Backlog grooming: reconcile open stories in ../../TASKS.md with this module's roadmap.
- Implementation: collaborate with service owners to land feature work defined in SPRINTS/EPIC docs.
- Validation: extend tests/fixtures to preserve determinism and provenance requirements.
## Workstreams
- Backlog grooming: reconcile open stories in ../../TASKS.md with this module's roadmap.
- Implementation: collaborate with service owners to land feature work defined in SPRINTS/EPIC docs.
- Validation: extend tests/fixtures to preserve determinism and provenance requirements.
- Documentation sync: keep module docs aligned with active sprint `docs/implplan/SPRINT_0316_0001_0001_docs_modules_cli.md`.
## Epic milestones
- **Epic 2 Policy Engine & Editor:** deliver deterministic policy verbs, simulation, and explain outputs.

View File

@@ -0,0 +1,47 @@
# Advisory AI API (structured chunks)
**Scope:** `/advisories/{advisoryKey}/chunks` (Concelier WebService) · aligned with Sprint 0112 canonical model.
## Response contract
```jsonc
{
"advisoryKey": "CVE-2025-0001",
"fingerprint": "<sha256 canonical advisory>",
"total": 3,
"truncated": false,
"entries": [
{
"type": "workaround", // ordered by (type, observationPath, documentId)
"chunkId": "c0ffee12", // sha256(documentId|observationPath) first 8 bytes
"content": { /* structured field payload */ },
"provenance": {
"documentId": "tenant-a:chunk:newest", // Observation _id
"observationPath": "/references/0", // JSON Pointer into observation
"source": "nvd",
"kind": "workaround",
"value": "tenant-a:chunk:newest",
"recordedAt": "2025-01-07T00:00:00Z",
"fieldMask": ["/references/0"]
}
}
]
}
```
### Determinism & provenance
- Sort entries by `(type, observationPath, documentId)` to keep cache keys stable across nodes.
- Cache keys include the advisory `fingerprint`, chunk/observation limits, filters, and observation hashes.
- Provenance anchors must always include both `documentId` and `observationPath` for Console/Attestor deep links and offline mirrors.
### Query parameters
- `tenant` (required): tenant id; must match authorization context.
- `limit`, `observations`, `minLength`: bounded integers (see `ConcelierOptions.AdvisoryChunks`).
- `section`, `format`: comma-separated filters (case-insensitive).
### Compatibility notes
- Mirrors and offline kits rely on `fingerprint` + `chunkId` to verify chunks without re-merging observations.
- Field names mirror GHSA GraphQL and Cisco PSIRT openVuln payloads for downstream parity.

View File

@@ -1,12 +1,15 @@
# Link-Not-Merge (LNM) Observation & Linkset Schema
_Draft for approval — authored 2025-11-16 to unblock CONCELIER-LNM tracks._
_Frozen v1 (add-only) — approved 2025-11-17 for CONCELIER-LNM-21-001/002/101._
## Goals
- Immutable storage of raw advisory observations per source/tenant.
- Deterministic linksets built from observations without merging or mutating originals.
- Stable across online/offline deployments; replayable from raw inputs.
## Status
- Frozen v1 as of 2025-11-17; further schema changes must go through ADR + sprint gating (CONCELIER-LNM-22x+).
## Observation document (Mongo JSON Schema excerpt)
```json
{
@@ -41,6 +44,17 @@ _Draft for approval — authored 2025-11-16 to unblock CONCELIER-LNM tracks._
}
},
"references": {"bsonType": "array", "items": {"bsonType":"string"}},
"scopes": {"bsonType":"array","items":{"bsonType":"string"}},
"relationships": {
"bsonType": "array",
"items": {"bsonType":"object","required":["type","source","target"],
"properties": {
"type":{"bsonType":"string"},
"source":{"bsonType":"string"},
"target":{"bsonType":"string"},
"provenance":{"bsonType":"string"}
}}
},
"weaknesses": {"bsonType":"array","items":{"bsonType":"string"}},
"published": {"bsonType": "date"},
"modified": {"bsonType": "date"},
@@ -84,6 +98,14 @@ _Draft for approval — authored 2025-11-16 to unblock CONCELIER-LNM tracks._
"severities": {"bsonType":"array","items":{"bsonType":"object"}}
}
},
"confidence": {"bsonType":"double", "description":"Optional correlation confidence (01)"},
"conflicts": {"bsonType":"array","items":{"bsonType":"object",
"required":["field","reason"],
"properties":{
"field":{"bsonType":"string"},
"reason":{"bsonType":"string"},
"values":{"bsonType":"array","items":{"bsonType":"string"}}
}}},
"createdAt":{"bsonType":"date"},
"builtByJobId":{"bsonType":"string"},
"provenance": {"bsonType":"object","properties":{

View File

@@ -0,0 +1,89 @@
# Excititor Advisory-AI Evidence Contract (v1)
Updated: 2025-11-18 · Scope: EXCITITOR-AIAI-31-004 (Phase 119)
This note defines the deterministic, aggregation-only contract that Excititor exposes to Advisory AI and Lens consumers. It covers the `/v1/vex/evidence/chunks` NDJSON stream plus the projection rules for observation IDs, signatures, and provenance metadata.
## Goals
- **Deterministic & replayable**: stable ordering, no implicit clocks, fixed schemas.
- **Aggregation-only**: no consensus/inference; raw supplier statements plus signatures and AOC (Aggregation-Only Contract) guardrails.
- **Offline-friendly**: chunked NDJSON; no cross-tenant lookups; portable enough for mirror/air-gap bundles.
## Endpoint
- `GET /v1/vex/evidence/chunks`
- **Query**:
- `tenant` (required)
- `vulnerabilityId` (optional, repeatable) — CVE, GHSA, etc.
- `productKey` (optional, repeatable) — PURLish key used by Advisory AI.
- `cursor` (optional) — stable pagination token.
- `limit` (optional) — max records per stream chunk (default 500, max 2000).
- **Response**: `Content-Type: application/x-ndjson`
- Each line is a single evidence record (see schema below).
- Ordered by `(tenant, vulnerabilityId, productKey, observationId, statementId)` to stay deterministic.
## Evidence record schema (NDJSON)
```json
{
"tenant": "acme",
"vulnerabilityId": "CVE-2024-1234",
"productKey": "pkg:pypi/django@3.2.24",
"observationId": "obs-3cf9d6e4-…",
"statementId": "stmt-9c1d…",
"source": {
"supplier": "upstream:osv",
"documentId": "osv:GHSA-xxxx-yyyy",
"retrievedAt": "2025-11-10T12:34:56Z",
"signatureStatus": "missing|unverified|verified"
},
"aoc": {
"violations": [
{ "code": "EVIDENCE_SIGNATURE_MISSING", "surface": "ingest" }
]
},
"evidence": {
"type": "vex.statement",
"payload": { "...supplier-normalized-fields..." }
},
"provenance": {
"hash": "sha256:...",
"canonicalUri": "https://mirror.example/bundles/…",
"bundleId": "mirror-bundle-001"
}
}
```
### Field notes
- `observationId` is stable and maps 1:1 to internal storage; Advisory AI must cite it when emitting narratives.
- `statementId` remains unique within an observation.
- `signatureStatus` is pass-through from ingest; no interpretation beyond `missing|unverified|verified`.
- `aoc.violations` enumerates guardrail violations without blocking delivery.
- `evidence.payload` is supplier-shaped; we **do not** merge or rank.
- `provenance.hash` is the SHA-256 of the supplier document bytes; `canonicalUri` points to the mirror bundle when available.
## Determinism rules
- Ordering: fixed sort above; pagination cursor is derived from the last emitted `(tenant, vulnerabilityId, productKey, observationId, statementId)`.
- Clocks: All timestamps are UTC ISO-8601 with `Z`.
- No server-generated randomness; record content is idempotent for identical upstream inputs.
## AOC guardrails
- Enforced surfaces: ingest, `/v1/vex/aoc/verify`, and chunk emission.
- Violations are reported via `aoc.violations` and metric `excititor.vex.aoc.guard_violations`.
- No statements are dropped due to AOC; consumers decide how to act.
## Telemetry (counters/logs-only until span sink arrives)
- `excititor.vex.chunks.requests` — by `tenant`, `outcome`, `truncated`.
- `excititor.vex.chunks.bytes` — histogram of NDJSON stream sizes.
- `excititor.vex.chunks.records` — histogram of records per stream.
- Existing observation metrics (`excititor.vex.observation.*`) remain unchanged.
## Error handling
- 400 for invalid tenant or mutually exclusive filters.
- 429 with `Retry-After` when throttle budgets exceeded.
- 503 on upstream store/transient failures; responses remain NDJSON-free on error.
## Offline / mirror readiness
- When mirror bundles are configured, `provenance.canonicalUri` points to the local bundle path; otherwise it is omitted.
- All payloads are side-effect free; no remote fetches occur while streaming.
## Versioning
- Contract version: `v1` (this document). Changes must be additive; breaking changes require `v2` path and updated doc.

View File

@@ -17,7 +17,10 @@ Excititors evidence APIs now emit first-class OpenTelemetry metrics so Lens,
| `excititor.vex.observation.requests` | Counter | Number of `/v1/vex/observations/{vulnerabilityId}/{productKey}` requests handled. | `tenant`, `outcome` (`success`, `error`, `cancelled`), `truncated` (`true/false`) |
| `excititor.vex.observation.statement_count` | Histogram | Distribution of statements returned per observation projection request. | `tenant`, `outcome` |
| `excititor.vex.signature.status` | Counter | Signature status per statement (missing vs. unverified). | `tenant`, `status` (`missing`, `unverified`) |
| `excititor.vex.aoc.guard_violations` | Counter | Aggregated count of Aggregation-Only Contract violations detected by the WebService (ingest + `/vex/aoc/verify`). | `tenant`, `surface` (`ingest`, `aoc_verify`, etc.), `code` (AOC error code) |
| `excititor.vex.aoc.guard_violations` | Counter | Aggregated count of Aggregation-Only Contract violations detected by the WebService (ingest + `/v1/vex/aoc/verify`). | `tenant`, `surface` (`ingest`, `aoc_verify`, etc.), `code` (AOC error code) |
| `excititor.vex.chunks.requests` | Counter | Requests to `/v1/vex/evidence/chunks` (NDJSON stream). | `tenant`, `outcome` (`success`,`error`,`cancelled`), `truncated` (`true/false`) |
| `excititor.vex.chunks.bytes` | Histogram | Size of NDJSON chunk streams served (bytes). | `tenant`, `outcome` |
| `excititor.vex.chunks.records` | Histogram | Count of evidence records emitted per chunk stream. | `tenant`, `outcome` |
> All metrics originate from the `EvidenceTelemetry` helper (`src/Excititor/StellaOps.Excititor.WebService/Telemetry/EvidenceTelemetry.cs`). When disabled (telemetry off), the helper is inert.
@@ -31,8 +34,8 @@ Excititors evidence APIs now emit first-class OpenTelemetry metrics so Lens,
1. **Enable telemetry**: set `Excititor:Telemetry:EnableMetrics=true`, configure OTLP endpoints/headers as described in `TelemetryExtensions`.
2. **Add dashboards**: import panels referencing the metrics above (see Grafana JSON snippets in Ops repo once merged).
3. **Alerting**: add rules for high guard violation rates and missing signatures. Tie alerts back to connectors via tenant metadata.
4. **Post-deploy checks**: after each release, verify metrics emit by curling `/v1/vex/observations/...`, watching the console exporter (dev) or OTLP (prod).
3. **Alerting**: add rules for high guard violation rates, missing signatures, and abnormal chunk bytes/record counts. Tie alerts back to connectors via tenant metadata.
4. **Post-deploy checks**: after each release, verify metrics emit by curling `/v1/vex/observations/...` and `/v1/vex/evidence/chunks`, watching the console exporter (dev) or OTLP (prod).
## Related documents

View File

@@ -17,6 +17,8 @@
| `ledger_ingest_backlog_events` | Gauge | `tenant` | Number of events buffered in the writer queue. Alert when >5000 for 5min. |
| `ledger_projection_lag_seconds` | Gauge | `tenant` | Wall-clock difference between latest ledger event and projection tail. Target <30s. |
| `ledger_projection_rebuild_seconds` | Histogram | `tenant` | Duration of replay/rebuild operations triggered by LEDGER-29-008 harness. |
| `ledger_projection_apply_seconds` | Histogram | `tenant`, `event_type`, `policy_version`, `evaluation_status` | Time to apply a single ledger event to projection. Target P95 <1s. |
| `ledger_projection_events_total` | Counter | `tenant`, `event_type`, `policy_version`, `evaluation_status` | Count of events applied to projections. |
| `ledger_merkle_anchor_duration_seconds` | Histogram | `tenant` | Time to batch + anchor events. Target <60s per 10k events. |
| `ledger_merkle_anchor_failures_total` | Counter | `tenant`, `reason` (`db`, `signing`, `network`) | Alerts at >0 within 15min. |
| `ledger_attachments_encryption_failures_total` | Counter | `tenant`, `stage` (`encrypt`, `sign`, `upload`) | Ensures secure attachment pipeline stays healthy. |
@@ -25,22 +27,23 @@
### Derived dashboards
- **Writer health:** `ledger_write_latency_seconds` (P50/P95/P99), backlog gauge, event throughput.
- **Projection health:** `ledger_projection_lag_seconds`, rebuild durations, conflict counts (from logs).
- **Projection health:** `ledger_projection_lag_seconds`, `ledger_projection_apply_seconds`, projection throughput, conflict counts (from logs).
- **Anchoring:** Anchor duration histogram, failure counter, root hash timeline.
## 3. Logs & traces
- **Log structure:** Serilog JSON with fields `tenant`, `chainId`, `sequence`, `eventId`, `eventType`, `actorId`, `policyVersion`, `hash`, `merkleRoot`.
- **Log levels:** `Information` for success summaries (sampled), `Warning` for retried operations, `Error` for failed writes/anchors.
- **Correlation:** Each API request includes `requestId` + `traceId` logged with events. Projector logs capture `replayId` and `rebuildReason`.
- **Timeline events:** `ledger.event.appended` and `ledger.projection.updated` are emitted as structured logs carrying `tenant`, `chainId`, `sequence`, `eventId`, `policyVersion`, `traceId`, and placeholder `evidence_ref` fields for downstream timeline consumers.
- **Secrets:** Ensure `event_body` is never logged; log only metadata/hashes.
## 4. Alerts
| Alert | Condition | Response |
| --- | --- | --- |
| **LedgerWriteSLA** | `ledger_write_latency_seconds` P95 > 0.12s for 3 intervals | Check DB contention, review queue backlog, scale writer. |
| **LedgerWriteSLA** | `ledger_write_latency_seconds` P95 > 1s for 3 intervals | Check DB contention, review queue backlog, scale writer. |
| **LedgerBacklogGrowing** | `ledger_ingest_backlog_events` > 5000 for 5min | Inspect upstream policy runs, ensure projector keeping up. |
| **ProjectionLag** | `ledger_projection_lag_seconds` > 60s | Trigger rebuild, verify change streams. |
| **ProjectionLag** | `ledger_projection_lag_seconds` > 30s | Trigger rebuild, verify change streams. |
| **AnchorFailure** | `ledger_merkle_anchor_failures_total` increase > 0 | Collect logs, rerun anchor, verify signing service. |
| **AttachmentSecurityError** | `ledger_attachments_encryption_failures_total` increase > 0 | Audit attachments pipeline; check key material and storage endpoints. |

View File

@@ -38,6 +38,7 @@ Events are immutable append-only records representing every workflow change. Rec
| `event_hash` | `char(64)` | SHA-256 over canonical payload envelope. |
| `previous_hash` | `char(64)` | Hash of prior event in chain (all zeroes for first). |
| `merkle_leaf_hash` | `char(64)` | Leaf hash used for Merkle anchoring (hash over `event_hash || sequence_no`). |
| `evidence_bundle_ref` | `text` | Optional reference to evaluation/job evidence bundle (DSSE or capsule id). |
**Constraints & indexes**
@@ -49,6 +50,7 @@ CHECK (event_hash ~ '^[0-9a-f]{64}$');
CHECK (previous_hash ~ '^[0-9a-f]{64}$');
CREATE INDEX ix_ledger_events_finding ON ledger_events (tenant_id, finding_id, policy_version);
CREATE INDEX ix_ledger_events_type ON ledger_events (tenant_id, event_type, recorded_at DESC);
CREATE INDEX ix_ledger_events_finding_evidence_ref ON ledger_events (tenant_id, finding_id, recorded_at DESC) WHERE evidence_bundle_ref IS NOT NULL;
```
Partitions: top-level partitioned by `tenant_id` (list) with a default partition. Optional sub-partition by month on `recorded_at` for large tenants. PostgreSQL requires the partition key in unique constraints; global uniqueness for `event_id` is enforced as `(tenant_id, event_id)` with application-level guards maintaining cross-tenant uniqueness.

View File

@@ -16,7 +16,8 @@ Graph Indexer + Graph API build the tenant-scoped knowledge graph that powers bl
- **Storage abstraction** — supports document + adjacency (Mongo) or pluggable graph engine; both paths enforce deterministic ordering and export manifests.
## Current workstreams (Q42025)
- `GRAPH-SVC-30-00x` (in `src/Graph/StellaOps.Graph.Indexer/TASKS.md`) — stand up Graph Indexer pipeline, identity registry, snapshot exports.
- `GRAPH-SVC-30-00x` (see `src/Graph/StellaOps.Graph.Indexer/TASKS.md`) — stand up Graph Indexer pipeline, identity registry, snapshot exports.
- Active sprint: `docs/implplan/SPRINT_0141_0001_0001_graph_indexer.md` (Runtime & Signals 140.A) — clustering/centrality jobs, incremental/backfill pipeline, determinism tests, packaging.
- `GRAPH-API-30-00x` — draft API planner/cost guard, streaming responses, and Authority scope integration.
- `DOCS-GRAPH-24-003` & related backlog — author overview/API/query language docs; update this README again once those deliverables land.
- Deployment/DevOps follow-ups (`DEVOPS-VEX-30-001`, `DEPLOY-VEX-30-001`) coordinate dashboards, load tests, and Helm/Compose overlays for the graph stack.

View File

@@ -1,6 +1,7 @@
# Implementation plan — Graph
## Delivery phases
## Delivery phases
> Current active execution sprint: `docs/implplan/SPRINT_0141_0001_0001_graph_indexer.md` (Runtime & Signals 140.A).
- **Phase 1 Graph Indexer foundations**
Stand up Graph Indexer service, node/edge schemas, ingestion from SBOM/Concelier/Excititor events, identity stability, and snapshot materialisation.
- **Phase 2 Graph API service**

View File

@@ -2,14 +2,17 @@
The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform.
## Latest updates (2025-11-01)
- Authority added `orch:quota` and `orch:backfill` scopes for quota/backfill operations, plus token reason/ticket auditing (`docs/updates/2025-11-01-orch-admin-scope.md`). Operators must supply `quota_reason` / `quota_ticket` (or `backfill_reason` / `backfill_ticket`) when requesting elevated tokens and surface those claims in change reviews.
## Latest updates (2025-11-18)
- Job leasing now flows through the Task Runner bridge: allocations carry idempotency keys, lease durations, and retry hints; workers acknowledge via claim/ack and emit heartbeats.
- Event envelopes remain interim pending ORCH-SVC-37-101; include provenance (tenant/project, job type, correlationId, task runner id) in all notifier events.
- Authority `orch:quota` / `orch:backfill` scopes require reason/ticket audit fields; include them in runbooks and dashboard overrides.
## Responsibilities
- Track job state, throughput, and errors for Concelier, Excititor, Scheduler, and export pipelines.
- Expose dashboards and APIs for throttling, replays, and failover.
- Enforce rate-limits, concurrency and dependency chains across queues.
- Stream structured events and audit logs for incident response.
- Provide Task Runner bridge semantics (claim/ack, heartbeats, progress, artifacts, backfills) for Go/Python SDKs.
## Key components
- Orchestrator WebService (control plane).
@@ -24,9 +27,9 @@ The Orchestrator schedules, observes, and recovers ingestion and analysis jobs a
## Operational notes
- Job recovery runbooks and dashboard JSON as described in Epic 9.
- Audit retention policies for job history.
- Rate-limit reconfiguration guidelines.
- When using the new `orch:quota` / `orch:backfill` scopes, ensure reason/ticket fields are captured in runbooks and audit checklists per the 2025-11-01 Authority update.
- Rate-limit and lease reconfiguration guidelines; keep lease defaults aligned across runners and SDKs (Go/Python).
- Log streaming: SSE/WS endpoints carry correlationId + tenant/project; buffer size and retention must be documented in runbooks.
- When using `orch:quota` / `orch:backfill` scopes, capture reason/ticket fields in runbooks and audit checklists.
## Epic alignment
- Epic 9: Source & Job Orchestrator Dashboard.

View File

@@ -0,0 +1,9 @@
# Orchestrator docs task board
| Task ID | Status | Owner(s) | Notes |
| --- | --- | --- | --- |
| ORCH-DOCS-0001 | DONE | Docs Guild | README updated with leasing / task runner bridge notes and interim envelope guidance. |
| ORCH-ENG-0001 | DONE | Module Team | Sprint references normalized; notes synced to doc sprint. |
| ORCH-OPS-0001 | DONE | Ops Guild | Runbook impacts captured in README; follow-up to update ops docs. |
Status rules: mirror changes in `docs/implplan/SPRINT_0323_0001_0001_docs_modules_orchestrator.md`; use TODO → DOING → DONE/BLOCKED; add brief note if pausing.

View File

@@ -9,13 +9,18 @@
- **Queue abstraction.** Supports Mongo queue, Redis Streams, or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
- **Dashboard feeds.** SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.
## 2) Job lifecycle
1. **Enqueue.** Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit `JobRequest` records containing `jobType`, `tenant`, `priority`, `payloadDigest`, `dependencies`.
2. **Scheduling.** Orchestrator applies quotas and rate limits per `{tenant, jobType}`. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
3. **Leasing.** Workers poll `LeaseJob` endpoint; Orchestrator returns job with `leaseId`, `leaseUntil`, and instrumentation tokens. Lease renewal required for long-running tasks.
4. **Completion.** Worker reports status (`succeeded`, `failed`, `canceled`, `timed_out`). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
5. **Replay.** Operators trigger `POST /jobs/{id}/replay` which clones job payload, sets `replayOf` pointer, and requeues with high priority while preserving determinism metadata.
## 2) Job lifecycle
1. **Enqueue.** Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit `JobRequest` records containing `jobType`, `tenant`, `priority`, `payloadDigest`, `dependencies`.
2. **Scheduling.** Orchestrator applies quotas and rate limits per `{tenant, jobType}`. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
3. **Leasing (Task Runner bridge).** Workers poll `LeaseJob` endpoint; Orchestrator returns job with `leaseId`, `leaseUntil`, `idempotencyKey`, and instrumentation tokens. Lease renewal required for long-running tasks; leases carry retry hints and provenance (`tenant`, `project`, `correlationId`, `taskRunnerId`).
4. **Completion.** Worker reports status (`succeeded`, `failed`, `canceled`, `timed_out`). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
5. **Replay.** Operators trigger `POST /jobs/{id}/replay` which clones job payload, sets `replayOf` pointer, and requeues with high priority while preserving determinism metadata.
### Pack-run lifecycle (phase III)
- **Register** `pack-run` job type with task runner hints (artifacts, log channel, heartbeat cadence).
- **Logs/Artifacts**: SSE/WS stream keyed by `packRunId` + `tenant/project`; artifacts published with content digests and URI metadata.
- **Events**: notifier payloads include envelope provenance (tenant, project, correlationId, idempotencyKey) pending ORCH-SVC-37-101 final spec.
## 3) Rate-limit & quota governance
@@ -24,22 +29,24 @@
- Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
- Control plane quota updates require Authority scope `orch:quota` (issued via `Orch.Admin` role). Historical rebuilds/backfills additionally require `orch:backfill` and must supply `backfill_reason` and `backfill_ticket` alongside the operator metadata. Authority persists all four fields (`quota_reason`, `quota_ticket`, `backfill_reason`, `backfill_ticket`) for audit replay.
## 4) APIs
- `GET /api/jobs?status=` — list jobs with filters (tenant, jobType, status, time window).
- `GET /api/jobs/{id}` — job detail (payload digest, attempts, worker, lease history, metrics).
- `POST /api/jobs/{id}/cancel` — cancel running/pending job with audit reason.
- `POST /api/jobs/{id}/replay` — schedule replay.
- `POST /api/limits/throttle` — apply throttle (requires elevated scope).
- `GET /api/dashboard/metrics` — aggregated metrics for Console dashboards.
## 4) APIs
- `GET /api/jobs?status=` — list jobs with filters (tenant, jobType, status, time window).
- `GET /api/jobs/{id}` — job detail (payload digest, attempts, worker, lease history, metrics).
- `POST /api/jobs/{id}/cancel` — cancel running/pending job with audit reason.
- `POST /api/jobs/{id}/replay` — schedule replay.
- `POST /api/limits/throttle` — apply throttle (requires elevated scope).
- `GET /api/dashboard/metrics` — aggregated metrics for Console dashboards.
- Event envelope draft (`docs/modules/orchestrator/event-envelope.md`) defines notifier/webhook/SSE payloads with idempotency keys, provenance, and task runner metadata for job/pack-run events.
All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.
## 5) Observability
- Metrics: `job_queue_depth{jobType,tenant}`, `job_latency_seconds`, `job_failures_total`, `job_retry_total`, `lease_extensions_total`.
- Logs: structured with `jobId`, `jobType`, `tenant`, `workerId`, `leaseId`, `status`. Incident logs flagged for Ops.
- Traces: spans covering `enqueue`, `schedule`, `lease`, `worker_execute`, `complete`. Trace IDs propagate to worker spans for end-to-end correlation.
## 5) Observability
- Metrics: `job_queue_depth{jobType,tenant}`, `job_latency_seconds`, `job_failures_total`, `job_retry_total`, `lease_extensions_total`.
- Task Runner bridge adds `pack_run_logs_stream_lag_seconds`, `pack_run_heartbeats_total`, `pack_run_artifacts_total`.
- Logs: structured with `jobId`, `jobType`, `tenant`, `workerId`, `leaseId`, `status`. Incident logs flagged for Ops.
- Traces: spans covering `enqueue`, `schedule`, `lease`, `worker_execute`, `complete`. Trace IDs propagate to worker spans for end-to-end correlation.
## 6) Offline support

View File

@@ -0,0 +1,69 @@
# Orchestrator Event Envelope (draft)
Status: draft for ORCH-SVC-38-101 (pending ORCH-SVC-37-101 approval)
## Goals
- Single, provenance-rich envelope for policy/export/job lifecycle events.
- Idempotent across retries and transports (Notifier bus, webhooks, SSE/WS streams).
- Tenant/project isolation and offline-friendly replays.
## Envelope
```jsonc
{
"schemaVersion": "orch.event.v1",
"eventId": "urn:orch:event:...", // UUIDv7 or ULID
"eventType": "job.failed|job.completed|pack_run.log|pack_run.artifact|policy.updated|export.completed",
"occurredAt": "2025-11-19T12:34:56Z",
"idempotencyKey": "orch-{eventType}-{jobId}-{attempt}",
"correlationId": "corr-...", // propagated from producer
"tenantId": "...",
"projectId": "...", // optional but preferred
"actor": {
"subject": "service/worker-sdk-go", // who emitted the event
"scopes": ["orch:quota", "orch:backfill"]
},
"job": {
"id": "job_018f...",
"type": "pack-run|ingest|export|policy-simulate",
"runId": "run_018f...", // for pack runs / sims
"attempt": 3,
"leaseId": "lease_018f...",
"taskRunnerId": "tr_018f...",
"status": "completed|failed|running|canceled",
"reason": "user_cancelled|retry_backoff|quota_paused",
"payloadDigest": "sha256:...",
"artifacts": [
{"uri": "s3://...", "digest": "sha256:...", "mime": "application/json"}
]
},
"metrics": {
"durationSeconds": 12.345,
"logStreamLagSeconds": 0.8,
"backoffSeconds": 30
},
"notifier": {
"channel": "orch.jobs",
"delivery": "dsse",
"replay": {"ordinal": 5, "total": 12}
}
}
```
## Idempotency rules
- `eventId` globally unique; `idempotencyKey` dedupe per channel.
- Emit once per state transition; retries reuse the same `eventId`/`idempotencyKey`.
## Provenance
- Always include `tenantId` and `projectId` (if available).
- Carry `correlationId` from upstream producers and `taskRunnerId` from leasing bridge.
- Include `actor.scopes` when events are triggered via elevated tokens (`orch:quota`, `orch:backfill`).
## Transport bindings
- **Notifier bus**: DSSE-wrapped envelope; subject `orch.event` and `eventType`.
- **Webhooks**: HMAC with `X-Orchestrator-Signature` (sha256), replay-safe via `idempotencyKey`.
- **SSE/WS**: stream per `tenantId` filtered by `projectId`; client dedupe via `eventId`.
## Backlog & follow-ups
- Align field names with ORCH-SVC-37-101 once finalized.
- Add examples for policy/export events and pack-run log/manifest payloads.
- Document retry/backoff semantics in Notify/Console subscribers.

View File

@@ -0,0 +1,57 @@
# SBOM Service architecture (2025Q4)
> Scope: canonical SBOM projections, lookup and timeline APIs, asset metadata overlays, and events feeding Advisory AI, Console, Graph, Policy, and Vuln Explorer.
## 1) Mission & boundaries
- Mission: serve deterministic, tenant-scoped SBOM projections (Link-Not-Merge v1) and related metadata for downstream reasoning and overlays.
- Boundaries:
- Does not perform scanning; consumes Scanner outputs or supplied SPDX/CycloneDX blobs.
- Does not author verdicts/policy; supplies evidence and projections to Policy/Concelier/Graph.
- Append-only SBOM versions; mutations happen via new versions, never in-place edits.
## 2) Project layout
- `src/SbomService/StellaOps.SbomService` — REST API + event emitters + orchestrator integration.
- Storage: MongoDB collections (proposed)
- `sbom_snapshots` (immutable versions; tenant + artifact + digest + createdAt)
- `sbom_projections` (materialised views keyed by snapshotId, entrypoint/service node flags)
- `sbom_assets` (asset metadata, criticality/owner/env/exposure; append-only history)
- `sbom_paths` (resolved dependency paths with runtime flags, blast-radius hints)
- `sbom_events` (outbox for event delivery + watermark/backfill tracking)
## 3) APIs (first wave)
- `GET /sbom/paths?purl=...&artifact=...&scope=...&env=...` — returns ordered paths with runtime_flag/blast_radius and nearest-safe-version hint; supports `cursor` pagination.
- `GET /sbom/versions?artifact=...` — time-ordered SBOM version timeline for Advisory AI; include provenance and source bundle hash.
- `GET /console/sboms` — Console catalog with filters (artifact, license, scope, asset tags), cursor pagination, evaluation metadata, immutable JSON projection for drawer views.
- `GET /components/lookup?purl=...` — component neighborhood for global search/Graph overlays; returns caches hints + tenant enforcement.
- `POST /entrypoints` / `GET /entrypoints` — manage entrypoint/service node overrides feeding Cartographer relevance; deterministic defaults when unset.
## 4) Ingestion & orchestrator integration
- Ingest sources: Scanner pipeline (preferred) or uploaded SPDX 3.0.1/CycloneDX 1.6 bundles.
- Orchestrator: register SBOM ingest/index jobs; worker SDK emits artifact hash + job metadata; honor pause/throttle; report backpressure metrics; support watermark-based backfill for idempotent replays.
- Idempotency: combine `(tenant, artifactDigest, sbomVersion)` as primary key; duplicate ingests short-circuit.
## 5) Events & streaming
- `sbom.version.created` — emitted per new SBOM snapshot; payload: tenant, artifact digest, sbomVersion, projection hash, source bundle hash, import provenance; replay/backfill via outbox with watermark.
- `sbom.asset.updated` — emitted when asset metadata changes; idempotent payload keyed by `(tenant, assetId, version)`.
- Inventory/resolver feeds — queue/topic delivering `(artifact, purl, version, paths, runtime_flag, scope, nearest_safe_version)` for Vuln Explorer/Findings Ledger.
## 6) Determinism & offline posture
- Stable ordering for projections and paths; timestamps in UTC ISO-8601; hash inputs canonicalised.
- Add-only evolution for schemas; LNM v1 fixtures published alongside API docs and replayable tests.
- Offline-friendly: uses mirrored packages, avoids external calls during projection; exports NDJSON bundles for air-gapped replay.
## 7) Tenancy & security
- All APIs require tenant context (token claims or mTLS binding); collection filters must include tenant keys.
- Enforce least-privilege queries; avoid cross-tenant caches; log tenant IDs in structured logs.
- Input validation: schema-validate incoming SBOMs; reject oversized/unsupported media types early.
## 8) Observability
- Metrics: `sbom_projection_seconds`, `sbom_projection_size_bytes`, `sbom_paths_latency_seconds`, `sbom_paths_cache_hit_ratio`, `sbom_events_backlog`.
- Traces: wrap ingest, projection build, and API handlers; propagate orchestrator job IDs.
- Logs: structured, include tenant + artifact digest + sbomVersion; classify ingest failures (schema, storage, orchestrator, validation).
- Alerts: backlog thresholds for outbox/event delivery; high latency on path/timeline endpoints.
## 9) Open questions / dependencies
- Confirm orchestrator pause/backfill contract (shared with Runtime & Signals 140-series).
- Finalise storage collection names and indexes (compound on tenant+artifactDigest+version, TTL for transient staging).
- Publish canonical LNM v1 fixtures and JSON schemas for projections and asset metadata.

View File

@@ -4,6 +4,7 @@ This directory contains deep technical designs for current and upcoming analyzer
## Language analyzers
- `ruby-analyzer.md` — lockfile, runtime graph, capability signals for Ruby.
- `deno-runtime-signals.md` — runtime trace + policy signal contract for Deno analyzer.
## Surface & platform contracts
- `surface-fs.md`

View File

@@ -0,0 +1,109 @@
# Deno Runtime Signals & Policy Contract (v0.1-DRAFT)
## Purpose
Define deterministic runtime evidence records and policy signals for Deno analyzer phase II (tasks DENO-26-009/010/011). The contract is offline-friendly, append-only, and compatible with Surface/Signals stores.
## Scope
- Harnessed execution hook (`stella deno trace`) capturing module loads and permission grants during analysis.
- Trace serialization for Worker/CLI/Offline Kit and AnalysisStore.
- Policy signal keys consumed by Surface/Signals and Policy Engine.
## Event model
- Encoding: NDJSON; each line is a UTF-8 JSON object sorted by key when written.
- Path handling: absolute paths are converted to analyzer-relative paths; each relative path also emits `path_sha256` (lowercase hex) to proof without leaking paths.
- Timestamps: ISO-8601 UTC with millisecond precision; no local time.
### Event types
```jsonc
{
"type": "deno.module.load", // required
"ts": "2025-11-17T12:00:00.123Z", // required
"module": {
"specifier": "file:///src/app/main.ts", // original
"normalized": "app/main.ts",
"path_sha256": "..."
},
"reason": "dynamic-import", // static-import | dynamic-import | npm | cache | bundle
"permissions": ["fs", "net"], // granted at time of load
"origin": "https://deno.land/x/std@0.208.0/http/server.ts" // optional for remote/npm
}
```
```jsonc
{
"type": "deno.permission.use",
"ts": "2025-11-17T12:00:01.234Z",
"permission": "ffi", // fs|net|env|ffi|process|crypto|worker
"module": {
"normalized": "native/mod.ts",
"path_sha256": "..."
},
"details": "Deno.dlopen" // short reason code
}
```
```jsonc
{
"type": "deno.npm.resolution",
"ts": "2025-11-17T12:00:02.100Z",
"specifier": "npm:chalk@5",
"package": "chalk",
"version": "5.3.0",
"resolved": "file:///cache/npm/registry.npmjs.org/chalk/5.3.0",
"exists": true
}
```
```jsonc
{
"type": "deno.wasm.load",
"ts": "2025-11-17T12:00:03.000Z",
"module": {
"normalized": "pkg/module.wasm",
"path_sha256": "..."
},
"importer": "app/main.ts",
"reason": "dynamic-import"
}
```
## Observation envelope (AnalysisStore)
Key: `ScanAnalysisKeys.DenoObservationPayload`
Payload fields:
- `analyzerId`: `deno`
- `kind`: `deno.runtime.v1`
- `mediaType`: `application/x-ndjson`
- `metadata` (map):
- `deno.runtime.event_count`
- `deno.runtime.permission_uses`
- `deno.runtime.module_loads`
- `deno.runtime.remote_origins` (comma-separated, sorted)
- `deno.runtime.permissions` (unique perms CSV)
- `deno.runtime.npm_resolutions`
- `deno.runtime.wasm_loads`
- `deno.runtime.dynamic_imports`
- `content`: gz-safe byte stream of NDJSON lines.
## Policy signal keys
Emit into Surface/Signals (namespaced `surface.lang.deno.*`) derived from observation digest + static analyzer outputs:
- `surface.lang.deno.permissions`: CSV of unique permissions seen (fs, net, env, ffi, process, crypto, worker).
- `surface.lang.deno.remote_origins`: CSV of normalized remote origins from module loads/fetches.
- `surface.lang.deno.npm_modules`: integer count of npm resolutions observed.
- `surface.lang.deno.wasm_modules`: integer count of wasm loads.
- `surface.lang.deno.dynamic_imports`: integer count of `deno.module.load` events where `reason=dynamic-import`.
- `surface.lang.deno.capabilities`: CSV of capability reason codes from static analyzer (`builtin.*`) merged with runtime permissions.
- `surface.lang.deno.module_loads`: integer count of module load events.
- `surface.lang.deno.permission_uses`: integer count of permission use events.
## CLI / Worker contracts
- CLI verb `stella deno trace --root <path>` writes `deno-runtime.ndjson` to output folder and prints observation hash.
- Worker: when `DenoRuntimeCapture:true`, analyzer writes observation to AnalysisStore and links hash in layer metadata `deno.observation.hash` (already produced by static analyzer) and new `deno.runtime.hash`.
## Determinism and safety
- No network fetches; trace operates on cached artifacts or harnessed execution with `--allow-all` disabled. Permissions recorded reflect requested grants; blanks treated as deny.
- Paths always normalized to forward slashes; hashing uses full relative path bytes.
- Redaction: no environment variable values or file contents persisted—only paths + hashes.
## Open follow-ups (to track in sprint)
- Map NDJSON to AOC writer once runtime ingestion lands (LANG-11-003 analogue for Deno).
- Add integration tests mirroring fixtures from DENO-26-008 with synthetic permission use and dynamic imports.

View File

@@ -1,34 +1,39 @@
# Scheduler agent guide
## Mission
Scheduler detects advisory/VEX deltas, computes impact windows, and orchestrates re-evaluations across Scanner and Policy Engine.
Scheduler detects advisory/VEX deltas, computes impact windows, and orchestrates re-evaluations across Scanner and Policy Engine. Docs in this directory are the front-door contract for contributors.
## Key docs
- [Module README](./README.md)
- [Architecture](./architecture.md)
- [Implementation plan](./implementation_plan.md)
- [Task board](./TASKS.md)
## Working directory
- `docs/modules/scheduler` (docs-only); code changes live under `src/Scheduler/**` but must be coordinated via sprint plans.
## How to get started
1. Open sprint file `/docs/implplan/SPRINT_*.md` and locate the stories referencing this module.
2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED).
3. Read the architecture and README for domain context before editing code or docs.
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Roles & owners
- **Docs author**: curates AGENTS/TASKS/runbooks; keeps determinism/offline guidance accurate.
- **Scheduler engineer (Worker/WebService)**: aligns implementation notes with architecture and ensures observability/runbook updates land with code.
- **Observability/Ops**: maintains dashboards/rules, documents operational SLOs and alert contracts.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.
## Required Reading
- `docs/modules/scheduler/README.md`
- `docs/modules/scheduler/architecture.md`
- `docs/modules/scheduler/implementation_plan.md`
- `docs/modules/platform/architecture-overview.md`
## Working Agreement
- 1. Update task status to `DOING`/`DONE` in both correspoding sprint file `/docs/implplan/SPRINT_*.md` and the local `TASKS.md` when you start or finish work.
- 2. Review this charter and the Required Reading documents before coding; confirm prerequisites are met.
- 3. Keep changes deterministic (stable ordering, timestamps, hashes) and align with offline/air-gap expectations.
- 4. Coordinate doc updates, tests, and cross-guild communication whenever contracts or workflows change.
- 5. Revert to `TODO` if you pause the task without shipping changes; leave notes in commit/PR descriptions for context.
## How to work
1. Open relevant sprint file in `docs/implplan/SPRINT_*.md` and set task status to `DOING` there and in `docs/modules/scheduler/TASKS.md` before starting.
2. Confirm prerequisites above are read; note any missing contracts in sprint **Decisions & Risks**.
3. Keep outputs deterministic (stable ordering, UTC ISO-8601 timestamps, sorted lists) and offline-friendly (no external fetches without mirrors).
4. When changing behavior, update runbooks and observability assets in `./operations/`.
5. On completion, set status to `DONE` in both the sprint file and `TASKS.md`; if paused, revert to `TODO` and add a brief note.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see `../../ingestion/aggregation-only-contract.md`).
- No undocumented schema or API contract changes; document deltas in architecture or implementation_plan.
- Keep Offline Kit parity—document air-gapped workflows for any new feature.
- Prefer deterministic fixtures and avoid machine-specific artefacts in examples.
## Testing & determinism expectations
- Examples and snippets should be reproducible; pin sample timestamps to UTC and sort collections.
- Observability examples must align with published metric names and labels; update `operations/worker-prometheus-rules.yaml` if alert semantics change.
## Status mirrors
- Sprint tracker: `/docs/implplan/SPRINT_*.md` (source of record for Delivery Tracker).
- Local tracker: `docs/modules/scheduler/TASKS.md` (mirrors sprint status; keep in sync).

View File

@@ -0,0 +1,14 @@
# Scheduler module task board
Keep this table in sync with sprint Delivery Trackers for the Scheduler docs/process stream.
| Task ID | Status | Owner(s) | Notes |
| --- | --- | --- | --- |
| SCHEDULER-DOCS-0001 | DONE | Docs Guild | AGENTS charter refreshed with roles/prereqs/determinism and cross-links. |
| SCHEDULER-ENG-0001 | DONE | Module Team | TASKS.md created; status mirror rules documented. |
| SCHEDULER-OPS-0001 | DONE | Ops Guild | Outcomes synced to sprint file and tasks-all tracker. |
## Status rules
- Update both this file and the relevant `docs/implplan/SPRINT_*.md` entry whenever you change a task state.
- Use TODO → DOING → DONE/BLOCKED. If you pause work, revert to TODO and leave a short note.
- Document contract or runbook changes in the appropriate module docs under this directory.