feat: Implement vulnerability token signing and verification utilities
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

- Added VulnTokenSigner for signing JWT tokens with specified algorithms and keys.
- Introduced VulnTokenUtilities for resolving tenant and subject claims, and sanitizing context dictionaries.
- Created VulnTokenVerificationUtilities for parsing tokens, verifying signatures, and deserializing payloads.
- Developed VulnWorkflowAntiForgeryTokenIssuer for issuing anti-forgery tokens with configurable options.
- Implemented VulnWorkflowAntiForgeryTokenVerifier for verifying anti-forgery tokens and validating payloads.
- Added AuthorityVulnerabilityExplorerOptions to manage configuration for vulnerability explorer features.
- Included tests for FilesystemPackRunDispatcher to ensure proper job handling under egress policy restrictions.
This commit is contained in:
master
2025-11-03 10:02:29 +02:00
parent bf2bf4b395
commit b1e78fe412
215 changed files with 19441 additions and 12185 deletions

View File

@@ -146,6 +146,48 @@ plan? = <plan name> // optional hint for UIs; not used for e
---
### 3.5 Vuln Explorer workflow safeguards
* **Anti-forgery flow** — Vuln Explorers mutation verbs call
* `POST /vuln/workflow/anti-forgery/issue`
* `POST /vuln/workflow/anti-forgery/verify`
Callers must hold `vuln:operate` scopes. Issued tokens embed the actor, tenant, whitelisted actions, ABAC selectors (environment/owner/business tier), and optional context key/value pairs. Tokens are EdDSA/ES256 signed via the primary Authority signing key and default to a 10minute TTL (cap: 30minutes). Verification enforces nonce reuse prevention, tenant match, and action membership before forwarding the request to Vuln Explorer.
* **Attachment access** — Evidence bundles and attachments reference a ledger hash. Vuln Explorer obtains a scoped download token through:
* `POST /vuln/attachments/tokens/issue`
* `POST /vuln/attachments/tokens/verify`
These tokens bind the ledger event hash, attachment identifier, optional finding/content metadata, and the actor. They default to a 30minute TTL (cap: 4hours) and require `vuln:investigate`.
* **Audit trail** — Both flows emit `vuln.workflow.csrf.*` and `vuln.attachment.token.*` audit records with tenant, actor, ledger hash, nonce, and filtered context metadata so Offline Kit operators can reconcile actions against ledger entries.
* **Configuration**
```yaml
authority:
vulnerabilityExplorer:
workflow:
antiForgery:
enabled: true
audience: "stellaops:vuln-workflow"
defaultLifetime: "00:10:00"
maxLifetime: "00:30:00"
maxContextEntries: 16
maxContextValueLength: 256
attachments:
enabled: true
defaultLifetime: "00:30:00"
maxLifetime: "04:00:00"
payloadType: "application/vnd.stellaops.vuln-attachment-token+json"
maxMetadataEntries: 16
maxMetadataValueLength: 512
```
Air-gapped bundles include the signing key material and policy snapshots required to validate these tokens offline.
---
## 4) Audiences, scopes & RBAC
### 4.1 Audiences

View File

@@ -1,7 +1,7 @@
# Vexer agent guide
# Excitor agent guide
## Mission
Vexer computes deterministic consensus across VEX claims, preserving conflicts and producing attestable evidence for policy suppression.
Excitor computes deterministic consensus across VEX claims, preserving conflicts and producing attestable evidence for policy suppression.
## Key docs
- [Module README](./README.md)
@@ -21,9 +21,9 @@ Vexer computes deterministic consensus across VEX claims, preserving conflicts a
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.
## Required Reading
- `docs/modules/vexer/README.md`
- `docs/modules/vexer/architecture.md`
- `docs/modules/vexer/implementation_plan.md`
- `docs/modules/excitor/README.md`
- `docs/modules/excitor/architecture.md`
- `docs/modules/excitor/implementation_plan.md`
- `docs/modules/platform/architecture-overview.md`
## Working Agreement

View File

@@ -1,34 +1,34 @@
# StellaOps Vexer
Vexer computes deterministic consensus across VEX claims, preserving conflicts and producing attestable evidence for policy suppression.
## Responsibilities
- Ingest Excititor observations and compute per-product consensus snapshots.
- Provide APIs for querying canonical VEX positions and conflict sets.
- Publish exports and DSSE-ready digests for downstream consumption.
- Keep provenance weights and disagreement metadata.
## Key components
- Consensus engine and API host in `StellaOps.Vexer.*` (to-be-implemented).
- Storage schema for consensus graphs.
- Integration hooks for Policy Engine suppression logic.
## Integrations & dependencies
- Excititor for raw observations.
- Policy Engine and UI for suppression stories.
- CLI for evidence inspection.
## Operational notes
- Deterministic consensus algorithms (see architecture).
- Planned telemetry for disagreement counts and freshness.
- Offline exports aligning with Concelier/Excititor timelines.
## Related resources
- ./scoring.md
## Backlog references
- DOCS-VEXER backlog referenced in architecture doc.
- CLI parity tracked in ../../TASKS.md (CLI-GRAPH/VEX stories).
## Epic alignment
- **Epic 7 VEX Consensus Lens:** deliver trust-weighted consensus snapshots, disagreement metadata, and explain APIs.
# StellaOps Excitor
Excitor computes deterministic consensus across VEX claims, preserving conflicts and producing attestable evidence for policy suppression.
## Responsibilities
- Ingest Excititor observations and compute per-product consensus snapshots.
- Provide APIs for querying canonical VEX positions and conflict sets.
- Publish exports and DSSE-ready digests for downstream consumption.
- Keep provenance weights and disagreement metadata.
## Key components
- Consensus engine and API host in `StellaOps.Excitor.*` (to-be-implemented).
- Storage schema for consensus graphs.
- Integration hooks for Policy Engine suppression logic.
## Integrations & dependencies
- Excititor for raw observations.
- Policy Engine and UI for suppression stories.
- CLI for evidence inspection.
## Operational notes
- Deterministic consensus algorithms (see architecture).
- Planned telemetry for disagreement counts and freshness.
- Offline exports aligning with Concelier/Excititor timelines.
## Related resources
- ./scoring.md
## Backlog references
- DOCS-EXCITOR backlog referenced in architecture doc.
- CLI parity tracked in ../../TASKS.md (CLI-GRAPH/VEX stories).
## Epic alignment
- **Epic 7 VEX Consensus Lens:** deliver trust-weighted consensus snapshots, disagreement metadata, and explain APIs.

View File

@@ -0,0 +1,9 @@
# Task board — Excitor
> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
| ID | Status | Owner(s) | Description | Notes |
|----|--------|----------|-------------|-------|
| EXCITOR-DOCS-0001 | DOING (2025-10-29) | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
| EXCITOR-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
| EXCITOR-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |

View File

@@ -1,465 +1,465 @@
# component_architecture_vexer.md — **StellaOps Vexer** (2025Q4)
# component_architecture_excitor.md — **StellaOps Excitor** (2025Q4)
> Built to satisfy Epic7 VEX Consensus Lens requirements.
> **Scope.** This document specifies the **Vexer** service: its purpose, trust model, data structures, APIs, plugin contracts, storage schema, normalization/consensus algorithms, performance budgets, testing matrix, and how it integrates with Scanner, Policy, Feedser, and the attestation chain. It is implementationready.
---
## 0) Mission & role in the platform
**Mission.** Convert heterogeneous **VEX** statements (OpenVEX, CSAF VEX, CycloneDX VEX; vendor/distro/platform sources) into **canonical, queryable claims**; compute **deterministic consensus** per *(vuln, product)*; preserve **conflicts with provenance**; publish **stable, attestable exports** that the backend uses to suppress nonexploitable findings, prioritize remaining risk, and explain decisions.
**Boundaries.**
* Vexer **does not** decide PASS/FAIL. It supplies **evidence** (statuses + justifications + provenance weights).
* Vexer preserves **conflicting claims** unchanged; consensus encodes how we would pick, but the raw set is always exportable.
* VEX consumption is **backendonly**: Scanner never applies VEX. The backends **Policy Engine** asks Vexer for status evidence and then decides what to show.
---
## 1) Inputs, outputs & canonical domain
### 1.1 Accepted input formats (ingest)
* **OpenVEX** JSON documents (attested or raw).
* **CSAF VEX** 2.x (vendor PSIRTs and distros commonly publish CSAF).
* **CycloneDX VEX** 1.4+ (standalone VEX or embedded VEX blocks).
* **OCIattached attestations** (VEX statements shipped as OCI referrers) — optional connectors.
All connectors register **source metadata**: provider identity, trust tier, signature expectations (PGP/cosign/PKI), fetch windows, rate limits, and time anchors.
### 1.2 Canonical model (normalized)
Every incoming statement becomes a set of **VexClaim** records:
```
VexClaim
- providerId // 'redhat', 'suse', 'ubuntu', 'github', 'vendorX'
- vulnId // 'CVE-2025-12345', 'GHSA-xxxx', canonicalized
- productKey // canonical product identity (see §2.2)
- status // affected | not_affected | fixed | under_investigation
- justification? // for 'not_affected'/'affected' where provided
- introducedVersion? // semantics per provider (range or exact)
- fixedVersion? // where provided (range or exact)
- lastObserved // timestamp from source or fetch time
- provenance // doc digest, signature status, fetch URI, line/offset anchors
- evidence[] // raw source snippets for explainability
- supersedes? // optional cross-doc chain (docDigest → docDigest)
```
### 1.3 Exports (consumption)
* **VexConsensus** per `(vulnId, productKey)` with:
* `rollupStatus` (after policy weights/justification gates),
* `sources[]` (winning + losing claims with weights & reasons),
* `policyRevisionId` (identifier of the Vexer policy used),
* `consensusDigest` (stable SHA256 over canonical JSON).
* **Raw claims** export for auditing (unchanged, with provenance).
* **Provider snapshots** (per source, last N days) for operator debugging.
* **Index** optimized for backend joins: `(productKey, vulnId) → (status, confidence, sourceSet)`.
All exports are **deterministic**, and (optionally) **attested** via DSSE and logged to Rekor v2.
---
## 2) Identity model — products & joins
### 2.1 Vuln identity
* Accepts **CVE**, **GHSA**, vendor IDs (MSRC, RHSA…), distro IDs (DSA/USN/RHSA…) — normalized to `vulnId` with alias sets.
* **Alias graph** maintained (from Feedser) to map vendor/distro IDs → CVE (primary) and to **GHSA** where applicable.
### 2.2 Product identity (`productKey`)
* **Primary:** `purl` (Package URL).
* **Secondary links:** `cpe`, **OS package NVRA/EVR**, NuGet/Maven/Golang identity, and **OS package name** when purl unavailable.
* **Fallback:** `oci:<registry>/<repo>@<digest>` for imagelevel VEX.
* **Special cases:** kernel modules, firmware, platforms → providerspecific mapping helpers (connector captures providers product taxonomy → canonical `productKey`).
> Vexer does not invent identities. If a provider cannot be mapped to purl/CPE/NVRA deterministically, we keep the native **product string** and mark the claim as **nonjoinable**; the backend will ignore it unless a policy explicitly whitelists that provider mapping.
---
## 3) Storage schema (MongoDB)
Database: `vexer`
### 3.1 Collections
**`vex.providers`**
```
_id: providerId
name, homepage, contact
trustTier: enum {vendor, distro, platform, hub, attestation}
signaturePolicy: { type: pgp|cosign|x509|none, keys[], certs[], cosignKeylessRoots[] }
fetch: { baseUrl, kind: http|oci|file, rateLimit, etagSupport, windowDays }
enabled: bool
createdAt, modifiedAt
```
**`vex.raw`** (immutable raw documents)
```
_id: sha256(doc bytes)
providerId
uri
ingestedAt
contentType
sig: { verified: bool, method: pgp|cosign|x509|none, keyId|certSubject, bundle? }
payload: GridFS pointer (if large)
disposition: kept|replaced|superseded
correlation: { replaces?: sha256, replacedBy?: sha256 }
```
**`vex.claims`** (normalized rows; dedupe on providerId+vulnId+productKey+docDigest)
```
_id
providerId
vulnId
productKey
status
justification?
introducedVersion?
fixedVersion?
lastObserved
docDigest
provenance { uri, line?, pointer?, signatureState }
evidence[] { key, value, locator }
indices:
- {vulnId:1, productKey:1}
- {providerId:1, lastObserved:-1}
- {status:1}
- text index (optional) on evidence.value for debugging
```
**`vex.consensus`** (rollups)
```
_id: sha256(canonical(vulnId, productKey, policyRevision))
vulnId
productKey
rollupStatus
sources[]: [
{ providerId, status, justification?, weight, lastObserved, accepted:bool, reason }
]
policyRevisionId
evaluatedAt
consensusDigest // same as _id
indices:
- {vulnId:1, productKey:1}
- {policyRevisionId:1, evaluatedAt:-1}
```
**`vex.exports`** (manifest of emitted artifacts)
```
_id
querySignature
format: raw|consensus|index
artifactSha256
rekor { uuid, index, url }?
createdAt
policyRevisionId
cacheable: bool
```
**`vex.cache`**
```
querySignature -> exportId (for fast reuse)
ttl, hits
```
**`vex.migrations`**
* ordered migrations applied at bootstrap to ensure indexes.
### 3.2 Indexing strategy
* Hot path queries use exact `(vulnId, productKey)` and timebounded windows; compound indexes cover both.
* Providers list view by `lastObserved` for monitoring staleness.
* `vex.consensus` keyed by `(vulnId, productKey, policyRevision)` for deterministic reuse.
---
## 4) Ingestion pipeline
### 4.1 Connector contract
```csharp
public interface IVexConnector
{
string ProviderId { get; }
Task FetchAsync(VexConnectorContext ctx, CancellationToken ct); // raw docs
Task NormalizeAsync(VexConnectorContext ctx, CancellationToken ct); // raw -> VexClaim[]
}
```
* **Fetch** must implement: window scheduling, conditional GET (ETag/IfModifiedSince), rate limiting, retry/backoff.
* **Normalize** parses the format, validates schema, maps product identities deterministically, emits `VexClaim` records with **provenance**.
### 4.2 Signature verification (per provider)
* **cosign (keyless or keyful)** for OCI referrers or HTTPserved JSON with Sigstore bundles.
* **PGP** (provider keyrings) for distro/vendor feeds that sign docs.
* **x509** (mutual TLS / providerpinned certs) where applicable.
* Signature state is stored on **vex.raw.sig** and copied into **provenance.signatureState** on claims.
> Claims from sources failing signature policy are marked `"signatureState.verified=false"` and **policy** can downweight or ignore them.
### 4.3 Time discipline
* For each doc, prefer **providers document timestamp**; if absent, use fetch time.
* Claims carry `lastObserved` which drives **tiebreaking** within equal weight tiers.
---
## 5) Normalization: product & status semantics
### 5.1 Product mapping
* **purl** first; **cpe** second; OS package NVRA/EVR mapping helpers (distro connectors) produce purls via canonical tables (e.g., rpm→purl:rpm, deb→purl:deb).
* Where a provider publishes **platformlevel** VEX (e.g., “RHEL 9 not affected”), connectors expand to known product inventory rules (e.g., map to sets of packages/components shipped in the platform). Expansion tables are versioned and kept per provider; every expansion emits **evidence** indicating the rule applied.
* If expansion would be speculative, the claim remains **platformscoped** with `productKey="platform:redhat:rhel:9"` and is flagged **nonjoinable**; backend can decide to use platform VEX only when Scanner proves the platform runtime.
### 5.2 Status + justification mapping
* Canonical **status**: `affected | not_affected | fixed | under_investigation`.
* **Justifications** normalized to a controlled vocabulary (CISAaligned), e.g.:
* `component_not_present`
* `vulnerable_code_not_in_execute_path`
* `vulnerable_configuration_unused`
* `inline_mitigation_applied`
* `fix_available` (with `fixedVersion`)
* `under_investigation`
* Providers with freetext justifications are mapped by deterministic tables; raw text preserved as `evidence`.
---
## 6) Consensus algorithm
**Goal:** produce a **stable**, explainable `rollupStatus` per `(vulnId, productKey)` given possibly conflicting claims.
### 6.1 Inputs
* Set **S** of `VexClaim` for the key.
* **Vexer policy snapshot**:
* **weights** per provider tier and per provider overrides.
* **justification gates** (e.g., require justification for `not_affected` to be acceptable).
* **minEvidence** rules (e.g., `not_affected` must come from ≥1 vendor or 2 distros).
* **signature requirements** (e.g., require verified signature for fixed to be considered).
### 6.2 Steps
1. **Filter invalid** claims by signature policy & justification gates → set `S'`.
2. **Score** each claim:
`score = weight(provider) * freshnessFactor(lastObserved)` where freshnessFactor ∈ [0.8, 1.0] for staleness decay (configurable; small effect).
3. **Aggregate** scores per status: `W(status) = Σ score(claims with that status)`.
4. **Pick** `rollupStatus = argmax_status W(status)`.
5. **Tiebreakers** (in order):
* Higher **max single** provider score wins (vendor > distro > platform > hub).
* More **recent** lastObserved wins.
* Deterministic lexicographic order of status (`fixed` > `not_affected` > `under_investigation` > `affected`) as final tiebreaker.
6. **Explain**: mark accepted sources (`accepted=true; reason="weight"`/`"freshness"`), mark rejected sources with explicit `reason` (`"insufficient_justification"`, `"signature_unverified"`, `"lower_weight"`).
> The algorithm is **pure** given S and policy snapshot; result is reproducible and hashed into `consensusDigest`.
---
## 7) Query & export APIs
All endpoints are versioned under `/api/v1/vex`.
### 7.1 Query (online)
```
POST /claims/search
body: { vulnIds?: string[], productKeys?: string[], providers?: string[], since?: timestamp, limit?: int, pageToken?: string }
→ { claims[], nextPageToken? }
POST /consensus/search
body: { vulnIds?: string[], productKeys?: string[], policyRevisionId?: string, since?: timestamp, limit?: int, pageToken?: string }
→ { entries[], nextPageToken? }
POST /excititor/resolve (scope: vex.read)
body: { productKeys?: string[], purls?: string[], vulnerabilityIds: string[], policyRevisionId?: string }
→ { policy, resolvedAt, results: [ { vulnerabilityId, productKey, status, sources[], conflicts[], decisions[], signals?, summary?, envelope: { artifact, contentSignature?, attestation?, attestationEnvelope?, attestationSignature? } } ] }
```
### 7.2 Exports (cacheable snapshots)
```
POST /exports
body: { signature: { vulnFilter?, productFilter?, providers?, since? }, format: raw|consensus|index, policyRevisionId?: string, force?: bool }
→ { exportId, artifactSha256, rekor? }
GET /exports/{exportId} → bytes (application/json or binary index)
GET /exports/{exportId}/meta → { signature, policyRevisionId, createdAt, artifactSha256, rekor? }
```
### 7.3 Provider operations
```
GET /providers → provider list & signature policy
POST /providers/{id}/refresh → trigger fetch/normalize window
GET /providers/{id}/status → last fetch, doc counts, signature stats
```
**Auth:** servicetoservice via Authority tokens; operator operations via UI/CLI with RBAC.
---
## 8) Attestation integration
* Exports can be **DSSEsigned** via **Signer** and logged to **Rekor v2** via **Attestor** (optional but recommended for regulated pipelines).
* `vex.exports.rekor` stores `{uuid, index, url}` when present.
* **Predicate type**: `https://stella-ops.org/attestations/vex-export/1` with fields:
* `querySignature`, `policyRevisionId`, `artifactSha256`, `createdAt`.
---
## 9) Configuration (YAML)
```yaml
vexer:
mongo: { uri: "mongodb://mongo/vexer" }
s3:
endpoint: http://minio:9000
bucket: stellaops
policy:
weights:
vendor: 1.0
distro: 0.9
platform: 0.7
hub: 0.5
attestation: 0.6
providerOverrides:
redhat: 1.0
suse: 0.95
requireJustificationForNotAffected: true
signatureRequiredForFixed: true
minEvidence:
not_affected:
vendorOrTwoDistros: true
connectors:
- providerId: redhat
kind: csaf
baseUrl: https://access.redhat.com/security/data/csaf/v2/
signaturePolicy: { type: pgp, keys: [ "…redhat-pgp-key…" ] }
windowDays: 7
- providerId: suse
kind: csaf
baseUrl: https://ftp.suse.com/pub/projects/security/csaf/
signaturePolicy: { type: pgp, keys: [ "…suse-pgp-key…" ] }
- providerId: ubuntu
kind: openvex
baseUrl: https://…/vex/
signaturePolicy: { type: none }
- providerId: vendorX
kind: cyclonedx-vex
ociRef: ghcr.io/vendorx/vex@sha256:
signaturePolicy: { type: cosign, cosignKeylessRoots: [ "sigstore-root" ] }
```
---
## 10) Security model
* **Input signature verification** enforced per provider policy (PGP, cosign, x509).
* **Connector allowlists**: outbound fetch constrained to configured domains.
* **Tenant isolation**: pertenant DB prefixes or separate DBs; pertenant S3 prefixes; pertenant policies.
* **AuthN/Z**: Authorityissued OpToks; RBAC roles (`vex.read`, `vex.admin`, `vex.export`).
* **No secrets in logs**; deterministic logging contexts include providerId, docDigest, claim keys.
---
## 11) Performance & scale
* **Targets:**
* Normalize 10k VEX claims/minute/core.
* Consensus compute ≤50ms for 1k unique `(vuln, product)` pairs in hot cache.
* Export (consensus) 1M rows in ≤60s on 8 cores with streaming writer.
* **Scaling:**
* WebService handles control APIs; **Worker** background services (same image) execute fetch/normalize in parallel with ratelimits; Mongo writes batched; upserts by natural keys.
* Exports stream straight to S3 (MinIO) with rolling buffers.
* **Caching:**
* `vex.cache` maps query signatures → export; TTL to avoid stampedes; optimistic reuse unless `force`.
---
## 12) Observability
* **Metrics:**
* `vex.ingest.docs_total{provider}`
* `vex.normalize.claims_total{provider}`
* `vex.signature.failures_total{provider,method}`
* `vex.consensus.conflicts_total{vulnId}`
* `vex.exports.bytes{format}` / `vex.exports.latency_seconds`
* **Tracing:** spans for fetch, verify, parse, map, consensus, export.
* **Dashboards:** provider staleness, top conflicting vulns/components, signature posture, export cache hitrate.
---
## 13) Testing matrix
* **Connectors:** golden raw docs → deterministic claims (fixtures per provider/format).
* **Signature policies:** valid/invalid PGP/cosign/x509 samples; ensure rejects are recorded but not accepted.
* **Normalization edge cases:** platformonly claims, freetext justifications, nonpurl products.
* **Consensus:** conflict scenarios across tiers; check tiebreakers; justification gates.
* **Performance:** 1Mrow export timing; memory ceilings; stream correctness.
* **Determinism:** same inputs + policy → identical `consensusDigest` and export bytes.
* **API contract tests:** pagination, filters, RBAC, rate limits.
---
## 14) Integration points
* **Backend Policy Engine** (in Scanner.WebService): calls `POST /excititor/resolve` (scope `vex.read`) with batched `(purl, vulnId)` pairs to fetch `rollupStatus + sources`.
* **Feedser**: provides alias graph (CVE↔vendor IDs) and may supply VEXadjacent metadata (e.g., KEV flag) for policy escalation.
* **UI**: VEX explorer screens use `/claims/search` and `/consensus/search`; show conflicts & provenance.
* **CLI**: `stellaops vex export --consensus --since 7d --out vex.json` for audits.
---
## 15) Failure modes & fallback
* **Provider unreachable:** stale thresholds trigger warnings; policy can downweight stale providers automatically (freshness factor).
* **Signature outage:** continue to ingest but mark `signatureState.verified=false`; consensus will likely exclude or downweight per policy.
* **Schema drift:** unknown fields are preserved as `evidence`; normalization rejects only on **invalid identity** or **status**.
---
## 16) Rollout plan (incremental)
1. **MVP**: OpenVEX + CSAF connectors for 3 major providers (e.g., Red Hat/SUSE/Ubuntu), normalization + consensus + `/excititor/resolve`.
2. **Signature policies**: PGP for distros; cosign for OCI.
3. **Exports + optional attestation**.
4. **CycloneDX VEX** connectors; platform claim expansion tables; UI explorer.
5. **Scale hardening**: export indexes; conflict analytics.
---
## 17) Appendix — canonical JSON (stable ordering)
All exports and consensus entries are serialized via `VexCanonicalJsonSerializer`:
* UTF8 without BOM;
* keys sorted (ASCII);
* arrays sorted by `(providerId, vulnId, productKey, lastObserved)` unless semantic order mandated;
* timestamps in `YYYYMMDDThh:mm:ssZ`;
* no insignificant whitespace.
> **Scope.** This document specifies the **Excitor** service: its purpose, trust model, data structures, APIs, plugin contracts, storage schema, normalization/consensus algorithms, performance budgets, testing matrix, and how it integrates with Scanner, Policy, Conselier, and the attestation chain. It is implementationready.
---
## 0) Mission & role in the platform
**Mission.** Convert heterogeneous **VEX** statements (OpenVEX, CSAF VEX, CycloneDX VEX; vendor/distro/platform sources) into **canonical, queryable claims**; compute **deterministic consensus** per *(vuln, product)*; preserve **conflicts with provenance**; publish **stable, attestable exports** that the backend uses to suppress nonexploitable findings, prioritize remaining risk, and explain decisions.
**Boundaries.**
* Excitor **does not** decide PASS/FAIL. It supplies **evidence** (statuses + justifications + provenance weights).
* Excitor preserves **conflicting claims** unchanged; consensus encodes how we would pick, but the raw set is always exportable.
* VEX consumption is **backendonly**: Scanner never applies VEX. The backends **Policy Engine** asks Excitor for status evidence and then decides what to show.
---
## 1) Inputs, outputs & canonical domain
### 1.1 Accepted input formats (ingest)
* **OpenVEX** JSON documents (attested or raw).
* **CSAF VEX** 2.x (vendor PSIRTs and distros commonly publish CSAF).
* **CycloneDX VEX** 1.4+ (standalone VEX or embedded VEX blocks).
* **OCIattached attestations** (VEX statements shipped as OCI referrers) — optional connectors.
All connectors register **source metadata**: provider identity, trust tier, signature expectations (PGP/cosign/PKI), fetch windows, rate limits, and time anchors.
### 1.2 Canonical model (normalized)
Every incoming statement becomes a set of **VexClaim** records:
```
VexClaim
- providerId // 'redhat', 'suse', 'ubuntu', 'github', 'vendorX'
- vulnId // 'CVE-2025-12345', 'GHSA-xxxx', canonicalized
- productKey // canonical product identity (see §2.2)
- status // affected | not_affected | fixed | under_investigation
- justification? // for 'not_affected'/'affected' where provided
- introducedVersion? // semantics per provider (range or exact)
- fixedVersion? // where provided (range or exact)
- lastObserved // timestamp from source or fetch time
- provenance // doc digest, signature status, fetch URI, line/offset anchors
- evidence[] // raw source snippets for explainability
- supersedes? // optional cross-doc chain (docDigest → docDigest)
```
### 1.3 Exports (consumption)
* **VexConsensus** per `(vulnId, productKey)` with:
* `rollupStatus` (after policy weights/justification gates),
* `sources[]` (winning + losing claims with weights & reasons),
* `policyRevisionId` (identifier of the Excitor policy used),
* `consensusDigest` (stable SHA256 over canonical JSON).
* **Raw claims** export for auditing (unchanged, with provenance).
* **Provider snapshots** (per source, last N days) for operator debugging.
* **Index** optimized for backend joins: `(productKey, vulnId) → (status, confidence, sourceSet)`.
All exports are **deterministic**, and (optionally) **attested** via DSSE and logged to Rekor v2.
---
## 2) Identity model — products & joins
### 2.1 Vuln identity
* Accepts **CVE**, **GHSA**, vendor IDs (MSRC, RHSA…), distro IDs (DSA/USN/RHSA…) — normalized to `vulnId` with alias sets.
* **Alias graph** maintained (from Conselier) to map vendor/distro IDs → CVE (primary) and to **GHSA** where applicable.
### 2.2 Product identity (`productKey`)
* **Primary:** `purl` (Package URL).
* **Secondary links:** `cpe`, **OS package NVRA/EVR**, NuGet/Maven/Golang identity, and **OS package name** when purl unavailable.
* **Fallback:** `oci:<registry>/<repo>@<digest>` for imagelevel VEX.
* **Special cases:** kernel modules, firmware, platforms → providerspecific mapping helpers (connector captures providers product taxonomy → canonical `productKey`).
> Excitor does not invent identities. If a provider cannot be mapped to purl/CPE/NVRA deterministically, we keep the native **product string** and mark the claim as **nonjoinable**; the backend will ignore it unless a policy explicitly whitelists that provider mapping.
---
## 3) Storage schema (MongoDB)
Database: `excitor`
### 3.1 Collections
**`vex.providers`**
```
_id: providerId
name, homepage, contact
trustTier: enum {vendor, distro, platform, hub, attestation}
signaturePolicy: { type: pgp|cosign|x509|none, keys[], certs[], cosignKeylessRoots[] }
fetch: { baseUrl, kind: http|oci|file, rateLimit, etagSupport, windowDays }
enabled: bool
createdAt, modifiedAt
```
**`vex.raw`** (immutable raw documents)
```
_id: sha256(doc bytes)
providerId
uri
ingestedAt
contentType
sig: { verified: bool, method: pgp|cosign|x509|none, keyId|certSubject, bundle? }
payload: GridFS pointer (if large)
disposition: kept|replaced|superseded
correlation: { replaces?: sha256, replacedBy?: sha256 }
```
**`vex.claims`** (normalized rows; dedupe on providerId+vulnId+productKey+docDigest)
```
_id
providerId
vulnId
productKey
status
justification?
introducedVersion?
fixedVersion?
lastObserved
docDigest
provenance { uri, line?, pointer?, signatureState }
evidence[] { key, value, locator }
indices:
- {vulnId:1, productKey:1}
- {providerId:1, lastObserved:-1}
- {status:1}
- text index (optional) on evidence.value for debugging
```
**`vex.consensus`** (rollups)
```
_id: sha256(canonical(vulnId, productKey, policyRevision))
vulnId
productKey
rollupStatus
sources[]: [
{ providerId, status, justification?, weight, lastObserved, accepted:bool, reason }
]
policyRevisionId
evaluatedAt
consensusDigest // same as _id
indices:
- {vulnId:1, productKey:1}
- {policyRevisionId:1, evaluatedAt:-1}
```
**`vex.exports`** (manifest of emitted artifacts)
```
_id
querySignature
format: raw|consensus|index
artifactSha256
rekor { uuid, index, url }?
createdAt
policyRevisionId
cacheable: bool
```
**`vex.cache`**
```
querySignature -> exportId (for fast reuse)
ttl, hits
```
**`vex.migrations`**
* ordered migrations applied at bootstrap to ensure indexes.
### 3.2 Indexing strategy
* Hot path queries use exact `(vulnId, productKey)` and timebounded windows; compound indexes cover both.
* Providers list view by `lastObserved` for monitoring staleness.
* `vex.consensus` keyed by `(vulnId, productKey, policyRevision)` for deterministic reuse.
---
## 4) Ingestion pipeline
### 4.1 Connector contract
```csharp
public interface IVexConnector
{
string ProviderId { get; }
Task FetchAsync(VexConnectorContext ctx, CancellationToken ct); // raw docs
Task NormalizeAsync(VexConnectorContext ctx, CancellationToken ct); // raw -> VexClaim[]
}
```
* **Fetch** must implement: window scheduling, conditional GET (ETag/IfModifiedSince), rate limiting, retry/backoff.
* **Normalize** parses the format, validates schema, maps product identities deterministically, emits `VexClaim` records with **provenance**.
### 4.2 Signature verification (per provider)
* **cosign (keyless or keyful)** for OCI referrers or HTTPserved JSON with Sigstore bundles.
* **PGP** (provider keyrings) for distro/vendor feeds that sign docs.
* **x509** (mutual TLS / providerpinned certs) where applicable.
* Signature state is stored on **vex.raw.sig** and copied into **provenance.signatureState** on claims.
> Claims from sources failing signature policy are marked `"signatureState.verified=false"` and **policy** can downweight or ignore them.
### 4.3 Time discipline
* For each doc, prefer **providers document timestamp**; if absent, use fetch time.
* Claims carry `lastObserved` which drives **tiebreaking** within equal weight tiers.
---
## 5) Normalization: product & status semantics
### 5.1 Product mapping
* **purl** first; **cpe** second; OS package NVRA/EVR mapping helpers (distro connectors) produce purls via canonical tables (e.g., rpm→purl:rpm, deb→purl:deb).
* Where a provider publishes **platformlevel** VEX (e.g., “RHEL 9 not affected”), connectors expand to known product inventory rules (e.g., map to sets of packages/components shipped in the platform). Expansion tables are versioned and kept per provider; every expansion emits **evidence** indicating the rule applied.
* If expansion would be speculative, the claim remains **platformscoped** with `productKey="platform:redhat:rhel:9"` and is flagged **nonjoinable**; backend can decide to use platform VEX only when Scanner proves the platform runtime.
### 5.2 Status + justification mapping
* Canonical **status**: `affected | not_affected | fixed | under_investigation`.
* **Justifications** normalized to a controlled vocabulary (CISAaligned), e.g.:
* `component_not_present`
* `vulnerable_code_not_in_execute_path`
* `vulnerable_configuration_unused`
* `inline_mitigation_applied`
* `fix_available` (with `fixedVersion`)
* `under_investigation`
* Providers with freetext justifications are mapped by deterministic tables; raw text preserved as `evidence`.
---
## 6) Consensus algorithm
**Goal:** produce a **stable**, explainable `rollupStatus` per `(vulnId, productKey)` given possibly conflicting claims.
### 6.1 Inputs
* Set **S** of `VexClaim` for the key.
* **Excitor policy snapshot**:
* **weights** per provider tier and per provider overrides.
* **justification gates** (e.g., require justification for `not_affected` to be acceptable).
* **minEvidence** rules (e.g., `not_affected` must come from ≥1 vendor or 2 distros).
* **signature requirements** (e.g., require verified signature for fixed to be considered).
### 6.2 Steps
1. **Filter invalid** claims by signature policy & justification gates → set `S'`.
2. **Score** each claim:
`score = weight(provider) * freshnessFactor(lastObserved)` where freshnessFactor ∈ [0.8, 1.0] for staleness decay (configurable; small effect).
3. **Aggregate** scores per status: `W(status) = Σ score(claims with that status)`.
4. **Pick** `rollupStatus = argmax_status W(status)`.
5. **Tiebreakers** (in order):
* Higher **max single** provider score wins (vendor > distro > platform > hub).
* More **recent** lastObserved wins.
* Deterministic lexicographic order of status (`fixed` > `not_affected` > `under_investigation` > `affected`) as final tiebreaker.
6. **Explain**: mark accepted sources (`accepted=true; reason="weight"`/`"freshness"`), mark rejected sources with explicit `reason` (`"insufficient_justification"`, `"signature_unverified"`, `"lower_weight"`).
> The algorithm is **pure** given S and policy snapshot; result is reproducible and hashed into `consensusDigest`.
---
## 7) Query & export APIs
All endpoints are versioned under `/api/v1/vex`.
### 7.1 Query (online)
```
POST /claims/search
body: { vulnIds?: string[], productKeys?: string[], providers?: string[], since?: timestamp, limit?: int, pageToken?: string }
→ { claims[], nextPageToken? }
POST /consensus/search
body: { vulnIds?: string[], productKeys?: string[], policyRevisionId?: string, since?: timestamp, limit?: int, pageToken?: string }
→ { entries[], nextPageToken? }
POST /excititor/resolve (scope: vex.read)
body: { productKeys?: string[], purls?: string[], vulnerabilityIds: string[], policyRevisionId?: string }
→ { policy, resolvedAt, results: [ { vulnerabilityId, productKey, status, sources[], conflicts[], decisions[], signals?, summary?, envelope: { artifact, contentSignature?, attestation?, attestationEnvelope?, attestationSignature? } } ] }
```
### 7.2 Exports (cacheable snapshots)
```
POST /exports
body: { signature: { vulnFilter?, productFilter?, providers?, since? }, format: raw|consensus|index, policyRevisionId?: string, force?: bool }
→ { exportId, artifactSha256, rekor? }
GET /exports/{exportId} → bytes (application/json or binary index)
GET /exports/{exportId}/meta → { signature, policyRevisionId, createdAt, artifactSha256, rekor? }
```
### 7.3 Provider operations
```
GET /providers → provider list & signature policy
POST /providers/{id}/refresh → trigger fetch/normalize window
GET /providers/{id}/status → last fetch, doc counts, signature stats
```
**Auth:** servicetoservice via Authority tokens; operator operations via UI/CLI with RBAC.
---
## 8) Attestation integration
* Exports can be **DSSEsigned** via **Signer** and logged to **Rekor v2** via **Attestor** (optional but recommended for regulated pipelines).
* `vex.exports.rekor` stores `{uuid, index, url}` when present.
* **Predicate type**: `https://stella-ops.org/attestations/vex-export/1` with fields:
* `querySignature`, `policyRevisionId`, `artifactSha256`, `createdAt`.
---
## 9) Configuration (YAML)
```yaml
excitor:
mongo: { uri: "mongodb://mongo/excitor" }
s3:
endpoint: http://minio:9000
bucket: stellaops
policy:
weights:
vendor: 1.0
distro: 0.9
platform: 0.7
hub: 0.5
attestation: 0.6
providerOverrides:
redhat: 1.0
suse: 0.95
requireJustificationForNotAffected: true
signatureRequiredForFixed: true
minEvidence:
not_affected:
vendorOrTwoDistros: true
connectors:
- providerId: redhat
kind: csaf
baseUrl: https://access.redhat.com/security/data/csaf/v2/
signaturePolicy: { type: pgp, keys: [ "…redhat-pgp-key…" ] }
windowDays: 7
- providerId: suse
kind: csaf
baseUrl: https://ftp.suse.com/pub/projects/security/csaf/
signaturePolicy: { type: pgp, keys: [ "…suse-pgp-key…" ] }
- providerId: ubuntu
kind: openvex
baseUrl: https://…/vex/
signaturePolicy: { type: none }
- providerId: vendorX
kind: cyclonedx-vex
ociRef: ghcr.io/vendorx/vex@sha256:
signaturePolicy: { type: cosign, cosignKeylessRoots: [ "sigstore-root" ] }
```
---
## 10) Security model
* **Input signature verification** enforced per provider policy (PGP, cosign, x509).
* **Connector allowlists**: outbound fetch constrained to configured domains.
* **Tenant isolation**: pertenant DB prefixes or separate DBs; pertenant S3 prefixes; pertenant policies.
* **AuthN/Z**: Authorityissued OpToks; RBAC roles (`vex.read`, `vex.admin`, `vex.export`).
* **No secrets in logs**; deterministic logging contexts include providerId, docDigest, claim keys.
---
## 11) Performance & scale
* **Targets:**
* Normalize 10k VEX claims/minute/core.
* Consensus compute ≤50ms for 1k unique `(vuln, product)` pairs in hot cache.
* Export (consensus) 1M rows in ≤60s on 8 cores with streaming writer.
* **Scaling:**
* WebService handles control APIs; **Worker** background services (same image) execute fetch/normalize in parallel with ratelimits; Mongo writes batched; upserts by natural keys.
* Exports stream straight to S3 (MinIO) with rolling buffers.
* **Caching:**
* `vex.cache` maps query signatures → export; TTL to avoid stampedes; optimistic reuse unless `force`.
---
## 12) Observability
* **Metrics:**
* `vex.ingest.docs_total{provider}`
* `vex.normalize.claims_total{provider}`
* `vex.signature.failures_total{provider,method}`
* `vex.consensus.conflicts_total{vulnId}`
* `vex.exports.bytes{format}` / `vex.exports.latency_seconds`
* **Tracing:** spans for fetch, verify, parse, map, consensus, export.
* **Dashboards:** provider staleness, top conflicting vulns/components, signature posture, export cache hitrate.
---
## 13) Testing matrix
* **Connectors:** golden raw docs → deterministic claims (fixtures per provider/format).
* **Signature policies:** valid/invalid PGP/cosign/x509 samples; ensure rejects are recorded but not accepted.
* **Normalization edge cases:** platformonly claims, freetext justifications, nonpurl products.
* **Consensus:** conflict scenarios across tiers; check tiebreakers; justification gates.
* **Performance:** 1Mrow export timing; memory ceilings; stream correctness.
* **Determinism:** same inputs + policy → identical `consensusDigest` and export bytes.
* **API contract tests:** pagination, filters, RBAC, rate limits.
---
## 14) Integration points
* **Backend Policy Engine** (in Scanner.WebService): calls `POST /excititor/resolve` (scope `vex.read`) with batched `(purl, vulnId)` pairs to fetch `rollupStatus + sources`.
* **Conselier**: provides alias graph (CVE↔vendor IDs) and may supply VEXadjacent metadata (e.g., KEV flag) for policy escalation.
* **UI**: VEX explorer screens use `/claims/search` and `/consensus/search`; show conflicts & provenance.
* **CLI**: `stellaops vex export --consensus --since 7d --out vex.json` for audits.
---
## 15) Failure modes & fallback
* **Provider unreachable:** stale thresholds trigger warnings; policy can downweight stale providers automatically (freshness factor).
* **Signature outage:** continue to ingest but mark `signatureState.verified=false`; consensus will likely exclude or downweight per policy.
* **Schema drift:** unknown fields are preserved as `evidence`; normalization rejects only on **invalid identity** or **status**.
---
## 16) Rollout plan (incremental)
1. **MVP**: OpenVEX + CSAF connectors for 3 major providers (e.g., Red Hat/SUSE/Ubuntu), normalization + consensus + `/excititor/resolve`.
2. **Signature policies**: PGP for distros; cosign for OCI.
3. **Exports + optional attestation**.
4. **CycloneDX VEX** connectors; platform claim expansion tables; UI explorer.
5. **Scale hardening**: export indexes; conflict analytics.
---
## 17) Appendix — canonical JSON (stable ordering)
All exports and consensus entries are serialized via `VexCanonicalJsonSerializer`:
* UTF8 without BOM;
* keys sorted (ASCII);
* arrays sorted by `(providerId, vulnId, productKey, lastObserved)` unless semantic order mandated;
* timestamps in `YYYYMMDDThh:mm:ssZ`;
* no insignificant whitespace.

View File

@@ -1,65 +1,65 @@
# Implementation plan — Vexer
## Delivery phases
- **Phase 1 Connectors & normalization**
Build connectors for OpenVEX, CSAF VEX, CycloneDX VEX, OCI attestations; capture provenance, signatures, and source metadata; normalise into `VexClaim`.
- **Phase 2 Mapping & trust registry**
Implement product mapping (CPE → purl/version), issuer registry (trust tiers, signatures), scope scoring, and justification taxonomy.
- **Phase 3 Consensus & projections**
Deliver consensus computation, conflict preservation, projections (`vex_consensus`, history, provider snapshots), and DSSE events.
- **Phase 4 APIs & integrations**
Expose REST/CLI endpoints for claims, consensus, conflicts, exports; integrate Policy Engine, Vuln Explorer, Advisory AI, Export Center.
- **Phase 5 Observability & offline**
Ship metrics, logs, traces, dashboards, incident runbooks, Offline Kit bundles, and performance tuning (10M claims/tenant).
## Work breakdown
- **Connectors**
- Fetchers for vendor feeds, CSAF repositories, OpenVEX docs, OCI referrers.
- Signature verification (PGP, cosign, PKI) per source; schema validation; rate limiting.
- Source configuration (trust tier, fetch cadence, blackout windows) stored in metadata registry.
- **Normalization**
- Canonical `VexClaim` schema with deterministic IDs, provenance, supersedes chains.
- Product tree parsing, mapping to canonical product keys and environments.
- Justification and scope scoring derived from source semantics.
- **Consensus & projections**
- Lattice join with precedence rules, conflict tracking, confidence scores, recency decay.
- Append-only history, conflict queue, DSSE events (`vex.consensus.updated`).
- Export-ready JSONL & DSSE bundles for Offline Kit and Export Center.
- **APIs & UX**
- REST endpoints (`/claims`, `/consensus`, `/conflicts`, `/providers`) with tenant RBAC.
- CLI commands `stella vex claims|consensus|conflicts|export`.
- Console modules (list/detail, conflict diagnostics, provider health, simulation hooks).
- **Integrations**
- Policy Engine trust knobs, Vuln Explorer consensus badges, Advisory AI narrative generation, Notify alerts for conflicts.
- Orchestrator jobs for recompute/backfill triggered by Excitator deltas.
- **Observability & Ops**
- Metrics (ingest latency, signature failure rate, conflict rate, consensus latency).
- Logs/traces with tenant/issuer/provenance context.
- Runbooks for mapping failures, signature errors, recompute storms, quota exhaustion.
## Acceptance criteria
- Connectors ingest validated VEX statements with signed provenance, deterministic mapping, and tenant isolation.
- Consensus outputs reproducible, include conflicts, and integrate with Policy Engine/Vuln Explorer/Export Center.
- CLI/Console provide evidence inspection, conflict analysis, and exports; Offline Kit bundles replay verification offline.
- Observability dashboards/alerts capture ingest health, trust anomalies, conflict spikes, and performance budgets.
- Recompute pipeline handles policy changes and new evidence without dropping deterministic outcomes.
## Risks & mitigations
- **Mapping ambiguity:** maintain scope scores, manual overrides, highlight warnings.
- **Signature trust gaps:** issuer registry with auditing, fallback trust policies, tenant overrides.
- **Evidence surges:** orchestrator backpressure, prioritised queues, shardable workers.
- **Performance regressions:** indexing, caching, load tests, budget enforcement.
- **Tenant leakage:** strict RBAC/filters, fuzz tests, compliance reviews.
## Test strategy
- **Unit:** connector parsers, normalization, mapping conversions, lattice operations.
- **Property:** randomised evidence ensuring commutative consensus and deterministic digests.
- **Integration:** end-to-end pipeline from Excitator to consensus export, policy simulation, conflict handling.
- **Performance:** large feed ingestion, recompute stress, CLI export throughput.
- **Security:** signature tampering, issuer revocation, RBAC.
- **Offline:** export/import verification, DSSE bundle validation.
## Definition of done
- Connectors, normalization, consensus, APIs, and integrations deployed with telemetry, runbooks, and Offline Kit parity.
- Documentation (overview, architecture, algorithm, issuer registry, API/CLI, runbooks) updated with imposed rule compliance.
- ./TASKS.md and ../../TASKS.md reflect active status and dependencies.
# Implementation plan — Excitor
## Delivery phases
- **Phase 1 Connectors & normalization**
Build connectors for OpenVEX, CSAF VEX, CycloneDX VEX, OCI attestations; capture provenance, signatures, and source metadata; normalise into `VexClaim`.
- **Phase 2 Mapping & trust registry**
Implement product mapping (CPE → purl/version), issuer registry (trust tiers, signatures), scope scoring, and justification taxonomy.
- **Phase 3 Consensus & projections**
Deliver consensus computation, conflict preservation, projections (`vex_consensus`, history, provider snapshots), and DSSE events.
- **Phase 4 APIs & integrations**
Expose REST/CLI endpoints for claims, consensus, conflicts, exports; integrate Policy Engine, Vuln Explorer, Advisory AI, Export Center.
- **Phase 5 Observability & offline**
Ship metrics, logs, traces, dashboards, incident runbooks, Offline Kit bundles, and performance tuning (10M claims/tenant).
## Work breakdown
- **Connectors**
- Fetchers for vendor feeds, CSAF repositories, OpenVEX docs, OCI referrers.
- Signature verification (PGP, cosign, PKI) per source; schema validation; rate limiting.
- Source configuration (trust tier, fetch cadence, blackout windows) stored in metadata registry.
- **Normalization**
- Canonical `VexClaim` schema with deterministic IDs, provenance, supersedes chains.
- Product tree parsing, mapping to canonical product keys and environments.
- Justification and scope scoring derived from source semantics.
- **Consensus & projections**
- Lattice join with precedence rules, conflict tracking, confidence scores, recency decay.
- Append-only history, conflict queue, DSSE events (`vex.consensus.updated`).
- Export-ready JSONL & DSSE bundles for Offline Kit and Export Center.
- **APIs & UX**
- REST endpoints (`/claims`, `/consensus`, `/conflicts`, `/providers`) with tenant RBAC.
- CLI commands `stella vex claims|consensus|conflicts|export`.
- Console modules (list/detail, conflict diagnostics, provider health, simulation hooks).
- **Integrations**
- Policy Engine trust knobs, Vuln Explorer consensus badges, Advisory AI narrative generation, Notify alerts for conflicts.
- Orchestrator jobs for recompute/backfill triggered by Excitor deltas.
- **Observability & Ops**
- Metrics (ingest latency, signature failure rate, conflict rate, consensus latency).
- Logs/traces with tenant/issuer/provenance context.
- Runbooks for mapping failures, signature errors, recompute storms, quota exhaustion.
## Acceptance criteria
- Connectors ingest validated VEX statements with signed provenance, deterministic mapping, and tenant isolation.
- Consensus outputs reproducible, include conflicts, and integrate with Policy Engine/Vuln Explorer/Export Center.
- CLI/Console provide evidence inspection, conflict analysis, and exports; Offline Kit bundles replay verification offline.
- Observability dashboards/alerts capture ingest health, trust anomalies, conflict spikes, and performance budgets.
- Recompute pipeline handles policy changes and new evidence without dropping deterministic outcomes.
## Risks & mitigations
- **Mapping ambiguity:** maintain scope scores, manual overrides, highlight warnings.
- **Signature trust gaps:** issuer registry with auditing, fallback trust policies, tenant overrides.
- **Evidence surges:** orchestrator backpressure, prioritised queues, shardable workers.
- **Performance regressions:** indexing, caching, load tests, budget enforcement.
- **Tenant leakage:** strict RBAC/filters, fuzz tests, compliance reviews.
## Test strategy
- **Unit:** connector parsers, normalization, mapping conversions, lattice operations.
- **Property:** randomised evidence ensuring commutative consensus and deterministic digests.
- **Integration:** end-to-end pipeline from Excitor to consensus export, policy simulation, conflict handling.
- **Performance:** large feed ingestion, recompute stress, CLI export throughput.
- **Security:** signature tampering, issuer revocation, RBAC.
- **Offline:** export/import verification, DSSE bundle validation.
## Definition of done
- Connectors, normalization, consensus, APIs, and integrations deployed with telemetry, runbooks, and Offline Kit parity.
- Documentation (overview, architecture, algorithm, issuer registry, API/CLI, runbooks) updated with imposed rule compliance.
- ./TASKS.md and ../../TASKS.md reflect active status and dependencies.

View File

@@ -1,83 +1,83 @@
## Status
This document tracks the future-looking risk scoring model for Vexer. The calculation below is not active yet; Sprint 7 work will add the required schema fields, policy controls, and services. Until that ships, Vexer emits consensus statuses without numeric scores.
## Scoring model (target state)
**S = Gate(VEX_status) × W_trust(source) × [Severity_base × (1 + α·KEV + β·EPSS)]**
* **Gate(VEX_status)**: `affected`/`under_investigation` → 1, `not_affected`/`fixed` → 0. A trusted “not affected” or “fixed” still zeroes the score.
* **W_trust(source)**: normalized policy weight (baseline 01). Policies may opt into >1 boosts for signed vendor feeds once Phase 1 closes.
* **Severity_base**: canonical numeric severity from Feedser (CVSS or org-defined scale).
* **KEV flag**: 0/1 boost when CISA Known Exploited Vulnerabilities applies.
* **EPSS**: probability [0,1]; bounded multiplier.
* **α, β**: configurable coefficients (default α=0.25, β=0.5) stored in policy.
Safeguards: freeze boosts when product identity is unknown, clamp outputs ≥0, and log every factor in the audit trail.
## Implementation roadmap
| Phase | Scope | Artifacts |
| --- | --- | --- |
| **Phase 1 Schema foundations** | Extend Vexer consensus/claims and Feedser canonical advisories with severity, KEV, EPSS, and expose α/β + weight ceilings in policy. | Sprint 7 tasks `VEXER-CORE-02-001`, `VEXER-POLICY-02-001`, `VEXER-STORAGE-02-001`, `FEEDCORE-ENGINE-07-001`. |
| **Phase 2 Deterministic score engine** | Implement a scoring component that executes alongside consensus and persists score envelopes with hashes. | Planned task `VEXER-CORE-02-002` (backlog). |
| **Phase 3 Surfacing & enforcement** | Expose scores via WebService/CLI, integrate with Feedser noise priors, and enforce policy-based suppressions. | To be scheduled after Phase 2. |
## Data model (after Phase 1)
```json
{
"vulnerabilityId": "CVE-2025-12345",
"product": "pkg:name@version",
"consensus": {
"status": "affected",
"policyRevisionId": "rev-12",
"policyDigest": "0D9AEC…"
},
"signals": {
"severity": {"scheme": "CVSS:3.1", "score": 7.5},
"kev": true,
"epss": 0.40
},
"policy": {
"weight": 1.15,
"alpha": 0.25,
"beta": 0.5
},
"score": {
"value": 10.8,
"generatedAt": "2025-11-05T14:12:30Z",
"audit": [
"gate:affected",
"weight:1.15",
"severity:7.5",
"kev:1",
"epss:0.40"
]
}
}
```
## Operational guidance
* **Inputs**: Feedser delivers severity/KEV/EPSS via the advisory event log; Vexer connectors load VEX statements. Policy owns trust tiers and coefficients.
* **Processing**: the scoring engine (Phase 2) runs next to consensus, storing results with deterministic hashes so exports and attestations can reference them.
* **Consumption**: WebService/CLI will return consensus plus score; scanners may suppress findings only when policy-authorized VEX gating and signed score envelopes agree.
## Pseudocode (Phase 2 preview)
```python
def risk_score(gate, weight, severity, kev, epss, alpha, beta, freeze_boosts=False):
if gate == 0:
return 0
if freeze_boosts:
kev, epss = 0, 0
boost = 1 + alpha * kev + beta * epss
return max(0, weight * severity * boost)
```
## FAQ
* **Can operators opt out?** Set α=β=0 or keep weights ≤1.0 via policy.
* **What about missing signals?** Treat them as zero and log the omission.
* **When will this ship?** Phase 1 is planned for Sprint 7; later phases depend on connector coverage and attestation delivery.
## Status
This document tracks the future-looking risk scoring model for Excitor. The calculation below is not active yet; Sprint 7 work will add the required schema fields, policy controls, and services. Until that ships, Excitor emits consensus statuses without numeric scores.
## Scoring model (target state)
**S = Gate(VEX_status) × W_trust(source) × [Severity_base × (1 + α·KEV + β·EPSS)]**
* **Gate(VEX_status)**: `affected`/`under_investigation` → 1, `not_affected`/`fixed` → 0. A trusted “not affected” or “fixed” still zeroes the score.
* **W_trust(source)**: normalized policy weight (baseline 01). Policies may opt into >1 boosts for signed vendor feeds once Phase 1 closes.
* **Severity_base**: canonical numeric severity from Conselier (CVSS or org-defined scale).
* **KEV flag**: 0/1 boost when CISA Known Exploited Vulnerabilities applies.
* **EPSS**: probability [0,1]; bounded multiplier.
* **α, β**: configurable coefficients (default α=0.25, β=0.5) stored in policy.
Safeguards: freeze boosts when product identity is unknown, clamp outputs ≥0, and log every factor in the audit trail.
## Implementation roadmap
| Phase | Scope | Artifacts |
| --- | --- | --- |
| **Phase 1 Schema foundations** | Extend Excitor consensus/claims and Conselier canonical advisories with severity, KEV, EPSS, and expose α/β + weight ceilings in policy. | Sprint 7 tasks `EXCITOR-CORE-02-001`, `EXCITOR-POLICY-02-001`, `EXCITOR-STORAGE-02-001`, `FEEDCORE-ENGINE-07-001`. |
| **Phase 2 Deterministic score engine** | Implement a scoring component that executes alongside consensus and persists score envelopes with hashes. | Planned task `EXCITOR-CORE-02-002` (backlog). |
| **Phase 3 Surfacing & enforcement** | Expose scores via WebService/CLI, integrate with Conselier noise priors, and enforce policy-based suppressions. | To be scheduled after Phase 2. |
## Data model (after Phase 1)
```json
{
"vulnerabilityId": "CVE-2025-12345",
"product": "pkg:name@version",
"consensus": {
"status": "affected",
"policyRevisionId": "rev-12",
"policyDigest": "0D9AEC…"
},
"signals": {
"severity": {"scheme": "CVSS:3.1", "score": 7.5},
"kev": true,
"epss": 0.40
},
"policy": {
"weight": 1.15,
"alpha": 0.25,
"beta": 0.5
},
"score": {
"value": 10.8,
"generatedAt": "2025-11-05T14:12:30Z",
"audit": [
"gate:affected",
"weight:1.15",
"severity:7.5",
"kev:1",
"epss:0.40"
]
}
}
```
## Operational guidance
* **Inputs**: Conselier delivers severity/KEV/EPSS via the advisory event log; Excitor connectors load VEX statements. Policy owns trust tiers and coefficients.
* **Processing**: the scoring engine (Phase 2) runs next to consensus, storing results with deterministic hashes so exports and attestations can reference them.
* **Consumption**: WebService/CLI will return consensus plus score; scanners may suppress findings only when policy-authorized VEX gating and signed score envelopes agree.
## Pseudocode (Phase 2 preview)
```python
def risk_score(gate, weight, severity, kev, epss, alpha, beta, freeze_boosts=False):
if gate == 0:
return 0
if freeze_boosts:
kev, epss = 0, 0
boost = 1 + alpha * kev + beta * epss
return max(0, weight * severity * boost)
```
## FAQ
* **Can operators opt out?** Set α=β=0 or keep weights ≤1.0 via policy.
* **What about missing signals?** Treat them as zero and log the omission.
* **When will this ship?** Phase 1 is planned for Sprint 7; later phases depend on connector coverage and attestation delivery.

View File

@@ -1,150 +1,150 @@
# Export Center Provenance & Signing
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Export Center runs emit deterministic manifests, provenance records, and signatures so operators can prove bundle integrity end-to-end—whether the artefact is downloaded over HTTPS, pulled as an OCI object, or staged through the Offline Kit. This guide captures the canonical artefacts, signing pipeline, verification workflows, and failure handling expectations that backlogs `EXPORT-SVC-35-005` and `EXPORT-SVC-37-002` implement.
---
## 1. Goals & scope
- **Authenticity.** Every export manifest and provenance document is signed using Authority-managed KMS keys (cosign-compatible) with optional SLSA Level3 attestation.
- **Traceability.** Provenance links each bundle to the inputs that produced it: tenant, findings ledger queries, policy snapshots, SBOM identifiers, adapter versions, and encryption recipients.
- **Determinism.** Canonical JSON (sorted keys, RFC3339 UTC timestamps, normalized numbers) guarantees byte-for-byte stability across reruns with identical input.
- **Portability.** Signatures and attestations travel with filesystem bundles, OCI artefacts, and Offline Kit staging trees. Verification does not require online Authority access when the bundle includes the cosign public key.
---
## 2. Artefact inventory
| File | Location | Description | Notes |
|------|----------|-------------|-------|
| `export.json` | `manifests/export.json` or HTTP `GET /api/export/runs/{id}/manifest` | Canonical manifest describing profile, selectors, counts, SHA-256 digests, compression hints, distribution targets. | Hash of this file is included in provenance `subjects[]`. |
| `provenance.json` | `manifests/provenance.json` or `GET /api/export/runs/{id}/provenance` | In-toto provenance record listing subjects, materials, toolchain metadata, encryption recipients, and KMS key identifiers. | Mirrors SLSA Level2 schema; optionally upgraded to Level3 with builder attestations. |
| `export.json.sig` / `export.json.dsse` | `signatures/export.json.sig` | Cosign signature (and optional DSSE envelope) for manifest. | File naming matches cosign defaults; offline verification scripts expect `.sig`. |
| `provenance.json.sig` / `provenance.json.dsse` | `signatures/provenance.json.sig` | Cosign signature (and optional DSSE envelope) for provenance document. | `dsse` present when SLSA Level3 is enabled. |
| `bundle.attestation` | `signatures/bundle.attestation` (optional) | SLSA Level2/3 attestation binding bundle tarball/OCI digest to the run. | Only produced when `export.attestation.enabled=true`. |
| `manifest.yaml` | bundle root | Human-readable summary including digests, sizes, encryption metadata, and verification hints. | Unsigned but redundant; signatures cover the JSON manifests. |
All digests use lowercase hex SHA-256 (`sha256:<digest>`). When bundle encryption is enabled, `provenance.json` records wrapped data keys and recipient fingerprints under `encryption.recipients[]`.
---
## 3. Signing pipeline
1. **Canonicalisation.** Export worker serialises `export.json` and `provenance.json` using `NotifyCanonicalJsonSerializer` (identical canonical JSON helpers shared across services). Keys are sorted lexicographically, arrays ordered deterministically, timestamps normalised to UTC.
2. **Digest creation.** SHA-256 digests are computed and recorded:
- `manifest_hash` and `provenance_hash` stored in the run metadata (Mongo) and exported via `/api/export/runs/{id}`.
- Provenance `subjects[]` contains both manifest hash and bundle/archive hash.
3. **Key retrieval.** Worker obtains a short-lived signing token from Authoritys KMS client using tenant-scoped credentials (`export.sign` scope). Keys live in Authority or tenant-specific HSMs depending on deployment.
4. **Signature emission.** Cosign generates detached signatures (`*.sig`). If DSSE is enabled, cosign wraps payload bytes in a DSSE envelope (`*.dsse`). Attestations follow the SLSA Level2 provenance template; Level3 requires builder metadata (`EXPORT-SVC-37-002` optional feature flag).
5. **Storage & distribution.** Signatures and attestations are written alongside manifests in object storage, included in filesystem bundles, and attached as OCI artefact layers/annotations.
6. **Audit trail.** Run metadata captures signer identity (`signing_key_id`), cosign certificate serial, signature timestamps, and verification hints. Console/CLI surface these details for downstream automation.
> **Key management.** Secrets and key references are configured per tenant via `export.signing`, pointing to Authority clients or external HSM aliases. Offline deployments pre-load cosign public keys into the bundle (`signatures/pubkeys/{tenant}.pem`).
---
## 4. Provenance schema highlights
`provenance.json` follows the SLSA provenance (`https://slsa.dev/provenance/v1`) structure with StellaOps-specific extensions. Key fields:
| Path | Description |
|------|-------------|
| `subject[]` | Array of `{name,digest}` pairs. Includes bundle tarball/OCI digest and `export.json` digest. |
| `predicateType` | SLSA v1 (default). |
| `predicate.builder` | `{id:"stellaops/export-center@<region>"}` identifies the worker instance/cluster. |
| `predicate.buildType` | Profile identifier (`mirror:full`, `mirror:delta`, etc.). |
| `predicate.invocation.parameters` | Profile selectors, retention flags, encryption mode, base export references. |
| `predicate.materials[]` | Source artefacts with digests: findings ledger query snapshots, policy snapshot IDs + hashes, SBOM identifiers, adapter release digests. |
| `predicate.metadata.buildFinishedOn` | RFC3339 timestamp when signing completed. |
| `predicate.metadata.reproducible` | Always `true`—workers guarantee determinism. |
| `predicate.environment.encryption` | Records encryption recipients, wrapped keys, algorithm (`age` or `aes-gcm`). |
| `predicate.environment.kms` | Signing key identifier (`authority://tenant/export-signing-key`) and certificate chain fingerprints. |
Sample (abridged):
```json
{
"subject": [
{ "name": "bundle.tar.zst", "digest": { "sha256": "c1fe..." } },
{ "name": "manifests/export.json", "digest": { "sha256": "ad42..." } }
],
"predicate": {
"buildType": "mirror:delta",
"invocation": {
"parameters": {
"tenant": "tenant-01",
"baseExportId": "run-20251020-01",
"selectors": { "sources": ["concelier","vexer"], "profiles": ["mirror"] }
}
},
"materials": [
{ "uri": "ledger://tenant-01/findings?cursor=rev-42", "digest": { "sha256": "0f9a..." } },
{ "uri": "policy://tenant-01/snapshots/rev-17", "digest": { "sha256": "8c3d..." } }
],
"environment": {
"encryption": {
"mode": "age",
"recipients": [
{ "recipient": "age1qxyz...", "wrappedKey": "BASE64...", "keyId": "tenant-01/notify-age" }
]
},
"kms": {
"signingKeyId": "authority://tenant-01/export-signing",
"certificateChainSha256": "1f5e..."
}
}
}
}
```
---
## 5. Verification workflows
| Scenario | Steps |
|----------|-------|
| **CLI verification** | 1. `stella export manifest <runId> --output manifests/export.json --signature manifests/export.json.sig`<br>2. `stella export provenance <runId> --output manifests/provenance.json --signature manifests/provenance.json.sig`<br>3. `cosign verify-blob --key pubkeys/tenant.pem --signature manifests/export.json.sig manifests/export.json`<br>4. `cosign verify-blob --key pubkeys/tenant.pem --signature manifests/provenance.json.sig manifests/provenance.json` |
| **Bundle verification (offline)** | 1. Extract bundle (or mount OCI artefact).<br>2. Validate manifest/provenance signatures using bundled public key.<br>3. Recompute SHA-256 for `data/` files and compare with entries in `export.json`.<br>4. If encrypted, decrypt with Age/AES-GCM recipient key, then re-run digest comparisons on decrypted content. |
| **CI pipeline** | Use `stella export verify --manifest manifests/export.json --provenance manifests/provenance.json --signature manifests/export.json.sig --signature manifests/provenance.json.sig` (task `CLI-EXPORT-37-001`). Failure exits non-zero with reason codes (`ERR_EXPORT_SIG_INVALID`, `ERR_EXPORT_DIGEST_MISMATCH`). |
| **Console download** | Console automatically verifies signatures before exposing the bundle; failure surfaces an actionable error referencing the export run ID and required remediation. |
Verification guidance (docs/modules/cli/guides/cli-reference.md §export) cross-links here; keep both docs in sync when CLI behaviour changes.
---
## 6. Distribution considerations
- **HTTP headers.** `X-Export-Digest` includes bundle digest; `X-Export-Provenance` references `provenance.json` URL; `X-Export-Signature` references `.sig`. Clients use these hints to short-circuit re-downloads.
- **OCI annotations.** `org.opencontainers.image.ref.name`, `io.stellaops.export.manifest-digest`, and `io.stellaops.export.provenance-ref` allow registry tooling to locate manifests/signatures quickly.
- **Offline Kit staging.** Offline kit assembler copies `manifests/`, `signatures/`, and `pubkeys/` verbatim. Verification scripts (`offline-kits/bin/verify-export.sh`) wrap the cosign commands described above.
---
## 7. Failure handling & observability
- Runs surface signature status via `/api/export/runs/{id}` (`signing.status`, `signing.lastError`). Common errors include `ERR_EXPORT_KMS_UNAVAILABLE`, `ERR_EXPORT_ATTESTATION_FAILED`, `ERR_EXPORT_CANONICALIZE`.
- Metrics: `exporter_sign_duration_seconds`, `exporter_sign_failures_total{error_code}`, `exporter_provenance_verify_failures_total`.
- Logs: `phase=sign`, `error_code`, `signing_key_id`, `cosign_certificate_sn`.
- Alerts: DevOps dashboards (task `DEVOPS-EXPORT-37-001`) trigger on consecutive signing failures or verification failures >0.
When verification fails downstream, operators should:
1. Confirm signatures using the known-good key.
2. Inspect `provenance.json` materials; rerun the source queries to ensure matching digests.
3. Review run audit logs and retry export with `--resume` to regenerate manifests.
---
## 8. Compliance checklist
- [ ] Manifests and provenance documents generated with canonical JSON, deterministic digests, and signatures.
- [ ] Cosign public keys published per tenant, rotated through Authority, and distributed to Offline Kit consumers.
- [ ] SLSA attestations enabled where supply-chain requirements demand Level3 evidence.
- [ ] CLI/Console verification paths documented and tested (CI pipelines exercise `stella export verify`).
- [ ] Encryption metadata (recipients, wrapped keys) recorded in provenance and validated during verification.
- [ ] Run audit logs capture signature timestamps, signer identity, and failure reasons.
---
> **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
# Export Center Provenance & Signing
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Export Center runs emit deterministic manifests, provenance records, and signatures so operators can prove bundle integrity end-to-end—whether the artefact is downloaded over HTTPS, pulled as an OCI object, or staged through the Offline Kit. This guide captures the canonical artefacts, signing pipeline, verification workflows, and failure handling expectations that backlogs `EXPORT-SVC-35-005` and `EXPORT-SVC-37-002` implement.
---
## 1. Goals & scope
- **Authenticity.** Every export manifest and provenance document is signed using Authority-managed KMS keys (cosign-compatible) with optional SLSA Level3 attestation.
- **Traceability.** Provenance links each bundle to the inputs that produced it: tenant, findings ledger queries, policy snapshots, SBOM identifiers, adapter versions, and encryption recipients.
- **Determinism.** Canonical JSON (sorted keys, RFC3339 UTC timestamps, normalized numbers) guarantees byte-for-byte stability across reruns with identical input.
- **Portability.** Signatures and attestations travel with filesystem bundles, OCI artefacts, and Offline Kit staging trees. Verification does not require online Authority access when the bundle includes the cosign public key.
---
## 2. Artefact inventory
| File | Location | Description | Notes |
|------|----------|-------------|-------|
| `export.json` | `manifests/export.json` or HTTP `GET /api/export/runs/{id}/manifest` | Canonical manifest describing profile, selectors, counts, SHA-256 digests, compression hints, distribution targets. | Hash of this file is included in provenance `subjects[]`. |
| `provenance.json` | `manifests/provenance.json` or `GET /api/export/runs/{id}/provenance` | In-toto provenance record listing subjects, materials, toolchain metadata, encryption recipients, and KMS key identifiers. | Mirrors SLSA Level2 schema; optionally upgraded to Level3 with builder attestations. |
| `export.json.sig` / `export.json.dsse` | `signatures/export.json.sig` | Cosign signature (and optional DSSE envelope) for manifest. | File naming matches cosign defaults; offline verification scripts expect `.sig`. |
| `provenance.json.sig` / `provenance.json.dsse` | `signatures/provenance.json.sig` | Cosign signature (and optional DSSE envelope) for provenance document. | `dsse` present when SLSA Level3 is enabled. |
| `bundle.attestation` | `signatures/bundle.attestation` (optional) | SLSA Level2/3 attestation binding bundle tarball/OCI digest to the run. | Only produced when `export.attestation.enabled=true`. |
| `manifest.yaml` | bundle root | Human-readable summary including digests, sizes, encryption metadata, and verification hints. | Unsigned but redundant; signatures cover the JSON manifests. |
All digests use lowercase hex SHA-256 (`sha256:<digest>`). When bundle encryption is enabled, `provenance.json` records wrapped data keys and recipient fingerprints under `encryption.recipients[]`.
---
## 3. Signing pipeline
1. **Canonicalisation.** Export worker serialises `export.json` and `provenance.json` using `NotifyCanonicalJsonSerializer` (identical canonical JSON helpers shared across services). Keys are sorted lexicographically, arrays ordered deterministically, timestamps normalised to UTC.
2. **Digest creation.** SHA-256 digests are computed and recorded:
- `manifest_hash` and `provenance_hash` stored in the run metadata (Mongo) and exported via `/api/export/runs/{id}`.
- Provenance `subjects[]` contains both manifest hash and bundle/archive hash.
3. **Key retrieval.** Worker obtains a short-lived signing token from Authoritys KMS client using tenant-scoped credentials (`export.sign` scope). Keys live in Authority or tenant-specific HSMs depending on deployment.
4. **Signature emission.** Cosign generates detached signatures (`*.sig`). If DSSE is enabled, cosign wraps payload bytes in a DSSE envelope (`*.dsse`). Attestations follow the SLSA Level2 provenance template; Level3 requires builder metadata (`EXPORT-SVC-37-002` optional feature flag).
5. **Storage & distribution.** Signatures and attestations are written alongside manifests in object storage, included in filesystem bundles, and attached as OCI artefact layers/annotations.
6. **Audit trail.** Run metadata captures signer identity (`signing_key_id`), cosign certificate serial, signature timestamps, and verification hints. Console/CLI surface these details for downstream automation.
> **Key management.** Secrets and key references are configured per tenant via `export.signing`, pointing to Authority clients or external HSM aliases. Offline deployments pre-load cosign public keys into the bundle (`signatures/pubkeys/{tenant}.pem`).
---
## 4. Provenance schema highlights
`provenance.json` follows the SLSA provenance (`https://slsa.dev/provenance/v1`) structure with StellaOps-specific extensions. Key fields:
| Path | Description |
|------|-------------|
| `subject[]` | Array of `{name,digest}` pairs. Includes bundle tarball/OCI digest and `export.json` digest. |
| `predicateType` | SLSA v1 (default). |
| `predicate.builder` | `{id:"stellaops/export-center@<region>"}` identifies the worker instance/cluster. |
| `predicate.buildType` | Profile identifier (`mirror:full`, `mirror:delta`, etc.). |
| `predicate.invocation.parameters` | Profile selectors, retention flags, encryption mode, base export references. |
| `predicate.materials[]` | Source artefacts with digests: findings ledger query snapshots, policy snapshot IDs + hashes, SBOM identifiers, adapter release digests. |
| `predicate.metadata.buildFinishedOn` | RFC3339 timestamp when signing completed. |
| `predicate.metadata.reproducible` | Always `true`—workers guarantee determinism. |
| `predicate.environment.encryption` | Records encryption recipients, wrapped keys, algorithm (`age` or `aes-gcm`). |
| `predicate.environment.kms` | Signing key identifier (`authority://tenant/export-signing-key`) and certificate chain fingerprints. |
Sample (abridged):
```json
{
"subject": [
{ "name": "bundle.tar.zst", "digest": { "sha256": "c1fe..." } },
{ "name": "manifests/export.json", "digest": { "sha256": "ad42..." } }
],
"predicate": {
"buildType": "mirror:delta",
"invocation": {
"parameters": {
"tenant": "tenant-01",
"baseExportId": "run-20251020-01",
"selectors": { "sources": ["concelier","excitor"], "profiles": ["mirror"] }
}
},
"materials": [
{ "uri": "ledger://tenant-01/findings?cursor=rev-42", "digest": { "sha256": "0f9a..." } },
{ "uri": "policy://tenant-01/snapshots/rev-17", "digest": { "sha256": "8c3d..." } }
],
"environment": {
"encryption": {
"mode": "age",
"recipients": [
{ "recipient": "age1qxyz...", "wrappedKey": "BASE64...", "keyId": "tenant-01/notify-age" }
]
},
"kms": {
"signingKeyId": "authority://tenant-01/export-signing",
"certificateChainSha256": "1f5e..."
}
}
}
}
```
---
## 5. Verification workflows
| Scenario | Steps |
|----------|-------|
| **CLI verification** | 1. `stella export manifest <runId> --output manifests/export.json --signature manifests/export.json.sig`<br>2. `stella export provenance <runId> --output manifests/provenance.json --signature manifests/provenance.json.sig`<br>3. `cosign verify-blob --key pubkeys/tenant.pem --signature manifests/export.json.sig manifests/export.json`<br>4. `cosign verify-blob --key pubkeys/tenant.pem --signature manifests/provenance.json.sig manifests/provenance.json` |
| **Bundle verification (offline)** | 1. Extract bundle (or mount OCI artefact).<br>2. Validate manifest/provenance signatures using bundled public key.<br>3. Recompute SHA-256 for `data/` files and compare with entries in `export.json`.<br>4. If encrypted, decrypt with Age/AES-GCM recipient key, then re-run digest comparisons on decrypted content. |
| **CI pipeline** | Use `stella export verify --manifest manifests/export.json --provenance manifests/provenance.json --signature manifests/export.json.sig --signature manifests/provenance.json.sig` (task `CLI-EXPORT-37-001`). Failure exits non-zero with reason codes (`ERR_EXPORT_SIG_INVALID`, `ERR_EXPORT_DIGEST_MISMATCH`). |
| **Console download** | Console automatically verifies signatures before exposing the bundle; failure surfaces an actionable error referencing the export run ID and required remediation. |
Verification guidance (docs/modules/cli/guides/cli-reference.md §export) cross-links here; keep both docs in sync when CLI behaviour changes.
---
## 6. Distribution considerations
- **HTTP headers.** `X-Export-Digest` includes bundle digest; `X-Export-Provenance` references `provenance.json` URL; `X-Export-Signature` references `.sig`. Clients use these hints to short-circuit re-downloads.
- **OCI annotations.** `org.opencontainers.image.ref.name`, `io.stellaops.export.manifest-digest`, and `io.stellaops.export.provenance-ref` allow registry tooling to locate manifests/signatures quickly.
- **Offline Kit staging.** Offline kit assembler copies `manifests/`, `signatures/`, and `pubkeys/` verbatim. Verification scripts (`offline-kits/bin/verify-export.sh`) wrap the cosign commands described above.
---
## 7. Failure handling & observability
- Runs surface signature status via `/api/export/runs/{id}` (`signing.status`, `signing.lastError`). Common errors include `ERR_EXPORT_KMS_UNAVAILABLE`, `ERR_EXPORT_ATTESTATION_FAILED`, `ERR_EXPORT_CANONICALIZE`.
- Metrics: `exporter_sign_duration_seconds`, `exporter_sign_failures_total{error_code}`, `exporter_provenance_verify_failures_total`.
- Logs: `phase=sign`, `error_code`, `signing_key_id`, `cosign_certificate_sn`.
- Alerts: DevOps dashboards (task `DEVOPS-EXPORT-37-001`) trigger on consecutive signing failures or verification failures >0.
When verification fails downstream, operators should:
1. Confirm signatures using the known-good key.
2. Inspect `provenance.json` materials; rerun the source queries to ensure matching digests.
3. Review run audit logs and retry export with `--resume` to regenerate manifests.
---
## 8. Compliance checklist
- [ ] Manifests and provenance documents generated with canonical JSON, deterministic digests, and signatures.
- [ ] Cosign public keys published per tenant, rotated through Authority, and distributed to Offline Kit consumers.
- [ ] SLSA attestations enabled where supply-chain requirements demand Level3 evidence.
- [ ] CLI/Console verification paths documented and tested (CI pipelines exercise `stella export verify`).
- [ ] Encryption metadata (recipients, wrapped keys) recorded in provenance and validated during verification.
- [ ] Run audit logs capture signature timestamps, signer identity, and failure reasons.
---
> **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

View File

@@ -88,6 +88,7 @@
- `curl -fsSL https://localhost:8447/health/live`
- Issue an access token and list issuers to confirm results.
- Check Mongo counts match expectations (`db.issuers.countDocuments()`, etc.).
- Confirm Prometheus scrapes `issuer_directory_changes_total` and `issuer_directory_key_operations_total` for the tenants you restored.
## Disaster recovery notes
- **Retention:** Maintain 30 daily + 12 monthly archives. Store copies in geographically separate, access-controlled vaults.
@@ -98,6 +99,6 @@
## Verification checklist
- [ ] `/health/live` returns `200 OK`.
- [ ] Mongo collections (`issuers`, `issuer_keys`, `issuer_trust_overrides`) have expected counts.
- [ ] `issuer_directory_changes_total` and `issuer_directory_key_operations_total` metrics resume within 1 minute.
- [ ] `issuer_directory_changes_total`, `issuer_directory_key_operations_total`, and `issuer_directory_key_validation_failures_total` metrics resume within 1 minute.
- [ ] Audit entries exist for post-restore CRUD activity.
- [ ] Client integrations (VEX Lens, Excititor) resolve issuers successfully.

View File

@@ -39,6 +39,13 @@
```
Compose automatically mounts `../../etc/issuer-directory.yaml` into the container at `/etc/issuer-directory.yaml`, seeds CSAF publishers, and exposes the API on `https://localhost:8447`.
### Compose environment variables
| Variable | Purpose | Default |
| --- | --- | --- |
| `ISSUER_DIRECTORY_PORT` | Host port that maps to container port `8080`. | `8447` |
| `ISSUER_DIRECTORY_MONGO_CONNECTION_STRING` | Injected into `ISSUERDIRECTORY__MONGO__CONNECTIONSTRING`; should contain credentials. | `mongodb://${MONGO_INITDB_ROOT_USERNAME}:${MONGO_INITDB_ROOT_PASSWORD}@mongo:27017` |
| `ISSUER_DIRECTORY_SEED_CSAF` | Toggles CSAF bootstrap on startup. Set to `false` after the first production import if you manage issuers manually. | `true` |
4. **Smoke test**
```bash
curl -k https://localhost:8447/health/live

View File

@@ -12,6 +12,7 @@ Include the following artefacts in your Offline Update Kit staging tree:
| `config/issuer-directory/issuer-directory.yaml` | `etc/issuer-directory.yaml` (customised) | Replace Authority issuer, tenant header, and log level as required. |
| `config/issuer-directory/csaf-publishers.json` | `src/IssuerDirectory/StellaOps.IssuerDirectory/data/csaf-publishers.json` or regional override | Operators can edit before import to add private publishers. |
| `secrets/issuer-directory/connection.env` | Secure secret store export (`ISSUER_DIRECTORY_MONGO_CONNECTION_STRING=`) | Encrypt at rest; Offline Kit importer places it in the Compose/Helm secret. |
| `env/issuer-directory.env` (optional) | Curated `.env` snippet (for example `ISSUER_DIRECTORY_SEED_CSAF=false`) | Helps operators disable reseeding after their first import without editing the main profile. |
| `docs/issuer-directory/deployment.md` | `docs/modules/issuer-directory/operations/deployment.md` | Ship alongside kit documentation for operators. |
> **Image digests:** Update `deploy/releases/2025.10-edge.yaml` (or the relevant manifest) with the exact digest before building the kit so `offline-manifest.json` can assert integrity.
@@ -69,3 +70,4 @@ Include the following artefacts in your Offline Update Kit staging tree:
- [ ] `/issuer-directory/issuers` returns global seed issuers (requires token with `issuer-directory:read` scope).
- [ ] Audit collection receives entries when you create/update issuers offline.
- [ ] Offline kit manifest (`offline-manifest.json`) lists `images/issuer-directory-web.tar` and `config/issuer-directory/issuer-directory.yaml` with SHA-256 values you recorded during packaging.
- [ ] Prometheus in the offline environment reports `issuer_directory_changes_total` for the tenants imported from the kit.

File diff suppressed because it is too large Load Diff

View File

@@ -1,56 +1,56 @@
{
"$id": "https://stella-ops.org/schemas/notify/notify-event@1.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Notify Event Envelope",
"type": "object",
"required": ["eventId", "kind", "tenant", "ts", "payload"],
"properties": {
"eventId": {"type": "string", "format": "uuid"},
"kind": {
"type": "string",
"description": "Event kind identifier (e.g. scanner.report.ready).",
"enum": [
"scanner.report.ready",
"scanner.scan.completed",
"scheduler.rescan.delta",
"attestor.logged",
"zastava.admission",
"feedser.export.completed",
"vexer.export.completed"
]
},
"version": {"type": "string"},
"tenant": {"type": "string"},
"ts": {"type": "string", "format": "date-time"},
"actor": {"type": "string"},
"scope": {
"type": "object",
"properties": {
"namespace": {"type": "string"},
"repo": {"type": "string"},
"digest": {"type": "string"},
"component": {"type": "string"},
"image": {"type": "string"},
"labels": {"$ref": "#/$defs/stringMap"},
"attributes": {"$ref": "#/$defs/stringMap"}
},
"additionalProperties": false
},
"payload": {
"type": "object",
"description": "Event specific body; see individual schemas for shapes.",
"additionalProperties": true
},
"attributes": {"$ref": "#/$defs/stringMap"}
},
"additionalProperties": false,
"$defs": {
"stringMap": {
"type": "object",
"patternProperties": {
".*": {"type": "string"}
},
"additionalProperties": false
}
}
}
{
"$id": "https://stella-ops.org/schemas/notify/notify-event@1.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Notify Event Envelope",
"type": "object",
"required": ["eventId", "kind", "tenant", "ts", "payload"],
"properties": {
"eventId": {"type": "string", "format": "uuid"},
"kind": {
"type": "string",
"description": "Event kind identifier (e.g. scanner.report.ready).",
"enum": [
"scanner.report.ready",
"scanner.scan.completed",
"scheduler.rescan.delta",
"attestor.logged",
"zastava.admission",
"conselier.export.completed",
"excitor.export.completed"
]
},
"version": {"type": "string"},
"tenant": {"type": "string"},
"ts": {"type": "string", "format": "date-time"},
"actor": {"type": "string"},
"scope": {
"type": "object",
"properties": {
"namespace": {"type": "string"},
"repo": {"type": "string"},
"digest": {"type": "string"},
"component": {"type": "string"},
"image": {"type": "string"},
"labels": {"$ref": "#/$defs/stringMap"},
"attributes": {"$ref": "#/$defs/stringMap"}
},
"additionalProperties": false
},
"payload": {
"type": "object",
"description": "Event specific body; see individual schemas for shapes.",
"additionalProperties": true
},
"attributes": {"$ref": "#/$defs/stringMap"}
},
"additionalProperties": false,
"$defs": {
"stringMap": {
"type": "object",
"patternProperties": {
".*": {"type": "string"}
},
"additionalProperties": false
}
}
}

View File

@@ -8,6 +8,8 @@ Policy Engine compiles and evaluates Stella DSL policies deterministically, prod
- [Architecture](./architecture.md)
- [Implementation plan](./implementation_plan.md)
- [Task board](./TASKS.md)
- [Secret leak detection readiness](../policy/secret-leak-detection-readiness.md)
- [Windows package readiness](../policy/windows-package-readiness.md)
## How to get started
1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module.

View File

@@ -1,31 +1,33 @@
# StellaOps Policy Engine
Policy Engine compiles and evaluates Stella DSL policies deterministically, producing explainable findings with full provenance.
## Responsibilities
- Compile `stella-dsl@1` packs into executable graphs.
- Join advisories, VEX evidence, and SBOM inventories to derive effective findings.
- Expose simulation and diff APIs for UI/CLI workflows.
- Emit change-stream driven events for Notify/Scheduler integrations.
## Key components
- `StellaOps.Policy.Engine` service host.
- Shared libraries under `StellaOps.Policy.*` for evaluation, storage, DSL tooling.
## Integrations & dependencies
- MongoDB findings collections, RustFS explain bundles.
- Scheduler for incremental re-evaluation triggers.
- CLI/UI for policy authoring and runs.
## Operational notes
- DSL grammar and lifecycle docs in ../../policy/.
- Observability guidance in ../../observability/policy.md.
- Governance and scope mapping in ../../security/policy-governance.md.
## Backlog references
- DOCS-POLICY-20-001 … DOCS-POLICY-20-012 (completed baseline).
- DOCS-POLICY-23-007 (upcoming command updates).
## Epic alignment
- **Epic 2 Policy Engine & Editor:** deliver deterministic evaluation, DSL infrastructure, explain traces, and incremental runs.
- **Epic 4 Policy Studio:** integrate registry workflows, simulation at scale, approvals, and promotion semantics.
# StellaOps Policy Engine
Policy Engine compiles and evaluates Stella DSL policies deterministically, producing explainable findings with full provenance.
## Responsibilities
- Compile `stella-dsl@1` packs into executable graphs.
- Join advisories, VEX evidence, and SBOM inventories to derive effective findings.
- Expose simulation and diff APIs for UI/CLI workflows.
- Emit change-stream driven events for Notify/Scheduler integrations.
## Key components
- `StellaOps.Policy.Engine` service host.
- Shared libraries under `StellaOps.Policy.*` for evaluation, storage, DSL tooling.
## Integrations & dependencies
- MongoDB findings collections, RustFS explain bundles.
- Scheduler for incremental re-evaluation triggers.
- CLI/UI for policy authoring and runs.
## Operational notes
- DSL grammar and lifecycle docs in ../../policy/.
- Observability guidance in ../../observability/policy.md.
- Governance and scope mapping in ../../security/policy-governance.md.
- Readiness briefs: ../policy/secret-leak-detection-readiness.md, ../policy/windows-package-readiness.md.
- Readiness briefs: ../scanner/design/macos-analyzer.md, ../scanner/design/windows-analyzer.md, ../policy/secret-leak-detection-readiness.md, ../policy/windows-package-readiness.md.
## Backlog references
- DOCS-POLICY-20-001 … DOCS-POLICY-20-012 (completed baseline).
- DOCS-POLICY-23-007 (upcoming command updates).
## Epic alignment
- **Epic 2 Policy Engine & Editor:** deliver deterministic evaluation, DSL infrastructure, explain traces, and incremental runs.
- **Epic 4 Policy Studio:** integrate registry workflows, simulation at scale, approvals, and promotion semantics.

View File

@@ -1,9 +1,11 @@
# Task board — Policy Engine
> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
| ID | Status | Owner(s) | Description | Notes |
|----|--------|----------|-------------|-------|
| POLICY ENGINE-DOCS-0001 | TODO | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
| POLICY ENGINE-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
| POLICY ENGINE-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |
# Task board — Policy Engine
> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
| ID | Status | Owner(s) | Description | Notes |
|----|--------|----------|-------------|-------|
| POLICY ENGINE-DOCS-0001 | TODO | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
| POLICY ENGINE-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
| POLICY ENGINE-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |
| POLICY-READINESS-0001 | DOING (2025-11-03) | Policy Guild, Security Guild | Resolve open questions in `../policy/secret-leak-detection-readiness.md` ahead of SCANNER-ENG-0007. | Decision workshop 2025-11-10 (Northwind demo); cover masking depth, telemetry retention, bundle defaults, tenant overrides. |
| POLICY-READINESS-0002 | DOING (2025-11-03) | Policy Guild, Security Guild, Offline Kit Guild | Review `../policy/windows-package-readiness.md`, set signature verification locus, feed mirroring scopes, and legacy installer posture. | FinSecure PCI blocker; deliver Authenticode/feed decision by 2025-11-07 before analyzer spike kickoff. |

View File

@@ -0,0 +1,80 @@
# Secret Leak Detection Readiness — Policy & Security Brief
> Audience: Policy Guild, Security Guild
> Related backlog: SCANNER-ENG-0007 (deterministic leak detection pipeline), DOCS-SCANNER-BENCH-62-007 (rule bundle documentation), SCANNER-SECRETS-01..03 (Surface.Secrets alignment)
## 1. Goal & scope
- Provide a shared understanding of how the planned `StellaOps.Scanner.Analyzers.Secrets` plug-in will operate so Policy/Security can prepare governance controls in parallel with engineering delivery.
- Document evidence flow, policy predicates, and offline distribution requirements to minimise lead time once implementation lands.
- Capture open questions requiring Policy/Security sign-off (masking rules, tenancy constraints, waiver workflows).
## 2. Proposed evidence pipeline
1. **Source resolution**
- Surface.Secrets providers (Kubernetes, file bundle, inline) continue to resolve operational credentials. Handles remain opaque and never enter analyzer outputs.
2. **Deterministic scanning**
- New plug-in executes signed rule bundles (regex + entropy signatures) stored under `scanner/rules/secrets/`.
- Execution context restricted to read-only layer mount; analyzers emit `secret.leak` evidence with: `{rule.id, rule.version, confidence, severity, mask, file, line}`.
3. **Analysis store persistence**
- Findings are written into `ScanAnalysisStore` (`ScanAnalysisKeys.secretFindings`) so Policy Engine can ingest them alongside component fragments.
4. **Policy overlay**
- Policy predicates (see §3) evaluate evidence, lattice scores, and tenant-scoped allow/deny lists.
- CLI/export surfaces show masked snippets and remediation hints.
5. **Offline parity**
- Rule bundles, signature manifests, and validator hash lists ship with Offline Kit; rule updates must be signed and versioned to preserve determinism.
## 3. Policy Engine considerations
- **New predicates**
- `secret.hasFinding(ruleId?, severity?, confidence?)`
- `secret.bundle.version(requiredVersion)`
- `secret.mask.applied` (bool) — verify masking for high severity hits.
- `secret.path.allowlist` — tenant-configured allow list keyed by digest/path.
- **Lattice weight suggestions**
- High severity & high confidence → escalate to `block` unless waived.
- Low confidence → default to `warn` with optional escalation when multiple matches occur (`secret.match.count >= N`).
- **Waiver workflow**
- Reuse VEX-first lattice approach: require attach of remediation note, ticket reference, and expiration date.
- Ensure waivers attach rule version so upgraded rules re-evaluate automatically.
- **Masking / privacy**
- Minimum masking: first and last 2 characters retained; remainder replaced with `*`.
- Persist masked payload only; full value never leaves scanner context.
## 4. Security guardrails
- Rule bundle signing: Signer issues DSSE envelope for each ruleset; Policy must verify signature before enabling new bundle.
- Tenant isolation: bundle enablement scoped per tenant; defaults deny unknown bundles.
- Telemetry: emit `scanner.secret.finding_total{tenant, ruleId, severity}` with masking applied after count aggregation.
- Access control: restrict retrieval of raw finding payloads to roles with `scanner.secret.read` scope; audits log query + tenant + rule id.
## 5. Offline & update flow
1. Engineering publishes new bundle → Signer signs → Offline Kit includes bundle + manifest.
2. Operators import bundle via CLI (`stella secrets bundle install --path <bundle>`).
- CLI verifies signature, version monotonicity, and rule hash list.
3. Policy update: push config snippet enabling bundle, severity mapping, and waiver templates.
4. Run `stella secrets scan --dry-run` to verify deterministic output against golden fixtures before enabling in production.
## 6. Open questions / owners
| Topic | Question | Owner | Target decision |
| --- | --- | --- | --- |
| Masking depth | Do we mask file paths or only payloads? | Security Guild | Sprint 132 design review |
| Telemetry retention | Should secret finding metrics be sampled or full fidelity? | Policy + Observability Guild | Sprint 132 |
| Default bundles | Which rule families ship enabled-by-default (cloud creds, SSH keys, JWT)? | Security Guild | Sprint 133 |
| Tenant overrides | Format for per-tenant allow lists (path glob vs digest)? | Policy Guild | Sprint 133 |
### Decision tracker
| Decision | Owner(s) | Target date | Status |
| --- | --- | --- | --- |
| Masking depth (paths vs payloads) | Security Guild | 2025-11-10 | Pending — workshop aligned with Northwind demo |
| Telemetry retention granularity | Policy + Observability Guild | 2025-11-10 | Pending |
| Default rule bundles (cloud creds/SSH/JWT) | Security Guild | 2025-11-10 | Draft proposals under review |
| Tenant override format | Policy Guild | 2025-11-10 | Pending |
## 7. Next steps
1. Policy Guild drafts predicate specs + policy templates (map to DOCS-SCANNER-BENCH-62-007 exit criteria).
2. Security Guild reviews signing + masking requirements; align with Surface.Secrets roadmap.
3. Docs Guild (this task) continues maintaining `docs/benchmarks/scanner/deep-dives/secrets.md` with finalized rule taxonomy and references.
4. Engineering provides prototype fixture outputs for review once SCANNER-ENG-0007 spikes begin.
## Coordination
- Capture macOS customer requirements via `docs/benchmarks/scanner/windows-macos-demand.md` (Northwind Health Services).
- Update dashboard (`docs/api/scanner/windows-macos-summary.md`) after readiness decisions.
- Share outcomes from 2025-11-10 macOS demo before finalising POLICY-READINESS-0001.

View File

@@ -0,0 +1,92 @@
# Windows Package Coverage — Policy & Security Readiness Brief
> Audience: Policy Guild, Security Guild, Offline Kit Guild
> Related engineering backlog (proposed): SCANNER-ENG-0024..0027
> Docs linkage: DOCS-SCANNER-BENCH-62-016
## 1. Goal
- Prepare policy and security guidance ahead of Windows analyzer implementation (MSI, WinSxS, Chocolatey, registry).
- Define evidence handling, predicates, waiver expectations, and offline prerequisites so engineering can align during spike execution.
## 2. Evidence pipeline snapshot (from `design/windows-analyzer.md`)
1. **Collection**
- MSI database parsing → component records keyed by ProductCode/ComponentCode.
- WinSxS manifests → assembly identities, catalog signatures.
- Chocolatey packages → nuspec metadata, feed provenance, script hashes.
- Registry exports → uninstall/service entries, legacy installers.
- Driver/service mapper → capability overlays (kernel-mode, auto-start).
2. **Storage**
- Results persisted as `LayerComponentFragment`s plus capability overlays (`ScanAnalysisKeys.capability.windows`).
- Provenance metadata includes signature thumbprint, catalog hash, feed URL, install context.
3. **Downstream**
- Policy Engine consumes component + capability evidence; Export Center bundles MSI manifests, nuspec metadata, catalog hashes.
## 3. Policy predicate requirements
| Predicate | Description | Initial default |
| --- | --- | --- |
| `windows.package.signed(thumbprint?)` | True when Authenticode signature/cert matches allowlist. | Warn on missing signature, fail on mismatched thumbprint for kernel drivers. |
| `windows.package.sourceAllowed(sourceId)` | Validates Chocolatey/nuget feed against tenant allowlist. | Fail if feed not in tenant policy. |
| `windows.driver.kernelMode()` | Flags kernel-mode drivers for extra scrutiny. | Fail when unsigned; warn otherwise. |
| `windows.driver.signedBy(publisher)` | Checks driver publisher matches allowlist. | Warn on unknown publisher. |
| `windows.service.autoStart(name)` | Identifies auto-start services. | Warn if unsigned binary or service not in allowlist. |
| `windows.package.legacyInstaller()` | Legacy EXE-only installers detected via registry. | Warn by default; escalate if binary unsigned. |
Additional considerations:
- Map KB references (from WinSxS/MSP metadata) to vulnerability posture once Policy Engine supports patch layering.
- Provide predicates to waive specific ProductCodes or AssemblyIdentities with expiration.
## 4. Waiver & governance model
- Waiver key: `{productCode, version, signatureThumbprint}` or for drivers `{driverName, serviceName, signatureThumbprint}`.
- Required metadata: remediation ticket, justification, expiry date.
- Automated re-evaluation when version or signature changes.
- Tenants maintain allow lists for Chocolatey feeds and driver publishers via policy configuration.
## 5. Masking & privacy
- Findings should not include raw script contents; provide SHA256 hash and limited excerpt (first/last 8 chars).
- Registry values (install paths, command lines) must be truncated if they contain secrets; rely on Surface.Secrets to manage environment variables referenced during install scripts.
## 6. Offline kit guidance
- Bundle:
- MSI parser binary + schema definitions.
- Chocolatey feed snapshot(s) (nupkg files) with hash manifest.
- Microsoft root/intermediate certificate bundles; optional CRL/OCSP cache instructions.
- Operators must export registry hives (`SOFTWARE`, `SYSTEM`) during image extraction; document PowerShell script and required access.
- Provide checksum manifest to verify feed snapshot integrity.
## 7. Telemetry expectations
- Metrics:
- `scanner.windows.package_total{tenant,signed}` — count packages per signature state.
- `scanner.windows.driver_unsigned_total{tenant}`.
- `scanner.windows.choco_feed_total{tenant,feed}`.
- Logs:
- Include product code, version, signature thumbprint, feed ID (no file paths unless sanitized).
- Traces:
- Annotate collector spans (`collector.windows.msi`, `collector.windows.winsxs`, etc.) with component counts and parsing duration.
## 8. Open questions
| Topic | Question | Owner | Target decision |
| --- | --- | --- | --- |
| Signature verification locus | Scanner vs Policy: where to verify Authenticode signatures + revocation? | Security Guild | Sprint 133 |
| Feed mirroring scope | Default set of Chocolatey feeds to mirror (official/community). | Product + Security Guild | Sprint 133 |
| Legacy installers | Should we block unsigned EXE installers by default or allow warn-only posture? | Policy Guild | Sprint 134 |
| Driver taxonomy | Define high-risk driver categories (kernel-mode, filter drivers) for policy severity. | Policy Guild | Sprint 134 |
### Decision tracker
| Decision | Owner(s) | Target date | Status |
| --- | --- | --- | --- |
| Authenticode verification locus (Scanner vs Policy) | Security Guild | 2025-11-07 | Pending — blocker for FinSecure |
| Chocolatey feed mirroring scope | Product + Security Guild | 2025-11-07 | Draft proposal circulating |
| Legacy installer posture (warn vs fail) | Policy Guild | 2025-11-14 | Not started |
| Driver risk taxonomy | Policy Guild | 2025-11-14 | Not started |
## 9. Next steps
1. Policy Guild drafts predicate specs + policy templates; align with DOCS-SCANNER-BENCH-62-016.
2. Security Guild evaluates signature verification approach and revocation handling (online vs offline CRL cache).
3. Offline Kit Guild scopes snapshot size and update cadence for Chocolatey feeds and certificate bundles.
4. Docs Guild prepares policy/user guidance updates once predicates are finalised.
5. Security Guild to report decision for FinSecure Corp (POLICY-READINESS-0002) by 2025-11-07; feed outcome into dashboards.
## Coordination
- Sync demand signals via `docs/benchmarks/scanner/windows-macos-demand.md`.
- Log policy readiness status in `docs/api/scanner/windows-coverage.md`.
- Update Windows/macOS metrics dashboard when decisions change (`docs/api/scanner/windows-macos-summary.md`).

View File

@@ -8,6 +8,12 @@ Scanner analyses container images layer-by-layer, producing deterministic SBOM f
- [Architecture](./architecture.md)
- [Implementation plan](./implementation_plan.md)
- [Task board](./TASKS.md)
- [macOS analyzer design](./design/macos-analyzer.md)
- [Windows analyzer design](./design/windows-analyzer.md)
- [Field engagement workflow](./operations/field-engagement.md)
- [Design dossiers](./design/README.md)
- [Benchmarks overview](../../benchmarks/scanner/README.md)
- [Benchmarks overview](../../benchmarks/scanner/README.md)
## How to get started
1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module.

View File

@@ -1,38 +1,46 @@
# StellaOps Scanner
Scanner analyses container images layer-by-layer, producing deterministic SBOM fragments, diffs, and signed reports.
## Responsibilities
- Expose APIs (WebService) for scan orchestration, diffing, and artifact retrieval.
- Run Worker analyzers for OS, language, and native ecosystems with restart-only plug-ins.
- Store SBOM fragments and artifacts in RustFS/object storage.
- Publish DSSE-ready metadata for Signer/Attestor and downstream policy evaluation.
## Key components
- `StellaOps.Scanner.WebService` minimal API host.
- `StellaOps.Scanner.Worker` analyzer executor.
- Analyzer libraries under `StellaOps.Scanner.Analyzers.*`.
## Integrations & dependencies
- Scheduler for job intake and retries.
- Policy Engine for evidence handoff.
- Export Center / Offline Kit for artifact packaging.
## Operational notes
- CAS caches, bounded retries, DSSE integration.
- Monitoring dashboards (see ./operations/analyzers-grafana-dashboard.json).
- RustFS migration playbook.
## Related resources
- ./operations/analyzers.md
- ./operations/analyzers-grafana-dashboard.json
- ./operations/rustfs-migration.md
- ./operations/entrypoint.md
## Backlog references
- DOCS-SCANNER updates tracked in ../../TASKS.md.
- Analyzer parity work in src/Scanner/**/TASKS.md.
## Epic alignment
- **Epic 6 Vulnerability Explorer:** provide policy-aware scan outputs, explain traces, and findings ledger hooks for triage workflows.
- **Epic 10 Export Center:** generate export-ready artefacts, manifests, and DSSE metadata for bundles.
# StellaOps Scanner
Scanner analyses container images layer-by-layer, producing deterministic SBOM fragments, diffs, and signed reports.
## Responsibilities
- Expose APIs (WebService) for scan orchestration, diffing, and artifact retrieval.
- Run Worker analyzers for OS, language, and native ecosystems with restart-only plug-ins.
- Store SBOM fragments and artifacts in RustFS/object storage.
- Publish DSSE-ready metadata for Signer/Attestor and downstream policy evaluation.
## Key components
- `StellaOps.Scanner.WebService` minimal API host.
- `StellaOps.Scanner.Worker` analyzer executor.
- Analyzer libraries under `StellaOps.Scanner.Analyzers.*`.
## Integrations & dependencies
- Scheduler for job intake and retries.
- Policy Engine for evidence handoff.
- Export Center / Offline Kit for artifact packaging.
## Operational notes
- CAS caches, bounded retries, DSSE integration.
- Monitoring dashboards (see ./operations/analyzers-grafana-dashboard.json).
- RustFS migration playbook.
## Related resources
- ./operations/analyzers.md
- ./operations/analyzers-grafana-dashboard.json
- ./operations/rustfs-migration.md
- ./operations/entrypoint.md
- ./design/macos-analyzer.md
- ./design/windows-analyzer.md
- ../benchmarks/scanner/deep-dives/macos.md
- ../benchmarks/scanner/deep-dives/windows.md
- ../benchmarks/scanner/windows-macos-demand.md
- ../benchmarks/scanner/windows-macos-interview-template.md
- ./operations/field-engagement.md
- ./design/README.md
## Backlog references
- DOCS-SCANNER updates tracked in ../../TASKS.md.
- Analyzer parity work in src/Scanner/**/TASKS.md.
## Epic alignment
- **Epic 6 Vulnerability Explorer:** provide policy-aware scan outputs, explain traces, and findings ledger hooks for triage workflows.
- **Epic 10 Export Center:** generate export-ready artefacts, manifests, and DSSE metadata for bundles.

View File

@@ -28,5 +28,13 @@
| SCANNER-ENG-0005 | TODO | Go Analyzer Guild | Enhance Go stripped-binary fallback inference per `docs/benchmarks/scanner/scanning-gaps-stella-misses-from-competitors.md`. | Include inferred module metadata & policy integration |
| SCANNER-ENG-0006 | TODO | Rust Analyzer Guild | Expand Rust fingerprint coverage per `docs/benchmarks/scanner/scanning-gaps-stella-misses-from-competitors.md`. | Ship enriched fingerprint catalogue + policy controls |
| SCANNER-ENG-0007 | TODO | Scanner Guild, Policy Guild | Design deterministic secret leak detection pipeline per `docs/benchmarks/scanner/scanning-gaps-stella-misses-from-competitors.md`. | Include rule packaging, Policy Engine integration, CLI workflow |
| SCANNER-ENG-0020 | TODO | Scanner Guild (macOS Cellar Squad) | Implement Homebrew collector and fragment mapper per `design/macos-analyzer.md` §3.1. | Emit brew component fragments with tap provenance; integrate Surface.Validation/FS limits. |
| SCANNER-ENG-0021 | TODO | Scanner Guild (macOS Receipts Squad) | Implement pkgutil receipt collector per `design/macos-analyzer.md` §3.2. | Parse receipts/BOMs into deterministic component records with install metadata. |
| SCANNER-ENG-0022 | TODO | Scanner Guild, Policy Guild (macOS Bundles Squad) | Implement macOS bundle inspector & capability overlays per `design/macos-analyzer.md` §3.3. | Extract signing/entitlements, emit capability evidence, merge with receipts/Homebrew. |
| SCANNER-ENG-0023 | TODO | Scanner Guild, Offline Kit Guild, Policy Guild | Deliver macOS policy/offline integration per `design/macos-analyzer.md` §56. | Define policy predicates, CLI toggles, Offline Kit packaging, and documentation. |
| SCANNER-ENG-0024 | TODO | Scanner Guild (Windows MSI Squad) | Implement Windows MSI collector per `design/windows-analyzer.md` §3.1. | Parse MSI databases, emit component fragments with provenance metadata; blocked until POLICY-READINESS-0002 (decision due 2025-11-07). |
| SCANNER-ENG-0025 | TODO | Scanner Guild (Windows WinSxS Squad) | Implement WinSxS manifest collector per `design/windows-analyzer.md` §3.2. | Correlate assemblies with MSI components and catalog signatures; dependent on POLICY-READINESS-0002 outcome. |
| SCANNER-ENG-0026 | TODO | Scanner Guild (Windows Packages Squad) | Implement Chocolatey & registry collectors per `design/windows-analyzer.md` §3.33.4. | Harvest nuspec metadata and registry uninstall/service evidence; merge with filesystem artefacts; align with feed decisions from POLICY-READINESS-0002. |
| SCANNER-ENG-0027 | TODO | Scanner Guild, Policy Guild, Offline Kit Guild | Deliver Windows policy/offline integration per `design/windows-analyzer.md` §56. | Define predicates, CLI/Offline docs, and packaging for feeds/certs; start after POLICY-READINESS-0002 sign-off. |
| SCANNER-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
| SCANNER-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |

View File

@@ -0,0 +1,35 @@
# Scanner Design Dossiers
This directory contains deep technical designs for current and upcoming analyzers and surface components.
## Language analyzers
- `ruby-analyzer.md` — lockfile, runtime graph, capability signals for Ruby.
## Surface & platform contracts
- `surface-fs.md`
- `surface-env.md`
- `surface-validation.md`
- `surface-secrets.md`
## OS ecosystem designs
- `macos-analyzer.md` — Homebrew, pkgutil, `.app` bundle plan.
- `windows-analyzer.md` — MSI, WinSxS, Chocolatey, registry collectors.
## Demand & dashboards
- `../../benchmarks/scanner/windows-macos-demand.md` — demand tracker.
- `../../benchmarks/scanner/windows-macos-interview-template.md` — interview template.
- `../../api/scanner/windows-coverage.md` — coverage summary dashboard.
- `../../api/scanner/windows-macos-summary.md` — metric snapshot.
## Utility & reference
- `../operations/field-engagement.md` — SE workflow guidance.
- `../operations/analyzers.md` — operational runbook.
- `../operations/rustfs-migration.md` — storage migration notes.
## Maintenance tips
- Keep demand tracker (`../../benchmarks/scanner/windows-macos-demand.md`) and API dashboards in sync when updating macOS/Windows designs.
- Cross-reference policy readiness briefs for associated predicates and waiver models.
## Policy readiness
- `../policy/secret-leak-detection-readiness.md` — secret leak pipeline decisions.
- `../policy/windows-package-readiness.md` — Windows analyzer policy decisions.

View File

@@ -0,0 +1,123 @@
# macOS Analyzer Design Brief (Draft)
> Owners: Scanner Guild, Policy Guild, Offline Kit Guild
> Related backlog (proposed): SCANNER-ENG-0020..0023, DOCS-SCANNER-BENCH-62-002
> Status: Draft pending demand validation (see `docs/benchmarks/scanner/windows-macos-demand.md`)
## 1. Scope & objectives
- Deliver deterministic inventory coverage for macOS container and VM images, focusing on Homebrew, Apple/pkgutil receipts, and `.app` bundles.
- Preserve StellaOps principles: per-layer provenance, offline parity, policy explainability, and signed evidence pipelines.
- Provide capability signals (entitlements, hardened runtime, launch agents) to enable Policy Engine gating.
Out of scope (Phase 1):
- Dynamic runtime tracing of macOS services (deferred to Zastava/EntryTrace).
- iOS/tvOS/visionOS package formats (will reuse bundle inspection where feasible).
- Direct notarization ticket validation (delegated to Policy Engine unless otherwise decided).
## 2. High-level architecture
```
Scanner.Worker (macOS profile)
├─ Surface.Validation (enforce allowlists, bundle size limits)
├─ Surface.FS (layer/materialized filesystem)
├─ HomebrewCollector (new) -> LayerComponentFragment (brew)
├─ PkgutilCollector (new) -> LayerComponentFragment (pkgutil)
├─ BundleInspector (new) -> Capability records + component fragments
├─ LaunchAgentMapper (optional) -> Usage hints
└─ MacOsAggregator (new) -> merges fragments, emits ComponentGraph & capability overlays
```
- Each collector runs deterministically against the mounted filesystem; results persist in `ScanAnalysisStore` under dedicated keys before being normalized by `MacOsComponentMapper`.
- Layer digests follow existing `LayerComponentFragment` semantics to maintain diff/replay parity.
## 3. Collectors & data sources
### 3.1 Homebrew collector
- Scan `/usr/local/Cellar/**` and `/opt/homebrew/Cellar/**` for `INSTALL_RECEIPT.json` and `*.rb` formula metadata.
- Output fields: `tap`, `formula`, `version`, `revision`, `poured_from_bottle`, `installed_with`: [].
- Store `bottle.sha256`, `source.url`, and `tapped_from` metadata for provenance.
- Map to PURL: `pkg:brew/<tap>/<formula>@<version>?revision=<revision>`.
- Guardrails: limit traversal depth, ignore caches (`/Caskroom` optional), enforce 200MB per formula cap (configurable).
### 3.2 pkgutil receipt collector
- Parse `/var/db/receipts/*.plist` for pkg identifiers, version, install time, volume, installer domain.
- Use `.bom` files to enumerate installed files; capture path hashes via `osx.BomParser`.
- Emit `ComponentRecord`s keyed by `pkgutil:<identifier>` with metadata: `bundleIdentifier`, `installTimeUtc`, `volume`.
- Provide dependency mapping between pkg receipts and Homebrew formulae when overlapping file hashes exist (optional Phase 2).
### 3.3 Bundle inspector
- Traverse `/Applications`, `/System/Applications`, `/Users/*/Applications`, `/Library/Application Support`.
- For each `.app`:
- Read `Info.plist` (bundle id, version, short version, minimum system).
- Extract code signing metadata via `codesign --display --xml` style parsing (Team ID, certificate chain, hardened runtime flag).
- Parse entitlements (`archived-entitlements.xcent`) and map to capability taxonomy (network, camera, automation, etc.).
- Hash `CodeResources` manifest to support provenance comparisons.
- Link `.app` bundles to receipts and Homebrew formulae using bundle id or install path heuristics.
- Emit capability overlays (e.g., `macos.entitlement("com.apple.security.network.server") == true`).
### 3.4 Launch agent mapper (optional Phase 1 extension)
- Inspect `/Library/Launch{Agents,Daemons}` and user equivalents.
- Parse `plist` for program arguments, run conditions, sockets, environment.
- Feed derived runtime hints into EntryTrace (`UsageHints`) to mark active binaries vs dormant installations.
## 4. Aggregation & output
- `MacOsComponentMapper` converts collector results into:
- `LayerComponentFragment` arrays keyed by synthetic digests (`sha256:stellaops-os-macbrew`, etc.).
- `ComponentMetadata` entries capturing tap origin, Team ID, entitlements, notarization flag (when available).
- Capability overlays stored under `ScanAnalysisKeys.capability.macOS`.
- Export Center enhancements:
- Include Homebrew metadata manifests and CodeResources hashes in attested bundles.
- Provide optional notarization proof attachments if Policy Engine later requires them.
## 5. Policy & governance integration
- Predicates to introduce:
- `macos.bundle.signed(teamId?, hardenedRuntime?)`
- `macos.entitlement(name)`
- `macos.pkg.receipt(identifier, version?)`
- `macos.notarized` (pending Security decision).
- Lattice guidance:
- Unsigned/unnotarized third-party apps default to `warn`.
- High-risk entitlements (camera, screen capture) escalate severity unless whitelisted.
- Waiver model similar to existing EntryTrace/Secrets: require bundle hash + Team ID to avoid broad exceptions.
## 6. Offline kit considerations
- Mirror required Homebrew taps (`homebrew/core`, `homebrew/cask`) and freeze metadata per release.
- Bundle notarization cache instructions (CRL/OCSP) so operators can prefetch without Apple connectivity.
- Package Apple certificate chain updates (WWDR) and provide verification script to ensure validity.
- Document disk space impacts (Homebrew cellar snapshots ~500MB per tap snapshot).
## 7. Testing strategy
- Unit tests with fixture directories:
- Sample Cellar tree with multiple formulas (bottled vs source builds).
- Synthetic pkg receipts + BOM files.
- `.app` bundles with varying signing states (signed/notarized/unsigned).
- Integration tests:
- Build macOS rootfs tarballs via CI job (runner requiring macOS) that mimic layered container conversion.
- Verify deterministic output by re-running collectors and diffing results (CI gating).
- Offline compliance tests:
- Ensure rule bundles and caches load without network by running integration suite in sealed environment.
## 8. Dependencies & open items
| Item | Description | Owner | Status |
| --- | --- | --- | --- |
| Demand threshold | Need ≥3 qualified customer asks to justify engineering investment | Product Guild | Active (DOCS-SCANNER-BENCH-62-002) |
| macOS rootfs fixtures | Require automation to produce macOS layer tarballs for tests | Scanner Guild | Pending design |
| Notarization validation | Decide whether scanner performs `spctl --assess` or defers | Security Guild | TBD |
| Entitlement taxonomy | Finalize capability grouping for Policy Engine | Policy Guild | TBD |
## 9. Proposed backlog entries
| ID (proposed) | Title | Summary |
| --- | --- | --- |
| SCANNER-ENG-0020 | Implement Homebrew collector and fragment mapper | Parse cellar manifests, emit brew component records, integrate with Surface.FS/Validation. |
| SCANNER-ENG-0021 | Implement pkgutil receipt collector | Parse receipts/BOM, output installer component records with provenance. |
| SCANNER-ENG-0022 | Implement macOS bundle inspector & capability overlays | Extract plist/signing/entitlements, produce capability evidence and merge with receipts. |
| SCANNER-ENG-0023 | Policy & Offline integration for macOS | Define predicates, CLI toggles, Offline Kit packaging, and documentation. |
## 10. References
- `docs/benchmarks/scanner/deep-dives/macos.md` — competitor snapshot and roadmap summary.
- `docs/benchmarks/scanner/windows-macos-demand.md` — demand capture log.
- `docs/modules/scanner/design/surface-secrets.md`, `surface-fs.md`, `surface-validation.md` — surface contracts leveraged by collectors.
Further reading: `../../api/scanner/windows-coverage.md` (summary) and `../../api/scanner/windows-macos-summary.md` (metrics dashboard).
Policy readiness alignment: see `../policy/secret-leak-detection-readiness.md` (POLICY-READINESS-0001).
Upcoming milestone: Northwind Health Services demo on 2025-11-10 to validate notarization/entitlement outputs before final policy sign-off.

View File

@@ -0,0 +1,135 @@
# Windows Analyzer Design Brief (Draft)
> Owners: Scanner Guild, Policy Guild, Offline Kit Guild, Security Guild
> Related backlog (proposed): SCANNER-ENG-0024..0027, DOCS-SCANNER-BENCH-62-002
> Status: Draft — contingent on Windows demand threshold (see `docs/benchmarks/scanner/windows-macos-demand.md`)
## 1. Objectives & boundaries
- Provide deterministic inventory for Windows Server/container images covering MSI/WinSxS assemblies, Chocolatey packages, and registry-derived installers.
- Preserve replayability (layer fragments, provenance metadata) and align outputs with existing SBOM/policy pipelines.
- Respect sovereignty constraints: offline-friendly, signed rule bundles, no reliance on Windows APIs unavailable in containerized scans.
Out of scope (Phase 1):
- Live registry queries on running Windows hosts (requires runtime agent; defer to Zastava/Runtime roadmap).
- Windows Update patch baseline comparison (tracked separately under Runtime/Posture).
- UWP/MSIX packages (flagged for follow-up once MSI parity is complete).
## 2. Architecture overview
```
Scanner.Worker (Windows profile)
├─ Surface.Validation (enforce layer size, path allowlists)
├─ Surface.FS (materialized NTFS image via 7z/guestmount)
├─ MsiCollector -> LayerComponentFragment (windows-msi)
├─ WinSxSCollector -> LayerComponentFragment (windows-winsxs)
├─ ChocolateyCollector -> LayerComponentFragment (windows-choco)
├─ RegistryCollector -> Evidence overlays (uninstall/services)
├─ DriverCapabilityMapper -> Capability overlays (kernel/user drivers)
└─ WindowsComponentMapper -> ComponentGraph + capability metadata
```
- Collectors operate on extracted filesystem snapshots; registry access performed on exported hive files produced during image extraction (document in ops runbooks).
- `WindowsComponentMapper` normalizes component identities (ProductCode, AssemblyIdentity, Chocolatey package ID) and merges overlapping evidence into deterministic fragments.
## 3. Collectors
### 3.1 MSI collector
- Input: `Windows/Installer/*.msi` database files (Jet OLE DB), registry hive exports for product mapping.
- Implementation approach:
- Use open-source MSI parser (custom or MIT-compatible) to avoid COM dependencies.
- Extract Product, Component, File, Feature, Media tables.
- Compute SHA256 for installed files via Component table, linking to WinSxS manifests.
- Output metadata: `productCode`, `upgradeCode`, `productVersion`, `manufacturer`, `language`, `installContext`, `packageCode`, `sourceList`.
- Evidence: file paths with digests, component IDs, CAB/patch references.
### 3.2 WinSxS collector
- Input: `Windows/WinSxS/Manifests/*.manifest`, `Windows/WinSxS/` payload directories, catalog (.cat) files.
- Parse XML assembly identities (name, version, processor architecture, public key token, language).
- Map to MSI components when file hashes match.
- Capture catalog signature thumbprint and optional patch KB references for policy gating.
### 3.3 Chocolatey collector
- Input: `ProgramData/Chocolatey/lib/**`, `ProgramData/Chocolatey/package.backup`, `chocolateyinstall.ps1`, `.nuspec`.
- Extract package ID, version, checksum, source feed, installed files and scripts.
- Note whether install used cache or remote feed; record script hash for determinism.
### 3.4 Registry collector
- Input: Exported `SOFTWARE` hive covering:
- `Microsoft\Windows\CurrentVersion\Uninstall`
- `Microsoft\Windows\CurrentVersion\Installer\UserData`
- `Microsoft\Windows\CurrentVersion\Run` (startup apps)
- Service/driver configuration from `SYSTEM` hive under `Services`.
- Emit fallback evidence for installers not captured by MSI/Chocolatey (legacy EXE installers).
- Record uninstall strings, install dates, publisher, estimated size, install location.
### 3.5 Driver & service mapper
- Parse `SYSTEM` hive `Services` entries to detect drivers (type=1 or 2) and critical services (start mode auto/boot).
- Output capability overlays (e.g., `windows.driver.kernelMode(true)`, `windows.service.autoStart("Spooler")`) for Policy Engine.
## 4. Component mapping & output
- `WindowsComponentMapper`:
- Generate `LayerComponentFragment`s with synthetic layer digests (e.g., `sha256:stellaops-windows-msi`).
- Build `ComponentIdentity` with PURL-like scheme: `pkg:msi/<productCode>` or `pkg:winsxs/<assemblyIdentity>`.
- Include metadata: signature thumbprint, catalog hash, KB references, install context, manufacturer.
- Capability overlays stored under `ScanAnalysisKeys.capability.windows` for policy consumption.
- Export Center bundling:
- Include MSI manifest extracts, WinSxS assembly manifests, Chocolatey nuspec snapshots, and service/driver capability CSV.
## 5. Policy integration
- Predicates to introduce:
- `windows.package.signed(expectedThumbprint?)`
- `windows.package.unsupportedInstallerType`
- `windows.driver.kernelMode`, `windows.driver.unsigned`
- `windows.service.autoStart(name)`
- `windows.choco.sourceAllowed(feed)`
- Lattice approach:
- Unsigned kernel drivers → default `fail`.
- Unknown installer sources → `warn` with escalation on critical services.
- Chocolatey packages from non-whitelisted feeds → configurable severity.
- Waiver semantics bind to product code + signature thumbprint; waivers expire when package version changes.
## 6. Offline kit & distribution
- Package:
- MSI schema definitions and parser binaries (signed).
- Chocolatey feed snapshot (nupkg archives + index) for allow-listed feeds.
- Windows catalog certificate chains + optional CRL/OCSP caches.
- Documentation:
- Provide instructions for exporting registry hives during image extraction (PowerShell script included).
- Note disk space expectations (Chocolatey snapshot size, WinSxS manifest volume).
## 7. Testing strategy
- Fixtures:
- Sample MSI packages (with/without transforms), WinSxS manifests, Chocolatey packages.
- Registry hive exports representing mixed installer types.
- Tests:
- Unit tests for each collector parsing edge cases (language-specific manifests, transforms, script hashing).
- Integration tests using synthetic Windows container image layers (generated via CI on Windows worker).
- Determinism checks ensuring repeated runs produce identical fragments.
- Security review:
- Validate script execution paths (collectors must never execute Chocolatey scripts; inspect only).
## 8. Dependencies & open questions
| Item | Description | Owner | Status |
| --- | --- | --- | --- |
| MSI parser choice | Select MIT/Apache-compatible parser or build internal reader | Scanner Guild | TBD |
| Registry export tooling | Determine standard script/utility for hive exports in container context | Ops Guild | TBD |
| Authenticodes verification locus | Decide scanner vs policy responsibility for signature verification | Security Guild | TBD |
| Feed mirroring policy | Which Chocolatey feeds to mirror by default | Product + Security Guilds | TBD |
## 9. Proposed backlog entries
| ID (proposed) | Title | Summary |
| --- | --- | --- |
| SCANNER-ENG-0024 | Implement Windows MSI collector | Parse MSI databases, emit component fragments with provenance metadata. |
| SCANNER-ENG-0025 | Implement WinSxS manifest collector | Correlate assemblies with MSI components and catalog signatures. |
| SCANNER-ENG-0026 | Implement Chocolatey & registry collectors | Harvest nuspec metadata and uninstall/service registry data. |
| SCANNER-ENG-0027 | Policy & Offline integration for Windows | Define predicates, CLI toggles, Offline Kit packaging, documentation. |
## 10. References
- `docs/benchmarks/scanner/deep-dives/windows.md`
- `docs/benchmarks/scanner/windows-macos-demand.md`
- `docs/modules/scanner/design/macos-analyzer.md` (structure/composition parallels)
- Surface design docs (`surface-fs.md`, `surface-validation.md`, `surface-secrets.md`) for interfacing expectations.
Further reading: `../../api/scanner/windows-coverage.md` (summary) and `../../api/scanner/windows-macos-summary.md` (metrics dashboard).
Policy readiness alignment: see `../policy/windows-package-readiness.md` (POLICY-READINESS-0002).
Upcoming milestone: FinSecure Corp PCI review requires Authenticode/feed decision by 2025-11-07 before Windows analyzer spike kickoff.

View File

@@ -0,0 +1,30 @@
# Field Engagement Playbook — Windows & macOS Coverage
> Audience: Field SEs, Product Specialists • Status: Draft
## Purpose
Provide quick-reference guidance when prospects or customers ask about Windows/macOS coverage.
## Key talking points
- **Current scope**: Scanner supports deterministic Linux coverage; Windows/macOS analyzers are in design.
- **Roadmap**: macOS design (brew/pkgutil/.app) at `../design/macos-analyzer.md`; Windows design (MSI/WinSxS/Chocolatey) at `../design/windows-analyzer.md`.
- **Demand tracking**: All signals captured in `../../benchmarks/scanner/windows-macos-demand.md` using the interview template.
- **Policy readiness**: Secret leak detection briefing (`../../policy/secret-leak-detection-readiness.md`) and Windows package readiness (`../../policy/windows-package-readiness.md`).
- **Backlog IDs**: MacOS (SCANNER-ENG-0020..0023), Windows (SCANNER-ENG-0024..0027), policy follow-ups (POLICY-READINESS-0001/0002).
## SE workflow
1. Use the interview template to capture customer needs.
2. Append structured summary to `windows-macos-demand.md` and update the API dashboards (`docs/api/scanner/windows-macos-summary.md`, `docs/api/scanner/windows-coverage.md`).
3. Notify Product/Scanner guild during weekly sync; flag blockers in Jira.
4. Add highlight to the “Recent updates” section in `docs/api/scanner/windows-macos-summary.md`.
5. Track upcoming milestones (FinSecure decision 2025-11-07, Northwind demo 2025-11-10) and ensure readiness tasks reflect outcomes.
## FAQ snippets
- *When will Windows/macOS analyzers be GA?* — Pending demand threshold; design complete, awaiting prioritisation.
- *Can we run scans offline?* — Offline parity is a requirement; Offline Kit packaging detailed in design briefs.
- *Do we cover Authenticode/notarization?* — Planned via Policy Engine predicates as part of readiness tasks.
## Contacts
- Product lead: TBD (record in demand log when assigned)
- Scanner guild rep: TBD
- Policy guild rep: TBD

View File

@@ -1,426 +1,426 @@
# component_architecture_scheduler.md — **StellaOps Scheduler** (2025Q4)
> Synthesises the scheduling requirements documented across the Policy, Vulnerability Explorer, and Orchestrator module guides and implementation plans.
> **Scope.** Implementationready architecture for **Scheduler**: a service that (1) **reevaluates** alreadycataloged images when intel changes (Feedser/Vexer/policy), (2) orchestrates **nightly** and **adhoc** runs, (3) targets only the **impacted** images using the BOMIndex, and (4) emits **reportready** events that downstream **Notify** fans out. Default mode is **analysisonly** (no image pull); optional **contentrefresh** can be enabled per schedule.
---
## 0) Mission & boundaries
**Mission.** Keep scan results **current** without rescanning the world. When new advisories or VEX claims land, **pinpoint** affected images and ask the backend to recompute **verdicts** against the **existing SBOMs**. Surface only **meaningful deltas** to humans and ticket queues.
**Boundaries.**
* Scheduler **does not** compute SBOMs and **does not** sign. It calls Scanner/WebServices **/reports (analysisonly)** endpoint and lets the backend (Policy + Vexer + Feedser) decide PASS/FAIL.
* Scheduler **may** ask Scanner to **contentrefresh** selected targets (e.g., mutable tags) but the default is **no** image pull.
* Notifications are **not** sent directly; Scheduler emits events consumed by **Notify**.
---
## 1) Runtime shape & projects
```
src/
├─ StellaOps.Scheduler.WebService/ # REST (schedules CRUD, runs, admin)
├─ StellaOps.Scheduler.Worker/ # planners + runners (N replicas)
├─ StellaOps.Scheduler.ImpactIndex/ # purl→images inverted index (roaring bitmaps)
├─ StellaOps.Scheduler.Models/ # DTOs (Schedule, Run, ImpactSet, Deltas)
├─ StellaOps.Scheduler.Storage.Mongo/ # schedules, runs, cursors, locks
├─ StellaOps.Scheduler.Queue/ # Redis Streams / NATS abstraction
├─ StellaOps.Scheduler.Tests.* # unit/integration/e2e
```
**Deployables**:
* **Scheduler.WebService** (stateless)
* **Scheduler.Worker** (scaleout; planners + executors)
**Dependencies**: Authority (OpTok + DPoP/mTLS), Scanner.WebService, Feedser, Vexer, MongoDB, Redis/NATS, (optional) Notify.
---
## 2) Core responsibilities
1. **Timebased** runs: cron windows per tenant/timezone (e.g., “02:00 Europe/Sofia”).
2. **Eventdriven** runs: react to **Feedser export** and **Vexer export** deltas (changed product keys / advisories / claims).
3. **Impact targeting**: map changes to **image sets** using a **global inverted index** built from Scanners perimage **BOMIndex** sidecars.
4. **Run planning**: shard, pace, and ratelimit jobs to avoid thundering herds.
5. **Execution**: call Scanner **/reports (analysisonly)** or **/scans (contentrefresh)**; aggregate **delta** results.
6. **Events**: publish `rescan.delta` and `report.ready` summaries for **Notify** & **UI**.
7. **Control plane**: CRUD schedules, **pause/resume**, dryrun previews, audit.
---
## 3) Data model (Mongo)
**Database**: `scheduler`
* `schedules`
```
{ _id, tenantId, name, enabled, whenCron, timezone,
mode: "analysis-only" | "content-refresh",
selection: { scope: "all-images" | "by-namespace" | "by-repo" | "by-digest" | "by-labels",
includeTags?: ["prod-*"], digests?: [sha256...], resolvesTags?: bool },
onlyIf: { lastReportOlderThanDays?: int, policyRevision?: string },
notify: { onNewFindings: bool, minSeverity: "low|medium|high|critical", includeKEV: bool },
limits: { maxJobs?: int, ratePerSecond?: int, parallelism?: int },
createdAt, updatedAt, createdBy, updatedBy }
```
* `runs`
```
{ _id, scheduleId?, tenantId, trigger: "cron|feedser|vexer|manual",
reason?: { feedserExportId?, vexerExportId?, cursor? },
state: "planning|queued|running|completed|error|cancelled",
stats: { candidates: int, deduped: int, queued: int, completed: int, deltas: int, newCriticals: int },
startedAt, finishedAt, error? }
```
* `impact_cursors`
```
{ _id: tenantId, feedserLastExportId, vexerLastExportId, updatedAt }
```
* `locks` (singleton schedulers, run leases)
* `audit` (CRUD actions, run outcomes)
**Indexes**:
* `schedules` on `{tenantId, enabled}`, `{whenCron}`.
* `runs` on `{tenantId, startedAt desc}`, `{state}`.
* TTL optional for completed runs (e.g., 180 days).
---
## 4) ImpactIndex (global inverted index)
Goal: translate **change keys** → **image sets** in **milliseconds**.
**Source**: Scanner produces perimage **BOMIndex** sidecars (purls, and `usedByEntrypoint` bitmaps). Scheduler ingests/refreshes them to build a **global** index.
**Representation**:
* Assign **image IDs** (dense ints) to catalog images.
* Keep **Roaring Bitmaps**:
* `Contains[purl] → bitmap(imageIds)`
* `UsedBy[purl] → bitmap(imageIds)` (subset of Contains)
* Optionally keep **Owner maps**: `{imageId → {tenantId, namespaces[], repos[]}}` for selection filters.
* Persist in RocksDB/LMDB or Redismodules; cache hot shards in memory; snapshot to Mongo for cold start.
**Update paths**:
* On new/updated image SBOM: **merge** perimage set into global maps.
* On image remove/expiry: **clear** id from bitmaps.
**API (internal)**:
```csharp
IImpactIndex {
ImpactSet ResolveByPurls(IEnumerable<string> purls, bool usageOnly, Selector sel);
ImpactSet ResolveByVulns(IEnumerable<string> vulnIds, bool usageOnly, Selector sel); // optional (vuln->purl precomputed by Feedser)
ImpactSet ResolveAll(Selector sel); // for nightly
}
```
**Selector filters**: tenant, namespaces, repos, labels, digest allowlists, `includeTags` patterns.
---
## 5) External interfaces (REST)
Base path: `/api/v1/scheduler` (Authority OpToks; scopes: `scheduler.read`, `scheduler.admin`).
### 5.1 Schedules CRUD
* `POST /schedules` → create
* `GET /schedules` → list (filter by tenant)
* `GET /schedules/{id}` → details + next run
* `PATCH /schedules/{id}` → pause/resume/update
* `DELETE /schedules/{id}` → delete (soft delete, optional)
### 5.2 Run control & introspection
* `POST /run` — adhoc run
```json
{ "mode": "analysis-only|content-refresh", "selection": {...}, "reason": "manual" }
```
* `GET /runs` — list with paging
* `GET /runs/{id}` — status, stats, links to deltas
* `POST /runs/{id}/cancel` — besteffort cancel
### 5.3 Previews (dryrun)
* `POST /preview/impact` — returns **candidate count** and a small sample of impacted digests for given change keys or selection.
### 5.4 Event webhooks (optional push from Feedser/Vexer)
* `POST /events/feedser-export`
```json
{ "exportId":"...", "changedProductKeys":["pkg:rpm/openssl", ...], "kev": ["CVE-..."], "window": { "from":"...","to":"..." } }
```
* `POST /events/vexer-export`
```json
{ "exportId":"...", "changedClaims":[ { "productKey":"pkg:deb/...", "vulnId":"CVE-...", "status":"not_affected→affected"} ], ... }
```
**Security**: webhook requires **mTLS** or an **HMAC** `X-Scheduler-Signature` (Ed25519 / SHA256) plus Authority token.
---
## 6) Planner → Runner pipeline
### 6.1 Planning algorithm (eventdriven)
```
On Export Event (Feedser/Vexer):
keys = Normalize(change payload) # productKeys or vulnIds→productKeys
usageOnly = schedule/policy hint? # default true
sel = Selector for tenant/scope from schedules subscribed to events
impacted = ImpactIndex.ResolveByPurls(keys, usageOnly, sel)
impacted = ApplyOwnerFilters(impacted, sel) # namespaces/repos/labels
impacted = DeduplicateByDigest(impacted)
impacted = EnforceLimits(impacted, limits.maxJobs)
shards = Shard(impacted, byHashPrefix, n=limits.parallelism)
For each shard:
Enqueue RunSegment (runId, shard, rate=limits.ratePerSecond)
```
**Fairness & pacing**
* Use **leaky bucket** per tenant and per registry host.
* Prioritize **KEVtagged** and **critical** first if oversubscribed.
### 6.2 Nightly planning
```
At cron tick:
sel = resolve selection
candidates = ImpactIndex.ResolveAll(sel)
if lastReportOlderThanDays present → filter by report age (via Scanner catalog)
shard & enqueue as above
```
### 6.3 Execution (Runner)
* Pop **RunSegment** job → for each image digest:
* **analysisonly**: `POST scanner/reports { imageDigest, policyRevision? }`
* **contentrefresh**: resolve tag→digest if needed; `POST scanner/scans { imageRef, attest? false }` then `POST /reports`
* Collect **delta**: `newFindings`, `newCriticals`/`highs`, `links` (UI deep link, Rekor if present).
* Persist perimage outcome in `runs.{id}.stats` (incremental counters).
* Emit `scheduler.rescan.delta` events to **Notify** only when **delta > 0** and matches severity rule.
---
## 7) Event model (outbound)
**Topic**: `rescan.delta` (internal bus → Notify; UI subscribes via backend).
```json
{
"tenant": "tenant-01",
"runId": "324af…",
"imageDigest": "sha256:…",
"newCriticals": 1,
"newHigh": 2,
"kevHits": ["CVE-2025-..."],
"topFindings": [
{ "purl":"pkg:rpm/openssl@3.0.12-...","vulnId":"CVE-2025-...","severity":"critical","link":"https://ui/scans/..." }
],
"reportUrl": "https://ui/.../scans/sha256:.../report",
"attestation": { "uuid":"rekor-uuid", "verified": true },
"ts": "2025-10-18T03:12:45Z"
}
```
**Also**: `report.ready` for “nochange” summaries (digest + zero delta), which Notify can ignore by rule.
---
## 8) Security posture
* **AuthN/Z**: Authority OpToks with `aud=scheduler`; DPoP (preferred) or mTLS.
* **Multitenant**: every schedule, run, and event carries `tenantId`; ImpactIndex filters by tenantvisible images.
* **Webhook** callers (Feedser/Vexer) present **mTLS** or **HMAC** and Authority token.
* **Input hardening**: size caps on changed key lists; reject >100k keys per event; compress (zstd/gzip) allowed with limits.
* **No secrets** in logs; redact tokens and signatures.
---
## 9) Observability & SLOs
**Metrics (Prometheus)**
* `scheduler.events_total{source, result}`
* `scheduler.impact_resolve_seconds{quantile}`
* `scheduler.images_selected_total{mode}`
* `scheduler.jobs_enqueued_total{mode}`
* `scheduler.run_latency_seconds{quantile}` // event → first verdict
* `scheduler.delta_images_total{severity}`
* `scheduler.rate_limited_total{reason}`
**Targets**
* Resolve 10k changed keys → impacted set in **<300ms** (hot cache).
* Event → first rescan verdict in **≤60s** (p95).
* Nightly coverage 50k images in **≤10min** with 10 workers (analysisonly).
**Tracing** (OTEL): spans `plan`, `resolve`, `enqueue`, `report_call`, `persist`, `emit`.
---
## 10) Configuration (YAML)
```yaml
scheduler:
authority:
issuer: "https://authority.internal"
require: "dpop" # or "mtls"
queue:
kind: "redis" # or "nats"
url: "redis://redis:6379/4"
mongo:
uri: "mongodb://mongo/scheduler"
impactIndex:
storage: "rocksdb" # "rocksdb" | "redis" | "memory"
warmOnStart: true
usageOnlyDefault: true
limits:
defaultRatePerSecond: 50
defaultParallelism: 8
maxJobsPerRun: 50000
integrates:
scannerUrl: "https://scanner-web.internal"
feedserWebhook: true
vexerWebhook: true
notifications:
emitBus: "internal" # deliver to Notify via internal bus
```
---
## 11) UI touchpoints
* **Schedules** page: CRUD, enable/pause, next run, last run stats, mode (analysis/content), selector preview.
* **Runs** page: timeline; heatmap of deltas; drilldown to affected images.
* **Dryrun preview** modal: “This Feedser export touches ~3,214 images; projected deltas: ~420 (34 KEV).”
---
## 12) Failure modes & degradations
| Condition | Behavior |
| ------------------------------------ | ---------------------------------------------------------------------------------------- |
| ImpactIndex cold / incomplete | Fall back to **All** selection for nightly; for events, cap to KEV+critical until warmed |
| Feedser/Vexer webhook storm | Coalesce by exportId; debounce 3060s; keep last |
| Scanner under load (429) | Backoff with jitter; respect pertenant/leaky bucket |
| Oversubscription (too many impacted) | Prioritize KEV/critical first; spillover to next window; UI banner shows backlog |
| Notify down | Buffer outbound events in queue (TTL 24h) |
| Mongo slow | Cut batch sizes; samplelog; alert ops; dont drop runs unless critical |
---
## 13) Testing matrix
* **ImpactIndex**: correctness (purl→image sets), performance, persistence after restart, memory pressure with 1M purls.
* **Planner**: dedupe, shard, fairness, limit enforcement, KEV prioritization.
* **Runner**: parallel report calls, error backoff, partial failures, idempotency.
* **Endtoend**: Feedser export → deltas visible in UI in ≤60s.
* **Security**: webhook auth (mTLS/HMAC), DPoP nonce dance, tenant isolation.
* **Chaos**: drop scanner availability; simulate registry throttles (contentrefresh mode).
* **Nightly**: cron tick correctness across timezones and DST.
---
## 14) Implementation notes
* **Language**: .NET 10 minimal API; Channelsbased pipeline; `System.Threading.RateLimiting`.
* **Bitmaps**: Roaring via `RoaringBitmap` bindings; memorymap large shards if RocksDB used.
* **Cron**: Quartzstyle parser with timezone support; clock skew tolerated ±60s.
* **Dryrun**: use ImpactIndex only; never call scanner.
* **Idempotency**: run segments carry deterministic keys; retries safe.
* **Backpressure**: pertenant buckets; perhost registry budgets respected when contentrefresh enabled.
---
## 15) Sequences (representative)
**A) Eventdriven rescan (Feedser delta)**
```mermaid
sequenceDiagram
autonumber
participant FE as Feedser
participant SCH as Scheduler.Worker
participant IDX as ImpactIndex
participant SC as Scanner.WebService
participant NO as Notify
FE->>SCH: POST /events/feedser-export {exportId, changedProductKeys}
SCH->>IDX: ResolveByPurls(keys, usageOnly=true, sel)
IDX-->>SCH: bitmap(imageIds) → digests list
SCH->>SC: POST /reports {imageDigest} (batch/sequenced)
SC-->>SCH: report deltas (new criticals/highs)
alt delta>0
SCH->>NO: rescan.delta {digest, newCriticals, links}
end
```
**B) Nightly rescan**
```mermaid
sequenceDiagram
autonumber
participant CRON as Cron
participant SCH as Scheduler.Worker
participant IDX as ImpactIndex
participant SC as Scanner.WebService
CRON->>SCH: tick (02:00 Europe/Sofia)
SCH->>IDX: ResolveAll(selector)
IDX-->>SCH: candidates
SCH->>SC: POST /reports {digest} (paced)
SC-->>SCH: results
SCH-->>SCH: aggregate, store run stats
```
**C) Contentrefresh (tag followers)**
```mermaid
sequenceDiagram
autonumber
participant SCH as Scheduler
participant SC as Scanner
SCH->>SC: resolve tag→digest (if changed)
alt digest changed
SCH->>SC: POST /scans {imageRef} # new SBOM
SC-->>SCH: scan complete (artifacts)
SCH->>SC: POST /reports {imageDigest}
else unchanged
SCH->>SC: POST /reports {imageDigest} # analysis-only
end
```
---
## 16) Roadmap
* **Vulncentric impact**: prejoin vuln→purl→images to rank by **KEV** and **exploitedinthewild** signals.
* **Policy diff preview**: when a staged policy changes, show projected breakage set before promotion.
* **Crosscluster federation**: one Scheduler instance driving many Scanner clusters (tenant isolation).
* **Windows containers**: integrate Zastava runtime hints for Usage view tightening.
---
**End — component_architecture_scheduler.md**
> **Scope.** Implementationready architecture for **Scheduler**: a service that (1) **reevaluates** alreadycataloged images when intel changes (Conselier/Excitor/policy), (2) orchestrates **nightly** and **adhoc** runs, (3) targets only the **impacted** images using the BOMIndex, and (4) emits **reportready** events that downstream **Notify** fans out. Default mode is **analysisonly** (no image pull); optional **contentrefresh** can be enabled per schedule.
---
## 0) Mission & boundaries
**Mission.** Keep scan results **current** without rescanning the world. When new advisories or VEX claims land, **pinpoint** affected images and ask the backend to recompute **verdicts** against the **existing SBOMs**. Surface only **meaningful deltas** to humans and ticket queues.
**Boundaries.**
* Scheduler **does not** compute SBOMs and **does not** sign. It calls Scanner/WebServices **/reports (analysisonly)** endpoint and lets the backend (Policy + Excitor + Conselier) decide PASS/FAIL.
* Scheduler **may** ask Scanner to **contentrefresh** selected targets (e.g., mutable tags) but the default is **no** image pull.
* Notifications are **not** sent directly; Scheduler emits events consumed by **Notify**.
---
## 1) Runtime shape & projects
```
src/
├─ StellaOps.Scheduler.WebService/ # REST (schedules CRUD, runs, admin)
├─ StellaOps.Scheduler.Worker/ # planners + runners (N replicas)
├─ StellaOps.Scheduler.ImpactIndex/ # purl→images inverted index (roaring bitmaps)
├─ StellaOps.Scheduler.Models/ # DTOs (Schedule, Run, ImpactSet, Deltas)
├─ StellaOps.Scheduler.Storage.Mongo/ # schedules, runs, cursors, locks
├─ StellaOps.Scheduler.Queue/ # Redis Streams / NATS abstraction
├─ StellaOps.Scheduler.Tests.* # unit/integration/e2e
```
**Deployables**:
* **Scheduler.WebService** (stateless)
* **Scheduler.Worker** (scaleout; planners + executors)
**Dependencies**: Authority (OpTok + DPoP/mTLS), Scanner.WebService, Conselier, Excitor, MongoDB, Redis/NATS, (optional) Notify.
---
## 2) Core responsibilities
1. **Timebased** runs: cron windows per tenant/timezone (e.g., “02:00 Europe/Sofia”).
2. **Eventdriven** runs: react to **Conselier export** and **Excitor export** deltas (changed product keys / advisories / claims).
3. **Impact targeting**: map changes to **image sets** using a **global inverted index** built from Scanners perimage **BOMIndex** sidecars.
4. **Run planning**: shard, pace, and ratelimit jobs to avoid thundering herds.
5. **Execution**: call Scanner **/reports (analysisonly)** or **/scans (contentrefresh)**; aggregate **delta** results.
6. **Events**: publish `rescan.delta` and `report.ready` summaries for **Notify** & **UI**.
7. **Control plane**: CRUD schedules, **pause/resume**, dryrun previews, audit.
---
## 3) Data model (Mongo)
**Database**: `scheduler`
* `schedules`
```
{ _id, tenantId, name, enabled, whenCron, timezone,
mode: "analysis-only" | "content-refresh",
selection: { scope: "all-images" | "by-namespace" | "by-repo" | "by-digest" | "by-labels",
includeTags?: ["prod-*"], digests?: [sha256...], resolvesTags?: bool },
onlyIf: { lastReportOlderThanDays?: int, policyRevision?: string },
notify: { onNewFindings: bool, minSeverity: "low|medium|high|critical", includeKEV: bool },
limits: { maxJobs?: int, ratePerSecond?: int, parallelism?: int },
createdAt, updatedAt, createdBy, updatedBy }
```
* `runs`
```
{ _id, scheduleId?, tenantId, trigger: "cron|conselier|excitor|manual",
reason?: { conselierExportId?, excitorExportId?, cursor? },
state: "planning|queued|running|completed|error|cancelled",
stats: { candidates: int, deduped: int, queued: int, completed: int, deltas: int, newCriticals: int },
startedAt, finishedAt, error? }
```
* `impact_cursors`
```
{ _id: tenantId, conselierLastExportId, excitorLastExportId, updatedAt }
```
* `locks` (singleton schedulers, run leases)
* `audit` (CRUD actions, run outcomes)
**Indexes**:
* `schedules` on `{tenantId, enabled}`, `{whenCron}`.
* `runs` on `{tenantId, startedAt desc}`, `{state}`.
* TTL optional for completed runs (e.g., 180 days).
---
## 4) ImpactIndex (global inverted index)
Goal: translate **change keys** → **image sets** in **milliseconds**.
**Source**: Scanner produces perimage **BOMIndex** sidecars (purls, and `usedByEntrypoint` bitmaps). Scheduler ingests/refreshes them to build a **global** index.
**Representation**:
* Assign **image IDs** (dense ints) to catalog images.
* Keep **Roaring Bitmaps**:
* `Contains[purl] → bitmap(imageIds)`
* `UsedBy[purl] → bitmap(imageIds)` (subset of Contains)
* Optionally keep **Owner maps**: `{imageId → {tenantId, namespaces[], repos[]}}` for selection filters.
* Persist in RocksDB/LMDB or Redismodules; cache hot shards in memory; snapshot to Mongo for cold start.
**Update paths**:
* On new/updated image SBOM: **merge** perimage set into global maps.
* On image remove/expiry: **clear** id from bitmaps.
**API (internal)**:
```csharp
IImpactIndex {
ImpactSet ResolveByPurls(IEnumerable<string> purls, bool usageOnly, Selector sel);
ImpactSet ResolveByVulns(IEnumerable<string> vulnIds, bool usageOnly, Selector sel); // optional (vuln->purl precomputed by Conselier)
ImpactSet ResolveAll(Selector sel); // for nightly
}
```
**Selector filters**: tenant, namespaces, repos, labels, digest allowlists, `includeTags` patterns.
---
## 5) External interfaces (REST)
Base path: `/api/v1/scheduler` (Authority OpToks; scopes: `scheduler.read`, `scheduler.admin`).
### 5.1 Schedules CRUD
* `POST /schedules` → create
* `GET /schedules` → list (filter by tenant)
* `GET /schedules/{id}` → details + next run
* `PATCH /schedules/{id}` → pause/resume/update
* `DELETE /schedules/{id}` → delete (soft delete, optional)
### 5.2 Run control & introspection
* `POST /run` — adhoc run
```json
{ "mode": "analysis-only|content-refresh", "selection": {...}, "reason": "manual" }
```
* `GET /runs` — list with paging
* `GET /runs/{id}` — status, stats, links to deltas
* `POST /runs/{id}/cancel` — besteffort cancel
### 5.3 Previews (dryrun)
* `POST /preview/impact` — returns **candidate count** and a small sample of impacted digests for given change keys or selection.
### 5.4 Event webhooks (optional push from Conselier/Excitor)
* `POST /events/conselier-export`
```json
{ "exportId":"...", "changedProductKeys":["pkg:rpm/openssl", ...], "kev": ["CVE-..."], "window": { "from":"...","to":"..." } }
```
* `POST /events/excitor-export`
```json
{ "exportId":"...", "changedClaims":[ { "productKey":"pkg:deb/...", "vulnId":"CVE-...", "status":"not_affected→affected"} ], ... }
```
**Security**: webhook requires **mTLS** or an **HMAC** `X-Scheduler-Signature` (Ed25519 / SHA256) plus Authority token.
---
## 6) Planner → Runner pipeline
### 6.1 Planning algorithm (eventdriven)
```
On Export Event (Conselier/Excitor):
keys = Normalize(change payload) # productKeys or vulnIds→productKeys
usageOnly = schedule/policy hint? # default true
sel = Selector for tenant/scope from schedules subscribed to events
impacted = ImpactIndex.ResolveByPurls(keys, usageOnly, sel)
impacted = ApplyOwnerFilters(impacted, sel) # namespaces/repos/labels
impacted = DeduplicateByDigest(impacted)
impacted = EnforceLimits(impacted, limits.maxJobs)
shards = Shard(impacted, byHashPrefix, n=limits.parallelism)
For each shard:
Enqueue RunSegment (runId, shard, rate=limits.ratePerSecond)
```
**Fairness & pacing**
* Use **leaky bucket** per tenant and per registry host.
* Prioritize **KEVtagged** and **critical** first if oversubscribed.
### 6.2 Nightly planning
```
At cron tick:
sel = resolve selection
candidates = ImpactIndex.ResolveAll(sel)
if lastReportOlderThanDays present → filter by report age (via Scanner catalog)
shard & enqueue as above
```
### 6.3 Execution (Runner)
* Pop **RunSegment** job → for each image digest:
* **analysisonly**: `POST scanner/reports { imageDigest, policyRevision? }`
* **contentrefresh**: resolve tag→digest if needed; `POST scanner/scans { imageRef, attest? false }` then `POST /reports`
* Collect **delta**: `newFindings`, `newCriticals`/`highs`, `links` (UI deep link, Rekor if present).
* Persist perimage outcome in `runs.{id}.stats` (incremental counters).
* Emit `scheduler.rescan.delta` events to **Notify** only when **delta > 0** and matches severity rule.
---
## 7) Event model (outbound)
**Topic**: `rescan.delta` (internal bus → Notify; UI subscribes via backend).
```json
{
"tenant": "tenant-01",
"runId": "324af…",
"imageDigest": "sha256:…",
"newCriticals": 1,
"newHigh": 2,
"kevHits": ["CVE-2025-..."],
"topFindings": [
{ "purl":"pkg:rpm/openssl@3.0.12-...","vulnId":"CVE-2025-...","severity":"critical","link":"https://ui/scans/..." }
],
"reportUrl": "https://ui/.../scans/sha256:.../report",
"attestation": { "uuid":"rekor-uuid", "verified": true },
"ts": "2025-10-18T03:12:45Z"
}
```
**Also**: `report.ready` for “nochange” summaries (digest + zero delta), which Notify can ignore by rule.
---
## 8) Security posture
* **AuthN/Z**: Authority OpToks with `aud=scheduler`; DPoP (preferred) or mTLS.
* **Multitenant**: every schedule, run, and event carries `tenantId`; ImpactIndex filters by tenantvisible images.
* **Webhook** callers (Conselier/Excitor) present **mTLS** or **HMAC** and Authority token.
* **Input hardening**: size caps on changed key lists; reject >100k keys per event; compress (zstd/gzip) allowed with limits.
* **No secrets** in logs; redact tokens and signatures.
---
## 9) Observability & SLOs
**Metrics (Prometheus)**
* `scheduler.events_total{source, result}`
* `scheduler.impact_resolve_seconds{quantile}`
* `scheduler.images_selected_total{mode}`
* `scheduler.jobs_enqueued_total{mode}`
* `scheduler.run_latency_seconds{quantile}` // event → first verdict
* `scheduler.delta_images_total{severity}`
* `scheduler.rate_limited_total{reason}`
**Targets**
* Resolve 10k changed keys → impacted set in **<300ms** (hot cache).
* Event → first rescan verdict in **≤60s** (p95).
* Nightly coverage 50k images in **≤10min** with 10 workers (analysisonly).
**Tracing** (OTEL): spans `plan`, `resolve`, `enqueue`, `report_call`, `persist`, `emit`.
---
## 10) Configuration (YAML)
```yaml
scheduler:
authority:
issuer: "https://authority.internal"
require: "dpop" # or "mtls"
queue:
kind: "redis" # or "nats"
url: "redis://redis:6379/4"
mongo:
uri: "mongodb://mongo/scheduler"
impactIndex:
storage: "rocksdb" # "rocksdb" | "redis" | "memory"
warmOnStart: true
usageOnlyDefault: true
limits:
defaultRatePerSecond: 50
defaultParallelism: 8
maxJobsPerRun: 50000
integrates:
scannerUrl: "https://scanner-web.internal"
conselierWebhook: true
excitorWebhook: true
notifications:
emitBus: "internal" # deliver to Notify via internal bus
```
---
## 11) UI touchpoints
* **Schedules** page: CRUD, enable/pause, next run, last run stats, mode (analysis/content), selector preview.
* **Runs** page: timeline; heatmap of deltas; drilldown to affected images.
* **Dryrun preview** modal: “This Conselier export touches ~3,214 images; projected deltas: ~420 (34 KEV).”
---
## 12) Failure modes & degradations
| Condition | Behavior |
| ------------------------------------ | ---------------------------------------------------------------------------------------- |
| ImpactIndex cold / incomplete | Fall back to **All** selection for nightly; for events, cap to KEV+critical until warmed |
| Conselier/Excitor webhook storm | Coalesce by exportId; debounce 3060s; keep last |
| Scanner under load (429) | Backoff with jitter; respect pertenant/leaky bucket |
| Oversubscription (too many impacted) | Prioritize KEV/critical first; spillover to next window; UI banner shows backlog |
| Notify down | Buffer outbound events in queue (TTL 24h) |
| Mongo slow | Cut batch sizes; samplelog; alert ops; dont drop runs unless critical |
---
## 13) Testing matrix
* **ImpactIndex**: correctness (purl→image sets), performance, persistence after restart, memory pressure with 1M purls.
* **Planner**: dedupe, shard, fairness, limit enforcement, KEV prioritization.
* **Runner**: parallel report calls, error backoff, partial failures, idempotency.
* **Endtoend**: Conselier export → deltas visible in UI in ≤60s.
* **Security**: webhook auth (mTLS/HMAC), DPoP nonce dance, tenant isolation.
* **Chaos**: drop scanner availability; simulate registry throttles (contentrefresh mode).
* **Nightly**: cron tick correctness across timezones and DST.
---
## 14) Implementation notes
* **Language**: .NET 10 minimal API; Channelsbased pipeline; `System.Threading.RateLimiting`.
* **Bitmaps**: Roaring via `RoaringBitmap` bindings; memorymap large shards if RocksDB used.
* **Cron**: Quartzstyle parser with timezone support; clock skew tolerated ±60s.
* **Dryrun**: use ImpactIndex only; never call scanner.
* **Idempotency**: run segments carry deterministic keys; retries safe.
* **Backpressure**: pertenant buckets; perhost registry budgets respected when contentrefresh enabled.
---
## 15) Sequences (representative)
**A) Eventdriven rescan (Conselier delta)**
```mermaid
sequenceDiagram
autonumber
participant FE as Conselier
participant SCH as Scheduler.Worker
participant IDX as ImpactIndex
participant SC as Scanner.WebService
participant NO as Notify
FE->>SCH: POST /events/conselier-export {exportId, changedProductKeys}
SCH->>IDX: ResolveByPurls(keys, usageOnly=true, sel)
IDX-->>SCH: bitmap(imageIds) → digests list
SCH->>SC: POST /reports {imageDigest} (batch/sequenced)
SC-->>SCH: report deltas (new criticals/highs)
alt delta>0
SCH->>NO: rescan.delta {digest, newCriticals, links}
end
```
**B) Nightly rescan**
```mermaid
sequenceDiagram
autonumber
participant CRON as Cron
participant SCH as Scheduler.Worker
participant IDX as ImpactIndex
participant SC as Scanner.WebService
CRON->>SCH: tick (02:00 Europe/Sofia)
SCH->>IDX: ResolveAll(selector)
IDX-->>SCH: candidates
SCH->>SC: POST /reports {digest} (paced)
SC-->>SCH: results
SCH-->>SCH: aggregate, store run stats
```
**C) Contentrefresh (tag followers)**
```mermaid
sequenceDiagram
autonumber
participant SCH as Scheduler
participant SC as Scanner
SCH->>SC: resolve tag→digest (if changed)
alt digest changed
SCH->>SC: POST /scans {imageRef} # new SBOM
SC-->>SCH: scan complete (artifacts)
SCH->>SC: POST /reports {imageDigest}
else unchanged
SCH->>SC: POST /reports {imageDigest} # analysis-only
end
```
---
## 16) Roadmap
* **Vulncentric impact**: prejoin vuln→purl→images to rank by **KEV** and **exploitedinthewild** signals.
* **Policy diff preview**: when a staged policy changes, show projected breakage set before promotion.
* **Crosscluster federation**: one Scheduler instance driving many Scanner clusters (tenant isolation).
* **Windows containers**: integrate Zastava runtime hints for Usage view tightening.
---
**End — component_architecture_scheduler.md**

View File

@@ -1,82 +1,82 @@
# Scheduler Worker Observability & Runbook
## Purpose
Monitor planner and runner health for the Scheduler Worker (Sprint16 telemetry). The new .NET meters surface queue throughput, latency, backlog, and delta severities so operators can detect stalled runs before rescan SLAs slip.
> **Grafana note:** Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` into the Prometheus-backed Grafana stack that scrapes the OpenTelemetry Collector.
---
## Key metrics
| Metric | Use case | Suggested query |
| --- | --- | --- |
| `scheduler_planner_runs_total{status}` | Planner throughput & failure ratio | `sum by (status) (rate(scheduler_planner_runs_total[5m]))` |
| `scheduler_planner_latency_seconds_bucket` | Planning latency (p95 / p99) | `histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket[5m])))` |
| `scheduler_runner_segments_total{status}` | Runner success vs retries | `sum by (status) (rate(scheduler_runner_segments_total[5m]))` |
| `scheduler_runner_delta_{critical,high,total}` | Newly-detected findings | `sum(rate(scheduler_runner_delta_critical_total[5m]))` |
| `scheduler_runner_backlog{scheduleId}` | Remaining digests awaiting runner | `max by (scheduleId) (scheduler_runner_backlog)` |
| `scheduler_runs_active{mode}` | Active runs in-flight | `sum(scheduler_runs_active)` |
Reference queries power the bundled Grafana dashboard panels. Use the `mode` template variable to focus on `analysisOnly` versus `contentRefresh` schedules.
---
## Grafana dashboard
1. Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` (UID `scheduler-worker-observability`).
2. Point the `datasource` variable to the Prometheus instance scraping the collector. Optional: pin the `mode` variable to a specific schedule mode.
3. Panels included:
- **Planner Runs per Status** visualises success vs failure ratio.
- **Planner Latency P95** highlights degradations in ImpactIndex or Mongo lookups.
- **Runner Segments per Status** shows retry pressure and queue health.
- **New Findings per Severity** rolls up delta counters (critical/high/total).
- **Runner Backlog by Schedule** tabulates outstanding digests per schedule.
- **Active Runs** stat panel showing the current number of in-flight runs.
Capture screenshots once Grafana provisioning completes and store them under `docs/assets/dashboards/` (pending automation ticket OBS-157).
---
## Prometheus alerts
Import `docs/modules/scheduler/operations/worker-prometheus-rules.yaml` into your Prometheus rule configuration. The bundle defines:
- **SchedulerPlannerFailuresHigh** 5%+ of planner runs failed for 10 minutes. Page SRE.
- **SchedulerPlannerLatencyHigh** planner p95 latency remains above 45s for 10 minutes. Investigate ImpactIndex, Mongo, and Feedser/Vexer event queues.
- **SchedulerRunnerBacklogGrowing** backlog exceeded 500 images for 15 minutes. Inspect runner workers, Scanner availability, and rate limiting.
- **SchedulerRunStuck** active run count stayed flat for 30 minutes while remaining non-zero. Check stuck segments, expired leases, and scanner retries.
Hook these alerts into the existing Observability notification pathway (`observability-pager` routing key) and ensure `service=scheduler-worker` is mapped to the on-call rotation.
---
## Runbook snapshot
1. **Planner failure/latency:**
- Check Planner logs for ImpactIndex or Mongo exceptions.
- Verify Feedser/Vexer webhook health; requeue events if necessary.
- If planner is overwhelmed, temporarily reduce schedule parallelism via `stella scheduler schedule update`.
2. **Runner backlog spike:**
- Confirm Scanner WebService health (`/healthz`).
- Inspect runner queue for stuck segments; consider increasing runner workers or scaling scanner capacity.
- Review rate limits (schedule limits, ImpactIndex throughput) before changing global throttles.
3. **Stuck runs:**
- Use `stella scheduler runs list --state running` to identify affected runs.
- Drill into Grafana panel “Runner Backlog by Schedule” to see offending schedule IDs.
- If a segment will not progress, use `stella scheduler segments release --segment <id>` to force retry after resolving root cause.
4. **Unexpected critical deltas:**
- Correlate `scheduler_runner_delta_critical_total` spikes with Notify events (`scheduler.rescan.delta`).
- Pivot to Scanner report links for impacted digests and confirm they match upstream advisories/policies.
Document incidents and mitigation in `ops/runbooks/INCIDENT_LOG.md` (per SRE policy) and attach Grafana screenshots for post-mortems.
---
## Checklist
- [ ] Grafana dashboard imported and wired to Prometheus datasource.
- [ ] Prometheus alert rules deployed (see above).
- [ ] Runbook linked from on-call rotation portal.
- [ ] Observability Guild sign-off captured for Sprint16 telemetry (OWNER: @obs-guild).
# Scheduler Worker Observability & Runbook
## Purpose
Monitor planner and runner health for the Scheduler Worker (Sprint16 telemetry). The new .NET meters surface queue throughput, latency, backlog, and delta severities so operators can detect stalled runs before rescan SLAs slip.
> **Grafana note:** Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` into the Prometheus-backed Grafana stack that scrapes the OpenTelemetry Collector.
---
## Key metrics
| Metric | Use case | Suggested query |
| --- | --- | --- |
| `scheduler_planner_runs_total{status}` | Planner throughput & failure ratio | `sum by (status) (rate(scheduler_planner_runs_total[5m]))` |
| `scheduler_planner_latency_seconds_bucket` | Planning latency (p95 / p99) | `histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket[5m])))` |
| `scheduler_runner_segments_total{status}` | Runner success vs retries | `sum by (status) (rate(scheduler_runner_segments_total[5m]))` |
| `scheduler_runner_delta_{critical,high,total}` | Newly-detected findings | `sum(rate(scheduler_runner_delta_critical_total[5m]))` |
| `scheduler_runner_backlog{scheduleId}` | Remaining digests awaiting runner | `max by (scheduleId) (scheduler_runner_backlog)` |
| `scheduler_runs_active{mode}` | Active runs in-flight | `sum(scheduler_runs_active)` |
Reference queries power the bundled Grafana dashboard panels. Use the `mode` template variable to focus on `analysisOnly` versus `contentRefresh` schedules.
---
## Grafana dashboard
1. Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` (UID `scheduler-worker-observability`).
2. Point the `datasource` variable to the Prometheus instance scraping the collector. Optional: pin the `mode` variable to a specific schedule mode.
3. Panels included:
- **Planner Runs per Status** visualises success vs failure ratio.
- **Planner Latency P95** highlights degradations in ImpactIndex or Mongo lookups.
- **Runner Segments per Status** shows retry pressure and queue health.
- **New Findings per Severity** rolls up delta counters (critical/high/total).
- **Runner Backlog by Schedule** tabulates outstanding digests per schedule.
- **Active Runs** stat panel showing the current number of in-flight runs.
Capture screenshots once Grafana provisioning completes and store them under `docs/assets/dashboards/` (pending automation ticket OBS-157).
---
## Prometheus alerts
Import `docs/modules/scheduler/operations/worker-prometheus-rules.yaml` into your Prometheus rule configuration. The bundle defines:
- **SchedulerPlannerFailuresHigh** 5%+ of planner runs failed for 10 minutes. Page SRE.
- **SchedulerPlannerLatencyHigh** planner p95 latency remains above 45s for 10 minutes. Investigate ImpactIndex, Mongo, and Conselier/Excitor event queues.
- **SchedulerRunnerBacklogGrowing** backlog exceeded 500 images for 15 minutes. Inspect runner workers, Scanner availability, and rate limiting.
- **SchedulerRunStuck** active run count stayed flat for 30 minutes while remaining non-zero. Check stuck segments, expired leases, and scanner retries.
Hook these alerts into the existing Observability notification pathway (`observability-pager` routing key) and ensure `service=scheduler-worker` is mapped to the on-call rotation.
---
## Runbook snapshot
1. **Planner failure/latency:**
- Check Planner logs for ImpactIndex or Mongo exceptions.
- Verify Conselier/Excitor webhook health; requeue events if necessary.
- If planner is overwhelmed, temporarily reduce schedule parallelism via `stella scheduler schedule update`.
2. **Runner backlog spike:**
- Confirm Scanner WebService health (`/healthz`).
- Inspect runner queue for stuck segments; consider increasing runner workers or scaling scanner capacity.
- Review rate limits (schedule limits, ImpactIndex throughput) before changing global throttles.
3. **Stuck runs:**
- Use `stella scheduler runs list --state running` to identify affected runs.
- Drill into Grafana panel “Runner Backlog by Schedule” to see offending schedule IDs.
- If a segment will not progress, use `stella scheduler segments release --segment <id>` to force retry after resolving root cause.
4. **Unexpected critical deltas:**
- Correlate `scheduler_runner_delta_critical_total` spikes with Notify events (`scheduler.rescan.delta`).
- Pivot to Scanner report links for impacted digests and confirm they match upstream advisories/policies.
Document incidents and mitigation in `ops/runbooks/INCIDENT_LOG.md` (per SRE policy) and attach Grafana screenshots for post-mortems.
---
## Checklist
- [ ] Grafana dashboard imported and wired to Prometheus datasource.
- [ ] Prometheus alert rules deployed (see above).
- [ ] Runbook linked from on-call rotation portal.
- [ ] Observability Guild sign-off captured for Sprint16 telemetry (OWNER: @obs-guild).

View File

@@ -1,63 +1,63 @@
# Implementation plan — VEX Consensus Lens
## Delivery phases
- **Phase 1 Core lens service**
Build normalisation pipeline (CSAF/OpenVEX/CycloneDX), product mapping library, trust weighting functions, consensus algorithm, and persistence (`vex_consensus`, history, conflicts).
- **Phase 2 API & integrations**
Expose `/vex/consensus` query/detail/simulate/export endpoints, integrate Policy Engine thresholds, Vuln Explorer UI chips, and VEX Lens change events.
- **Phase 3 Issuer Directory & signatures**
Deliver issuer registry, key management, signature verification, RBAC, audit logs, and tenant overrides.
- **Phase 4 Console & CLI experiences**
Ship Console module (lists, evidence table, quorum bar, conflicts, simulation drawer) and CLI commands (`stella vex consensus ...`) with export support.
- **Phase 5 Recompute & performance**
Implement recompute scheduling (policy activation, Excitator deltas), caching, load tests (10M records/tenant), observability dashboards, and Offline Kit exports.
## Work breakdown
- **VEX Lens service**
- Normalise VEX payloads, maintain scope scores, compute consensus digest.
- Trust weighting functions (issuer tier, freshness decay, scope quality).
- Idempotent workers for consensus projection and history tracking.
- Conflict handling queue for manual review and notifications.
- **Integrations**
- Excitator: enrich VEX events with issuer hints, signatures, product trees.
- Policy Engine: trust knobs, simulation endpoints, policy-driven recompute.
- Vuln Explorer & Advisory AI: consensus badges, conflict surfacing.
- **Issuer Directory**
- CRUD for issuers/keys, audit logs, import CSAF publishers, tenant overrides.
- Signature verification endpoints consumed by Lens.
- **APIs & UX**
- REST endpoints for query/detail/conflict export, trust weight updates.
- Console module with filters, saved views, evidence table, simulation drawer.
- CLI commands for list/show/simulate/export with JSON/CSV output.
- **Observability & Ops**
- Metrics (consensus latency, conflict rate, signature failures, cache hit rate), logs, traces.
- Dashboards + runbooks for recompute storms, mapping failures, signature errors, quota breaches.
- Offline exports for Export Center/Offline Kit.
## Acceptance criteria
- Consensus results reproducible across supported VEX formats with deterministic digests and provenance.
- Signature verification influences trust weights; unverifiable evidence is down-weighted without pipeline failure.
- Policy simulations show quorum shifts without persisting state; Vuln Explorer consumes consensus signals.
- Issuer Directory enforces RBAC, audit logs, and key rotation; CLI & Console parity achieved.
- Recompute pipeline handles Excitator deltas and policy activations with backpressure and incident surfacing.
- Observability dashboards/alerts cover ingestion lag, conflict spikes, signature failures, performance budgets (P95 < 500ms for 100-row pages at 10M records/tenant).
## Risks & mitigations
- **Product mapping ambiguity:** conservative scope scoring, manual overrides, surfaced warnings, policy review hooks.
- **Issuer compromise:** signature verification, trust weighting, tenant overrides, revocation runbooks.
- **Evidence storms:** batching, worker sharding, orchestrator rate limiting, priority queues.
- **Performance degradation:** caching, indexing, load tests, quota enforcement.
- **Offline gaps:** deterministic exports, manifest hashes, Offline Kit tests.
## Test strategy
- **Unit:** normalisers, mapping, trust weights, consensus lattice, signature verification.
- **Property:** randomised evidence sets verifying lattice commutativity and determinism.
- **Integration:** Excitator Lens Policy/Vuln Explorer flow, issuer overrides, simulation.
- **Performance:** large tenant datasets, cache behaviour, concurrency tests.
- **Security:** RBAC, tenant scoping, signature tampering, issuer revocation.
- **Offline:** export/import verification, CLI parity.
## Definition of done
- Lens service, issuer directory, API/CLI/Console components deployed with telemetry and runbooks.
- Documentation set (overview, algorithm, issuer directory, API, console, policy trust) updated with imposed rule statements.
- ./TASKS.md and ../../TASKS.md reflect current status; Offline Kit parity confirmed.
# Implementation plan — VEX Consensus Lens
## Delivery phases
- **Phase 1 Core lens service**
Build normalisation pipeline (CSAF/OpenVEX/CycloneDX), product mapping library, trust weighting functions, consensus algorithm, and persistence (`vex_consensus`, history, conflicts).
- **Phase 2 API & integrations**
Expose `/vex/consensus` query/detail/simulate/export endpoints, integrate Policy Engine thresholds, Vuln Explorer UI chips, and VEX Lens change events.
- **Phase 3 Issuer Directory & signatures**
Deliver issuer registry, key management, signature verification, RBAC, audit logs, and tenant overrides.
- **Phase 4 Console & CLI experiences**
Ship Console module (lists, evidence table, quorum bar, conflicts, simulation drawer) and CLI commands (`stella vex consensus ...`) with export support.
- **Phase 5 Recompute & performance**
Implement recompute scheduling (policy activation, Excitor deltas), caching, load tests (10M records/tenant), observability dashboards, and Offline Kit exports.
## Work breakdown
- **VEX Lens service**
- Normalise VEX payloads, maintain scope scores, compute consensus digest.
- Trust weighting functions (issuer tier, freshness decay, scope quality).
- Idempotent workers for consensus projection and history tracking.
- Conflict handling queue for manual review and notifications.
- **Integrations**
- Excitor: enrich VEX events with issuer hints, signatures, product trees.
- Policy Engine: trust knobs, simulation endpoints, policy-driven recompute.
- Vuln Explorer & Advisory AI: consensus badges, conflict surfacing.
- **Issuer Directory**
- CRUD for issuers/keys, audit logs, import CSAF publishers, tenant overrides.
- Signature verification endpoints consumed by Lens.
- **APIs & UX**
- REST endpoints for query/detail/conflict export, trust weight updates.
- Console module with filters, saved views, evidence table, simulation drawer.
- CLI commands for list/show/simulate/export with JSON/CSV output.
- **Observability & Ops**
- Metrics (consensus latency, conflict rate, signature failures, cache hit rate), logs, traces.
- Dashboards + runbooks for recompute storms, mapping failures, signature errors, quota breaches.
- Offline exports for Export Center/Offline Kit.
## Acceptance criteria
- Consensus results reproducible across supported VEX formats with deterministic digests and provenance.
- Signature verification influences trust weights; unverifiable evidence is down-weighted without pipeline failure.
- Policy simulations show quorum shifts without persisting state; Vuln Explorer consumes consensus signals.
- Issuer Directory enforces RBAC, audit logs, and key rotation; CLI & Console parity achieved.
- Recompute pipeline handles Excitor deltas and policy activations with backpressure and incident surfacing.
- Observability dashboards/alerts cover ingestion lag, conflict spikes, signature failures, performance budgets (P95 < 500ms for 100-row pages at 10M records/tenant).
## Risks & mitigations
- **Product mapping ambiguity:** conservative scope scoring, manual overrides, surfaced warnings, policy review hooks.
- **Issuer compromise:** signature verification, trust weighting, tenant overrides, revocation runbooks.
- **Evidence storms:** batching, worker sharding, orchestrator rate limiting, priority queues.
- **Performance degradation:** caching, indexing, load tests, quota enforcement.
- **Offline gaps:** deterministic exports, manifest hashes, Offline Kit tests.
## Test strategy
- **Unit:** normalisers, mapping, trust weights, consensus lattice, signature verification.
- **Property:** randomised evidence sets verifying lattice commutativity and determinism.
- **Integration:** Excitor Lens Policy/Vuln Explorer flow, issuer overrides, simulation.
- **Performance:** large tenant datasets, cache behaviour, concurrency tests.
- **Security:** RBAC, tenant scoping, signature tampering, issuer revocation.
- **Offline:** export/import verification, CLI parity.
## Definition of done
- Lens service, issuer directory, API/CLI/Console components deployed with telemetry and runbooks.
- Documentation set (overview, algorithm, issuer directory, API, console, policy trust) updated with imposed rule statements.
- ./TASKS.md and ../../TASKS.md reflect current status; Offline Kit parity confirmed.

View File

@@ -1,9 +0,0 @@
# Task board — Vexer
> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
| ID | Status | Owner(s) | Description | Notes |
|----|--------|----------|-------------|-------|
| VEXER-DOCS-0001 | DOING (2025-10-29) | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
| VEXER-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
| VEXER-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |

View File

@@ -47,18 +47,25 @@ CLI mirrors these endpoints (`stella findings list|view|update|export`). Console
- Scheduler integration triggers follow-up scans or policy re-evaluation when remediation plan reaches checkpoint.
- Zastava (Differential SBOM) feeds runtime exposure signals to reprioritise findings automatically.
## 5) Observability & compliance
- Metrics: `findings_open_total{severity,tenant}`, `findings_mttr_seconds`, `triage_actions_total{type}`, `report_generation_seconds`.
- Logs: structured with `findingId`, `artifactId`, `advisory`, `policyVersion`, `actor`, `actionType`.
- Audit exports: `audit_log.jsonl` appended whenever state changes; offline bundles include signed audit log and manifest.
- Compliance: accepted risk requires dual approval and stores justification plus expiry reminders (raised through Notify).
## 6) Offline bundle requirements
- Bundle structure:
- `manifest.json` (hashes, counts, policy version, generation timestamp).
- `findings.jsonl` (current open findings).
## 5) Observability & compliance
- Metrics: `findings_open_total{severity,tenant}`, `findings_mttr_seconds`, `triage_actions_total{type}`, `report_generation_seconds`.
- Logs: structured with `findingId`, `artifactId`, `advisory`, `policyVersion`, `actor`, `actionType`.
- Audit exports: `audit_log.jsonl` appended whenever state changes; offline bundles include signed audit log and manifest.
- Compliance: accepted risk requires dual approval and stores justification plus expiry reminders (raised through Notify).
## 6) Identity & access integration
- **Scopes** `vuln:view`, `vuln:investigate`, `vuln:operate`, `vuln:audit` map to read-only, triage, workflow, and audit experiences respectively. The deprecated `vuln:read` scope is still honoured for legacy tokens but is no longer advertised.
- **Attribute filters (ABAC)** Authority enforces per-service-account filters via the client-credential parameters `vuln_env`, `vuln_owner`, and `vuln_business_tier`. Service accounts define the allowed values in `authority.yaml` (`attributes` block). Tokens include the resolved filters as claims (`stellaops:vuln_env`, `stellaops:vuln_owner`, `stellaops:vuln_business_tier`), and tokens persisted to Mongo retain the same values for audit and revocation.
- **Audit trail** Every token issuance emits `authority.vuln_attr.*` audit properties that mirror the resolved filter set, along with `delegation.service_account` and ordered `delegation.actor[n]` entries so Vuln Explorer can correlate access decisions.
- **Permalinks** Signed permalinks inherit the callers ABAC filters; consuming services must enforce the embedded claims in addition to scope checks when resolving permalinks.
## 7) Offline bundle requirements
- Bundle structure:
- `manifest.json` (hashes, counts, policy version, generation timestamp).
- `findings.jsonl` (current open findings).
- `history.jsonl` (state changes).
- `actions.jsonl` (comments, assignments, tickets).
- `reports/` (generated PDFs/CSVs).

View File

@@ -1,70 +1,70 @@
# Implementation plan — Vulnerability Explorer
## Delivery phases
- **Phase 1 Findings Ledger & resolver**
Create append-only ledger, projector, ecosystem resolvers (npm/Maven/PyPI/Go/RPM/DEB), canonical advisory keys, and provenance hashing.
- **Phase 2 API & simulation**
Ship Vuln Explorer API (list/detail/grouping/simulation), batch evaluation with Policy Engine rationales, and export orchestrator.
- **Phase 3 Console & CLI workflows**
Deliver triage UI (assignments, comments, remediation plans, simulation bar), keyboard accessibility, and CLI commands (`stella vuln ...`) with JSON/CSV output.
- **Phase 4 Automation & integrations**
Integrate Advisory AI hints, Zastava runtime exposure, Notify rules, Scheduler follow-up scans, and Graph Explorer deep links.
- **Phase 5 Exports & offline parity**
Generate deterministic bundles (JSON, CSV, PDF, Offline Kit manifests), audit logs, and signed reports.
- **Phase 6 Observability & hardening**
Complete dashboards (projection lag, MTTR, accepted-risk cadence), alerts, runbooks, performance tuning (5M findings/tenant), and security/RBAC validation.
## Work breakdown
- **Findings Ledger**
- Define event schema, Merkle root anchoring, append-only storage, history tables.
- Projector to `finding_records` and `finding_history`, idempotent event processing, time travel snapshots.
- Resolver pipelines referencing SBOM inventory deltas, policy outputs, VEX consensus, runtime signals.
- **API & exports**
- REST endpoints (`/v1/findings`, `/v1/findings/{id}`, `/actions`, `/reports`, `/exports`) with ABAC filters.
- Simulation endpoint returning diffs, integration with Policy Engine batch evaluation.
- Export jobs for JSON/CSV/PDF plus Offline Kit bundle assembly and signing.
- **Console**
- Feature module `vuln-explorer` with grid, filters, saved views, deep links, detail tabs (policy, evidence, paths, remediation).
- Simulation drawer, delta chips, accepted-risk approvals, evidence bundle viewer.
- Accessibility (keyboard navigation, ARIA), virtualization for large result sets.
- **CLI**
- Commands `stella vuln list|show|simulate|assign|accept-risk|verify-fix|export`.
- Stable schemas for automation; piping support; tests for exit codes.
- **Integrations**
- Conseiller/Excitator: normalized advisory keys, linksets, evidence retrieval.
- SBOM Service: inventory deltas with scope/runtime flags, safe version hints.
- Notify: events for SLA breaches, accepted-risk expiries, remediation deadlines.
- Scheduler: trigger rescans when remediation plan milestones complete.
- **Observability & ops**
- Metrics (open findings, MTTR, projection lag, export duration, SLA burn), logs/traces with correlation IDs.
- Alerting on projector backlog, API 5xx spikes, export failures, accepted-risk nearing expiry.
- Runbooks covering recompute storms, mapping errors, report issues.
## Acceptance criteria
- Ledger/event sourcing reproduces historical states byte-for-byte; Merkle hashes verify integrity.
- Resolver respects ecosystem semantics, scope, and runtime context; path evidence presented in UI/CLI.
- Triage workflows (assignment, comments, accepted-risk) enforce justification and approval requirements with audit records.
- Simulation returns policy diffs without mutating state; CLI/UI parity achieved for simulation and exports.
- Exports and Offline Kit bundles reproducible with signed manifests and provenance; reports available in JSON/CSV/PDF.
- Observability dashboards show green SLOs, alerts fire for projection lag or SLA burns, and runbooks documented.
- RBAC/ABAC validated; attachments encrypted; tenant isolation guaranteed.
## Risks & mitigations
- **Advisory identity collisions:** strict canonicalization, linkset references, raw evidence access.
- **Resolver inaccuracies:** property-based tests, path verification, manual override workflows.
- **Projection lag/backlog:** autoscaling, queue backpressure, alerting, pause controls.
- **Export size/performance:** streaming NDJSON, size estimators, chunked downloads.
- **User confusion on suppression:** rationale tab, explicit badges, explain traces.
## Test strategy
- **Unit:** resolver algorithms, state machine transitions, policy mapping, export builders.
- **Integration:** ingestion → ledger → projector → API flow, simulation, Notify notifications.
- **E2E:** Console triage scenarios, CLI flows, accessibility tests.
- **Performance:** 5M findings/tenant, projection rebuild, export generation.
- **Security:** RBAC/ABAC matrix, CSRF, attachment encryption, signed URL expiry.
- **Determinism:** time-travel snapshots, export manifest hashing, Offline Kit replay.
## Definition of done
- Services, UI/CLI, integrations, exports, and observability deployed with runbooks and Offline Kit parity.
- Documentation suite (overview, using-console, API, CLI, findings ledger, policy mapping, VEX/SBOM integration, telemetry, security, runbooks, install) updated with imposed rule statement.
- ./TASKS.md and ../../TASKS.md reflect active progress; compliance checklists appended where required.
# Implementation plan — Vulnerability Explorer
## Delivery phases
- **Phase 1 Findings Ledger & resolver**
Create append-only ledger, projector, ecosystem resolvers (npm/Maven/PyPI/Go/RPM/DEB), canonical advisory keys, and provenance hashing.
- **Phase 2 API & simulation**
Ship Vuln Explorer API (list/detail/grouping/simulation), batch evaluation with Policy Engine rationales, and export orchestrator.
- **Phase 3 Console & CLI workflows**
Deliver triage UI (assignments, comments, remediation plans, simulation bar), keyboard accessibility, and CLI commands (`stella vuln ...`) with JSON/CSV output.
- **Phase 4 Automation & integrations**
Integrate Advisory AI hints, Zastava runtime exposure, Notify rules, Scheduler follow-up scans, and Graph Explorer deep links.
- **Phase 5 Exports & offline parity**
Generate deterministic bundles (JSON, CSV, PDF, Offline Kit manifests), audit logs, and signed reports.
- **Phase 6 Observability & hardening**
Complete dashboards (projection lag, MTTR, accepted-risk cadence), alerts, runbooks, performance tuning (5M findings/tenant), and security/RBAC validation.
## Work breakdown
- **Findings Ledger**
- Define event schema, Merkle root anchoring, append-only storage, history tables.
- Projector to `finding_records` and `finding_history`, idempotent event processing, time travel snapshots.
- Resolver pipelines referencing SBOM inventory deltas, policy outputs, VEX consensus, runtime signals.
- **API & exports**
- REST endpoints (`/v1/findings`, `/v1/findings/{id}`, `/actions`, `/reports`, `/exports`) with ABAC filters.
- Simulation endpoint returning diffs, integration with Policy Engine batch evaluation.
- Export jobs for JSON/CSV/PDF plus Offline Kit bundle assembly and signing.
- **Console**
- Feature module `vuln-explorer` with grid, filters, saved views, deep links, detail tabs (policy, evidence, paths, remediation).
- Simulation drawer, delta chips, accepted-risk approvals, evidence bundle viewer.
- Accessibility (keyboard navigation, ARIA), virtualization for large result sets.
- **CLI**
- Commands `stella vuln list|show|simulate|assign|accept-risk|verify-fix|export`.
- Stable schemas for automation; piping support; tests for exit codes.
- **Integrations**
- Conseiller/Excitor: normalized advisory keys, linksets, evidence retrieval.
- SBOM Service: inventory deltas with scope/runtime flags, safe version hints.
- Notify: events for SLA breaches, accepted-risk expiries, remediation deadlines.
- Scheduler: trigger rescans when remediation plan milestones complete.
- **Observability & ops**
- Metrics (open findings, MTTR, projection lag, export duration, SLA burn), logs/traces with correlation IDs.
- Alerting on projector backlog, API 5xx spikes, export failures, accepted-risk nearing expiry.
- Runbooks covering recompute storms, mapping errors, report issues.
## Acceptance criteria
- Ledger/event sourcing reproduces historical states byte-for-byte; Merkle hashes verify integrity.
- Resolver respects ecosystem semantics, scope, and runtime context; path evidence presented in UI/CLI.
- Triage workflows (assignment, comments, accepted-risk) enforce justification and approval requirements with audit records.
- Simulation returns policy diffs without mutating state; CLI/UI parity achieved for simulation and exports.
- Exports and Offline Kit bundles reproducible with signed manifests and provenance; reports available in JSON/CSV/PDF.
- Observability dashboards show green SLOs, alerts fire for projection lag or SLA burns, and runbooks documented.
- RBAC/ABAC validated; attachments encrypted; tenant isolation guaranteed.
## Risks & mitigations
- **Advisory identity collisions:** strict canonicalization, linkset references, raw evidence access.
- **Resolver inaccuracies:** property-based tests, path verification, manual override workflows.
- **Projection lag/backlog:** autoscaling, queue backpressure, alerting, pause controls.
- **Export size/performance:** streaming NDJSON, size estimators, chunked downloads.
- **User confusion on suppression:** rationale tab, explicit badges, explain traces.
## Test strategy
- **Unit:** resolver algorithms, state machine transitions, policy mapping, export builders.
- **Integration:** ingestion → ledger → projector → API flow, simulation, Notify notifications.
- **E2E:** Console triage scenarios, CLI flows, accessibility tests.
- **Performance:** 5M findings/tenant, projection rebuild, export generation.
- **Security:** RBAC/ABAC matrix, CSRF, attachment encryption, signed URL expiry.
- **Determinism:** time-travel snapshots, export manifest hashing, Offline Kit replay.
## Definition of done
- Services, UI/CLI, integrations, exports, and observability deployed with runbooks and Offline Kit parity.
- Documentation suite (overview, using-console, API, CLI, findings ledger, policy mapping, VEX/SBOM integration, telemetry, security, runbooks, install) updated with imposed rule statement.
- ./TASKS.md and ../../TASKS.md reflect active progress; compliance checklists appended where required.