Files
git.stella-ops.org/docs/ARCHITECTURE_CONCELIER.md
root 68da90a11a
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Restructure solution layout by module
2025-10-28 15:10:40 +02:00

519 lines
22 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# component_architecture_concelier.md — **StellaOps Concelier** (Sprint22)
> **Scope.** Implementation-ready architecture for **Concelier**: the advisory ingestion and Link-Not-Merge (LNM) observation pipeline that produces deterministic raw observations, correlation linksets, and evidence events consumed by Policy Engine, Console, CLI, and Export centers. Covers domain models, connectors, observation/linkset builders, storage schema, events, APIs, performance, security, and test matrices.
---
## 0) Mission & boundaries
**Mission.** Acquire authoritative **vulnerability advisories** (vendor PSIRTs, distros, OSS ecosystems, CERTs), persist them as immutable **observations** under the Aggregation-Only Contract (AOC), construct **linksets** that correlate observations without merging or precedence, and export deterministic evidence bundles (JSON, Trivy DB, Offline Kit) for downstream policy evaluation and operator tooling.
**Boundaries.**
* Concelier **does not** sign with private keys. When attestation is required, the export artifact is handed to the **Signer**/**Attestor** pipeline (outofprocess).
* Concelier **does not** decide PASS/FAIL; it provides data to the **Policy** engine.
* Online operation is **allowlistonly**; airgapped deployments use the **Offline Kit**.
---
## 1) Topology & processes
**Process shape:** single ASP.NET Core service `StellaOps.Concelier.WebService` hosting:
* **Scheduler** with distributed locks (Mongo backed).
* **Connectors** (fetch/parse/map) that emit immutable observation candidates.
* **Observation writer** enforcing AOC invariants via `AOCWriteGuard`.
* **Linkset builder** that correlates observations into `advisory_linksets` and annotates conflicts.
* **Event publisher** emitting `advisory.observation.updated` and `advisory.linkset.updated` messages.
* **Exporters** (JSON, Trivy DB, Offline Kit slices) fed from observation/linkset stores.
* **Minimal REST** for health/status/trigger/export and observation/linkset reads.
**Scale:** HA by running N replicas; **locks** prevent overlapping jobs per source/exporter.
---
## 2) Canonical domain model
> Stored in MongoDB (database `concelier`), serialized with a **canonical JSON** writer (stable order, camelCase, normalized timestamps).
### 2.1 Core entities
#### AdvisoryObservation
```jsonc
observationId // deterministic id: {tenant}:{source.vendor}:{upstreamId}:{revision}
tenant // issuing tenant (lower-case)
source{
vendor, stream, api, collectorVersion
}
upstream{
upstreamId, documentVersion, fetchedAt, receivedAt,
contentHash, signature{present, format?, keyId?, signature?}
}
content{
format, specVersion, raw, metadata?
}
identifiers{
cve?, ghsa?, vendorIds[], aliases[]
}
linkset{
purls[], cpes[], aliases[], references[{type,url}],
reconciledFrom[]
}
createdAt // when Concelier recorded the observation
attributes // optional provenance metadata (batch ids, ingest cursor)
```jsonc
#### AdvisoryLinkset
```jsonc
linksetId // sha256 over sorted (tenant, product/vuln tuple, observation ids)
tenant
key{
vulnerabilityId,
productKey,
confidence // low|medium|high
}
observations[] = [
{
observationId,
sourceVendor,
statement{
status?, severity?, references?, notes?
},
collectedAt
}
]
aliases{
primary,
others[]
}
purls[]
cpes[]
conflicts[]? // see AdvisoryLinksetConflict
createdAt
updatedAt
```jsonc
#### AdvisoryLinksetConflict
```jsonc
conflictId // deterministic hash
type // severity-mismatch | affected-range-divergence | reference-clash | alias-inconsistency | metadata-gap
field? // optional JSON pointer (e.g., /statement/severity/vector)
observations[] // per-source values contributing to the conflict
confidence // low|medium|high (heuristic weight)
detectedAt
```jsonc
#### ObservationEvent / LinksetEvent
```jsonc
eventId // ULID
tenant
type // advisory.observation.updated | advisory.linkset.updated
key{
observationId? // on observation event
linksetId? // on linkset event
vulnerabilityId?,
productKey?
}
delta{
added[], removed[], changed[] // normalized summary for consumers
}
hash // canonical hash of serialized delta payload
occurredAt
```jsonc
#### ExportState
```jsonc
exportKind // json | trivydb
baseExportId? // last full baseline
baseDigest? // digest of last full baseline
lastFullDigest? // digest of last full export
lastDeltaDigest? // digest of last delta export
cursor // per-kind incremental cursor
files[] // last manifest snapshot (path → sha256)
```jsonc
Legacy `Advisory`, `Affected`, and merge-centric entities remain in the repository for historical exports and replay but are being phased out as Link-Not-Merge takes over. New code paths must interact with `AdvisoryObservation` / `AdvisoryLinkset` exclusively and emit conflicts through the structured payloads described above.
### 2.2 Product identity (`productKey`)
* **Primary:** `purl` (Package URL).
* **OS packages:** RPM (NEVRA→purl:rpm), DEB (dpkg→purl:deb), APK (apk→purl:alpine), with **EVR/NVRA** preserved.
* **Secondary:** `cpe` retained for compatibility; advisory records may carry both.
* **Image/platform:** `oci:<registry>/<repo>@<digest>` for imagelevel advisories (rare).
* **Unmappable:** if a source is nondeterministic, keep native string under `productKey="native:<provider>:<id>"` and mark **nonjoinable**.
---
## 3) Source families & precedence
### 3.1 Families
* **Vendor PSIRTs**: Microsoft, Oracle, Cisco, Adobe, Apple, VMware, Chromium…
* **Linux distros**: Red Hat, SUSE, Ubuntu, Debian, Alpine…
* **OSS ecosystems**: OSV, GHSA (GitHub Security Advisories), PyPI, npm, Maven, NuGet, Go.
* **CERTs / national CSIRTs**: CISA (KEV, ICS), JVN, ACSC, CCCS, KISA, CERTFR/BUND, etc.
### 3.2 Precedence (when claims conflict)
1. **Vendor PSIRT** (authoritative for their product).
2. **Distro** (authoritative for packages they ship, including backports).
3. **Ecosystem** (OSV/GHSA) for library semantics.
4. **CERTs/aggregators** for enrichment (KEV/known exploited).
> Precedence affects **Affected** ranges and **fixed** info; **severity** is normalized to the **maximum** credible severity unless policy overrides. Conflicts are retained with **source provenance**.
---
## 4) Connectors & normalization
### 4.1 Connector contract
```csharp
public interface IFeedConnector {
string SourceName { get; }
Task FetchAsync(IServiceProvider sp, CancellationToken ct); // -> document collection
Task ParseAsync(IServiceProvider sp, CancellationToken ct); // -> dto collection (validated)
Task MapAsync(IServiceProvider sp, CancellationToken ct); // -> advisory/alias/affected/reference
}
```jsonc
* **Fetch**: windowed (cursor), conditional GET (ETag/LastModified), retry/backoff, rate limiting.
* **Parse**: schema validation (JSON Schema, XSD/CSAF), content type checks; write **DTO** with normalized casing.
* **Map**: build canonical records; all outputs carry **provenance** (doc digest, URI, anchors).
### 4.2 Version range normalization
* **SemVer** ecosystems (npm, pypi, maven, nuget, golang): normalize to `introduced`/`fixed` semver ranges (use `~`, `^`, `<`, `>=` canonicalized to intervals).
* **RPM EVR**: `epoch:version-release` with `rpmvercmp` semantics; store raw EVR strings and also **computed order keys** for query.
* **DEB**: dpkg version comparison semantics mirrored; store computed keys.
* **APK**: Alpine version semantics; compute order keys.
* **Generic**: if provider uses text, retain raw; do **not** invent ranges.
### 4.3 Severity & CVSS
* Normalize **CVSS v2/v3/v4** where available (vector, baseScore, severity).
* If multiple CVSS sources exist, track them all; **effective severity** defaults to **max** by policy (configurable).
* **ExploitKnown** toggled by KEV and equivalent sources; store **evidence** (source, date).
---
## 5) Observation & linkset pipeline
> **Goal:** deterministically ingest raw documents into immutable observations, correlate them into evidence-rich linksets, and broadcast changes without precedence or mutation.
### 5.1 Observation flow
1. **Connector fetch/parse/map** connectors download upstream payloads, validate signatures, and map to DTOs (identifiers, references, raw payload, provenance).
2. **AOC guard** `AOCWriteGuard` verifies forbidden keys, provenance completeness, tenant claims, timestamp normalization, and content hash idempotency. Violations raise `ERR_AOC_00x` mapped to structured logs and metrics.
3. **Append-only write** observations insert into `advisory_observations`; duplicates by `(tenant, source.vendor, upstream.upstreamId, upstream.contentHash)` become no-ops; new content for same upstream id creates a supersedes chain.
4. **Change feed + event** Mongo change streams trigger `advisory.observation.updated@1` events with deterministic payloads (IDs, hash, supersedes pointer, linkset summary). Policy Engine, Offline Kit builder, and guard dashboards subscribe.
### 5.2 Linkset correlation
1. **Queue** observation deltas enqueue correlation jobs keyed by `(tenant, vulnerabilityId, productKey)` candidates derived from identifiers + alias graph.
2. **Canonical grouping** builder resolves aliases using Conceliers alias store and deterministic heuristics (vendor > distro > cert), deriving normalized product keys (purl preferred) and confidence scores.
3. **Linkset materialization** `advisory_linksets` documents store sorted observation references, alias sets, product keys, range metadata, and conflict payloads. Writes are idempotent; unchanged hashes skip updates.
4. **Conflict detection** builder emits structured conflicts (`severity-mismatch`, `affected-range-divergence`, `reference-clash`, `alias-inconsistency`, `metadata-gap`). Conflicts carry per-observation values for explainability.
5. **Event emission** `advisory.linkset.updated@1` summarizes deltas (`added`, `removed`, `changed` observation IDs, conflict updates, confidence changes) and includes a canonical hash for replay validation.
### 5.3 Event contract
| Event | Schema | Notes |
|-------|--------|-------|
| `advisory.observation.updated@1` | `events/advisory.observation.updated@1.json` | Fired on new or superseded observations. Includes `observationId`, source metadata, `linksetSummary` (aliases/purls), supersedes pointer (if any), SHA-256 hash, and `traceId`. |
| `advisory.linkset.updated@1` | `events/advisory.linkset.updated@1.json` | Fired when correlation changes. Includes `linksetId`, `key{vulnerabilityId, productKey, confidence}`, observation deltas, conflicts, `updatedAt`, and canonical hash. |
Events are emitted via NATS (primary) and Redis Stream (fallback). Consumers acknowledge idempotently using the hash; duplicates are safe. Offline Kit captures both topics during bundle creation for air-gapped replay.
---
## 6) Storage schema (MongoDB)
### Collections & indexes (LNM path)
* `concelier.sources` `{_id, type, baseUrl, enabled, notes}` connector catalog.
* `concelier.source_state` `{sourceName(unique), enabled, cursor, lastSuccess, backoffUntil, paceOverrides}` run-state (TTL indexes on `backoffUntil`).
* `concelier.documents` `{_id, sourceName, uri, fetchedAt, sha256, contentType, status, metadata, gridFsId?, etag?, lastModified?}` raw payload registry.
* Indexes: `{sourceName:1, uri:1}` unique; `{fetchedAt:-1}` for recent fetches.
* `concelier.dto` `{_id, sourceName, documentId, schemaVer, payload, validatedAt}` normalized connector DTOs used for replay.
* Index: `{sourceName:1, documentId:1}`.
* `concelier.advisory_observations`
```
{
_id: "tenant:vendor:upstreamId:revision",
tenant,
source: { vendor, stream, api, collectorVersion },
upstream: { upstreamId, documentVersion, fetchedAt, receivedAt, contentHash, signature },
content: { format, specVersion, raw, metadata? },
identifiers: { cve?, ghsa?, vendorIds[], aliases[] },
linkset: { purls[], cpes[], aliases[], references[], reconciledFrom[] },
supersedes?: "prevObservationId",
createdAt,
attributes?: object
}
```
* Indexes: `{tenant:1, upstream.upstreamId:1}`, `{tenant:1, source.vendor:1, linkset.purls:1}`, `{tenant:1, linkset.aliases:1}`, `{tenant:1, createdAt:-1}`.
* `concelier.advisory_linksets`
```
{
_id: "sha256:...",
tenant,
key: { vulnerabilityId, productKey, confidence },
observations: [
{ observationId, sourceVendor, statement, collectedAt }
],
aliases: { primary, others: [] },
purls: [],
cpes: [],
conflicts: [],
createdAt,
updatedAt
}
```
* Indexes: `{tenant:1, key.vulnerabilityId:1, key.productKey:1}`, `{tenant:1, purls:1}`, `{tenant:1, aliases.primary:1}`, `{tenant:1, updatedAt:-1}`.
* `concelier.advisory_events`
```
{
_id: ObjectId,
tenant,
type: "advisory.observation.updated" | "advisory.linkset.updated",
key,
delta,
hash,
occurredAt
}
```
* TTL index on `occurredAt` (configurable retention), `{type:1, occurredAt:-1}` for replay.
* `concelier.export_state` `{_id(exportKind), baseExportId?, baseDigest?, lastFullDigest?, lastDeltaDigest?, cursor, files[]}`
* `locks` `{_id(jobKey), holder, acquiredAt, heartbeatAt, leaseMs, ttlAt}` (TTL cleans dead locks)
* `jobs` `{_id, type, args, state, startedAt, heartbeatAt, endedAt, error}`
**Legacy collections** (`advisory`, `alias`, `affected`, `reference`, `merge_event`) remain read-only during the migration window to support back-compat exports. New code must not write to them; scheduled cleanup removes them after Link-Not-Merge GA.
**GridFS buckets**: `fs.documents` for raw payloads (immutable); `fs.exports` for historical JSON/Trivy archives.
---
## 7) Exporters
### 7.1 Deterministic JSON (vulnlist style)
* Folder structure mirroring `/<scheme>/<first-two>/<rest>/…` with one JSON per advisory; deterministic ordering, stable timestamps, normalized whitespace.
* `manifest.json` lists all files with SHA256 and a toplevel **export digest**.
### 7.2 Trivy DB exporter
* Builds Bolt DB archives compatible with Trivy; supports **full** and **delta** modes.
* In delta, unchanged blobs are reused from the base; metadata captures:
```json
{
"mode": "delta|full",
"baseExportId": "...",
"baseManifestDigest": "sha256:...",
"changed": ["path1", "path2"],
"removed": ["path3"]
}
```
* Optional ORAS push (OCI layout) for registries.
* Offline kit bundles include Trivy DB + JSON tree + export manifest.
* Mirror-ready bundles: when `concelier.trivy.mirror` defines domains, the exporter emits `mirror/index.json` plus per-domain `manifest.json`, `metadata.json`, and `db.tar.gz` files with SHA-256 digests so Concelier mirrors can expose domain-scoped download endpoints.
* Concelier.WebService serves `/concelier/exports/index.json` and `/concelier/exports/mirror/{domain}/…` directly from the export tree with hour-long budgets (index: 60s, bundles: 300s, immutable) and per-domain rate limiting; the endpoints honour Stella Ops Authority or CIDR bypass lists depending on mirror topology.
### 7.3 Handoff to Signer/Attestor (optional)
* On export completion, if `attest: true` is set in job args, Concelier **posts** the artifact metadata to **Signer**/**Attestor**; Concelier itself **does not** hold signing keys.
* Export record stores returned `{ uuid, index, url }` from **Rekor v2**.
---
## 8) REST APIs
All under `/api/v1/concelier`.
**Health & status**
```
GET /healthz | /readyz
GET /status → sources, last runs, export cursors
```
**Sources & jobs**
```
GET /sources → list of configured sources
POST /sources/{name}/trigger → { jobId }
POST /sources/{name}/pause | /resume → toggle
GET /jobs/{id} → job status
```
**Exports**
```
POST /exports/json { full?:bool, force?:bool, attest?:bool } → { exportId, digest, rekor? }
POST /exports/trivy { full?:bool, force?:bool, publish?:bool, attest?:bool } → { exportId, digest, rekor? }
GET /exports/{id} → export metadata (kind, digest, createdAt, rekor?)
GET /concelier/exports/index.json → mirror index describing available domains/bundles
GET /concelier/exports/mirror/{domain}/manifest.json
GET /concelier/exports/mirror/{domain}/bundle.json
GET /concelier/exports/mirror/{domain}/bundle.json.jws
```
**Search (operator debugging)**
```
GET /advisories/{key}
GET /advisories?scheme=CVE&value=CVE-2025-12345
GET /affected?productKey=pkg:rpm/openssl&limit=100
```
**AuthN/Z:** Authority tokens (OpTok) with roles: `concelier.read`, `concelier.admin`, `concelier.export`.
---
## 9) Configuration (YAML)
```yaml
concelier:
mongo: { uri: "mongodb://mongo/concelier" }
s3:
endpoint: "http://minio:9000"
bucket: "stellaops-concelier"
scheduler:
windowSeconds: 30
maxParallelSources: 4
sources:
- name: redhat
kind: csaf
baseUrl: https://access.redhat.com/security/data/csaf/v2/
signature: { type: pgp, keys: [ "…redhat PGP…" ] }
enabled: true
windowDays: 7
- name: suse
kind: csaf
baseUrl: https://ftp.suse.com/pub/projects/security/csaf/
signature: { type: pgp, keys: [ "…suse PGP…" ] }
- name: ubuntu
kind: usn-json
baseUrl: https://ubuntu.com/security/notices.json
signature: { type: none }
- name: osv
kind: osv
baseUrl: https://api.osv.dev/v1/
signature: { type: none }
- name: ghsa
kind: ghsa
baseUrl: https://api.github.com/graphql
auth: { tokenRef: "env:GITHUB_TOKEN" }
exporters:
json:
enabled: true
output: s3://stellaops-concelier/json/
trivy:
enabled: true
mode: full
output: s3://stellaops-concelier/trivy/
oras:
enabled: false
repo: ghcr.io/org/concelier
precedence:
vendorWinsOverDistro: true
distroWinsOverOsv: true
severity:
policy: max # or 'vendorPreferred' / 'distroPreferred'
```
---
## 10) Security & compliance
* **Outbound allowlist** per connector (domains, protocols); proxy support; TLS pinning where possible.
* **Signature verification** for raw docs (PGP/cosign/x509) with results stored in `document.metadata.sig`. Docs failing verification may still be ingested but flagged; Policy Engine or downstream policy can down-weight them.
* **No secrets in logs**; auth material via `env:` or mounted files; HTTP redaction of `Authorization` headers.
* **Multitenant**: pertenant DBs or prefixes; pertenant S3 prefixes; tenantscoped API tokens.
* **Determinism**: canonical JSON writer; export digests stable across runs given same inputs.
---
## 11) Performance targets & scale
* **Ingest**: ≥ 5k documents/min on 4 cores (CSAF/OpenVEX/JSON).
* **Normalize/map**: ≥ 50k observation statements/min on 4 cores.
* **Observation write**: ≤ 5ms P95 per document (including guard + Mongo write).
* **Linkset build**: ≤ 15ms P95 per `(vulnerabilityId, productKey)` update, even with 20+ contributing observations.
* **Export**: 1M advisories JSON in ≤ 90s (streamed, zstd), Trivy DB in ≤ 60s on 8 cores.
* **Memory**: hard cap per job; chunked streaming writers; backpressure to avoid GC spikes.
**Scale pattern**: add Concelier replicas; Mongo scaling via indices and read/write concerns; GridFS only for oversized docs.
---
## 12) Observability
* **Metrics**
* `concelier.fetch.docs_total{source}`
* `concelier.fetch.bytes_total{source}`
* `concelier.parse.failures_total{source}`
* `concelier.map.statements_total{source}`
* `concelier.observations.write_total{result=ok|noop|error}`
* `concelier.linksets.updated_total{result=ok|skip|error}`
* `concelier.linksets.conflicts_total{type}`
* `concelier.export.bytes{kind}`
* `concelier.export.duration_seconds{kind}`
* **Tracing** around fetch/parse/map/observe/linkset/export.
* **Logs**: structured with `source`, `uri`, `docDigest`, `advisoryKey`, `exportId`.
---
## 13) Testing matrix
* **Connectors:** fixture suites for each provider/format (happy path; malformed; signature fail).
* **Version semantics:** EVR vs dpkg vs semver edge cases (epoch bumps, tilde versions, prereleases).
* **Linkset correlation:** multi-source conflicts (severity, range, alias) produce deterministic conflict payloads; ensure confidence scoring stable.
* **Export determinism:** byteforbyte stable outputs across runs; digest equality.
* **Performance:** soak tests with 1M advisories; cap memory; verify backpressure.
* **API:** pagination, filters, RBAC, error envelopes (RFC 7807).
* **Offline kit:** bundle build & import correctness.
---
## 14) Failure modes & recovery
* **Source outages:** scheduler backs off with exponential delay; `source_state.backoffUntil`; alerts on staleness.
* **Schema drifts:** parse stage marks DTO invalid; job fails with clear diagnostics; connector version flags track supported schema ranges.
* **Partial exports:** exporters write to temp prefix; **manifest commit** is atomic; only then move to final prefix and update `export_state`.
* **Resume:** all stages idempotent; `source_state.cursor` supports window resume.
---
## 15) Operator runbook (quick)
* **Trigger all sources:** `POST /api/v1/concelier/sources/*/trigger`
* **Force full export JSON:** `POST /api/v1/concelier/exports/json { "full": true, "force": true }`
* **Force Trivy DB delta publish:** `POST /api/v1/concelier/exports/trivy { "full": false, "publish": true }`
* **Inspect observation:** `GET /api/v1/concelier/observations/{observationId}`
* **Query linkset:** `GET /api/v1/concelier/linksets?vulnerabilityId=CVE-2025-12345&productKey=pkg:rpm/redhat/openssl`
* **Pause noisy source:** `POST /api/v1/concelier/sources/osv/pause`
---
## 16) Rollout plan
1. **MVP**: Red Hat (CSAF), SUSE (CSAF), Ubuntu (USN JSON), OSV; JSON export.
2. **Add**: GHSA GraphQL, Debian (DSA HTML/JSON), Alpine secdb; Trivy DB export.
3. **Attestation handoff**: integrate with **Signer/Attestor** (optional).
4. **Scale & diagnostics**: provider dashboards, staleness alerts, export cache reuse.
5. **Offline kit**: endtoend verified bundles for airgap.