Files
git.stella-ops.org/docs/ingestion/aggregation-only-contract.md
master 7600caea63
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Merge branch 'main' of https://git.stella-ops.org/stella-ops.org/git.stella-ops.org
2025-10-30 00:12:41 +02:00

182 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Aggregation-Only Contract Reference
> The Aggregation-Only Contract (AOC) is the governing rule set that keeps StellaOps ingestion services deterministic, policy-neutral, and auditable. It applies to Concelier, Excititor, and any future collectors that write raw advisory or VEX documents.
## 1. Purpose and Scope
- Defines the canonical behaviour for `advisory_raw` and `vex_raw` collections and the linkset hints they may emit.
- Applies to every ingestion runtime (`StellaOps.Concelier.*`, `StellaOps.Excititor.*`), the Authority scopes that guard them, and the DevOps/QA surfaces that verify compliance.
- Complements the high-level architecture in [Concelier](../modules/concelier/architecture.md) and Authority enforcement documented in [Authority Architecture](../modules/authority/architecture.md).
- Paired guidance: see the guard-rail checkpoints in [AOC Guardrails](../aoc/aoc-guardrails.md) and CLI usage that will land in `/docs/modules/cli/guides/` as part of Sprint 19 follow-up.
## 2. Philosophy and Goals
- Preserve upstream truth: ingestion only captures immutable raw facts plus provenance, never derived severity or policy decisions.
- Defer interpretation: Policy Engine and downstream overlays remain the sole writers of materialised findings, severity, consensus, or risk scores.
- Make every write explainable: provenance, signatures, and content hashes are required so operators can prove where each fact originated.
- Keep outputs reproducible: identical inputs must yield identical documents, hashes, and linksets across replays and air-gapped installs.
## 3. Contract Invariants
| # | Invariant | What it forbids or requires | Enforcement surfaces |
|---|-----------|-----------------------------|----------------------|
| 1 | No derived severity at ingest | Reject top-level keys such as `severity`, `cvss`, `effective_status`, `consensus_provider`, `risk_score`. Raw upstream CVSS remains inside `content.raw`. | Mongo schema validator, `AOCWriteGuard`, Roslyn analyzer, `stella aoc verify`. |
| 2 | No merges or opinionated dedupe | Each upstream document persists on its own; ingestion never collapses multiple vendors into one document. | Repository interceptors, unit/fixture suites. |
| 3 | Provenance is mandatory | `source.*`, `upstream.*`, and `signature` metadata must be present; missing provenance triggers `ERR_AOC_004`. | Schema validator, guard, CLI verifier. |
| 4 | Idempotent upserts | Writes keyed by `(vendor, upstream_id, content_hash)` either no-op or insert a new revision with `supersedes`. Duplicate hashes map to the same document. | Repository guard, storage unique index, CI smoke tests. |
| 5 | Append-only revisions | Updates create a new document with `supersedes` pointer; no in-place mutation of content. | Mongo schema (`supersedes` format), guard, data migration scripts. |
| 6 | Linkset only | Ingestion may compute link hints (`purls`, `cpes`, IDs) to accelerate joins, but must not transform or infer severity or policy. | Linkset builders reviewed via fixtures and analyzers. |
| 7 | Policy-only effective findings | Only Policy Engine identities can write `effective_finding_*`; ingestion callers receive `ERR_AOC_006` if they attempt it. | Authority scopes, Policy Engine guard. |
| 8 | Schema safety | Unknown top-level keys reject with `ERR_AOC_007`; timestamps use ISO 8601 UTC strings; tenant is required. | Mongo validator, JSON schema tests. |
| 9 | Clock discipline | Collectors stamp `fetched_at` and `received_at` monotonically per batch to support reproducibility windows. | Collector contracts, QA fixtures. |
## 4. Raw Schemas
### 4.1 `advisory_raw`
| Field | Type | Notes |
|-------|------|-------|
| `_id` | string | `advisory_raw:{source}:{upstream_id}:{revision}`; deterministic and tenant-scoped. |
| `tenant` | string | Required; injected by Authority middleware and asserted by schema validator. |
| `source.vendor` | string | Provider identifier (e.g., `redhat`, `osv`, `ghsa`). |
| `source.stream` | string | Connector stream name (`csaf`, `osv`, etc.). |
| `source.api` | string | Absolute URI of upstream document; stored for traceability. |
| `source.collector_version` | string | Semantic version of the collector. |
| `upstream.upstream_id` | string | Vendor- or ecosystem-provided identifier (CVE, GHSA, vendor ID). |
| `upstream.document_version` | string | Upstream issued timestamp or revision string. |
| `upstream.fetched_at` / `received_at` | string | ISO 8601 UTC timestamps recorded by the collector. |
| `upstream.content_hash` | string | `sha256:` digest of the raw payload used for idempotency. |
| `upstream.signature` | object | Required structure storing `present`, `format`, `key_id`, `sig`; even unsigned payloads set `present: false`. |
| `content.format` | string | Source format (`CSAF`, `OSV`, etc.). |
| `content.spec_version` | string | Upstream spec version when known. |
| `content.raw` | object | Full upstream payload, untouched except for transport normalisation. |
| `identifiers` | object | Upstream identifiers (`cve`, `ghsa`, `aliases`, etc.) captured as provided (trimmed, order preserved, duplicates allowed). |
| `linkset` | object | Join hints (see section 4.3). |
| `supersedes` | string or null | Points to previous revision of same upstream doc when content hash changes. |
### 4.2 `vex_raw`
| Field | Type | Notes |
|-------|------|-------|
| `_id` | string | `vex_raw:{source}:{upstream_id}:{revision}`. |
| `tenant` | string | Required; matches advisory collection requirements. |
| `source.*` | object | Same shape and requirements as `advisory_raw`. |
| `upstream.*` | object | Includes `document_version`, timestamps, `content_hash`, and `signature`. |
| `content.format` | string | Typically `CycloneDX-VEX` or `CSAF-VEX`. |
| `content.raw` | object | Entire upstream VEX payload. |
| `identifiers.statements` | array | Normalised statement summaries (IDs, PURLs, status, justification) to accelerate policy joins. |
| `linkset` | object | CVEs, GHSA IDs, and PURLs referenced in the document. |
| `supersedes` | string or null | Same convention as advisory documents. |
### 4.3 Linkset Fields
- `purls`: fully qualified Package URLs extracted from raw ranges or product nodes.
- `cpes`: Common Platform Enumerations when upstream docs provide them.
- `aliases`: Any alternate advisory identifiers present in the payload.
- `references`: Array of `{ type, url }` pairs pointing back to vendor advisories, patches, or exploits.
- `reconciled_from`: Provenance of linkset entries (JSON Pointer or field origin) to make automated checks auditable.
Canonicalisation rules:
- Package URLs are rendered in canonical form without qualifiers/subpaths (`pkg:type/namespace/name@version`).
- CPE values are normalised to the 2.3 binding (`cpe:2.3:part:vendor:product:version:*:*:*:*:*:*:*`).
- Connector mapping stages are responsible for the canonical form; ingestion trims whitespace but otherwise preserves the original order and duplicate entries so downstream policy can reason about upstream intent.
### 4.4 `advisory_observations`
`advisory_observations` is an immutable projection of the validated raw document used by LinkNotMerge overlays. Fields mirror the JSON contract surfaced by `StellaOps.Concelier.Models.Observations.AdvisoryObservation`.
| Field | Type | Notes |
|-------|------|-------|
| `_id` | string | Deterministic observation id — `{tenant}:{source.vendor}:{upstreamId}:{revision}`. |
| `tenant` | string | Lower-case tenant identifier. |
| `source.vendor` / `source.stream` | string | Connector identity (e.g., `vendor/redhat`, `ecosystem/osv`). |
| `source.api` | string | Absolute URI the connector fetched from. |
| `source.collectorVersion` | string | Optional semantic version of the connector build. |
| `upstream.upstream_id` | string | Advisory identifier as issued by the provider (CVE, vendor ID, etc.). |
| `upstream.document_version` | string | Upstream revision/version string. |
| `upstream.fetchedAt` / `upstream.receivedAt` | datetime | UTC timestamps recorded by the connector. |
| `upstream.contentHash` | string | `sha256:` digest used for idempotency. |
| `upstream.signature` | object | `{present, format?, keyId?, signature?}` describing upstream signature material. |
| `content.format` / `content.specVersion` | string | Raw payload format metadata (CSAF, OSV, JSON, etc.). |
| `content.raw` | object | Full upstream document stored losslessly (Relaxed Extended JSON). |
| `content.metadata` | object | Optional connector-specific metadata (batch ids, hints). |
| `linkset.aliases` | array | Connector-supplied aliases (trimmed, order preserved, duplicates allowed). |
| `linkset.purls` | array | Connector-supplied PURLs (ingestion preserves order and duplicates). |
| `linkset.cpes` | array | Connector-supplied CPE URIs (trimmed, order preserved). |
| `linkset.references` | array | `{ type, url }` pairs (trimmed; ingestion preserves order). |
| `createdAt` | datetime | Timestamp when Concelier persisted the observation. |
| `attributes` | object | Optional provenance attributes keyed by connector. |
## 5. Error Model
| Code | Description | HTTP status | Surfaces |
|------|-------------|-------------|----------|
| `ERR_AOC_001` | Forbidden field detected (severity, cvss, effective data). | 400 | Ingestion APIs, CLI verifier, CI guard. |
| `ERR_AOC_002` | Merge attempt detected (multiple upstream sources fused into one document). | 400 | Ingestion APIs, CLI verifier. |
| `ERR_AOC_003` | Idempotency violation (duplicate without supersedes pointer). | 409 | Repository guard, Mongo unique index, CLI verifier. |
| `ERR_AOC_004` | Missing provenance metadata (`source`, `upstream`, `signature`). | 422 | Schema validator, ingestion endpoints. |
| `ERR_AOC_005` | Signature or checksum mismatch. | 422 | Collector validation, CLI verifier. |
| `ERR_AOC_006` | Attempt to persist derived findings from ingestion context. | 403 | Policy engine guard, Authority scopes. |
| `ERR_AOC_007` | Unknown top-level fields (schema violation). | 400 | Mongo validator, CLI verifier. |
Consumers should map these codes to CLI exit codes and structured log events so automation can fail fast and produce actionable guidance.
## 6. API and Tooling Interfaces
- **Concelier ingestion** (`StellaOps.Concelier.WebService`)
- `POST /ingest/advisory`: accepts upstream payload metadata; server-side guard constructs and persists raw document.
- `GET /advisories/raw/{id}` and filterable list endpoints expose raw documents for debugging and offline analysis.
- `POST /aoc/verify`: runs guard checks over recent documents and returns summary totals plus first violations.
- **Excititor ingestion** (`StellaOps.Excititor.WebService`) mirrors the same surface for VEX documents.
- **CLI workflows** (`stella aoc verify`, `stella sources ingest --dry-run`) surface pre-flight verification; documentation will live in `/docs/modules/cli/guides/` alongside Sprint 19 CLI updates.
- **Authority scopes**: new `advisory:ingest`, `advisory:read`, `vex:ingest`, and `vex:read` scopes enforce least privilege; see [Authority Architecture](../modules/authority/architecture.md) for scope grammar.
## 7. Idempotency and Supersedes Rules
1. Compute `content_hash` before any transformation; use it with `(source.vendor, upstream.upstream_id)` to detect duplicates.
2. If a document with the same hash already exists, skip the write and log a no-op.
3. When a new hash arrives for an existing upstream document, insert a new record and set `supersedes` to the previous `_id`.
4. Keep supersedes chains acyclic; collectors must resolve conflicts by rewinding before they insert.
5. Expose idempotency counters via metrics (`ingestion_write_total{result=ok|noop}`) to catch regressions early.
## 8. Migration Playbook
1. Freeze ingestion writes except for raw pass-through paths while deploying schema validators.
2. Snapshot existing collections to `_backup_*` for rollback safety.
3. Strip forbidden fields from historical documents into a temporary `advisory_view_legacy` used only during transition.
4. Enable Mongo JSON schema validators for `advisory_raw` and `vex_raw`.
5. Run collectors in `--dry-run` to confirm only allowed keys appear; fix violations before lifting the freeze.
6. Point Policy Engine to consume exclusively from raw collections and compute derived outputs downstream.
7. Delete legacy normalisation paths from ingestion code and enable runtime guards plus CI linting.
8. Roll forward CLI, Console, and dashboards so operators can monitor AOC status end-to-end.
## 9. Observability and Diagnostics
- **Metrics**: `ingestion_write_total{result=ok|reject}`, `aoc_violation_total{code}`, `ingestion_signature_verified_total{result}`, `ingestion_latency_seconds`, `advisory_revision_count`.
- **Traces**: spans `ingest.fetch`, `ingest.transform`, `ingest.write`, and `aoc.guard` with correlation IDs shared across workers.
- **Logs**: structured entries must include `tenant`, `source.vendor`, `upstream.upstream_id`, `content_hash`, and `violation_code` when applicable.
- **Dashboards**: DevOps should add panels for violation counts, signature failures, supersedes growth, and CLI verifier outcomes for each tenant.
## 10. Security and Tenancy Checklist
- Enforce Authority scopes (`advisory:ingest`, `vex:ingest`, `advisory:read`, `vex:read`) and require tenant claims on every request.
- Maintain pinned trust stores for signature verification; capture verification result in metrics and logs.
- Ensure collectors never log secrets or raw authentication headers; redact tokens before persistence.
- Validate that Policy Engine remains the only identity with permission to write `effective_finding_*` documents.
- Verify offline bundles include the raw collections, guard configuration, and verifier binaries so air-gapped installs can audit parity.
- Document operator steps for recovering from violations, including rollback to superseded revisions and re-running policy evaluation.
## 11. Compliance Checklist
- [ ] Deterministic guard enabled in Concelier and Excititor repositories.
- [ ] Mongo validators deployed for `advisory_raw` and `vex_raw`.
- [ ] Authority scopes and tenant enforcement verified via integration tests.
- [ ] CLI and CI pipelines run `stella aoc verify` against seeded snapshots.
- [ ] Observability feeds (metrics, logs, traces) wired into dashboards with alerts.
- [ ] Offline kit instructions updated to bundle validators and verifier tooling.
- [ ] Security review recorded covering ingestion, tenancy, and rollback procedures.
---
*Last updated: 2025-10-27 (Sprint 19).*