Files
git.stella-ops.org/docs/ARCHITECTURE_FEEDSER.md
master b97fc7685a
Some checks failed
Build Test Deploy / authority-container (push) Has been cancelled
Build Test Deploy / docs (push) Has been cancelled
Build Test Deploy / deploy (push) Has been cancelled
Build Test Deploy / build-test (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Initial commit (history squashed)
2025-10-11 23:28:35 +03:00

191 lines
9.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ARCHITECTURE.md — **StellaOps.Feedser**
> **Goal**: Build a sovereign-ready, self-hostable **feed-merge service** that ingests authoritative vulnerability sources, normalizes and de-duplicates them into **MongoDB**, and exports **JSON** and **Trivy-compatible DB** artifacts.
> **Form factor**: Long-running **Web Service** with **REST APIs** (health, status, control) and an embedded **internal cron scheduler**. Controllable by StellaOps.Cli (# stella db ...)
> **No signing inside Feedser** (signing is a separate pipeline step).
> **Runtime SDK baseline**: .NET 10 Preview 7 (SDK 10.0.100-preview.7.25380.108) targeting `net10.0`, aligned with the deployed api.stella-ops.org service.
> **Four explicit stages**:
>
> 1. **Source Download** → raw documents.
> 2. **Parse & Normalize** → schema-validated DTOs enriched with canonical identifiers.
> 3. **Merge & Deduplicate** → precedence-aware canonical records persisted to MongoDB.
> 4. **Export** → JSON or TrivyDB (full or delta), then (externally) sign/publish.
---
## 1) Naming & Solution Layout
**Source connectors** namespace prefix: `StellaOps.Feedser.Source.*`
**Exporters**:
* `StellaOps.Feedser.Exporter.Json`
* `StellaOps.Feedser.Exporter.TrivyDb`
**Projects** (`/src`):
```
StellaOps.Feedser.WebService/ # ASP.NET Core (Minimal API, net10.0 preview) WebService + embedded scheduler
StellaOps.Feedser.Core/ # Domain models, pipelines, merge/dedupe engine, jobs orchestration
StellaOps.Feedser.Models/ # Canonical POCOs, JSON Schemas, enums
StellaOps.Feedser.Storage.Mongo/ # Mongo repositories, GridFS access, indexes, resume "flags"
StellaOps.Feedser.Source.Common/ # HTTP clients, rate-limiters, schema validators, parsers utils
StellaOps.Feedser.Source.Cve/
StellaOps.Feedser.Source.Nvd/
StellaOps.Feedser.Source.Ghsa/
StellaOps.Feedser.Source.Osv/
StellaOps.Feedser.Source.Jvn/
StellaOps.Feedser.Source.CertCc/
StellaOps.Feedser.Source.Kev/
StellaOps.Feedser.Source.Kisa/
StellaOps.Feedser.Source.CertIn/
StellaOps.Feedser.Source.CertFr/
StellaOps.Feedser.Source.CertBund/
StellaOps.Feedser.Source.Acsc/
StellaOps.Feedser.Source.Cccs/
StellaOps.Feedser.Source.Ru.Bdu/ # HTML→schema with LLM fallback (gated)
StellaOps.Feedser.Source.Ru.Nkcki/ # PDF/HTML bulletins → structured
StellaOps.Feedser.Source.Vndr.Msrc/
StellaOps.Feedser.Source.Vndr.Cisco/
StellaOps.Feedser.Source.Vndr.Oracle/
StellaOps.Feedser.Source.Vndr.Adobe/ # APSB ingest; emits vendor RangePrimitives with adobe.track/platform/priority telemetry + fixed-status provenance.
StellaOps.Feedser.Source.Vndr.Apple/
StellaOps.Feedser.Source.Vndr.Chromium/
StellaOps.Feedser.Source.Vndr.Vmware/
StellaOps.Feedser.Source.Distro.RedHat/
StellaOps.Feedser.Source.Distro.Debian/ # Fetches DSA list + detail HTML, emits EVR RangePrimitives with per-release provenance and telemetry.
StellaOps.Feedser.Source.Distro.Ubuntu/ # Ubuntu Security Notices connector (JSON index → EVR ranges with ubuntu.pocket telemetry).
StellaOps.Feedser.Source.Distro.Suse/ # CSAF fetch pipeline emitting NEVRA RangePrimitives with suse.status vendor telemetry.
StellaOps.Feedser.Source.Ics.Cisa/
StellaOps.Feedser.Source.Ics.Kaspersky/
StellaOps.Feedser.Normalization/ # Canonical mappers, validators, version-range normalization
StellaOps.Feedser.Merge/ # Identity graph, precedence, deterministic merge
StellaOps.Feedser.Exporter.Json/
StellaOps.Feedser.Exporter.TrivyDb/
StellaOps.Feedser.<Component>.Tests/ # Component-scoped unit/integration suites (Core, Storage.Mongo, Source.*, Exporter.*, WebService, etc.)
```
---
## 2) Runtime Shape
**Process**: single service (`StellaOps.Feedser.WebService`)
* `Program.cs`: top-level entry using **Generic Host**, **DI**, **Options** binding from `appsettings.json` + environment + optional `feedser.yaml`.
* Built-in **scheduler** (cron-like) + **job manager** with **distributed locks** in Mongo to prevent overlaps, enforce timeouts, allow cancel/kill.
* **REST APIs** for health/readiness/progress/trigger/kill/status.
**Key NuGet concepts** (indicative): `MongoDB.Driver`, `Polly` (retry/backoff), `System.Threading.Channels`, `Microsoft.Extensions.Http`, `Microsoft.Extensions.Hosting`, `Serilog`, `OpenTelemetry`.
---
## 3) Data Storage — **MongoDB** (single source of truth)
**Database**: `feedser`
**Write concern**: `majority` for merge/export state, `acknowledged` for raw docs.
**Collections** (with “flags”/resume points):
* `source`
* `_id`, `name`, `type`, `baseUrl`, `auth`, `notes`.
* `source_state`
* Keys: `sourceName` (unique), `enabled`, `cursor`, `lastSuccess`, `failCount`, `backoffUntil`, `paceOverrides`, `paused`.
* Drives incremental fetch/parse/map resume and operator pause/pace controls.
* `document`
* `_id`, `sourceName`, `uri`, `fetchedAt`, `sha256`, `contentType`, `status`, `metadata`, `gridFsId`, `etag`, `lastModified`.
* Index `{sourceName:1, uri:1}` unique; optional TTL for superseded versions.
* `dto`
* `_id`, `sourceName`, `documentId`, `schemaVer`, `payload` (BSON), `validatedAt`.
* Index `{sourceName:1, documentId:1}`.
* `advisory`
* `_id`, `advisoryKey`, `title`, `summary`, `lang`, `published`, `modified`, `severity`, `exploitKnown`.
* Unique `{advisoryKey:1}` plus indexes on `modified` and `published`.
* `alias`
* `advisoryId`, `scheme`, `value` with index `{scheme:1, value:1}`.
* `affected`
* `advisoryId`, `platform`, `name`, `versionRange`, `cpe`, `purl`, `fixedBy`, `introducedVersion`.
* Index `{platform:1, name:1}`, `{advisoryId:1}`.
* `reference`
* `advisoryId`, `url`, `kind`, `sourceTag` (e.g., advisory/patch/kb).
* Flags collections: `kev_flag`, `ru_flags`, `jp_flags`, `psirt_flags` keyed by `advisoryId`.
* `merge_event`
* `_id`, `advisoryKey`, `beforeHash`, `afterHash`, `mergedAt`, `inputs` (document ids).
* `export_state`
* `_id` (`json`/`trivydb`), `baseExportId`, `baseDigest`, `lastFullDigest`, `lastDeltaDigest`, `exportCursor`, `targetRepo`, `exporterVersion`.
* `locks`
* `_id` (`jobKey`), `holder`, `acquiredAt`, `heartbeatAt`, `leaseMs`, `ttlAt` (TTL index cleans dead locks).
* `jobs`
* `_id`, `type`, `args`, `state`, `startedAt`, `endedAt`, `error`, `owner`, `heartbeatAt`, `timeoutMs`.
**GridFS buckets**: `fs.documents` for raw large payloads; referenced by `document.gridFsId`.
---
## 4) Job & Scheduler Model
* Scheduler stores cron expressions per source/exporter in config; persists next-run pointers in Mongo.
* Jobs acquire locks (`locks` collection) to ensure singleton execution per source/exporter.
* Supports manual triggers via API endpoints (`POST /jobs/{type}`) and pause/resume toggles per source.
---
## 5) Connector Contracts
Connectors implement:
```csharp
public interface IFeedConnector {
string SourceName { get; }
Task FetchAsync(IServiceProvider sp, CancellationToken ct);
Task ParseAsync(IServiceProvider sp, CancellationToken ct);
Task MapAsync(IServiceProvider sp, CancellationToken ct);
}
```
* Fetch populates `document` rows respecting rate limits, conditional GET, and `source_state.cursor`.
* Parse validates schema (JSON Schema, XSD) and writes sanitized DTO payloads.
* Map produces canonical advisory rows + provenance entries; must be idempotent.
* Base helpers in `StellaOps.Feedser.Source.Common` provide HTTP clients, retry policies, and watermark utilities.
---
## 6) Merge & Normalization
* Canonical model stored in `StellaOps.Feedser.Models` with serialization contracts used by storage/export layers.
* `StellaOps.Feedser.Normalization` handles NEVRA/EVR/PURL range parsing, CVSS normalization, localization.
* `StellaOps.Feedser.Merge` builds alias graphs keyed by CVE first, then falls back to vendor/regional IDs.
* Precedence rules: PSIRT/OVAL overrides generic ranges; KEV only toggles exploitation; regional feeds enrich severity but dont override vendor truth.
* Determinism enforced via canonical JSON hashing logged in `merge_event`.
---
## 7) Exporters
* JSON exporter mirrors `aquasecurity/vuln-list` layout with deterministic ordering and reproducible timestamps.
* Trivy DB exporter shells out to `trivy-db build`, produces Bolt archives, and reuses unchanged blobs from the last full baseline when running in delta mode. The exporter annotates `metadata.json` with `mode`, `baseExportId`, `baseManifestDigest`, `resetBaseline`, and `delta.changedFiles[]`/`delta.removedPaths[]`, and honours `publishFull` / `publishDelta` (ORAS) plus `includeFull` / `includeDelta` (offline bundle) toggles.
* `StellaOps.Feedser.Storage.Mongo` provides cursors for delta exports based on `export_state.exportCursor` and the persisted per-file manifest (`export_state.files`).
* Export jobs produce OCI tarballs (layer media type `application/vnd.aquasec.trivy.db.layer.v1.tar+gzip`) and optionally push via ORAS; `metadata.json` accompanies each layout so mirrors can decide between full refreshes and deltas.
---
## 8) Observability
* Serilog structured logging with enrichment fields (`source`, `uri`, `stage`, `durationMs`).
* OpenTelemetry traces around fetch/parse/map/export; metrics for rate limit hits, schema failures, dedupe ratios, package size. Connector HTTP metrics are emitted via the shared `feedser.source.http.*` instruments tagged with `feedser.source=<connector>` so per-source dashboards slice on that label instead of bespoke metric names.
* Prometheus scraping endpoint served by WebService.
---
## 9) Security Considerations
* Offline-first: connectors only reach allowlisted hosts.
* BDU LLM fallback gated by config flag; logs audit trail with confidence score.
* No secrets written to logs; secrets loaded via environment or mounted files.
* Signing handled outside Feedser pipeline.
---
## 10) Deployment Notes
* Default storage MongoDB; for air-gapped, bundle Mongo image + seeded data backup.
* Horizontal scale achieved via multiple web service instances sharing Mongo locks.
* Provide `feedser.yaml` template describing sources, rate limits, and export settings.