documentation cleanse, sprints work and planning. remaining non EF DAL migration to EF

2026-02-25 01:24:07 +02:00
parent b07d27772e
commit 4db038123b
9090 changed files with 4836 additions and 2909 deletions
--- a/docs-archived/modules/analytics/architecture.md
+++ b/docs-archived/modules/analytics/architecture.md
@@ -0,0 +1,298 @@
+# Analytics Module Architecture
+
+> **Implementation Note:** Analytics is a cross-cutting feature integrated into the **Platform WebService** (`src/Platform/`). There is no standalone `src/Analytics/` module. Data ingestion pipelines span Scanner, Concelier, and Attestor modules. See [Platform Architecture](../platform/architecture-overview.md) for service-level integration details.
+
+## Design Philosophy
+
+The Analytics module implements a **star-schema data warehouse** pattern optimized for analytical queries rather than transactional workloads. Key design principles:
+
+1. **Separation of concerns**: Analytics schema is isolated from operational schemas (scanner, vex, proof_system)
+2. **Pre-computation**: Expensive aggregations computed in advance via materialized views
+3. **Audit trail**: Raw payloads preserved for reprocessing and compliance
+4. **Determinism**: Normalization functions are immutable and reproducible; array aggregates are ordered for stable outputs
+5. **Incremental updates**: Supports both full refresh and incremental ingestion
+
+## Data Flow
+
+```
+┌─────────────┐     ┌─────────────┐     ┌─────────────┐
+│   Scanner   │     │  Concelier  │     │   Attestor  │
+│   (SBOM)    │     │   (Vuln)    │     │   (DSSE)    │
+└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
+       │                   │                   │
+       │ SBOM Ingested     │ Vuln Updated      │ Attestation Created
+       ▼                   ▼                   ▼
+┌──────────────────────────────────────────────────────┐
+│               AnalyticsIngestionService              │
+│  - Normalize components (PURL, supplier, license)    │
+│  - Upsert to unified registry                        │
+│  - Correlate with vulnerabilities                    │
+│  - Store raw payloads                                │
+└──────────────────────────────────────────────────────┘
+       │
+       ▼
+┌──────────────────────────────────────────────────────┐
+│                 analytics schema                     │
+│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────────┐ │
+│  │components│ │artifacts│ │comp_vuln│ │attestations│ │
+│  └─────────┘ └─────────┘ └─────────┘ └────────────┘ │
+└──────────────────────────────────────────────────────┘
+       │
+       │ Daily refresh
+       ▼
+┌──────────────────────────────────────────────────────┐
+│              Materialized Views                      │
+│  mv_supplier_concentration | mv_license_distribution │
+│  mv_vuln_exposure          | mv_attestation_coverage │
+└──────────────────────────────────────────────────────┘
+       │
+       ▼
+┌──────────────────────────────────────────────────────┐
+│              Platform API Endpoints                  │
+│              (with 5-minute caching)                 │
+└──────────────────────────────────────────────────────┘
+```
+
+## Normalization Rules
+
+### PURL Parsing
+
+Package URLs (PURLs) are the canonical identifier for components. The `parse_purl()` function extracts:
+
+| Field | Example | Notes |
+|-------|---------|-------|
+| `purl_type` | `maven`, `npm`, `pypi` | Ecosystem identifier |
+| `purl_namespace` | `org.apache.logging` | Group/org/scope (optional) |
+| `purl_name` | `log4j-core` | Package name |
+| `purl_version` | `2.17.1` | Version string |
+
+### Supplier Normalization
+
+The `normalize_supplier()` function standardizes supplier names for consistent grouping:
+
+1. Convert to lowercase
+2. Trim whitespace
+3. Remove legal suffixes: Inc., LLC, Ltd., Corp., GmbH, B.V., S.A., PLC, Co.
+4. Normalize internal whitespace
+
+**Examples:**
+- `"Apache Software Foundation, Inc."` → `"apache software foundation"`
+- `"Google LLC"` → `"google"`
+- `"  Microsoft Corp.  "` → `"microsoft"`
+
+### License Categorization
+
+The `categorize_license()` function maps SPDX expressions to risk categories:
+
+| Category | Examples | Risk Level |
+|----------|----------|------------|
+| `permissive` | MIT, Apache-2.0, BSD-3-Clause, ISC | Low |
+| `copyleft-weak` | LGPL-2.1, MPL-2.0, EPL-2.0 | Medium |
+| `copyleft-strong` | GPL-3.0, AGPL-3.0, SSPL | High |
+| `proprietary` | Proprietary, Commercial | Review Required |
+| `unknown` | Unrecognized expressions | Review Required |
+
+**Special handling:**
+- GPL with exceptions (e.g., `GPL-2.0 WITH Classpath-exception-2.0`) → `copyleft-weak`
+- Dual-licensed (e.g., `MIT OR Apache-2.0`) → uses first match
+
+## Component Deduplication
+
+Components are deduplicated by `(purl, hash_sha256)`:
+
+1. If same PURL and hash: existing record updated (last_seen_at, counts)
+2. If same PURL but different hash: new record created (version change)
+3. If same hash but different PURL: new record (aliased package)
+
+**Upsert pattern:**
+```sql
+INSERT INTO analytics.components (...)
+VALUES (...)
+ON CONFLICT (purl, hash_sha256) DO UPDATE SET
+  last_seen_at = now(),
+  sbom_count = components.sbom_count + 1,
+  updated_at = now();
+```
+
+## Vulnerability Correlation
+
+When a component is upserted, the `VulnerabilityCorrelationService` queries Concelier for matching advisories:
+
+1. Query by PURL type + namespace + name
+2. Filter by version range matching
+3. Upsert to `component_vulns` with severity, EPSS, KEV flags
+
+**Version range matching** currently supports semver ranges and exact matches via
+`VersionRuleEvaluator`. Non-semver schemes fall back to exact string matches; wildcard
+and ecosystem-specific ranges require upstream normalization.
+
+## VEX Override Logic
+
+The `mv_vuln_exposure` view implements VEX-adjusted counts:
+
+```sql
+-- Effective count excludes artifacts with active VEX overrides
+COUNT(DISTINCT ac.artifact_id) FILTER (
+  WHERE NOT EXISTS (
+    SELECT 1 FROM analytics.vex_overrides vo
+    WHERE vo.artifact_id = ac.artifact_id
+      AND vo.vuln_id = cv.vuln_id
+      AND vo.status = 'not_affected'
+      AND (vo.valid_until IS NULL OR vo.valid_until > now())
+  )
+) AS effective_artifact_count
+```
+
+**Override validity:**
+- `valid_from`: When the override became effective
+- `valid_until`: Expiration (NULL = no expiration)
+- Only `status = 'not_affected'` reduces exposure counts, and only when the override is active in its validity window.
+
+## Attestation Ingestion
+
+Attestation ingestion consumes Attestor Rekor entry events and expects Sigstore bundles
+or raw DSSE envelopes. The ingestion service:
+- Resolves bundle URIs using `BundleUriTemplate`; `bundle:{digest}` maps to
+  `cas://<DefaultBucket>/{digest}` by default.
+- Decodes DSSE payloads, computes `dsse_payload_hash`, and records `predicate_uri` plus
+  Rekor log metadata (`rekor_log_id`, `rekor_log_index`).
+- Uses in-toto `subject` digests to link artifacts when reanalysis hints are absent.
+- Maps predicate URIs into `analytics_attestation_type` values
+  (`provenance`, `sbom`, `vex`, `build`, `scan`, `policy`).
+- Expands VEX statements into `vex_overrides` rows, one per product reference, and
+  captures optional validity timestamps when provided.
+
+## Time-Series Rollups
+
+Daily rollups computed by `compute_daily_rollups()`:
+
+**Vulnerability counts** (per environment/team/severity):
+- `total_vulns`: All affecting vulnerabilities
+- `fixable_vulns`: Vulns with `fix_available = TRUE`
+- `vex_mitigated`: Vulns with active `not_affected` override
+- `kev_vulns`: Vulns in CISA KEV
+- `unique_cves`: Distinct CVE IDs
+- `affected_artifacts`: Artifacts containing affected components
+- `affected_components`: Components with affecting vulns
+
+**Component counts** (per environment/team/license/type):
+- `total_components`: Distinct components
+- `unique_suppliers`: Distinct normalized suppliers
+
+**Retention policy:** 90 days in hot storage; `compute_daily_rollups()` prunes older rows and downstream jobs archive to cold storage.
+
+## Materialized View Refresh
+
+All materialized views support `REFRESH ... CONCURRENTLY` for zero-downtime updates:
+
+```sql
+-- Refresh all views (non-concurrent; run off-peak)
+SELECT analytics.refresh_all_views();
+```
+
+**Refresh schedule (recommended):**
+- `mv_supplier_concentration`: 02:00 UTC daily
+- `mv_license_distribution`: 02:15 UTC daily
+- `mv_vuln_exposure`: 02:30 UTC daily
+- `mv_attestation_coverage`: 02:45 UTC daily
+- `compute_daily_rollups()`: 03:00 UTC daily
+
+Platform WebService can run the daily rollup + refresh loop via
+`PlatformAnalyticsMaintenanceService`. Configure the schedule with:
+- `Platform:AnalyticsMaintenance:Enabled` (default `true`)
+- `Platform:AnalyticsMaintenance:IntervalMinutes` (default `1440`)
+- `Platform:AnalyticsMaintenance:RunOnStartup` (default `true`)
+- `Platform:AnalyticsMaintenance:ComputeDailyRollups` (default `true`)
+- `Platform:AnalyticsMaintenance:RefreshMaterializedViews` (default `true`)
+- `Platform:AnalyticsMaintenance:BackfillDays` (default `0`, set to `0` to disable; recompute the most recent N days on the first maintenance run)
+
+The hosted service issues concurrent refresh statements directly for each view.
+Use a DB scheduler (pg_cron) or external orchestrator if you need the staggered
+per-view timing above.
+
+## Performance Considerations
+
+### Indexing Strategy
+
+| Table | Key Indexes | Query Pattern |
+|-------|-------------|---------------|
+| `components` | `purl`, `supplier_normalized`, `license_category` | Lookup, aggregation |
+| `artifacts` | `digest`, `environment`, `team` | Lookup, filtering |
+| `component_vulns` | `vuln_id`, `severity`, `fix_available` | Join, filtering |
+| `attestations` | `artifact_id`, `predicate_type` | Join, aggregation |
+| `vex_overrides` | `(artifact_id, vuln_id)`, `status` | Subquery exists |
+
+### Query Performance Targets
+
+| Query | Target | Notes |
+|-------|--------|-------|
+| `sp_top_suppliers(20, 'prod')` | < 100ms | Uses materialized view when env is null; env filter reads base tables |
+| `sp_license_heatmap('prod')` | < 100ms | Uses materialized view when env is null; env filter reads base tables |
+| `sp_vuln_exposure()` | < 200ms | Uses materialized view for global queries; environment filters read base tables |
+| `sp_fixable_backlog()` | < 500ms | Live query with indexes |
+| `sp_attestation_gaps()` | < 100ms | Uses materialized view |
+
+### Caching Strategy
+
+Platform API endpoints use a 5-minute TTL cache:
+- Cache key: endpoint + query parameters
+- Invalidation: Time-based only (no event-driven invalidation)
+- Storage: Valkey (in-memory)
+
+## Security Considerations
+
+### Schema Permissions
+
+```sql
+-- Read-only role for dashboards
+GRANT USAGE ON SCHEMA analytics TO dashboard_reader;
+GRANT SELECT ON ALL TABLES IN SCHEMA analytics TO dashboard_reader;
+GRANT SELECT ON ALL SEQUENCES IN SCHEMA analytics TO dashboard_reader;
+
+-- Write role for ingestion service
+GRANT USAGE ON SCHEMA analytics TO analytics_writer;
+GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA analytics TO analytics_writer;
+GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA analytics TO analytics_writer;
+```
+
+### Data Classification
+
+| Table | Classification | Notes |
+|-------|----------------|-------|
+| `components` | Internal | Contains package names, versions |
+| `artifacts` | Internal | Contains image names, team names |
+| `component_vulns` | Internal | Vulnerability data (public CVEs) |
+| `vex_overrides` | Confidential | Contains justifications, operator IDs |
+| `raw_sboms` | Confidential | Full SBOM payloads |
+| `raw_attestations` | Confidential | Signed attestation envelopes |
+
+### Audit Trail
+
+All tables include `created_at` and `updated_at` timestamps. Raw payload tables (`raw_sboms`, `raw_attestations`) are append-only with content hashes for integrity verification.
+
+## Integration Points
+
+### Upstream Dependencies
+
+| Service | Event | Contract | Action |
+|---------|-------|----------|--------|
+| Scanner | SBOM report ready | `scanner.event.report.ready@1` (`docs/modules/signals/events/orchestrator-scanner-events.md`) | Normalize and upsert components |
+| Concelier | Advisory observation/linkset updated | `advisory.observation.updated@1` (`docs/modules/concelier/events/advisory.observation.updated@1.schema.json`), `advisory.linkset.updated@1` (`docs/modules/concelier/events/advisory.linkset.updated@1.md`) | Re-correlate affected components |
+| Excititor | VEX statement changes | `vex.statement.*` (`docs/modules/excititor/architecture.md`) | Create/update vex_overrides |
+| Attestor | Rekor entry logged | `rekor.entry.logged` (`docs/modules/attestor/architecture.md`) | Upsert attestation record |
+
+### Downstream Consumers
+
+| Consumer | Data | Endpoint |
+|----------|------|----------|
+| Console UI | Dashboard data | `/api/analytics/*` |
+| Export Center | Compliance reports | Direct DB query |
+| AdvisoryAI | Risk context | `/api/analytics/vulnerabilities` |
+
+## Future Enhancements
+
+1. **Partitioning**: Partition `daily_*` tables by date for faster queries and archival
+2. **Incremental refresh**: Implement incremental materialized view refresh for large datasets
+3. **Custom dimensions**: Support user-defined component groupings (business units, cost centers)
+4. **Predictive analytics**: Add ML-based risk prediction using historical trends
+5. **BI tool integration**: Direct connectors for Tableau, Looker, Metabase