# Analytics Module Architecture ## Design Philosophy The Analytics module implements a **star-schema data warehouse** pattern optimized for analytical queries rather than transactional workloads. Key design principles: 1. **Separation of concerns**: Analytics schema is isolated from operational schemas (scanner, vex, proof_system) 2. **Pre-computation**: Expensive aggregations computed in advance via materialized views 3. **Audit trail**: Raw payloads preserved for reprocessing and compliance 4. **Determinism**: Normalization functions are immutable and reproducible; array aggregates are ordered for stable outputs 5. **Incremental updates**: Supports both full refresh and incremental ingestion ## Data Flow ``` ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Scanner │ │ Concelier │ │ Attestor │ │ (SBOM) │ │ (Vuln) │ │ (DSSE) │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ SBOM Ingested │ Vuln Updated │ Attestation Created ▼ ▼ ▼ ┌──────────────────────────────────────────────────────┐ │ AnalyticsIngestionService │ │ - Normalize components (PURL, supplier, license) │ │ - Upsert to unified registry │ │ - Correlate with vulnerabilities │ │ - Store raw payloads │ └──────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────┐ │ analytics schema │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────────┐ │ │ │components│ │artifacts│ │comp_vuln│ │attestations│ │ │ └─────────┘ └─────────┘ └─────────┘ └────────────┘ │ └──────────────────────────────────────────────────────┘ │ │ Daily refresh ▼ ┌──────────────────────────────────────────────────────┐ │ Materialized Views │ │ mv_supplier_concentration | mv_license_distribution │ │ mv_vuln_exposure | mv_attestation_coverage │ └──────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────┐ │ Platform API Endpoints │ │ (with 5-minute caching) │ └──────────────────────────────────────────────────────┘ ``` ## Normalization Rules ### PURL Parsing Package URLs (PURLs) are the canonical identifier for components. The `parse_purl()` function extracts: | Field | Example | Notes | |-------|---------|-------| | `purl_type` | `maven`, `npm`, `pypi` | Ecosystem identifier | | `purl_namespace` | `org.apache.logging` | Group/org/scope (optional) | | `purl_name` | `log4j-core` | Package name | | `purl_version` | `2.17.1` | Version string | ### Supplier Normalization The `normalize_supplier()` function standardizes supplier names for consistent grouping: 1. Convert to lowercase 2. Trim whitespace 3. Remove legal suffixes: Inc., LLC, Ltd., Corp., GmbH, B.V., S.A., PLC, Co. 4. Normalize internal whitespace **Examples:** - `"Apache Software Foundation, Inc."` → `"apache software foundation"` - `"Google LLC"` → `"google"` - `" Microsoft Corp. "` → `"microsoft"` ### License Categorization The `categorize_license()` function maps SPDX expressions to risk categories: | Category | Examples | Risk Level | |----------|----------|------------| | `permissive` | MIT, Apache-2.0, BSD-3-Clause, ISC | Low | | `copyleft-weak` | LGPL-2.1, MPL-2.0, EPL-2.0 | Medium | | `copyleft-strong` | GPL-3.0, AGPL-3.0, SSPL | High | | `proprietary` | Proprietary, Commercial | Review Required | | `unknown` | Unrecognized expressions | Review Required | **Special handling:** - GPL with exceptions (e.g., `GPL-2.0 WITH Classpath-exception-2.0`) → `copyleft-weak` - Dual-licensed (e.g., `MIT OR Apache-2.0`) → uses first match ## Component Deduplication Components are deduplicated by `(purl, hash_sha256)`: 1. If same PURL and hash: existing record updated (last_seen_at, counts) 2. If same PURL but different hash: new record created (version change) 3. If same hash but different PURL: new record (aliased package) **Upsert pattern:** ```sql INSERT INTO analytics.components (...) VALUES (...) ON CONFLICT (purl, hash_sha256) DO UPDATE SET last_seen_at = now(), sbom_count = components.sbom_count + 1, updated_at = now(); ``` ## Vulnerability Correlation When a component is upserted, the `VulnerabilityCorrelationService` queries Concelier for matching advisories: 1. Query by PURL type + namespace + name 2. Filter by version range matching 3. Upsert to `component_vulns` with severity, EPSS, KEV flags **Version range matching** currently supports semver ranges and exact matches via `VersionRuleEvaluator`. Non-semver schemes fall back to exact string matches; wildcard and ecosystem-specific ranges require upstream normalization. ## VEX Override Logic The `mv_vuln_exposure` view implements VEX-adjusted counts: ```sql -- Effective count excludes artifacts with active VEX overrides COUNT(DISTINCT ac.artifact_id) FILTER ( WHERE NOT EXISTS ( SELECT 1 FROM analytics.vex_overrides vo WHERE vo.artifact_id = ac.artifact_id AND vo.vuln_id = cv.vuln_id AND vo.status = 'not_affected' AND (vo.valid_until IS NULL OR vo.valid_until > now()) ) ) AS effective_artifact_count ``` **Override validity:** - `valid_from`: When the override became effective - `valid_until`: Expiration (NULL = no expiration) - Only `status = 'not_affected'` reduces exposure counts, and only when the override is active in its validity window. ## Attestation Ingestion Attestation ingestion consumes Attestor Rekor entry events and expects Sigstore bundles or raw DSSE envelopes. The ingestion service: - Resolves bundle URIs using `BundleUriTemplate`; `bundle:{digest}` maps to `cas:///{digest}` by default. - Decodes DSSE payloads, computes `dsse_payload_hash`, and records `predicate_uri` plus Rekor log metadata (`rekor_log_id`, `rekor_log_index`). - Uses in-toto `subject` digests to link artifacts when reanalysis hints are absent. - Maps predicate URIs into `analytics_attestation_type` values (`provenance`, `sbom`, `vex`, `build`, `scan`, `policy`). - Expands VEX statements into `vex_overrides` rows, one per product reference, and captures optional validity timestamps when provided. ## Time-Series Rollups Daily rollups computed by `compute_daily_rollups()`: **Vulnerability counts** (per environment/team/severity): - `total_vulns`: All affecting vulnerabilities - `fixable_vulns`: Vulns with `fix_available = TRUE` - `vex_mitigated`: Vulns with active `not_affected` override - `kev_vulns`: Vulns in CISA KEV - `unique_cves`: Distinct CVE IDs - `affected_artifacts`: Artifacts containing affected components - `affected_components`: Components with affecting vulns **Component counts** (per environment/team/license/type): - `total_components`: Distinct components - `unique_suppliers`: Distinct normalized suppliers **Retention policy:** 90 days in hot storage; `compute_daily_rollups()` prunes older rows and downstream jobs archive to cold storage. ## Materialized View Refresh All materialized views support `REFRESH ... CONCURRENTLY` for zero-downtime updates: ```sql -- Refresh all views (non-concurrent; run off-peak) SELECT analytics.refresh_all_views(); ``` **Refresh schedule (recommended):** - `mv_supplier_concentration`: 02:00 UTC daily - `mv_license_distribution`: 02:15 UTC daily - `mv_vuln_exposure`: 02:30 UTC daily - `mv_attestation_coverage`: 02:45 UTC daily - `compute_daily_rollups()`: 03:00 UTC daily Platform WebService can run the daily rollup + refresh loop via `PlatformAnalyticsMaintenanceService`. Configure the schedule with: - `Platform:AnalyticsMaintenance:Enabled` (default `true`) - `Platform:AnalyticsMaintenance:IntervalMinutes` (default `1440`) - `Platform:AnalyticsMaintenance:RunOnStartup` (default `true`) - `Platform:AnalyticsMaintenance:ComputeDailyRollups` (default `true`) - `Platform:AnalyticsMaintenance:RefreshMaterializedViews` (default `true`) - `Platform:AnalyticsMaintenance:BackfillDays` (default `0`, set to `0` to disable; recompute the most recent N days on the first maintenance run) The hosted service issues concurrent refresh statements directly for each view. Use a DB scheduler (pg_cron) or external orchestrator if you need the staggered per-view timing above. ## Performance Considerations ### Indexing Strategy | Table | Key Indexes | Query Pattern | |-------|-------------|---------------| | `components` | `purl`, `supplier_normalized`, `license_category` | Lookup, aggregation | | `artifacts` | `digest`, `environment`, `team` | Lookup, filtering | | `component_vulns` | `vuln_id`, `severity`, `fix_available` | Join, filtering | | `attestations` | `artifact_id`, `predicate_type` | Join, aggregation | | `vex_overrides` | `(artifact_id, vuln_id)`, `status` | Subquery exists | ### Query Performance Targets | Query | Target | Notes | |-------|--------|-------| | `sp_top_suppliers(20, 'prod')` | < 100ms | Uses materialized view when env is null; env filter reads base tables | | `sp_license_heatmap('prod')` | < 100ms | Uses materialized view when env is null; env filter reads base tables | | `sp_vuln_exposure()` | < 200ms | Uses materialized view for global queries; environment filters read base tables | | `sp_fixable_backlog()` | < 500ms | Live query with indexes | | `sp_attestation_gaps()` | < 100ms | Uses materialized view | ### Caching Strategy Platform API endpoints use a 5-minute TTL cache: - Cache key: endpoint + query parameters - Invalidation: Time-based only (no event-driven invalidation) - Storage: Valkey (in-memory) ## Security Considerations ### Schema Permissions ```sql -- Read-only role for dashboards GRANT USAGE ON SCHEMA analytics TO dashboard_reader; GRANT SELECT ON ALL TABLES IN SCHEMA analytics TO dashboard_reader; GRANT SELECT ON ALL SEQUENCES IN SCHEMA analytics TO dashboard_reader; -- Write role for ingestion service GRANT USAGE ON SCHEMA analytics TO analytics_writer; GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA analytics TO analytics_writer; GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA analytics TO analytics_writer; ``` ### Data Classification | Table | Classification | Notes | |-------|----------------|-------| | `components` | Internal | Contains package names, versions | | `artifacts` | Internal | Contains image names, team names | | `component_vulns` | Internal | Vulnerability data (public CVEs) | | `vex_overrides` | Confidential | Contains justifications, operator IDs | | `raw_sboms` | Confidential | Full SBOM payloads | | `raw_attestations` | Confidential | Signed attestation envelopes | ### Audit Trail All tables include `created_at` and `updated_at` timestamps. Raw payload tables (`raw_sboms`, `raw_attestations`) are append-only with content hashes for integrity verification. ## Integration Points ### Upstream Dependencies | Service | Event | Contract | Action | |---------|-------|----------|--------| | Scanner | SBOM report ready | `scanner.event.report.ready@1` (`docs/modules/signals/events/orchestrator-scanner-events.md`) | Normalize and upsert components | | Concelier | Advisory observation/linkset updated | `advisory.observation.updated@1` (`docs/modules/concelier/events/advisory.observation.updated@1.schema.json`), `advisory.linkset.updated@1` (`docs/modules/concelier/events/advisory.linkset.updated@1.md`) | Re-correlate affected components | | Excititor | VEX statement changes | `vex.statement.*` (`docs/modules/excititor/architecture.md`) | Create/update vex_overrides | | Attestor | Rekor entry logged | `rekor.entry.logged` (`docs/modules/attestor/architecture.md`) | Upsert attestation record | ### Downstream Consumers | Consumer | Data | Endpoint | |----------|------|----------| | Console UI | Dashboard data | `/api/analytics/*` | | Export Center | Compliance reports | Direct DB query | | AdvisoryAI | Risk context | `/api/analytics/vulnerabilities` | ## Future Enhancements 1. **Partitioning**: Partition `daily_*` tables by date for faster queries and archival 2. **Incremental refresh**: Implement incremental materialized view refresh for large datasets 3. **Custom dimensions**: Support user-defined component groupings (business units, cost centers) 4. **Predictive analytics**: Add ML-based risk prediction using historical trends 5. **BI tool integration**: Direct connectors for Tableau, Looker, Metabase