297 lines
14 KiB
Markdown
297 lines
14 KiB
Markdown
# Analytics Module Architecture
|
|
|
|
## Design Philosophy
|
|
|
|
The Analytics module implements a **star-schema data warehouse** pattern optimized for analytical queries rather than transactional workloads. Key design principles:
|
|
|
|
1. **Separation of concerns**: Analytics schema is isolated from operational schemas (scanner, vex, proof_system)
|
|
2. **Pre-computation**: Expensive aggregations computed in advance via materialized views
|
|
3. **Audit trail**: Raw payloads preserved for reprocessing and compliance
|
|
4. **Determinism**: Normalization functions are immutable and reproducible; array aggregates are ordered for stable outputs
|
|
5. **Incremental updates**: Supports both full refresh and incremental ingestion
|
|
|
|
## Data Flow
|
|
|
|
```
|
|
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
│ Scanner │ │ Concelier │ │ Attestor │
|
|
│ (SBOM) │ │ (Vuln) │ │ (DSSE) │
|
|
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
|
|
│ │ │
|
|
│ SBOM Ingested │ Vuln Updated │ Attestation Created
|
|
▼ ▼ ▼
|
|
┌──────────────────────────────────────────────────────┐
|
|
│ AnalyticsIngestionService │
|
|
│ - Normalize components (PURL, supplier, license) │
|
|
│ - Upsert to unified registry │
|
|
│ - Correlate with vulnerabilities │
|
|
│ - Store raw payloads │
|
|
└──────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────────────────┐
|
|
│ analytics schema │
|
|
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────────┐ │
|
|
│ │components│ │artifacts│ │comp_vuln│ │attestations│ │
|
|
│ └─────────┘ └─────────┘ └─────────┘ └────────────┘ │
|
|
└──────────────────────────────────────────────────────┘
|
|
│
|
|
│ Daily refresh
|
|
▼
|
|
┌──────────────────────────────────────────────────────┐
|
|
│ Materialized Views │
|
|
│ mv_supplier_concentration | mv_license_distribution │
|
|
│ mv_vuln_exposure | mv_attestation_coverage │
|
|
└──────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────────────────┐
|
|
│ Platform API Endpoints │
|
|
│ (with 5-minute caching) │
|
|
└──────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Normalization Rules
|
|
|
|
### PURL Parsing
|
|
|
|
Package URLs (PURLs) are the canonical identifier for components. The `parse_purl()` function extracts:
|
|
|
|
| Field | Example | Notes |
|
|
|-------|---------|-------|
|
|
| `purl_type` | `maven`, `npm`, `pypi` | Ecosystem identifier |
|
|
| `purl_namespace` | `org.apache.logging` | Group/org/scope (optional) |
|
|
| `purl_name` | `log4j-core` | Package name |
|
|
| `purl_version` | `2.17.1` | Version string |
|
|
|
|
### Supplier Normalization
|
|
|
|
The `normalize_supplier()` function standardizes supplier names for consistent grouping:
|
|
|
|
1. Convert to lowercase
|
|
2. Trim whitespace
|
|
3. Remove legal suffixes: Inc., LLC, Ltd., Corp., GmbH, B.V., S.A., PLC, Co.
|
|
4. Normalize internal whitespace
|
|
|
|
**Examples:**
|
|
- `"Apache Software Foundation, Inc."` → `"apache software foundation"`
|
|
- `"Google LLC"` → `"google"`
|
|
- `" Microsoft Corp. "` → `"microsoft"`
|
|
|
|
### License Categorization
|
|
|
|
The `categorize_license()` function maps SPDX expressions to risk categories:
|
|
|
|
| Category | Examples | Risk Level |
|
|
|----------|----------|------------|
|
|
| `permissive` | MIT, Apache-2.0, BSD-3-Clause, ISC | Low |
|
|
| `copyleft-weak` | LGPL-2.1, MPL-2.0, EPL-2.0 | Medium |
|
|
| `copyleft-strong` | GPL-3.0, AGPL-3.0, SSPL | High |
|
|
| `proprietary` | Proprietary, Commercial | Review Required |
|
|
| `unknown` | Unrecognized expressions | Review Required |
|
|
|
|
**Special handling:**
|
|
- GPL with exceptions (e.g., `GPL-2.0 WITH Classpath-exception-2.0`) → `copyleft-weak`
|
|
- Dual-licensed (e.g., `MIT OR Apache-2.0`) → uses first match
|
|
|
|
## Component Deduplication
|
|
|
|
Components are deduplicated by `(purl, hash_sha256)`:
|
|
|
|
1. If same PURL and hash: existing record updated (last_seen_at, counts)
|
|
2. If same PURL but different hash: new record created (version change)
|
|
3. If same hash but different PURL: new record (aliased package)
|
|
|
|
**Upsert pattern:**
|
|
```sql
|
|
INSERT INTO analytics.components (...)
|
|
VALUES (...)
|
|
ON CONFLICT (purl, hash_sha256) DO UPDATE SET
|
|
last_seen_at = now(),
|
|
sbom_count = components.sbom_count + 1,
|
|
updated_at = now();
|
|
```
|
|
|
|
## Vulnerability Correlation
|
|
|
|
When a component is upserted, the `VulnerabilityCorrelationService` queries Concelier for matching advisories:
|
|
|
|
1. Query by PURL type + namespace + name
|
|
2. Filter by version range matching
|
|
3. Upsert to `component_vulns` with severity, EPSS, KEV flags
|
|
|
|
**Version range matching** currently supports semver ranges and exact matches via
|
|
`VersionRuleEvaluator`. Non-semver schemes fall back to exact string matches; wildcard
|
|
and ecosystem-specific ranges require upstream normalization.
|
|
|
|
## VEX Override Logic
|
|
|
|
The `mv_vuln_exposure` view implements VEX-adjusted counts:
|
|
|
|
```sql
|
|
-- Effective count excludes artifacts with active VEX overrides
|
|
COUNT(DISTINCT ac.artifact_id) FILTER (
|
|
WHERE NOT EXISTS (
|
|
SELECT 1 FROM analytics.vex_overrides vo
|
|
WHERE vo.artifact_id = ac.artifact_id
|
|
AND vo.vuln_id = cv.vuln_id
|
|
AND vo.status = 'not_affected'
|
|
AND (vo.valid_until IS NULL OR vo.valid_until > now())
|
|
)
|
|
) AS effective_artifact_count
|
|
```
|
|
|
|
**Override validity:**
|
|
- `valid_from`: When the override became effective
|
|
- `valid_until`: Expiration (NULL = no expiration)
|
|
- Only `status = 'not_affected'` reduces exposure counts, and only when the override is active in its validity window.
|
|
|
|
## Attestation Ingestion
|
|
|
|
Attestation ingestion consumes Attestor Rekor entry events and expects Sigstore bundles
|
|
or raw DSSE envelopes. The ingestion service:
|
|
- Resolves bundle URIs using `BundleUriTemplate`; `bundle:{digest}` maps to
|
|
`cas://<DefaultBucket>/{digest}` by default.
|
|
- Decodes DSSE payloads, computes `dsse_payload_hash`, and records `predicate_uri` plus
|
|
Rekor log metadata (`rekor_log_id`, `rekor_log_index`).
|
|
- Uses in-toto `subject` digests to link artifacts when reanalysis hints are absent.
|
|
- Maps predicate URIs into `analytics_attestation_type` values
|
|
(`provenance`, `sbom`, `vex`, `build`, `scan`, `policy`).
|
|
- Expands VEX statements into `vex_overrides` rows, one per product reference, and
|
|
captures optional validity timestamps when provided.
|
|
|
|
## Time-Series Rollups
|
|
|
|
Daily rollups computed by `compute_daily_rollups()`:
|
|
|
|
**Vulnerability counts** (per environment/team/severity):
|
|
- `total_vulns`: All affecting vulnerabilities
|
|
- `fixable_vulns`: Vulns with `fix_available = TRUE`
|
|
- `vex_mitigated`: Vulns with active `not_affected` override
|
|
- `kev_vulns`: Vulns in CISA KEV
|
|
- `unique_cves`: Distinct CVE IDs
|
|
- `affected_artifacts`: Artifacts containing affected components
|
|
- `affected_components`: Components with affecting vulns
|
|
|
|
**Component counts** (per environment/team/license/type):
|
|
- `total_components`: Distinct components
|
|
- `unique_suppliers`: Distinct normalized suppliers
|
|
|
|
**Retention policy:** 90 days in hot storage; `compute_daily_rollups()` prunes older rows and downstream jobs archive to cold storage.
|
|
|
|
## Materialized View Refresh
|
|
|
|
All materialized views support `REFRESH ... CONCURRENTLY` for zero-downtime updates:
|
|
|
|
```sql
|
|
-- Refresh all views (non-concurrent; run off-peak)
|
|
SELECT analytics.refresh_all_views();
|
|
```
|
|
|
|
**Refresh schedule (recommended):**
|
|
- `mv_supplier_concentration`: 02:00 UTC daily
|
|
- `mv_license_distribution`: 02:15 UTC daily
|
|
- `mv_vuln_exposure`: 02:30 UTC daily
|
|
- `mv_attestation_coverage`: 02:45 UTC daily
|
|
- `compute_daily_rollups()`: 03:00 UTC daily
|
|
|
|
Platform WebService can run the daily rollup + refresh loop via
|
|
`PlatformAnalyticsMaintenanceService`. Configure the schedule with:
|
|
- `Platform:AnalyticsMaintenance:Enabled` (default `true`)
|
|
- `Platform:AnalyticsMaintenance:IntervalMinutes` (default `1440`)
|
|
- `Platform:AnalyticsMaintenance:RunOnStartup` (default `true`)
|
|
- `Platform:AnalyticsMaintenance:ComputeDailyRollups` (default `true`)
|
|
- `Platform:AnalyticsMaintenance:RefreshMaterializedViews` (default `true`)
|
|
- `Platform:AnalyticsMaintenance:BackfillDays` (default `0`, set to `0` to disable; recompute the most recent N days on the first maintenance run)
|
|
|
|
The hosted service issues concurrent refresh statements directly for each view.
|
|
Use a DB scheduler (pg_cron) or external orchestrator if you need the staggered
|
|
per-view timing above.
|
|
|
|
## Performance Considerations
|
|
|
|
### Indexing Strategy
|
|
|
|
| Table | Key Indexes | Query Pattern |
|
|
|-------|-------------|---------------|
|
|
| `components` | `purl`, `supplier_normalized`, `license_category` | Lookup, aggregation |
|
|
| `artifacts` | `digest`, `environment`, `team` | Lookup, filtering |
|
|
| `component_vulns` | `vuln_id`, `severity`, `fix_available` | Join, filtering |
|
|
| `attestations` | `artifact_id`, `predicate_type` | Join, aggregation |
|
|
| `vex_overrides` | `(artifact_id, vuln_id)`, `status` | Subquery exists |
|
|
|
|
### Query Performance Targets
|
|
|
|
| Query | Target | Notes |
|
|
|-------|--------|-------|
|
|
| `sp_top_suppliers(20, 'prod')` | < 100ms | Uses materialized view when env is null; env filter reads base tables |
|
|
| `sp_license_heatmap('prod')` | < 100ms | Uses materialized view when env is null; env filter reads base tables |
|
|
| `sp_vuln_exposure()` | < 200ms | Uses materialized view for global queries; environment filters read base tables |
|
|
| `sp_fixable_backlog()` | < 500ms | Live query with indexes |
|
|
| `sp_attestation_gaps()` | < 100ms | Uses materialized view |
|
|
|
|
### Caching Strategy
|
|
|
|
Platform API endpoints use a 5-minute TTL cache:
|
|
- Cache key: endpoint + query parameters
|
|
- Invalidation: Time-based only (no event-driven invalidation)
|
|
- Storage: Valkey (in-memory)
|
|
|
|
## Security Considerations
|
|
|
|
### Schema Permissions
|
|
|
|
```sql
|
|
-- Read-only role for dashboards
|
|
GRANT USAGE ON SCHEMA analytics TO dashboard_reader;
|
|
GRANT SELECT ON ALL TABLES IN SCHEMA analytics TO dashboard_reader;
|
|
GRANT SELECT ON ALL SEQUENCES IN SCHEMA analytics TO dashboard_reader;
|
|
|
|
-- Write role for ingestion service
|
|
GRANT USAGE ON SCHEMA analytics TO analytics_writer;
|
|
GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA analytics TO analytics_writer;
|
|
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA analytics TO analytics_writer;
|
|
```
|
|
|
|
### Data Classification
|
|
|
|
| Table | Classification | Notes |
|
|
|-------|----------------|-------|
|
|
| `components` | Internal | Contains package names, versions |
|
|
| `artifacts` | Internal | Contains image names, team names |
|
|
| `component_vulns` | Internal | Vulnerability data (public CVEs) |
|
|
| `vex_overrides` | Confidential | Contains justifications, operator IDs |
|
|
| `raw_sboms` | Confidential | Full SBOM payloads |
|
|
| `raw_attestations` | Confidential | Signed attestation envelopes |
|
|
|
|
### Audit Trail
|
|
|
|
All tables include `created_at` and `updated_at` timestamps. Raw payload tables (`raw_sboms`, `raw_attestations`) are append-only with content hashes for integrity verification.
|
|
|
|
## Integration Points
|
|
|
|
### Upstream Dependencies
|
|
|
|
| Service | Event | Contract | Action |
|
|
|---------|-------|----------|--------|
|
|
| Scanner | SBOM report ready | `scanner.event.report.ready@1` (`docs/modules/signals/events/orchestrator-scanner-events.md`) | Normalize and upsert components |
|
|
| Concelier | Advisory observation/linkset updated | `advisory.observation.updated@1` (`docs/modules/concelier/events/advisory.observation.updated@1.schema.json`), `advisory.linkset.updated@1` (`docs/modules/concelier/events/advisory.linkset.updated@1.md`) | Re-correlate affected components |
|
|
| Excititor | VEX statement changes | `vex.statement.*` (`docs/modules/excititor/architecture.md`) | Create/update vex_overrides |
|
|
| Attestor | Rekor entry logged | `rekor.entry.logged` (`docs/modules/attestor/architecture.md`) | Upsert attestation record |
|
|
|
|
### Downstream Consumers
|
|
|
|
| Consumer | Data | Endpoint |
|
|
|----------|------|----------|
|
|
| Console UI | Dashboard data | `/api/analytics/*` |
|
|
| Export Center | Compliance reports | Direct DB query |
|
|
| AdvisoryAI | Risk context | `/api/analytics/vulnerabilities` |
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Partitioning**: Partition `daily_*` tables by date for faster queries and archival
|
|
2. **Incremental refresh**: Implement incremental materialized view refresh for large datasets
|
|
3. **Custom dimensions**: Support user-defined component groupings (business units, cost centers)
|
|
4. **Predictive analytics**: Add ML-based risk prediction using historical trends
|
|
5. **BI tool integration**: Direct connectors for Tableau, Looker, Metabase
|