11 KiB
Analytics Module Architecture
Design Philosophy
The Analytics module implements a star-schema data warehouse pattern optimized for analytical queries rather than transactional workloads. Key design principles:
- Separation of concerns: Analytics schema is isolated from operational schemas (scanner, vex, proof_system)
- Pre-computation: Expensive aggregations computed in advance via materialized views
- Audit trail: Raw payloads preserved for reprocessing and compliance
- Determinism: All normalization functions are immutable and reproducible
- Incremental updates: Supports both full refresh and incremental ingestion
Data Flow
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Scanner │ │ Concelier │ │ Attestor │
│ (SBOM) │ │ (Vuln) │ │ (DSSE) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
│ SBOM Ingested │ Vuln Updated │ Attestation Created
▼ ▼ ▼
┌──────────────────────────────────────────────────────┐
│ AnalyticsIngestionService │
│ - Normalize components (PURL, supplier, license) │
│ - Upsert to unified registry │
│ - Correlate with vulnerabilities │
│ - Store raw payloads │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ analytics schema │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────────┐ │
│ │components│ │artifacts│ │comp_vuln│ │attestations│ │
│ └─────────┘ └─────────┘ └─────────┘ └────────────┘ │
└──────────────────────────────────────────────────────┘
│
│ Daily refresh
▼
┌──────────────────────────────────────────────────────┐
│ Materialized Views │
│ mv_supplier_concentration | mv_license_distribution │
│ mv_vuln_exposure | mv_attestation_coverage │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Platform API Endpoints │
│ (with 5-minute caching) │
└──────────────────────────────────────────────────────┘
Normalization Rules
PURL Parsing
Package URLs (PURLs) are the canonical identifier for components. The parse_purl() function extracts:
| Field | Example | Notes |
|---|---|---|
purl_type |
maven, npm, pypi |
Ecosystem identifier |
purl_namespace |
org.apache.logging |
Group/org/scope (optional) |
purl_name |
log4j-core |
Package name |
purl_version |
2.17.1 |
Version string |
Supplier Normalization
The normalize_supplier() function standardizes supplier names for consistent grouping:
- Convert to lowercase
- Trim whitespace
- Remove legal suffixes: Inc., LLC, Ltd., Corp., GmbH, B.V., S.A., PLC, Co.
- Normalize internal whitespace
Examples:
"Apache Software Foundation, Inc."→"apache software foundation""Google LLC"→"google"" Microsoft Corp. "→"microsoft"
License Categorization
The categorize_license() function maps SPDX expressions to risk categories:
| Category | Examples | Risk Level |
|---|---|---|
permissive |
MIT, Apache-2.0, BSD-3-Clause, ISC | Low |
copyleft-weak |
LGPL-2.1, MPL-2.0, EPL-2.0 | Medium |
copyleft-strong |
GPL-3.0, AGPL-3.0, SSPL | High |
proprietary |
Proprietary, Commercial | Review Required |
unknown |
Unrecognized expressions | Review Required |
Special handling:
- GPL with exceptions (e.g.,
GPL-2.0 WITH Classpath-exception-2.0) →copyleft-weak - Dual-licensed (e.g.,
MIT OR Apache-2.0) → uses first match
Component Deduplication
Components are deduplicated by (purl, hash_sha256):
- If same PURL and hash: existing record updated (last_seen_at, counts)
- If same PURL but different hash: new record created (version change)
- If same hash but different PURL: new record (aliased package)
Upsert pattern:
INSERT INTO analytics.components (...)
VALUES (...)
ON CONFLICT (purl, hash_sha256) DO UPDATE SET
last_seen_at = now(),
sbom_count = components.sbom_count + 1,
updated_at = now();
Vulnerability Correlation
When a component is upserted, the VulnerabilityCorrelationService queries Concelier for matching advisories:
- Query by PURL type + namespace + name
- Filter by version range matching
- Upsert to
component_vulnswith severity, EPSS, KEV flags
Version range matching uses Concelier's existing logic to handle:
- Semver ranges:
>=1.0.0 <2.0.0 - Exact versions:
1.2.3 - Wildcards:
1.x
VEX Override Logic
The mv_vuln_exposure view implements VEX-adjusted counts:
-- Effective count excludes artifacts with active VEX overrides
COUNT(DISTINCT ac.artifact_id) FILTER (
WHERE NOT EXISTS (
SELECT 1 FROM analytics.vex_overrides vo
WHERE vo.artifact_id = ac.artifact_id
AND vo.vuln_id = cv.vuln_id
AND vo.status = 'not_affected'
AND (vo.valid_until IS NULL OR vo.valid_until > now())
)
) AS effective_artifact_count
Override validity:
valid_from: When the override became effectivevalid_until: Expiration (NULL = no expiration)- Only
status = 'not_affected'reduces exposure counts
Time-Series Rollups
Daily rollups computed by compute_daily_rollups():
Vulnerability counts (per environment/team/severity):
total_vulns: All affecting vulnerabilitiesfixable_vulns: Vulns withfix_available = TRUEvex_mitigated: Vulns with activenot_affectedoverridekev_vulns: Vulns in CISA KEVunique_cves: Distinct CVE IDsaffected_artifacts: Artifacts containing affected componentsaffected_components: Components with affecting vulns
Component counts (per environment/team/license/type):
total_components: Distinct componentsunique_suppliers: Distinct normalized suppliers
Retention policy: 90 days in hot storage; older data archived to cold storage.
Materialized View Refresh
All materialized views support REFRESH ... CONCURRENTLY for zero-downtime updates:
-- Refresh all views (run daily via pg_cron or Scheduler)
SELECT analytics.refresh_all_views();
Refresh schedule (recommended):
mv_supplier_concentration: 02:00 UTC dailymv_license_distribution: 02:15 UTC dailymv_vuln_exposure: 02:30 UTC dailymv_attestation_coverage: 02:45 UTC dailycompute_daily_rollups(): 03:00 UTC daily
Performance Considerations
Indexing Strategy
| Table | Key Indexes | Query Pattern |
|---|---|---|
components |
purl, supplier_normalized, license_category |
Lookup, aggregation |
artifacts |
digest, environment, team |
Lookup, filtering |
component_vulns |
vuln_id, severity, fix_available |
Join, filtering |
attestations |
artifact_id, predicate_type |
Join, aggregation |
vex_overrides |
(artifact_id, vuln_id), status |
Subquery exists |
Query Performance Targets
| Query | Target | Notes |
|---|---|---|
sp_top_suppliers(20) |
< 100ms | Uses materialized view |
sp_license_heatmap() |
< 100ms | Uses materialized view |
sp_vuln_exposure() |
< 200ms | Uses materialized view |
sp_fixable_backlog() |
< 500ms | Live query with indexes |
sp_attestation_gaps() |
< 100ms | Uses materialized view |
Caching Strategy
Platform API endpoints use a 5-minute TTL cache:
- Cache key: endpoint + query parameters
- Invalidation: Time-based only (no event-driven invalidation)
- Storage: Valkey (in-memory)
Security Considerations
Schema Permissions
-- Read-only role for dashboards
GRANT USAGE ON SCHEMA analytics TO dashboard_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics TO dashboard_reader;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA analytics TO dashboard_reader;
-- Write role for ingestion service
GRANT USAGE ON SCHEMA analytics TO analytics_writer;
GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA analytics TO analytics_writer;
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA analytics TO analytics_writer;
Data Classification
| Table | Classification | Notes |
|---|---|---|
components |
Internal | Contains package names, versions |
artifacts |
Internal | Contains image names, team names |
component_vulns |
Internal | Vulnerability data (public CVEs) |
vex_overrides |
Confidential | Contains justifications, operator IDs |
raw_sboms |
Confidential | Full SBOM payloads |
raw_attestations |
Confidential | Signed attestation envelopes |
Audit Trail
All tables include created_at and updated_at timestamps. Raw payload tables (raw_sboms, raw_attestations) are append-only with content hashes for integrity verification.
Integration Points
Upstream Dependencies
| Service | Event | Action |
|---|---|---|
| Scanner | SBOM ingested | Normalize and upsert components |
| Concelier | Advisory updated | Re-correlate affected components |
| Excititor | VEX observation | Create/update vex_overrides |
| Attestor | Attestation created | Upsert attestation record |
Downstream Consumers
| Consumer | Data | Endpoint |
|---|---|---|
| Console UI | Dashboard data | /api/analytics/* |
| Export Center | Compliance reports | Direct DB query |
| AdvisoryAI | Risk context | /api/analytics/vulnerabilities |
Future Enhancements
- Partitioning: Partition
daily_*tables by date for faster queries and archival - Incremental refresh: Implement incremental materialized view refresh for large datasets
- Custom dimensions: Support user-defined component groupings (business units, cost centers)
- Predictive analytics: Add ML-based risk prediction using historical trends
- BI tool integration: Direct connectors for Tableau, Looker, Metabase