14 KiB
Analytics Module Architecture
Design Philosophy
The Analytics module implements a star-schema data warehouse pattern optimized for analytical queries rather than transactional workloads. Key design principles:
- Separation of concerns: Analytics schema is isolated from operational schemas (scanner, vex, proof_system)
- Pre-computation: Expensive aggregations computed in advance via materialized views
- Audit trail: Raw payloads preserved for reprocessing and compliance
- Determinism: Normalization functions are immutable and reproducible; array aggregates are ordered for stable outputs
- Incremental updates: Supports both full refresh and incremental ingestion
Data Flow
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Scanner │ │ Concelier │ │ Attestor │
│ (SBOM) │ │ (Vuln) │ │ (DSSE) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
│ SBOM Ingested │ Vuln Updated │ Attestation Created
▼ ▼ ▼
┌──────────────────────────────────────────────────────┐
│ AnalyticsIngestionService │
│ - Normalize components (PURL, supplier, license) │
│ - Upsert to unified registry │
│ - Correlate with vulnerabilities │
│ - Store raw payloads │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ analytics schema │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────────┐ │
│ │components│ │artifacts│ │comp_vuln│ │attestations│ │
│ └─────────┘ └─────────┘ └─────────┘ └────────────┘ │
└──────────────────────────────────────────────────────┘
│
│ Daily refresh
▼
┌──────────────────────────────────────────────────────┐
│ Materialized Views │
│ mv_supplier_concentration | mv_license_distribution │
│ mv_vuln_exposure | mv_attestation_coverage │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Platform API Endpoints │
│ (with 5-minute caching) │
└──────────────────────────────────────────────────────┘
Normalization Rules
PURL Parsing
Package URLs (PURLs) are the canonical identifier for components. The parse_purl() function extracts:
| Field | Example | Notes |
|---|---|---|
purl_type |
maven, npm, pypi |
Ecosystem identifier |
purl_namespace |
org.apache.logging |
Group/org/scope (optional) |
purl_name |
log4j-core |
Package name |
purl_version |
2.17.1 |
Version string |
Supplier Normalization
The normalize_supplier() function standardizes supplier names for consistent grouping:
- Convert to lowercase
- Trim whitespace
- Remove legal suffixes: Inc., LLC, Ltd., Corp., GmbH, B.V., S.A., PLC, Co.
- Normalize internal whitespace
Examples:
"Apache Software Foundation, Inc."→"apache software foundation""Google LLC"→"google"" Microsoft Corp. "→"microsoft"
License Categorization
The categorize_license() function maps SPDX expressions to risk categories:
| Category | Examples | Risk Level |
|---|---|---|
permissive |
MIT, Apache-2.0, BSD-3-Clause, ISC | Low |
copyleft-weak |
LGPL-2.1, MPL-2.0, EPL-2.0 | Medium |
copyleft-strong |
GPL-3.0, AGPL-3.0, SSPL | High |
proprietary |
Proprietary, Commercial | Review Required |
unknown |
Unrecognized expressions | Review Required |
Special handling:
- GPL with exceptions (e.g.,
GPL-2.0 WITH Classpath-exception-2.0) →copyleft-weak - Dual-licensed (e.g.,
MIT OR Apache-2.0) → uses first match
Component Deduplication
Components are deduplicated by (purl, hash_sha256):
- If same PURL and hash: existing record updated (last_seen_at, counts)
- If same PURL but different hash: new record created (version change)
- If same hash but different PURL: new record (aliased package)
Upsert pattern:
INSERT INTO analytics.components (...)
VALUES (...)
ON CONFLICT (purl, hash_sha256) DO UPDATE SET
last_seen_at = now(),
sbom_count = components.sbom_count + 1,
updated_at = now();
Vulnerability Correlation
When a component is upserted, the VulnerabilityCorrelationService queries Concelier for matching advisories:
- Query by PURL type + namespace + name
- Filter by version range matching
- Upsert to
component_vulnswith severity, EPSS, KEV flags
Version range matching currently supports semver ranges and exact matches via
VersionRuleEvaluator. Non-semver schemes fall back to exact string matches; wildcard
and ecosystem-specific ranges require upstream normalization.
VEX Override Logic
The mv_vuln_exposure view implements VEX-adjusted counts:
-- Effective count excludes artifacts with active VEX overrides
COUNT(DISTINCT ac.artifact_id) FILTER (
WHERE NOT EXISTS (
SELECT 1 FROM analytics.vex_overrides vo
WHERE vo.artifact_id = ac.artifact_id
AND vo.vuln_id = cv.vuln_id
AND vo.status = 'not_affected'
AND (vo.valid_until IS NULL OR vo.valid_until > now())
)
) AS effective_artifact_count
Override validity:
valid_from: When the override became effectivevalid_until: Expiration (NULL = no expiration)- Only
status = 'not_affected'reduces exposure counts, and only when the override is active in its validity window.
Attestation Ingestion
Attestation ingestion consumes Attestor Rekor entry events and expects Sigstore bundles or raw DSSE envelopes. The ingestion service:
- Resolves bundle URIs using
BundleUriTemplate;bundle:{digest}maps tocas://<DefaultBucket>/{digest}by default. - Decodes DSSE payloads, computes
dsse_payload_hash, and recordspredicate_uriplus Rekor log metadata (rekor_log_id,rekor_log_index). - Uses in-toto
subjectdigests to link artifacts when reanalysis hints are absent. - Maps predicate URIs into
analytics_attestation_typevalues (provenance,sbom,vex,build,scan,policy). - Expands VEX statements into
vex_overridesrows, one per product reference, and captures optional validity timestamps when provided.
Time-Series Rollups
Daily rollups computed by compute_daily_rollups():
Vulnerability counts (per environment/team/severity):
total_vulns: All affecting vulnerabilitiesfixable_vulns: Vulns withfix_available = TRUEvex_mitigated: Vulns with activenot_affectedoverridekev_vulns: Vulns in CISA KEVunique_cves: Distinct CVE IDsaffected_artifacts: Artifacts containing affected componentsaffected_components: Components with affecting vulns
Component counts (per environment/team/license/type):
total_components: Distinct componentsunique_suppliers: Distinct normalized suppliers
Retention policy: 90 days in hot storage; compute_daily_rollups() prunes older rows and downstream jobs archive to cold storage.
Materialized View Refresh
All materialized views support REFRESH ... CONCURRENTLY for zero-downtime updates:
-- Refresh all views (non-concurrent; run off-peak)
SELECT analytics.refresh_all_views();
Refresh schedule (recommended):
mv_supplier_concentration: 02:00 UTC dailymv_license_distribution: 02:15 UTC dailymv_vuln_exposure: 02:30 UTC dailymv_attestation_coverage: 02:45 UTC dailycompute_daily_rollups(): 03:00 UTC daily
Platform WebService can run the daily rollup + refresh loop via
PlatformAnalyticsMaintenanceService. Configure the schedule with:
Platform:AnalyticsMaintenance:Enabled(defaulttrue)Platform:AnalyticsMaintenance:IntervalMinutes(default1440)Platform:AnalyticsMaintenance:RunOnStartup(defaulttrue)Platform:AnalyticsMaintenance:ComputeDailyRollups(defaulttrue)Platform:AnalyticsMaintenance:RefreshMaterializedViews(defaulttrue)Platform:AnalyticsMaintenance:BackfillDays(default0, set to0to disable; recompute the most recent N days on the first maintenance run)
The hosted service issues concurrent refresh statements directly for each view. Use a DB scheduler (pg_cron) or external orchestrator if you need the staggered per-view timing above.
Performance Considerations
Indexing Strategy
| Table | Key Indexes | Query Pattern |
|---|---|---|
components |
purl, supplier_normalized, license_category |
Lookup, aggregation |
artifacts |
digest, environment, team |
Lookup, filtering |
component_vulns |
vuln_id, severity, fix_available |
Join, filtering |
attestations |
artifact_id, predicate_type |
Join, aggregation |
vex_overrides |
(artifact_id, vuln_id), status |
Subquery exists |
Query Performance Targets
| Query | Target | Notes |
|---|---|---|
sp_top_suppliers(20, 'prod') |
< 100ms | Uses materialized view when env is null; env filter reads base tables |
sp_license_heatmap('prod') |
< 100ms | Uses materialized view when env is null; env filter reads base tables |
sp_vuln_exposure() |
< 200ms | Uses materialized view for global queries; environment filters read base tables |
sp_fixable_backlog() |
< 500ms | Live query with indexes |
sp_attestation_gaps() |
< 100ms | Uses materialized view |
Caching Strategy
Platform API endpoints use a 5-minute TTL cache:
- Cache key: endpoint + query parameters
- Invalidation: Time-based only (no event-driven invalidation)
- Storage: Valkey (in-memory)
Security Considerations
Schema Permissions
-- Read-only role for dashboards
GRANT USAGE ON SCHEMA analytics TO dashboard_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics TO dashboard_reader;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA analytics TO dashboard_reader;
-- Write role for ingestion service
GRANT USAGE ON SCHEMA analytics TO analytics_writer;
GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA analytics TO analytics_writer;
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA analytics TO analytics_writer;
Data Classification
| Table | Classification | Notes |
|---|---|---|
components |
Internal | Contains package names, versions |
artifacts |
Internal | Contains image names, team names |
component_vulns |
Internal | Vulnerability data (public CVEs) |
vex_overrides |
Confidential | Contains justifications, operator IDs |
raw_sboms |
Confidential | Full SBOM payloads |
raw_attestations |
Confidential | Signed attestation envelopes |
Audit Trail
All tables include created_at and updated_at timestamps. Raw payload tables (raw_sboms, raw_attestations) are append-only with content hashes for integrity verification.
Integration Points
Upstream Dependencies
| Service | Event | Contract | Action |
|---|---|---|---|
| Scanner | SBOM report ready | scanner.event.report.ready@1 (docs/modules/signals/events/orchestrator-scanner-events.md) |
Normalize and upsert components |
| Concelier | Advisory observation/linkset updated | advisory.observation.updated@1 (docs/modules/concelier/events/advisory.observation.updated@1.schema.json), advisory.linkset.updated@1 (docs/modules/concelier/events/advisory.linkset.updated@1.md) |
Re-correlate affected components |
| Excititor | VEX statement changes | vex.statement.* (docs/modules/excititor/architecture.md) |
Create/update vex_overrides |
| Attestor | Rekor entry logged | rekor.entry.logged (docs/modules/attestor/architecture.md) |
Upsert attestation record |
Downstream Consumers
| Consumer | Data | Endpoint |
|---|---|---|
| Console UI | Dashboard data | /api/analytics/* |
| Export Center | Compliance reports | Direct DB query |
| AdvisoryAI | Risk context | /api/analytics/vulnerabilities |
Future Enhancements
- Partitioning: Partition
daily_*tables by date for faster queries and archival - Incremental refresh: Implement incremental materialized view refresh for large datasets
- Custom dimensions: Support user-defined component groupings (business units, cost centers)
- Predictive analytics: Add ML-based risk prediction using historical trends
- BI tool integration: Direct connectors for Tableau, Looker, Metabase