Files
git.stella-ops.org/docs/modules/analytics/architecture.md

11 KiB

Analytics Module Architecture

Design Philosophy

The Analytics module implements a star-schema data warehouse pattern optimized for analytical queries rather than transactional workloads. Key design principles:

  1. Separation of concerns: Analytics schema is isolated from operational schemas (scanner, vex, proof_system)
  2. Pre-computation: Expensive aggregations computed in advance via materialized views
  3. Audit trail: Raw payloads preserved for reprocessing and compliance
  4. Determinism: All normalization functions are immutable and reproducible
  5. Incremental updates: Supports both full refresh and incremental ingestion

Data Flow

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Scanner   │     │  Concelier  │     │   Attestor  │
│   (SBOM)    │     │   (Vuln)    │     │   (DSSE)    │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       │ SBOM Ingested     │ Vuln Updated      │ Attestation Created
       ▼                   ▼                   ▼
┌──────────────────────────────────────────────────────┐
│               AnalyticsIngestionService              │
│  - Normalize components (PURL, supplier, license)    │
│  - Upsert to unified registry                        │
│  - Correlate with vulnerabilities                    │
│  - Store raw payloads                                │
└──────────────────────────────────────────────────────┘
       │
       ▼
┌──────────────────────────────────────────────────────┐
│                 analytics schema                     │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────────┐ │
│  │components│ │artifacts│ │comp_vuln│ │attestations│ │
│  └─────────┘ └─────────┘ └─────────┘ └────────────┘ │
└──────────────────────────────────────────────────────┘
       │
       │ Daily refresh
       ▼
┌──────────────────────────────────────────────────────┐
│              Materialized Views                      │
│  mv_supplier_concentration | mv_license_distribution │
│  mv_vuln_exposure          | mv_attestation_coverage │
└──────────────────────────────────────────────────────┘
       │
       ▼
┌──────────────────────────────────────────────────────┐
│              Platform API Endpoints                  │
│              (with 5-minute caching)                 │
└──────────────────────────────────────────────────────┘

Normalization Rules

PURL Parsing

Package URLs (PURLs) are the canonical identifier for components. The parse_purl() function extracts:

Field Example Notes
purl_type maven, npm, pypi Ecosystem identifier
purl_namespace org.apache.logging Group/org/scope (optional)
purl_name log4j-core Package name
purl_version 2.17.1 Version string

Supplier Normalization

The normalize_supplier() function standardizes supplier names for consistent grouping:

  1. Convert to lowercase
  2. Trim whitespace
  3. Remove legal suffixes: Inc., LLC, Ltd., Corp., GmbH, B.V., S.A., PLC, Co.
  4. Normalize internal whitespace

Examples:

  • "Apache Software Foundation, Inc.""apache software foundation"
  • "Google LLC""google"
  • " Microsoft Corp. ""microsoft"

License Categorization

The categorize_license() function maps SPDX expressions to risk categories:

Category Examples Risk Level
permissive MIT, Apache-2.0, BSD-3-Clause, ISC Low
copyleft-weak LGPL-2.1, MPL-2.0, EPL-2.0 Medium
copyleft-strong GPL-3.0, AGPL-3.0, SSPL High
proprietary Proprietary, Commercial Review Required
unknown Unrecognized expressions Review Required

Special handling:

  • GPL with exceptions (e.g., GPL-2.0 WITH Classpath-exception-2.0) → copyleft-weak
  • Dual-licensed (e.g., MIT OR Apache-2.0) → uses first match

Component Deduplication

Components are deduplicated by (purl, hash_sha256):

  1. If same PURL and hash: existing record updated (last_seen_at, counts)
  2. If same PURL but different hash: new record created (version change)
  3. If same hash but different PURL: new record (aliased package)

Upsert pattern:

INSERT INTO analytics.components (...)
VALUES (...)
ON CONFLICT (purl, hash_sha256) DO UPDATE SET
  last_seen_at = now(),
  sbom_count = components.sbom_count + 1,
  updated_at = now();

Vulnerability Correlation

When a component is upserted, the VulnerabilityCorrelationService queries Concelier for matching advisories:

  1. Query by PURL type + namespace + name
  2. Filter by version range matching
  3. Upsert to component_vulns with severity, EPSS, KEV flags

Version range matching uses Concelier's existing logic to handle:

  • Semver ranges: >=1.0.0 <2.0.0
  • Exact versions: 1.2.3
  • Wildcards: 1.x

VEX Override Logic

The mv_vuln_exposure view implements VEX-adjusted counts:

-- Effective count excludes artifacts with active VEX overrides
COUNT(DISTINCT ac.artifact_id) FILTER (
  WHERE NOT EXISTS (
    SELECT 1 FROM analytics.vex_overrides vo
    WHERE vo.artifact_id = ac.artifact_id
      AND vo.vuln_id = cv.vuln_id
      AND vo.status = 'not_affected'
      AND (vo.valid_until IS NULL OR vo.valid_until > now())
  )
) AS effective_artifact_count

Override validity:

  • valid_from: When the override became effective
  • valid_until: Expiration (NULL = no expiration)
  • Only status = 'not_affected' reduces exposure counts

Time-Series Rollups

Daily rollups computed by compute_daily_rollups():

Vulnerability counts (per environment/team/severity):

  • total_vulns: All affecting vulnerabilities
  • fixable_vulns: Vulns with fix_available = TRUE
  • vex_mitigated: Vulns with active not_affected override
  • kev_vulns: Vulns in CISA KEV
  • unique_cves: Distinct CVE IDs
  • affected_artifacts: Artifacts containing affected components
  • affected_components: Components with affecting vulns

Component counts (per environment/team/license/type):

  • total_components: Distinct components
  • unique_suppliers: Distinct normalized suppliers

Retention policy: 90 days in hot storage; older data archived to cold storage.

Materialized View Refresh

All materialized views support REFRESH ... CONCURRENTLY for zero-downtime updates:

-- Refresh all views (run daily via pg_cron or Scheduler)
SELECT analytics.refresh_all_views();

Refresh schedule (recommended):

  • mv_supplier_concentration: 02:00 UTC daily
  • mv_license_distribution: 02:15 UTC daily
  • mv_vuln_exposure: 02:30 UTC daily
  • mv_attestation_coverage: 02:45 UTC daily
  • compute_daily_rollups(): 03:00 UTC daily

Performance Considerations

Indexing Strategy

Table Key Indexes Query Pattern
components purl, supplier_normalized, license_category Lookup, aggregation
artifacts digest, environment, team Lookup, filtering
component_vulns vuln_id, severity, fix_available Join, filtering
attestations artifact_id, predicate_type Join, aggregation
vex_overrides (artifact_id, vuln_id), status Subquery exists

Query Performance Targets

Query Target Notes
sp_top_suppliers(20) < 100ms Uses materialized view
sp_license_heatmap() < 100ms Uses materialized view
sp_vuln_exposure() < 200ms Uses materialized view
sp_fixable_backlog() < 500ms Live query with indexes
sp_attestation_gaps() < 100ms Uses materialized view

Caching Strategy

Platform API endpoints use a 5-minute TTL cache:

  • Cache key: endpoint + query parameters
  • Invalidation: Time-based only (no event-driven invalidation)
  • Storage: Valkey (in-memory)

Security Considerations

Schema Permissions

-- Read-only role for dashboards
GRANT USAGE ON SCHEMA analytics TO dashboard_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics TO dashboard_reader;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA analytics TO dashboard_reader;

-- Write role for ingestion service
GRANT USAGE ON SCHEMA analytics TO analytics_writer;
GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA analytics TO analytics_writer;
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA analytics TO analytics_writer;

Data Classification

Table Classification Notes
components Internal Contains package names, versions
artifacts Internal Contains image names, team names
component_vulns Internal Vulnerability data (public CVEs)
vex_overrides Confidential Contains justifications, operator IDs
raw_sboms Confidential Full SBOM payloads
raw_attestations Confidential Signed attestation envelopes

Audit Trail

All tables include created_at and updated_at timestamps. Raw payload tables (raw_sboms, raw_attestations) are append-only with content hashes for integrity verification.

Integration Points

Upstream Dependencies

Service Event Action
Scanner SBOM ingested Normalize and upsert components
Concelier Advisory updated Re-correlate affected components
Excititor VEX observation Create/update vex_overrides
Attestor Attestation created Upsert attestation record

Downstream Consumers

Consumer Data Endpoint
Console UI Dashboard data /api/analytics/*
Export Center Compliance reports Direct DB query
AdvisoryAI Risk context /api/analytics/vulnerabilities

Future Enhancements

  1. Partitioning: Partition daily_* tables by date for faster queries and archival
  2. Incremental refresh: Implement incremental materialized view refresh for large datasets
  3. Custom dimensions: Support user-defined component groupings (business units, cost centers)
  4. Predictive analytics: Add ML-based risk prediction using historical trends
  5. BI tool integration: Direct connectors for Tableau, Looker, Metabase