Files
git.stella-ops.org/docs/modules/analytics/architecture.md
2026-01-22 19:08:46 +02:00

14 KiB

Analytics Module Architecture

Design Philosophy

The Analytics module implements a star-schema data warehouse pattern optimized for analytical queries rather than transactional workloads. Key design principles:

  1. Separation of concerns: Analytics schema is isolated from operational schemas (scanner, vex, proof_system)
  2. Pre-computation: Expensive aggregations computed in advance via materialized views
  3. Audit trail: Raw payloads preserved for reprocessing and compliance
  4. Determinism: Normalization functions are immutable and reproducible; array aggregates are ordered for stable outputs
  5. Incremental updates: Supports both full refresh and incremental ingestion

Data Flow

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Scanner   │     │  Concelier  │     │   Attestor  │
│   (SBOM)    │     │   (Vuln)    │     │   (DSSE)    │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       │ SBOM Ingested     │ Vuln Updated      │ Attestation Created
       ▼                   ▼                   ▼
┌──────────────────────────────────────────────────────┐
│               AnalyticsIngestionService              │
│  - Normalize components (PURL, supplier, license)    │
│  - Upsert to unified registry                        │
│  - Correlate with vulnerabilities                    │
│  - Store raw payloads                                │
└──────────────────────────────────────────────────────┘
       │
       ▼
┌──────────────────────────────────────────────────────┐
│                 analytics schema                     │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────────┐ │
│  │components│ │artifacts│ │comp_vuln│ │attestations│ │
│  └─────────┘ └─────────┘ └─────────┘ └────────────┘ │
└──────────────────────────────────────────────────────┘
       │
       │ Daily refresh
       ▼
┌──────────────────────────────────────────────────────┐
│              Materialized Views                      │
│  mv_supplier_concentration | mv_license_distribution │
│  mv_vuln_exposure          | mv_attestation_coverage │
└──────────────────────────────────────────────────────┘
       │
       ▼
┌──────────────────────────────────────────────────────┐
│              Platform API Endpoints                  │
│              (with 5-minute caching)                 │
└──────────────────────────────────────────────────────┘

Normalization Rules

PURL Parsing

Package URLs (PURLs) are the canonical identifier for components. The parse_purl() function extracts:

Field Example Notes
purl_type maven, npm, pypi Ecosystem identifier
purl_namespace org.apache.logging Group/org/scope (optional)
purl_name log4j-core Package name
purl_version 2.17.1 Version string

Supplier Normalization

The normalize_supplier() function standardizes supplier names for consistent grouping:

  1. Convert to lowercase
  2. Trim whitespace
  3. Remove legal suffixes: Inc., LLC, Ltd., Corp., GmbH, B.V., S.A., PLC, Co.
  4. Normalize internal whitespace

Examples:

  • "Apache Software Foundation, Inc.""apache software foundation"
  • "Google LLC""google"
  • " Microsoft Corp. ""microsoft"

License Categorization

The categorize_license() function maps SPDX expressions to risk categories:

Category Examples Risk Level
permissive MIT, Apache-2.0, BSD-3-Clause, ISC Low
copyleft-weak LGPL-2.1, MPL-2.0, EPL-2.0 Medium
copyleft-strong GPL-3.0, AGPL-3.0, SSPL High
proprietary Proprietary, Commercial Review Required
unknown Unrecognized expressions Review Required

Special handling:

  • GPL with exceptions (e.g., GPL-2.0 WITH Classpath-exception-2.0) → copyleft-weak
  • Dual-licensed (e.g., MIT OR Apache-2.0) → uses first match

Component Deduplication

Components are deduplicated by (purl, hash_sha256):

  1. If same PURL and hash: existing record updated (last_seen_at, counts)
  2. If same PURL but different hash: new record created (version change)
  3. If same hash but different PURL: new record (aliased package)

Upsert pattern:

INSERT INTO analytics.components (...)
VALUES (...)
ON CONFLICT (purl, hash_sha256) DO UPDATE SET
  last_seen_at = now(),
  sbom_count = components.sbom_count + 1,
  updated_at = now();

Vulnerability Correlation

When a component is upserted, the VulnerabilityCorrelationService queries Concelier for matching advisories:

  1. Query by PURL type + namespace + name
  2. Filter by version range matching
  3. Upsert to component_vulns with severity, EPSS, KEV flags

Version range matching currently supports semver ranges and exact matches via VersionRuleEvaluator. Non-semver schemes fall back to exact string matches; wildcard and ecosystem-specific ranges require upstream normalization.

VEX Override Logic

The mv_vuln_exposure view implements VEX-adjusted counts:

-- Effective count excludes artifacts with active VEX overrides
COUNT(DISTINCT ac.artifact_id) FILTER (
  WHERE NOT EXISTS (
    SELECT 1 FROM analytics.vex_overrides vo
    WHERE vo.artifact_id = ac.artifact_id
      AND vo.vuln_id = cv.vuln_id
      AND vo.status = 'not_affected'
      AND (vo.valid_until IS NULL OR vo.valid_until > now())
  )
) AS effective_artifact_count

Override validity:

  • valid_from: When the override became effective
  • valid_until: Expiration (NULL = no expiration)
  • Only status = 'not_affected' reduces exposure counts, and only when the override is active in its validity window.

Attestation Ingestion

Attestation ingestion consumes Attestor Rekor entry events and expects Sigstore bundles or raw DSSE envelopes. The ingestion service:

  • Resolves bundle URIs using BundleUriTemplate; bundle:{digest} maps to cas://<DefaultBucket>/{digest} by default.
  • Decodes DSSE payloads, computes dsse_payload_hash, and records predicate_uri plus Rekor log metadata (rekor_log_id, rekor_log_index).
  • Uses in-toto subject digests to link artifacts when reanalysis hints are absent.
  • Maps predicate URIs into analytics_attestation_type values (provenance, sbom, vex, build, scan, policy).
  • Expands VEX statements into vex_overrides rows, one per product reference, and captures optional validity timestamps when provided.

Time-Series Rollups

Daily rollups computed by compute_daily_rollups():

Vulnerability counts (per environment/team/severity):

  • total_vulns: All affecting vulnerabilities
  • fixable_vulns: Vulns with fix_available = TRUE
  • vex_mitigated: Vulns with active not_affected override
  • kev_vulns: Vulns in CISA KEV
  • unique_cves: Distinct CVE IDs
  • affected_artifacts: Artifacts containing affected components
  • affected_components: Components with affecting vulns

Component counts (per environment/team/license/type):

  • total_components: Distinct components
  • unique_suppliers: Distinct normalized suppliers

Retention policy: 90 days in hot storage; compute_daily_rollups() prunes older rows and downstream jobs archive to cold storage.

Materialized View Refresh

All materialized views support REFRESH ... CONCURRENTLY for zero-downtime updates:

-- Refresh all views (non-concurrent; run off-peak)
SELECT analytics.refresh_all_views();

Refresh schedule (recommended):

  • mv_supplier_concentration: 02:00 UTC daily
  • mv_license_distribution: 02:15 UTC daily
  • mv_vuln_exposure: 02:30 UTC daily
  • mv_attestation_coverage: 02:45 UTC daily
  • compute_daily_rollups(): 03:00 UTC daily

Platform WebService can run the daily rollup + refresh loop via PlatformAnalyticsMaintenanceService. Configure the schedule with:

  • Platform:AnalyticsMaintenance:Enabled (default true)
  • Platform:AnalyticsMaintenance:IntervalMinutes (default 1440)
  • Platform:AnalyticsMaintenance:RunOnStartup (default true)
  • Platform:AnalyticsMaintenance:ComputeDailyRollups (default true)
  • Platform:AnalyticsMaintenance:RefreshMaterializedViews (default true)
  • Platform:AnalyticsMaintenance:BackfillDays (default 0, set to 0 to disable; recompute the most recent N days on the first maintenance run)

The hosted service issues concurrent refresh statements directly for each view. Use a DB scheduler (pg_cron) or external orchestrator if you need the staggered per-view timing above.

Performance Considerations

Indexing Strategy

Table Key Indexes Query Pattern
components purl, supplier_normalized, license_category Lookup, aggregation
artifacts digest, environment, team Lookup, filtering
component_vulns vuln_id, severity, fix_available Join, filtering
attestations artifact_id, predicate_type Join, aggregation
vex_overrides (artifact_id, vuln_id), status Subquery exists

Query Performance Targets

Query Target Notes
sp_top_suppliers(20, 'prod') < 100ms Uses materialized view when env is null; env filter reads base tables
sp_license_heatmap('prod') < 100ms Uses materialized view when env is null; env filter reads base tables
sp_vuln_exposure() < 200ms Uses materialized view for global queries; environment filters read base tables
sp_fixable_backlog() < 500ms Live query with indexes
sp_attestation_gaps() < 100ms Uses materialized view

Caching Strategy

Platform API endpoints use a 5-minute TTL cache:

  • Cache key: endpoint + query parameters
  • Invalidation: Time-based only (no event-driven invalidation)
  • Storage: Valkey (in-memory)

Security Considerations

Schema Permissions

-- Read-only role for dashboards
GRANT USAGE ON SCHEMA analytics TO dashboard_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics TO dashboard_reader;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA analytics TO dashboard_reader;

-- Write role for ingestion service
GRANT USAGE ON SCHEMA analytics TO analytics_writer;
GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA analytics TO analytics_writer;
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA analytics TO analytics_writer;

Data Classification

Table Classification Notes
components Internal Contains package names, versions
artifacts Internal Contains image names, team names
component_vulns Internal Vulnerability data (public CVEs)
vex_overrides Confidential Contains justifications, operator IDs
raw_sboms Confidential Full SBOM payloads
raw_attestations Confidential Signed attestation envelopes

Audit Trail

All tables include created_at and updated_at timestamps. Raw payload tables (raw_sboms, raw_attestations) are append-only with content hashes for integrity verification.

Integration Points

Upstream Dependencies

Service Event Contract Action
Scanner SBOM report ready scanner.event.report.ready@1 (docs/modules/signals/events/orchestrator-scanner-events.md) Normalize and upsert components
Concelier Advisory observation/linkset updated advisory.observation.updated@1 (docs/modules/concelier/events/advisory.observation.updated@1.schema.json), advisory.linkset.updated@1 (docs/modules/concelier/events/advisory.linkset.updated@1.md) Re-correlate affected components
Excititor VEX statement changes vex.statement.* (docs/modules/excititor/architecture.md) Create/update vex_overrides
Attestor Rekor entry logged rekor.entry.logged (docs/modules/attestor/architecture.md) Upsert attestation record

Downstream Consumers

Consumer Data Endpoint
Console UI Dashboard data /api/analytics/*
Export Center Compliance reports Direct DB query
AdvisoryAI Risk context /api/analytics/vulnerabilities

Future Enhancements

  1. Partitioning: Partition daily_* tables by date for faster queries and archival
  2. Incremental refresh: Implement incremental materialized view refresh for large datasets
  3. Custom dimensions: Support user-defined component groupings (business units, cost centers)
  4. Predictive analytics: Add ML-based risk prediction using historical trends
  5. BI tool integration: Direct connectors for Tableau, Looker, Metabase