# Analytics Module Architecture

## Design Philosophy

The Analytics module implements a **star-schema data warehouse** pattern optimized for analytical queries rather than transactional workloads. Key design principles:

1. **Separation of concerns**: Analytics schema is isolated from operational schemas (scanner, vex, proof_system)
2. **Pre-computation**: Expensive aggregations computed in advance via materialized views
3. **Audit trail**: Raw payloads preserved for reprocessing and compliance
4. **Determinism**: All normalization functions are immutable and reproducible
5. **Incremental updates**: Supports both full refresh and incremental ingestion

## Data Flow

```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Scanner   │     │  Concelier  │     │   Attestor  │
│   (SBOM)    │     │   (Vuln)    │     │   (DSSE)    │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       │ SBOM Ingested     │ Vuln Updated      │ Attestation Created
       ▼                   ▼                   ▼
┌──────────────────────────────────────────────────────┐
│               AnalyticsIngestionService              │
│  - Normalize components (PURL, supplier, license)    │
│  - Upsert to unified registry                        │
│  - Correlate with vulnerabilities                    │
│  - Store raw payloads                                │
└──────────────────────────────────────────────────────┘
       │
       ▼
┌──────────────────────────────────────────────────────┐
│                 analytics schema                     │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────────┐ │
│  │components│ │artifacts│ │comp_vuln│ │attestations│ │
│  └─────────┘ └─────────┘ └─────────┘ └────────────┘ │
└──────────────────────────────────────────────────────┘
       │
       │ Daily refresh
       ▼
┌──────────────────────────────────────────────────────┐
│              Materialized Views                      │
│  mv_supplier_concentration | mv_license_distribution │
│  mv_vuln_exposure          | mv_attestation_coverage │
└──────────────────────────────────────────────────────┘
       │
       ▼
┌──────────────────────────────────────────────────────┐
│              Platform API Endpoints                  │
│              (with 5-minute caching)                 │
└──────────────────────────────────────────────────────┘
```

## Normalization Rules

### PURL Parsing

Package URLs (PURLs) are the canonical identifier for components. The `parse_purl()` function extracts:

| Field | Example | Notes |
|-------|---------|-------|
| `purl_type` | `maven`, `npm`, `pypi` | Ecosystem identifier |
| `purl_namespace` | `org.apache.logging` | Group/org/scope (optional) |
| `purl_name` | `log4j-core` | Package name |
| `purl_version` | `2.17.1` | Version string |

### Supplier Normalization

The `normalize_supplier()` function standardizes supplier names for consistent grouping:

1. Convert to lowercase
2. Trim whitespace
3. Remove legal suffixes: Inc., LLC, Ltd., Corp., GmbH, B.V., S.A., PLC, Co.
4. Normalize internal whitespace

**Examples:**
- `"Apache Software Foundation, Inc."` → `"apache software foundation"`
- `"Google LLC"` → `"google"`
- `"  Microsoft Corp.  "` → `"microsoft"`

### License Categorization

The `categorize_license()` function maps SPDX expressions to risk categories:

| Category | Examples | Risk Level |
|----------|----------|------------|
| `permissive` | MIT, Apache-2.0, BSD-3-Clause, ISC | Low |
| `copyleft-weak` | LGPL-2.1, MPL-2.0, EPL-2.0 | Medium |
| `copyleft-strong` | GPL-3.0, AGPL-3.0, SSPL | High |
| `proprietary` | Proprietary, Commercial | Review Required |
| `unknown` | Unrecognized expressions | Review Required |

**Special handling:**
- GPL with exceptions (e.g., `GPL-2.0 WITH Classpath-exception-2.0`) → `copyleft-weak`
- Dual-licensed (e.g., `MIT OR Apache-2.0`) → uses first match

## Component Deduplication

Components are deduplicated by `(purl, hash_sha256)`:

1. If same PURL and hash: existing record updated (last_seen_at, counts)
2. If same PURL but different hash: new record created (version change)
3. If same hash but different PURL: new record (aliased package)

**Upsert pattern:**
```sql
INSERT INTO analytics.components (...)
VALUES (...)
ON CONFLICT (purl, hash_sha256) DO UPDATE SET
  last_seen_at = now(),
  sbom_count = components.sbom_count + 1,
  updated_at = now();
```

## Vulnerability Correlation

When a component is upserted, the `VulnerabilityCorrelationService` queries Concelier for matching advisories:

1. Query by PURL type + namespace + name
2. Filter by version range matching
3. Upsert to `component_vulns` with severity, EPSS, KEV flags

**Version range matching** uses Concelier's existing logic to handle:
- Semver ranges: `>=1.0.0 <2.0.0`
- Exact versions: `1.2.3`
- Wildcards: `1.x`

## VEX Override Logic

The `mv_vuln_exposure` view implements VEX-adjusted counts:

```sql
-- Effective count excludes artifacts with active VEX overrides
COUNT(DISTINCT ac.artifact_id) FILTER (
  WHERE NOT EXISTS (
    SELECT 1 FROM analytics.vex_overrides vo
    WHERE vo.artifact_id = ac.artifact_id
      AND vo.vuln_id = cv.vuln_id
      AND vo.status = 'not_affected'
      AND (vo.valid_until IS NULL OR vo.valid_until > now())
  )
) AS effective_artifact_count
```

**Override validity:**
- `valid_from`: When the override became effective
- `valid_until`: Expiration (NULL = no expiration)
- Only `status = 'not_affected'` reduces exposure counts

## Time-Series Rollups

Daily rollups computed by `compute_daily_rollups()`:

**Vulnerability counts** (per environment/team/severity):
- `total_vulns`: All affecting vulnerabilities
- `fixable_vulns`: Vulns with `fix_available = TRUE`
- `vex_mitigated`: Vulns with active `not_affected` override
- `kev_vulns`: Vulns in CISA KEV
- `unique_cves`: Distinct CVE IDs
- `affected_artifacts`: Artifacts containing affected components
- `affected_components`: Components with affecting vulns

**Component counts** (per environment/team/license/type):
- `total_components`: Distinct components
- `unique_suppliers`: Distinct normalized suppliers

**Retention policy:** 90 days in hot storage; older data archived to cold storage.

## Materialized View Refresh

All materialized views support `REFRESH ... CONCURRENTLY` for zero-downtime updates:

```sql
-- Refresh all views (run daily via pg_cron or Scheduler)
SELECT analytics.refresh_all_views();
```

**Refresh schedule (recommended):**
- `mv_supplier_concentration`: 02:00 UTC daily
- `mv_license_distribution`: 02:15 UTC daily
- `mv_vuln_exposure`: 02:30 UTC daily
- `mv_attestation_coverage`: 02:45 UTC daily
- `compute_daily_rollups()`: 03:00 UTC daily

## Performance Considerations

### Indexing Strategy

| Table | Key Indexes | Query Pattern |
|-------|-------------|---------------|
| `components` | `purl`, `supplier_normalized`, `license_category` | Lookup, aggregation |
| `artifacts` | `digest`, `environment`, `team` | Lookup, filtering |
| `component_vulns` | `vuln_id`, `severity`, `fix_available` | Join, filtering |
| `attestations` | `artifact_id`, `predicate_type` | Join, aggregation |
| `vex_overrides` | `(artifact_id, vuln_id)`, `status` | Subquery exists |

### Query Performance Targets

| Query | Target | Notes |
|-------|--------|-------|
| `sp_top_suppliers(20)` | < 100ms | Uses materialized view |
| `sp_license_heatmap()` | < 100ms | Uses materialized view |
| `sp_vuln_exposure()` | < 200ms | Uses materialized view |
| `sp_fixable_backlog()` | < 500ms | Live query with indexes |
| `sp_attestation_gaps()` | < 100ms | Uses materialized view |

### Caching Strategy

Platform API endpoints use a 5-minute TTL cache:
- Cache key: endpoint + query parameters
- Invalidation: Time-based only (no event-driven invalidation)
- Storage: Valkey (in-memory)

## Security Considerations

### Schema Permissions

```sql
-- Read-only role for dashboards
GRANT USAGE ON SCHEMA analytics TO dashboard_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics TO dashboard_reader;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA analytics TO dashboard_reader;

-- Write role for ingestion service
GRANT USAGE ON SCHEMA analytics TO analytics_writer;
GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA analytics TO analytics_writer;
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA analytics TO analytics_writer;
```

### Data Classification

| Table | Classification | Notes |
|-------|----------------|-------|
| `components` | Internal | Contains package names, versions |
| `artifacts` | Internal | Contains image names, team names |
| `component_vulns` | Internal | Vulnerability data (public CVEs) |
| `vex_overrides` | Confidential | Contains justifications, operator IDs |
| `raw_sboms` | Confidential | Full SBOM payloads |
| `raw_attestations` | Confidential | Signed attestation envelopes |

### Audit Trail

All tables include `created_at` and `updated_at` timestamps. Raw payload tables (`raw_sboms`, `raw_attestations`) are append-only with content hashes for integrity verification.

## Integration Points

### Upstream Dependencies

| Service | Event | Action |
|---------|-------|--------|
| Scanner | SBOM ingested | Normalize and upsert components |
| Concelier | Advisory updated | Re-correlate affected components |
| Excititor | VEX observation | Create/update vex_overrides |
| Attestor | Attestation created | Upsert attestation record |

### Downstream Consumers

| Consumer | Data | Endpoint |
|----------|------|----------|
| Console UI | Dashboard data | `/api/analytics/*` |
| Export Center | Compliance reports | Direct DB query |
| AdvisoryAI | Risk context | `/api/analytics/vulnerabilities` |

## Future Enhancements

1. **Partitioning**: Partition `daily_*` tables by date for faster queries and archival
2. **Incremental refresh**: Implement incremental materialized view refresh for large datasets
3. **Custom dimensions**: Support user-defined component groupings (business units, cost centers)
4. **Predictive analytics**: Add ML-based risk prediction using historical trends
5. **BI tool integration**: Direct connectors for Tableau, Looker, Metabase