Files
git.stella-ops.org/docs/modules/scanner/epss-integration.md
StellaOps Bot 2eafe98d44 save work
2025-12-19 07:28:23 +02:00

460 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# EPSS Integration Architecture
> **Advisory Source**: `docs/product-advisories/16-Dec-2025 - Merging EPSS v4 with CVSS v4 Frameworks.md`
> **Last Updated**: 2025-12-17
> **Status**: Approved for Implementation
---
## Executive Summary
EPSS (Exploit Prediction Scoring System) is a **probabilistic model** that estimates the likelihood a given CVE will be exploited in the wild over the next ~30 days. This document defines how StellaOps integrates EPSS as a first-class risk signal.
**Key Distinction**:
- **CVSS v4**: Deterministic measurement of *severity* (0-10)
- **EPSS**: Dynamic, data-driven *probability of exploitation* (0-1)
EPSS does **not** replace CVSS or VEX—it provides complementary probabilistic threat intelligence.
---
## 1. Design Principles
### 1.1 EPSS as Probabilistic Signal
| Signal Type | Nature | Source |
|-------------|--------|--------|
| CVSS v4 | Deterministic impact | NVD, vendor |
| EPSS | Probabilistic threat | FIRST daily feeds |
| VEX | Vendor intent | Vendor statements |
| Runtime context | Actual exposure | StellaOps scanner |
**Rule**: EPSS *modulates confidence*, never asserts truth.
### 1.2 Architectural Constraints
1. **Append-only time-series**: Never overwrite historical EPSS data
2. **Deterministic replay**: Every scan stores the EPSS snapshot reference used
3. **Idempotent ingestion**: Safe to re-run for same date
4. **Postgres as source of truth**: Valkey is optional cache only
5. **Air-gap compatible**: Manual import via signed bundles
---
## 2. Data Model
### 2.1 Core Tables
#### Import Provenance
```sql
CREATE TABLE epss_import_runs (
import_run_id UUID PRIMARY KEY,
model_date DATE NOT NULL,
source_uri TEXT NOT NULL,
retrieved_at TIMESTAMPTZ NOT NULL,
file_sha256 TEXT NOT NULL,
decompressed_sha256 TEXT NULL,
row_count INT NOT NULL,
model_version_tag TEXT NULL,
published_date DATE NULL,
status TEXT NOT NULL, -- SUCCEEDED / FAILED
error TEXT NULL,
UNIQUE (model_date)
);
```
#### Time-Series Scores (Partitioned)
```sql
CREATE TABLE epss_scores (
model_date DATE NOT NULL,
cve_id TEXT NOT NULL,
epss_score DOUBLE PRECISION NOT NULL,
percentile DOUBLE PRECISION NOT NULL,
import_run_id UUID NOT NULL REFERENCES epss_import_runs(import_run_id),
PRIMARY KEY (model_date, cve_id)
) PARTITION BY RANGE (model_date);
```
#### Current Projection (Fast Lookup)
```sql
CREATE TABLE epss_current (
cve_id TEXT PRIMARY KEY,
epss_score DOUBLE PRECISION NOT NULL,
percentile DOUBLE PRECISION NOT NULL,
model_date DATE NOT NULL,
import_run_id UUID NOT NULL
);
CREATE INDEX idx_epss_current_score_desc ON epss_current (epss_score DESC);
CREATE INDEX idx_epss_current_percentile_desc ON epss_current (percentile DESC);
```
#### Change Detection
```sql
CREATE TABLE epss_changes (
model_date DATE NOT NULL,
cve_id TEXT NOT NULL,
old_score DOUBLE PRECISION NULL,
new_score DOUBLE PRECISION NOT NULL,
delta_score DOUBLE PRECISION NULL,
old_percentile DOUBLE PRECISION NULL,
new_percentile DOUBLE PRECISION NOT NULL,
flags INT NOT NULL, -- bitmask: NEW_SCORED, CROSSED_HIGH, BIG_JUMP
PRIMARY KEY (model_date, cve_id)
) PARTITION BY RANGE (model_date);
```
### 2.2 Flags Bitmask
| Flag | Value | Meaning |
|------|-------|---------|
| NEW_SCORED | 0x01 | CVE newly scored (not in previous day) |
| CROSSED_HIGH | 0x02 | Score crossed above high threshold |
| CROSSED_LOW | 0x04 | Score crossed below high threshold |
| BIG_JUMP_UP | 0x08 | Delta > 0.10 upward |
| BIG_JUMP_DOWN | 0x10 | Delta > 0.10 downward |
| TOP_PERCENTILE | 0x20 | Entered top 5% |
---
## 3. Service Architecture
### 3.1 Component Responsibilities
```
┌─────────────────────────────────────────────────────────────────┐
│ EPSS DATA FLOW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Scheduler │────►│ Concelier │────►│ Scanner │ │
│ │ (triggers) │ │ (ingest) │ │ (evidence) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────────┐ │ │
│ │ │ Postgres │◄───────────┘ │
│ │ │ (truth) │ │
│ │ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Notify │◄────│ Excititor │ │
│ │ (alerts) │ │ (VEX tasks) │ │
│ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
```
| Component | Responsibility |
|-----------|----------------|
| **Scheduler** | Triggers daily EPSS import job |
| **Concelier** | Downloads/imports EPSS, stores facts, computes delta, emits events |
| **Scanner** | Attaches EPSS-at-scan as immutable evidence, uses for scoring |
| **Excititor** | Creates VEX tasks when EPSS is high and VEX missing |
| **Notify** | Sends alerts on priority changes |
### 3.2 Event Flow
```
Scheduler
→ epss.ingest(date)
→ Concelier (ingest)
→ epss.updated
→ Notify (optional daily summary)
→ Concelier (enrichment)
→ vuln.priority.changed
→ Notify (targeted alerts)
→ Excititor (VEX task creation)
```
---
## 4. Ingestion Pipeline
### 4.1 Data Source
FIRST publishes daily CSV snapshots at:
```
https://epss.empiricalsecurity.com/epss_scores-YYYY-MM-DD.csv.gz
```
Each file contains ~300k CVE records with:
- `cve` - CVE ID
- `epss` - Score (0.000001.00000)
- `percentile` - Rank vs all CVEs
### 4.2 Ingestion Steps
1. **Scheduler** triggers daily job for date D
2. **Download** `epss_scores-D.csv.gz`
3. **Decompress** stream
4. **Parse** header comment for model version/date
5. **Validate** scores in [0,1], monotonic percentile
6. **Bulk load** into TEMP staging table
7. **Transaction**:
- Insert `epss_import_runs`
- Insert into `epss_scores` partition
- Compute `epss_changes` by comparing staging vs `epss_current`
- Upsert `epss_current`
- Enqueue `epss.updated` event
8. **Commit**
### 4.3 Air-Gap Import
Accept local bundle containing:
- `epss_scores-YYYY-MM-DD.csv.gz`
- `manifest.json` with sha256, source attribution, DSSE signature
Same pipeline, with `source_uri = bundle://...`.
---
## 5. Enrichment Rules
### 5.1 New Scan Findings (Immutable)
Store EPSS "as-of" scan time:
```csharp
public record ScanEpssEvidence
{
public double EpssScoreAtScan { get; init; }
public double EpssPercentileAtScan { get; init; }
public DateOnly EpssModelDateAtScan { get; init; }
public Guid EpssImportRunIdAtScan { get; init; }
}
```
This supports deterministic replay even if EPSS changes later.
### 5.2 Existing Findings (Live Triage)
Maintain mutable "current EPSS" on vulnerability instances:
- **scan_finding_evidence**: Immutable EPSS-at-scan
- **vuln_instance_triage**: Current EPSS + band (for live triage)
### 5.3 Efficient Delta Targeting
On `epss.updated(D)`:
1. Read `epss_changes` where flags indicate material change
2. Find impacted vulnerability instances by CVE
3. Update only those instances
4. Emit `vuln.priority.changed` only if band crossed
---
## 6. Notification Policy
### 6.1 Default Thresholds
| Threshold | Default | Description |
|-----------|---------|-------------|
| HighPercentile | 0.95 | Top 5% of all CVEs |
| HighScore | 0.50 | 50% exploitation probability |
| BigJumpDelta | 0.10 | Meaningful daily change |
### 6.2 Trigger Conditions
1. **Newly scored** CVE in inventory AND `percentile >= HighPercentile`
2. Existing CVE **crosses above** HighPercentile or HighScore
3. Delta > BigJumpDelta AND CVE in runtime-exposed assets
All thresholds are org-configurable.
---
## 7. Trust Lattice Integration
### 7.1 Scoring Rule Example
```
IF cvss_base >= 8.0
AND epss_score >= 0.35
AND runtime_exposed = true
→ priority = IMMEDIATE_ATTENTION
```
### 7.2 Score Weights
| Factor | Default Weight | Range |
|--------|---------------|-------|
| CVSS | 0.25 | 0.0-1.0 |
| EPSS | 0.25 | 0.0-1.0 |
| Reachability | 0.25 | 0.0-1.0 |
| Freshness | 0.15 | 0.0-1.0 |
| Frequency | 0.10 | 0.0-1.0 |
---
## 8. API Surface
### 8.1 Internal API Endpoints
| Endpoint | Description |
|----------|-------------|
| `GET /epss/current?cve=...` | Bulk lookup current EPSS |
| `GET /epss/history?cve=...&days=180` | Historical time-series |
| `GET /epss/top?order=epss&limit=100` | Top CVEs by score |
| `GET /epss/changes?date=...` | Daily change report |
### 8.2 UI Requirements
For each vulnerability instance:
- EPSS score + percentile
- Model date
- Trend delta vs previous scan date
- Filter chips: "High EPSS", "Rising EPSS", "High CVSS + High EPSS"
- Evidence panel showing EPSS-at-scan vs current EPSS
---
## 9. Implementation Checklist
### Phase 1: Data Foundation
- [ ] DB migrations: tables + partitions + indexes
- [ ] Concelier ingestion job: online download + bundle import
### Phase 2: Integration
- [x] epss_current + epss_changes projection
- [x] Scanner.WebService: attach EPSS-at-scan evidence
- [x] Bulk lookup API (`/api/v1/epss/*`)
### Phase 3: Enrichment
- [x] Scanner Worker `EpssEnrichmentJob`: update `vuln_instance_triage` for CVEs with material changes
- [x] Scanner Worker `EpssSignalJob`: generate tenant-scoped EPSS signals (stored in `epss_signal`; published via `IEpssSignalPublisher` when configured)
### Phase 4: UI/UX
- [ ] EPSS fields in vulnerability detail
- [ ] Filters and sort by exploit likelihood
- [ ] Trend visualization
### Phase 5: Operations
- [x] Backfill tool (last 180 days)
- [x] Ops runbook: schedules, manual re-run, air-gap import
---
## 10. Operations Runbook
### 10.1 Configuration
EPSS jobs are configured via the `Epss:*` sections in Scanner Worker configuration:
```yaml
Epss:
Ingest:
Enabled: true # Enable/disable the job
Schedule: "0 5 0 * * *" # Cron expression (default: 00:05 UTC daily)
SourceType: "online" # "online" or "bundle"
BundlePath: null # Path for air-gapped bundle import
InitialDelay: "00:00:30" # Wait before first run (30s)
RetryDelay: "00:05:00" # Delay between retries (5m)
MaxRetries: 3 # Maximum retry attempts
Enrichment:
Enabled: true # Enable/disable live triage enrichment
PostIngestDelay: "00:01:00" # Wait after ingest before enriching
BatchSize: 1000 # CVEs per batch
HighPercentile: 0.99 # ≥ threshold => HIGH (and CrossedHigh flag)
HighScore: 0.50 # ≥ threshold => high score threshold
BigJumpDelta: 0.10 # ≥ threshold => BIG_JUMP flag
CriticalPercentile: 0.995 # ≥ threshold => CRITICAL
MediumPercentile: 0.90 # ≥ threshold => MEDIUM
FlagsToProcess: "NewScored,CrossedHigh,BigJumpUp,BigJumpDown" # Empty => process all
Signal:
Enabled: true # Enable/disable tenant-scoped signal generation
PostEnrichmentDelay: "00:00:30" # Wait after enrichment before emitting signals
BatchSize: 500 # Signals per batch
RetentionDays: 90 # Retention for epss_signal layer
SuppressSignalsOnModelChange: true # Suppress per-CVE signals on model version changes
```
### 10.2 Online Mode (Connected)
The job automatically fetches EPSS data from FIRST.org at the scheduled time:
1. Downloads `https://epss.empiricalsecurity.com/epss_scores-YYYY-MM-DD.csv.gz`
2. Validates SHA256 hash
3. Parses CSV and bulk inserts to `epss_scores`
4. Computes delta against `epss_current`
5. Updates `epss_current` projection
6. Publishes `epss.updated` event
### 10.3 Air-Gap Mode (Bundle)
For offline deployments:
1. Download EPSS CSV from FIRST.org on an internet-connected system
2. Copy to the configured `BundlePath` location
3. Set `SourceType: "bundle"` in configuration
4. The job will read from the local file instead of fetching online
### 10.4 Manual Ingestion
There is currently no HTTP endpoint for one-shot ingestion. To force a run:
1. Temporarily set `Epss:Ingest:Schedule` to `0 * * * * *` and `Epss:Ingest:InitialDelay` to `00:00:00`
2. Restart Scanner Worker and wait for one ingest cycle
3. Restore the normal schedule
Note: a successful ingest triggers `EpssEnrichmentJob`, which then triggers `EpssSignalJob`.
### 10.5 Troubleshooting
| Symptom | Likely Cause | Resolution |
|---------|--------------|------------|
| Job not running | `Enabled: false` | Set `Enabled: true` |
| Download fails | Network/firewall | Check HTTPS egress to `epss.empiricalsecurity.com` |
| Parse errors | Corrupted file | Re-download, check SHA256 |
| Enrichment/signals not running | Storage disabled or job disabled | Ensure `ScannerStorage:Postgres:ConnectionString` is set and `Epss:Enrichment:Enabled` / `Epss:Signal:Enabled` are `true` |
| Slow ingestion | Large dataset / constrained IO | Expect <120s for ~310k rows; confirm via the perf harness and compare against CI baseline |
| Duplicate runs | Idempotent | Safe - existing data preserved |
### 10.6 Monitoring
Key metrics and traces:
- **Activities**
- `StellaOps.Scanner.EpssIngest` (`epss.ingest`): `epss.model_date`, `epss.row_count`, `epss.cve_count`, `epss.duration_ms`
- `StellaOps.Scanner.EpssEnrichment` (`epss.enrich`): `epss.model_date`, `epss.changed_cve_count`, `epss.updated_count`, `epss.band_change_count`, `epss.duration_ms`
- `StellaOps.Scanner.EpssSignal` (`epss.signal.generate`): `epss.model_date`, `epss.change_count`, `epss.signal_count`, `epss.filtered_count`, `epss.tenant_count`, `epss.duration_ms`
- **Metrics**
- `epss_enrichment_runs_total{result}` / `epss_enrichment_duration_ms` / `epss_enrichment_updated_total` / `epss_enrichment_band_changes_total`
- `epss_signal_runs_total{result}` / `epss_signal_duration_ms` / `epss_signals_emitted_total{event_type, tenant_id}`
- **Logs** (structured)
- `EPSS ingest/enrichment/signal job started`
- `EPSS ingestion completed: modelDate={ModelDate}, rows={RowCount}, ...`
- `EPSS enrichment completed: updated={Updated}, bandChanges={BandChanges}, ...`
- `EPSS model version changed: {OldVersion} -> {NewVersion}`
- `EPSS signal generation completed: signals={SignalCount}, changes={ChangeCount}, ...`
### 10.7 Performance
- Local harness: `src/Scanner/__Benchmarks/StellaOps.Scanner.Storage.Epss.Perf/README.md`
- CI workflow: `.gitea/workflows/epss-ingest-perf.yml` (nightly + manual, artifacts retained 90 days)
---
## 11. Anti-Patterns to Avoid
| Anti-Pattern | Why It's Wrong |
|--------------|----------------|
| Storing only latest EPSS | Breaks auditability and replay |
| Mixing EPSS into CVE table | EPSS is signal, not vulnerability data |
| Treating EPSS as severity | EPSS is probability, not impact |
| Alerting on every daily fluctuation | Creates alert fatigue |
| Recomputing EPSS internally | Use FIRST's authoritative data |
---
## Related Documents
- [Unknowns API Documentation](../api/unknowns-api.md)
- [Score Replay API](../api/score-replay-api.md)
- [Trust Lattice Architecture](../modules/scanner/architecture.md)