Files
git.stella-ops.org/docs/modules/telemetry/guides/fn-drift.md
2026-01-06 19:07:48 +02:00

178 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# FN-Drift Metrics Reference
> **Sprint:** SPRINT_3404_0001_0001
> **Module:** Scanner Storage / Telemetry
## Overview
False-Negative Drift (FN-Drift) measures how often vulnerability classifications change from "not affected" or "unknown" to "affected" during rescans. This metric is critical for:
- **Accuracy Assessment**: Tracking scanner reliability over time
- **SLO Compliance**: Meeting false-negative rate targets
- **Root Cause Analysis**: Stratified analysis by drift cause
- **Feed Quality**: Identifying problematic vulnerability feeds
## Metrics
### Gauges (30-day rolling window)
| Metric | Type | Description |
|--------|------|-------------|
| `scanner.fn_drift.percent` | Gauge | 30-day rolling FN-Drift percentage |
| `scanner.fn_drift.transitions_30d` | Gauge | Total FN transitions in last 30 days |
| `scanner.fn_drift.evaluated_30d` | Gauge | Total findings evaluated in last 30 days |
| `scanner.fn_drift.cause.feed_delta` | Gauge | FN transitions caused by feed updates |
| `scanner.fn_drift.cause.rule_delta` | Gauge | FN transitions caused by rule changes |
| `scanner.fn_drift.cause.lattice_delta` | Gauge | FN transitions caused by VEX lattice changes |
| `scanner.fn_drift.cause.reachability_delta` | Gauge | FN transitions caused by reachability changes |
| `scanner.fn_drift.cause.engine` | Gauge | FN transitions caused by engine changes (should be ~0) |
### Counters (all-time)
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `scanner.classification_changes_total` | Counter | `cause` | Total classification status changes |
| `scanner.fn_transitions_total` | Counter | `cause` | Total false-negative transitions |
## Classification Statuses
| Status | Description |
|--------|-------------|
| `new` | First scan, no previous status |
| `unaffected` | Confirmed not affected |
| `unknown` | Status unknown/uncertain |
| `affected` | Confirmed affected |
| `fixed` | Previously affected, now fixed |
## Drift Causes
| Cause | Description | Expected Impact |
|-------|-------------|-----------------|
| `feed_delta` | Vulnerability feed updated (NVD, GHSA, OVAL) | High - most common cause |
| `rule_delta` | Policy rules changed | Medium - controlled by policy team |
| `lattice_delta` | VEX lattice state changed | Medium - VEX updates |
| `reachability_delta` | Reachability analysis changed | Low - improved analysis |
| `engine` | Scanner engine change | ~0 - determinism violation if >0 |
| `other` | Unknown/unclassified cause | Low - investigate if high |
## FN-Drift Definition
A **False-Negative Transition** occurs when:
- Previous status was `unaffected` or `unknown`
- New status is `affected`
This indicates the scanner previously classified a finding as "not vulnerable" but now classifies it as "vulnerable" - a false negative in the earlier scan.
### FN-Drift Rate Calculation
```
FN-Drift % = (FN Transitions / Total Reclassified) × 100
```
Where:
- **FN Transitions**: Count of `(unaffected|unknown) → affected` changes
- **Total Reclassified**: Count of all status changes (excluding `new`)
## SLO Thresholds
| SLO Level | FN-Drift Threshold | Alert Severity |
|-----------|-------------------|----------------|
| Target | < 1.0% | None |
| Warning | 1.0% - 2.5% | Warning |
| Critical | > 2.5% | Critical |
| Engine Drift | > 0% | Page |
### Alerting Rules
```yaml
# Example Prometheus alerting rules
groups:
- name: fn-drift
rules:
- alert: FnDriftWarning
expr: scanner_fn_drift_percent > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "FN-Drift rate above warning threshold"
- alert: FnDriftCritical
expr: scanner_fn_drift_percent > 2.5
for: 5m
labels:
severity: critical
annotations:
summary: "FN-Drift rate above critical threshold"
- alert: EngineDriftDetected
expr: scanner_fn_drift_cause_engine > 0
for: 1m
labels:
severity: page
annotations:
summary: "Engine-caused FN drift detected - determinism violation"
```
## Dashboard Queries
### FN-Drift Trend (Grafana)
```promql
# 30-day rolling FN-Drift percentage
scanner_fn_drift_percent
# FN transitions by cause
sum by (cause) (rate(scanner_fn_transitions_total[1h]))
# Classification changes rate
sum by (cause) (rate(scanner_classification_changes_total[1h]))
```
### Drift Cause Breakdown
```promql
# Pie chart of drift causes
topk(5,
sum by (cause) (
increase(scanner_fn_transitions_total[24h])
)
)
```
## Database Schema
### classification_history Table
```sql
CREATE TABLE scanner.classification_history (
id BIGSERIAL PRIMARY KEY,
artifact_digest TEXT NOT NULL,
vuln_id TEXT NOT NULL,
package_purl TEXT NOT NULL,
tenant_id UUID NOT NULL,
manifest_id UUID NOT NULL,
execution_id UUID NOT NULL,
previous_status TEXT NOT NULL,
new_status TEXT NOT NULL,
is_fn_transition BOOLEAN GENERATED ALWAYS AS (...) STORED,
cause TEXT NOT NULL,
cause_detail JSONB,
changed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
```
### fn_drift_stats Materialized View
Aggregated daily statistics for efficient dashboard queries:
- Day bucket
- Tenant ID
- Cause breakdown
- FN count and percentage
## Related Documentation
- [Determinism Technical Reference](../product-advisories/14-Dec-2025%20-%20Determinism%20and%20Reproducibility%20Technical%20Reference.md) - Section 13.2
- [Scanner Architecture](../modules/scanner/architecture.md)
- [Telemetry Stack](../modules/telemetry/architecture.md)