git.stella-ops.org/docs/benchmarks/tiered-precision-curves.md

# Tiered Precision Curves for Scanner Accuracy

**Advisory:** 16-Dec-2025 - Measuring Progress with Tiered Precision Curves
**Status:** Processing
**Related Sprints:** SPRINT_3500_0003_0001 (Ground-Truth Corpus)

## Executive Summary

This advisory introduces a tiered approach to measuring scanner accuracy that prevents metric gaming. By tracking precision/recall separately for three evidence tiers (Imported, Executed, Tainted→Sink), we ensure improvements in one tier don't hide regressions in another.

## Key Concepts

### Evidence Tiers

| Tier | Description | Risk Level | Typical Volume |
|------|-------------|------------|----------------|
| **Imported** | Vuln exists in dependency | Lowest | High |
| **Executed** | Code/deps actually run | Medium | Medium |
| **Tainted→Sink** | User data reaches sink | Highest | Low |

### Tier Precedence

Highest tier wins when a finding has multiple evidence types:
1. `tainted_sink` (highest)
2. `executed`
3. `imported`

## Implementation Components

### 1. Evidence Schema (`eval` schema)

```sql
-- Ground truth samples
eval.sample(sample_id, name, repo_path, commit_sha, language, scenario, entrypoints)

-- Expected findings
eval.expected_finding(expected_id, sample_id, vuln_key, tier, rule_key, sink_class)

-- Evaluation runs
eval.run(eval_run_id, scanner_version, rules_hash, concelier_snapshot_hash)

-- Observed results
eval.observed_finding(observed_id, eval_run_id, sample_id, vuln_key, tier, score, rule_key, evidence)

-- Computed metrics
eval.metrics(eval_run_id, tier, op_point, precision, recall, f1, pr_auc, latency_p50_ms)
```

### 2. Scanner Worker Changes

Workers emit evidence primitives:
- `DependencyEvidence { purl, version, lockfile_path }`
- `ReachabilityEvidence { entrypoint, call_path[], confidence }`
- `TaintEvidence { source, sink, sanitizers[], dataflow_path[], confidence }`

### 3. Scanner WebService Changes

WebService performs tiering:
- Merge evidence for same `vuln_key`
- Run reachability/taint algorithms
- Assign `evidence_tier` deterministically
- Persist normalized findings

### 4. Evaluator CLI

New tool `StellaOps.Scanner.Evaluation.Cli`:
- `import-corpus` - Load samples and expected findings
- `run` - Trigger scans using replay manifest
- `compute` - Calculate per-tier PR curves
- `report` - Generate markdown artifacts

### 5. CI Gates

Fail builds when:
- PR-AUC(imported) drops > 2%
- PR-AUC(executed/tainted_sink) drops > 1%
- FP rate in `tainted_sink` > 5% at Recall ≥ 0.7

## Operating Points

| Tier | Target Recall | Purpose |
|------|--------------|---------|
| `imported` | ≥ 0.60 | Broad coverage |
| `executed` | ≥ 0.70 | Material risk |
| `tainted_sink` | ≥ 0.80 | Actionable findings |

## Integration with Existing Systems

### Concelier
- Stores advisory data, does not tier
- Tag advisories with sink classes when available

### Excititor (VEX)
- Include `tier` in VEX statements
- Allow policy per-tier thresholds
- Preserve pruning provenance

### Notify
- Gate alerts on tiered thresholds
- Page only on `tainted_sink` at operating point

### UI
- Show tier badge on findings
- Default sort: tainted_sink > executed > imported
- Display evidence summary (entrypoint, path length, sink class)

## Success Criteria

1. Can demonstrate release where overall precision stayed flat but tainted→sink PR-AUC improved
2. On-call noise reduced via tier-gated paging
3. TTFS p95 for tainted→sink within budget

## Related Documentation

- [Ground-Truth Corpus Sprint](../implplan/SPRINT_3500_0003_0001_ground_truth_corpus_ci_gates.md)
- [Scanner Architecture](../modules/scanner/architecture.md)
- [Reachability Analysis](./14-Dec-2025%20-%20Reachability%20Analysis%20Technical%20Reference.md)

## Overlap Analysis

This advisory **extends** the ground-truth corpus work (SPRINT_3500_0003_0001) with:
- Tiered precision tracking (new)
- Per-tier operating points (new)
- CI gates based on tier-specific AUC (enhancement)
- Integration with Notify for tier-gated alerts (new)

No contradictions with existing implementations found.