# Tiered Precision Curves for Scanner Accuracy **Advisory:** 16-Dec-2025 - Measuring Progress with Tiered Precision Curves **Status:** Processing **Related Sprints:** SPRINT_3500_0003_0001 (Ground-Truth Corpus) ## Executive Summary This advisory introduces a tiered approach to measuring scanner accuracy that prevents metric gaming. By tracking precision/recall separately for three evidence tiers (Imported, Executed, Tainted→Sink), we ensure improvements in one tier don't hide regressions in another. ## Key Concepts ### Evidence Tiers | Tier | Description | Risk Level | Typical Volume | |------|-------------|------------|----------------| | **Imported** | Vuln exists in dependency | Lowest | High | | **Executed** | Code/deps actually run | Medium | Medium | | **Tainted→Sink** | User data reaches sink | Highest | Low | ### Tier Precedence Highest tier wins when a finding has multiple evidence types: 1. `tainted_sink` (highest) 2. `executed` 3. `imported` ## Implementation Components ### 1. Evidence Schema (`eval` schema) ```sql -- Ground truth samples eval.sample(sample_id, name, repo_path, commit_sha, language, scenario, entrypoints) -- Expected findings eval.expected_finding(expected_id, sample_id, vuln_key, tier, rule_key, sink_class) -- Evaluation runs eval.run(eval_run_id, scanner_version, rules_hash, concelier_snapshot_hash) -- Observed results eval.observed_finding(observed_id, eval_run_id, sample_id, vuln_key, tier, score, rule_key, evidence) -- Computed metrics eval.metrics(eval_run_id, tier, op_point, precision, recall, f1, pr_auc, latency_p50_ms) ``` ### 2. Scanner Worker Changes Workers emit evidence primitives: - `DependencyEvidence { purl, version, lockfile_path }` - `ReachabilityEvidence { entrypoint, call_path[], confidence }` - `TaintEvidence { source, sink, sanitizers[], dataflow_path[], confidence }` ### 3. Scanner WebService Changes WebService performs tiering: - Merge evidence for same `vuln_key` - Run reachability/taint algorithms - Assign `evidence_tier` deterministically - Persist normalized findings ### 4. Evaluator CLI New tool `StellaOps.Scanner.Evaluation.Cli`: - `import-corpus` - Load samples and expected findings - `run` - Trigger scans using replay manifest - `compute` - Calculate per-tier PR curves - `report` - Generate markdown artifacts ### 5. CI Gates Fail builds when: - PR-AUC(imported) drops > 2% - PR-AUC(executed/tainted_sink) drops > 1% - FP rate in `tainted_sink` > 5% at Recall ≥ 0.7 ## Operating Points | Tier | Target Recall | Purpose | |------|--------------|---------| | `imported` | ≥ 0.60 | Broad coverage | | `executed` | ≥ 0.70 | Material risk | | `tainted_sink` | ≥ 0.80 | Actionable findings | ## Integration with Existing Systems ### Concelier - Stores advisory data, does not tier - Tag advisories with sink classes when available ### Excititor (VEX) - Include `tier` in VEX statements - Allow policy per-tier thresholds - Preserve pruning provenance ### Notify - Gate alerts on tiered thresholds - Page only on `tainted_sink` at operating point ### UI - Show tier badge on findings - Default sort: tainted_sink > executed > imported - Display evidence summary (entrypoint, path length, sink class) ## Success Criteria 1. Can demonstrate release where overall precision stayed flat but tainted→sink PR-AUC improved 2. On-call noise reduced via tier-gated paging 3. TTFS p95 for tainted→sink within budget ## Related Documentation - [Ground-Truth Corpus Sprint](../implplan/SPRINT_3500_0003_0001_ground_truth_corpus_ci_gates.md) - [Scanner Architecture](../modules/scanner/architecture.md) - [Reachability Analysis](./14-Dec-2025%20-%20Reachability%20Analysis%20Technical%20Reference.md) ## Overlap Analysis This advisory **extends** the ground-truth corpus work (SPRINT_3500_0003_0001) with: - Tiered precision tracking (new) - Per-tier operating points (new) - CI gates based on tier-specific AUC (enhancement) - Integration with Notify for tier-gated alerts (new) No contradictions with existing implementations found.