Here’s a clean way to **measure and report scanner accuracy without letting one metric hide weaknesses**: track precision/recall (and AUC) separately for three evidence tiers: **Imported**, **Executed**, and **Tainted→Sink**. This mirrors how risk truly escalates in Python/JS‑style ecosystems. ### Why tiers? * **Imported**: vuln in a dep that’s present (lots of noise). * **Executed**: code/deps actually run on typical paths (fewer FPs). * **Tainted→Sink**: user‑controlled data reaches a sensitive sink (highest signal). ### Minimal spec to implement now **Ground‑truth corpus design** * Label each finding as: `tier ∈ {imported, executed, tainted_sink}`, `true_label ∈ {TP,FN}`; store model confidence `p∈[0,1]`. * Keep language tags (py, js, ts), package manager, and scenario (web API, cli, job). **DB schema (add to test analytics db)** * `gt_sample(id, repo, commit, lang, scenario)` * `gt_finding(id, sample_id, vuln_id, tier, truth, score, rule, scanner_version, created_at)` * `gt_split(sample_id, split ∈ {train,dev,test})` **Metrics to publish (all stratified by tier)** * Precision@K (e.g., top‑100), Recall@K * PR‑AUC, ROC‑AUC (only if calibrated) * Latency p50/p95 from “scan start → first evidence” * Coverage: % of samples with any signal in that tier **Reporting layout (one chart per tier)** * PR curve + table: `Precision, Recall, F1, PR‑AUC, N(findings), N(samples)` * Error buckets: top 5 false‑positive rules, top 5 false‑negative patterns **Evaluation protocol** 1. Freeze a **toy but diverse corpus** (50–200 repos) with deterministic fixture data and replay scripts. 2. For each release candidate: * Run scanner with fixed flags and feeds. * Emit per‑finding scores; map each to a tier with your reachability engine. * Join to ground truth; compute metrics **per tier** and **overall**. 3. Fail the build if any of: * PR‑AUC(imported) drops >2%, or PR‑AUC(executed/tainted_sink) drops >1%. * FP rate in `tainted_sink` > 5% at operating point Recall ≥ 0.7. **How to classify tiers (deterministic rules)** * `imported`: package appears in lockfile/SBOM and is reachable in graph. * `executed`: function/module reached by dynamic trace, coverage, or proven path in static call graph used by entrypoints. * `tainted_sink`: taint source → sanitizers → sink path proven, with sink taxonomy (eval, exec, SQL, SSRF, deserialization, XXE, command, path traversal). **Developer checklist (Stella Ops naming)** * Scanner.Worker: emit `evidence_tier` and `score` on each finding. * Excititor (VEX): include `tier` in statements; allow policy per‑tier thresholds. * Concelier (feeds): tag advisories with sink classes when available to help tier mapping. * Scheduler/Notify: gate alerts on **tiered** thresholds (e.g., page only on `tainted_sink` at Recall‑target op‑point). * Router dashboards: three small PR curves + trend sparklines; hover shows last 5 FP causes. **Quick JSON result shape** ```json { "finding_id": "…", "vuln_id": "CVE-2024-12345", "rule": "py.sql.injection.param_concat", "evidence_tier": "tainted_sink", "score": 0.87, "reachability": { "entrypoint": "app.py:main", "path_len": 5, "sanitizers": ["escape_sql"] } } ``` **Operational point selection** * Choose op‑points per tier by maximizing F1 or fixing Recall targets: * imported: Recall 0.60 * executed: Recall 0.70 * tainted_sink: Recall 0.80 Then record **per‑tier precision at those recalls** each release. **Why this prevents metric gaming** * A model can’t inflate “overall precision” by over‑penalizing noisy imported findings: you still have to show gains in **executed** and **tainted_sink** curves, where it matters. If you want, I can draft a tiny sample corpus template (folders + labels) and a one‑file evaluator that outputs the three PR curves and a markdown summary ready for your CI artifact. What you are trying to solve is this: If you measure “scanner accuracy” as one overall precision/recall number, you can *accidentally* optimize the wrong thing. A scanner can look “better” by getting quieter on the easy/noisy tier (dependencies merely present) while getting worse on the tier that actually matters (user-data reaching a dangerous sink). Tiered accuracy prevents that failure mode and gives you a clean product contract: * **Imported** = “it exists in the artifact” (high volume, high noise) * **Executed** = “it actually runs on real entrypoints” (materially more useful) * **Tainted→Sink** = “user-controlled input reaches a sensitive sink” (highest signal, most actionable) This is not just analytics. It drives: * alerting (page only on tainted→sink), * UX (show the *reason* a vuln matters), * policy/lattice merges (VEX decisions should not collapse tiers), * engineering priorities (don’t let “imported” improvements hide “tainted→sink” regressions). Below is a concrete StellaOps implementation plan (aligned to your architecture rules: **lattice algorithms run in `scanner.webservice`**, Concelier/Excititor **preserve prune source**, Postgres is SoR, Valkey only ephemeral). --- ## 1) Product contract: what “tier” means in StellaOps ### 1.1 Tier assignment rule (single source of truth) **Owner:** `StellaOps.Scanner.WebService` **Input:** raw findings + evidence objects from workers (deps, callgraph, trace, taint paths) **Output:** `evidence_tier` on each normalized finding (plus an evidence summary) **Tier precedence (highest wins):** 1. `tainted_sink` 2. `executed` 3. `imported` **Deterministic mapping rule:** * `imported` if SBOM/lockfile indicates package/component present AND vuln applies to that component. * `executed` if reachability engine can prove reachable from declared entrypoints (static) OR runtime trace/coverage proves execution. * `tainted_sink` if taint engine proves source→(optional sanitizer)→sink path with sink taxonomy. ### 1.2 Evidence objects (the “why”) Workers emit *evidence primitives*; webservice merges + tiers them: * `DependencyEvidence { purl, version, lockfile_path }` * `ReachabilityEvidence { entrypoint, call_path[], confidence }` * `TaintEvidence { source, sink, sanitizers[], dataflow_path[], confidence }` --- ## 2) Data model in Postgres (system of record) Create a dedicated schema `eval` for ground truth + computed metrics (keeps it separate from production scans but queryable by the UI). ### 2.1 Tables (minimal but complete) ```sql create schema if not exists eval; -- A “sample” = one repo/fixture scenario you scan deterministically create table eval.sample ( sample_id uuid primary key, name text not null, repo_path text not null, -- local path in your corpus checkout commit_sha text null, language text not null, -- py/js/ts/java/dotnet/mixed scenario text not null, -- webapi/cli/job/lib entrypoints jsonb not null, -- array of entrypoint descriptors created_at timestamptz not null default now() ); -- Expected truth for a sample create table eval.expected_finding ( expected_id uuid primary key, sample_id uuid not null references eval.sample(sample_id) on delete cascade, vuln_key text not null, -- your canonical vuln key (see 2.2) tier text not null check (tier in ('imported','executed','tainted_sink')), rule_key text null, -- optional: expected rule family location_hint text null, -- e.g. file:line or package sink_class text null, -- sql/command/ssrf/deser/eval/path/etc notes text null ); -- One evaluation run (tied to exact versions + snapshots) create table eval.run ( eval_run_id uuid primary key, scanner_version text not null, rules_hash text not null, concelier_snapshot_hash text not null, -- feed snapshot / advisory set hash replay_manifest_hash text not null, started_at timestamptz not null default now(), finished_at timestamptz null ); -- Observed results captured from a scan run over the corpus create table eval.observed_finding ( observed_id uuid primary key, eval_run_id uuid not null references eval.run(eval_run_id) on delete cascade, sample_id uuid not null references eval.sample(sample_id) on delete cascade, vuln_key text not null, tier text not null check (tier in ('imported','executed','tainted_sink')), score double precision not null, -- 0..1 rule_key text not null, evidence jsonb not null, -- summarized evidence blob first_signal_ms int not null -- TTFS-like metric for this finding ); -- Computed metrics, per tier and operating point create table eval.metrics ( eval_run_id uuid not null references eval.run(eval_run_id) on delete cascade, tier text not null check (tier in ('imported','executed','tainted_sink')), op_point text not null, -- e.g. "recall>=0.80" or "threshold=0.72" precision double precision not null, recall double precision not null, f1 double precision not null, pr_auc double precision not null, latency_p50_ms int not null, latency_p95_ms int not null, n_expected int not null, n_observed int not null, primary key (eval_run_id, tier, op_point) ); ``` ### 2.2 Canonical vuln key (avoid mismatches) Define a single canonical key for matching expected↔observed: * For dependency vulns: `purl + advisory_id` (or `purl + cve` if available). * For code-pattern vulns: `rule_family + stable fingerprint` (e.g., `sink_class + file + normalized AST span`). You need this to stop “matching hell” from destroying the usefulness of metrics. --- ## 3) Corpus format (how developers add truth samples) Create `/corpus/` repo (or folder) with strict structure: ``` /corpus/ /samples/ /py_sql_injection_001/ sample.yml app.py requirements.txt expected.json /js_ssrf_002/ sample.yml index.js package-lock.json expected.json replay-manifest.yml # pins concelier snapshot, rules hash, analyzers tools/ run-scan.ps1 run-scan.sh ``` **`sample.yml`** includes: * language, scenario, entrypoints, * how to run/build (if needed), * “golden” command line for deterministic scanning. **`expected.json`** is a list of expected findings with `vuln_key`, `tier`, optional `sink_class`. --- ## 4) Pipeline changes in StellaOps (where code changes go) ### 4.1 Scanner workers: emit evidence primitives (no tiering here) **Modules:** * `StellaOps.Scanner.Worker.DotNet` * `StellaOps.Scanner.Worker.Python` * `StellaOps.Scanner.Worker.Node` * `StellaOps.Scanner.Worker.Java` **Change:** * Every raw finding must include: * `vuln_key` * `rule_key` * `score` (even if coarse at first) * `evidence[]` primitives (dependency / reachability / taint as available) * `first_signal_ms` (time from scan start to first evidence emitted for that finding) Workers do **not** decide tiers. They only report what they saw. ### 4.2 Scanner webservice: tiering + lattice merge (this is the policy brain) **Module:** `StellaOps.Scanner.WebService` Responsibilities: * Merge evidence for the same `vuln_key` across analyzers. * Run reachability/taint algorithms (your lattice policy engine sits here). * Assign `evidence_tier` deterministically. * Persist normalized findings (production tables) + export to eval capture. ### 4.3 Concelier + Excititor (preserve prune source) * Concelier stores advisory data; does not “tier” anything. * Excititor stores VEX statements; when it references a finding, it may *annotate* tier context, but it must preserve pruning provenance and not recompute tiers. --- ## 5) Evaluator implementation (the thing that computes tiered precision/recall) ### 5.1 New service/tooling Create: * `StellaOps.Scanner.Evaluation.Core` (library) * `StellaOps.Scanner.Evaluation.Cli` (dotnet tool) CLI responsibilities: 1. Load corpus samples + expected findings into `eval.sample` / `eval.expected_finding`. 2. Trigger scans (via Scheduler or direct Scanner API) using `replay-manifest.yml`. 3. Capture observed findings into `eval.observed_finding`. 4. Compute per-tier PR curve + PR-AUC + operating-point precision/recall. 5. Write `eval.metrics` + produce Markdown/JSON artifacts for CI. ### 5.2 Matching algorithm (practical and robust) For each `sample_id`: * Group expected by `(vuln_key, tier)`. * Group observed by `(vuln_key, tier)`. * A match is “same vuln_key, same tier”. * (Later enhancement: allow “higher tier” observed to satisfy a lower-tier expected only if you explicitly want that; default: **exact tier match** so you catch tier regressions.) Compute: * TP/FP/FN per tier. * PR curve by sweeping threshold over observed scores. * `first_signal_ms` percentiles per tier. ### 5.3 Operating points (so it’s not academic) Pick tier-specific gates: * `tainted_sink`: require Recall ≥ 0.80, minimize FP * `executed`: require Recall ≥ 0.70 * `imported`: require Recall ≥ 0.60 Store the chosen threshold per tier per version (so you can compare apples-to-apples in regressions). --- ## 6) CI gating (how this becomes “real” engineering pressure) In GitLab/Gitea pipeline: 1. Build scanner + webservice. 2. Pull pinned concelier snapshot bundle (or local snapshot). 3. Run evaluator CLI against corpus. 4. Fail build if: * `PR-AUC(tainted_sink)` drops > 1% vs baseline * or precision at `Recall>=0.80` drops below a floor (e.g. 0.95) * or `latency_p95_ms(tainted_sink)` regresses beyond a budget Store baselines in repo (`/corpus/baselines/.json`) to make diffs explicit. --- ## 7) UI and alerting (so tiering changes behavior) ### 7.1 UI Add three KPI cards: * Imported PR-AUC trend * Executed PR-AUC trend * Tainted→Sink PR-AUC trend In the findings list: * show tier badge * default sort: `tainted_sink` then `executed` then `imported` * clicking a finding shows evidence summary (entrypoint, path length, sink class) ### 7.2 Notify policy Default policy: * Page/urgent only on `tainted_sink` above a confidence threshold. * Create ticket on `executed`. * Batch report on `imported`. This is the main “why”: the system stops screaming about irrelevant imports. --- ## 8) Rollout plan (phased, developer-friendly) ### Phase 0: Contracts (1–2 days) * Define `vuln_key`, `rule_key`, evidence DTOs, tier enum. * Add schema `eval.*`. **Done when:** scanner output can carry evidence + score; eval tables exist. ### Phase 1: Evidence emission + tiering (1–2 sprints) * Workers emit evidence primitives. * Webservice assigns tier using deterministic precedence. **Done when:** every finding has a tier + evidence summary. ### Phase 2: Corpus + evaluator (1 sprint) * Build 30–50 samples (10 per tier minimum). * Implement evaluator CLI + metrics persistence. **Done when:** CI can compute tiered metrics and output markdown report. ### Phase 3: Gates + UX (1 sprint) * Add CI regression gates. * Add UI tier badge + dashboards. * Add Notify tier-based routing. **Done when:** a regression in tainted→sink breaks CI even if imported improves. ### Phase 4: Scale corpus + harden matching (ongoing) * Expand to 200+ samples, multi-language. * Add fingerprinting for code vulns to avoid brittle file/line matching. --- ## Definition of “success” (so nobody bikesheds) * You can point to one release where **overall precision stayed flat** but **tainted→sink PR-AUC improved**, and CI proves you didn’t “cheat” by just silencing imported findings. * On-call noise drops because paging is tier-gated. * TTFS p95 for tainted→sink stays within a budget you set (e.g., <30s on corpus and