Here’s a clean way to **measure and report scanner accuracy without letting one metric hide weaknesses**: track precision/recall (and AUC) separately for three evidence tiers: **Imported**, **Executed**, and **Tainted→Sink**. This mirrors how risk truly escalates in Python/JS‑style ecosystems.

### Why tiers?

* **Imported**: vuln in a dep that’s present (lots of noise).
* **Executed**: code/deps actually run on typical paths (fewer FPs).
* **Tainted→Sink**: user‑controlled data reaches a sensitive sink (highest signal).

### Minimal spec to implement now

**Ground‑truth corpus design**

* Label each finding as: `tier ∈ {imported, executed, tainted_sink}`, `true_label ∈ {TP,FN}`; store model confidence `p∈[0,1]`.
* Keep language tags (py, js, ts), package manager, and scenario (web API, cli, job).

**DB schema (add to test analytics db)**

* `gt_sample(id, repo, commit, lang, scenario)`
* `gt_finding(id, sample_id, vuln_id, tier, truth, score, rule, scanner_version, created_at)`
* `gt_split(sample_id, split ∈ {train,dev,test})`

**Metrics to publish (all stratified by tier)**

* Precision@K (e.g., top‑100), Recall@K
* PR‑AUC, ROC‑AUC (only if calibrated)
* Latency p50/p95 from “scan start → first evidence”
* Coverage: % of samples with any signal in that tier

**Reporting layout (one chart per tier)**

* PR curve + table: `Precision, Recall, F1, PR‑AUC, N(findings), N(samples)`
* Error buckets: top 5 false‑positive rules, top 5 false‑negative patterns

**Evaluation protocol**

1. Freeze a **toy but diverse corpus** (50–200 repos) with deterministic fixture data and replay scripts.
2. For each release candidate:

   * Run scanner with fixed flags and feeds.
   * Emit per‑finding scores; map each to a tier with your reachability engine.
   * Join to ground truth; compute metrics **per tier** and **overall**.
3. Fail the build if any of:

   * PR‑AUC(imported) drops >2%, or PR‑AUC(executed/tainted_sink) drops >1%.
   * FP rate in `tainted_sink` > 5% at operating point Recall ≥ 0.7.

**How to classify tiers (deterministic rules)**

* `imported`: package appears in lockfile/SBOM and is reachable in graph.
* `executed`: function/module reached by dynamic trace, coverage, or proven path in static call graph used by entrypoints.
* `tainted_sink`: taint source → sanitizers → sink path proven, with sink taxonomy (eval, exec, SQL, SSRF, deserialization, XXE, command, path traversal).

**Developer checklist (Stella Ops naming)**

* Scanner.Worker: emit `evidence_tier` and `score` on each finding.
* Excititor (VEX): include `tier` in statements; allow policy per‑tier thresholds.
* Concelier (feeds): tag advisories with sink classes when available to help tier mapping.
* Scheduler/Notify: gate alerts on **tiered** thresholds (e.g., page only on `tainted_sink` at Recall‑target op‑point).
* Router dashboards: three small PR curves + trend sparklines; hover shows last 5 FP causes.

**Quick JSON result shape**

```json
{
  "finding_id": "…",
  "vuln_id": "CVE-2024-12345",
  "rule": "py.sql.injection.param_concat",
  "evidence_tier": "tainted_sink",
  "score": 0.87,
  "reachability": { "entrypoint": "app.py:main", "path_len": 5, "sanitizers": ["escape_sql"] }
}
```

**Operational point selection**

* Choose op‑points per tier by maximizing F1 or fixing Recall targets:

  * imported: Recall 0.60
  * executed: Recall 0.70
  * tainted_sink: Recall 0.80
    Then record **per‑tier precision at those recalls** each release.

**Why this prevents metric gaming**

* A model can’t inflate “overall precision” by over‑penalizing noisy imported findings: you still have to show gains in **executed** and **tainted_sink** curves, where it matters.

If you want, I can draft a tiny sample corpus template (folders + labels) and a one‑file evaluator that outputs the three PR curves and a markdown summary ready for your CI artifact.
What you are trying to solve is this:

If you measure “scanner accuracy” as one overall precision/recall number, you can *accidentally* optimize the wrong thing. A scanner can look “better” by getting quieter on the easy/noisy tier (dependencies merely present) while getting worse on the tier that actually matters (user-data reaching a dangerous sink). Tiered accuracy prevents that failure mode and gives you a clean product contract:

* **Imported** = “it exists in the artifact” (high volume, high noise)
* **Executed** = “it actually runs on real entrypoints” (materially more useful)
* **Tainted→Sink** = “user-controlled input reaches a sensitive sink” (highest signal, most actionable)

This is not just analytics. It drives:

* alerting (page only on tainted→sink),
* UX (show the *reason* a vuln matters),
* policy/lattice merges (VEX decisions should not collapse tiers),
* engineering priorities (don’t let “imported” improvements hide “tainted→sink” regressions).

Below is a concrete StellaOps implementation plan (aligned to your architecture rules: **lattice algorithms run in `scanner.webservice`**, Concelier/Excititor **preserve prune source**, Postgres is SoR, Valkey only ephemeral).

---

## 1) Product contract: what “tier” means in StellaOps

### 1.1 Tier assignment rule (single source of truth)

**Owner:** `StellaOps.Scanner.WebService`
**Input:** raw findings + evidence objects from workers (deps, callgraph, trace, taint paths)
**Output:** `evidence_tier` on each normalized finding (plus an evidence summary)

**Tier precedence (highest wins):**

1. `tainted_sink`
2. `executed`
3. `imported`

**Deterministic mapping rule:**

* `imported` if SBOM/lockfile indicates package/component present AND vuln applies to that component.
* `executed` if reachability engine can prove reachable from declared entrypoints (static) OR runtime trace/coverage proves execution.
* `tainted_sink` if taint engine proves source→(optional sanitizer)→sink path with sink taxonomy.

### 1.2 Evidence objects (the “why”)

Workers emit *evidence primitives*; webservice merges + tiers them:

* `DependencyEvidence { purl, version, lockfile_path }`
* `ReachabilityEvidence { entrypoint, call_path[], confidence }`
* `TaintEvidence { source, sink, sanitizers[], dataflow_path[], confidence }`

---

## 2) Data model in Postgres (system of record)

Create a dedicated schema `eval` for ground truth + computed metrics (keeps it separate from production scans but queryable by the UI).

### 2.1 Tables (minimal but complete)

```sql
create schema if not exists eval;

-- A “sample” = one repo/fixture scenario you scan deterministically
create table eval.sample (
  sample_id uuid primary key,
  name text not null,
  repo_path text not null,              -- local path in your corpus checkout
  commit_sha text null,
  language text not null,               -- py/js/ts/java/dotnet/mixed
  scenario text not null,               -- webapi/cli/job/lib
  entrypoints jsonb not null,           -- array of entrypoint descriptors
  created_at timestamptz not null default now()
);

-- Expected truth for a sample
create table eval.expected_finding (
  expected_id uuid primary key,
  sample_id uuid not null references eval.sample(sample_id) on delete cascade,
  vuln_key text not null,               -- your canonical vuln key (see 2.2)
  tier text not null check (tier in ('imported','executed','tainted_sink')),
  rule_key text null,                   -- optional: expected rule family
  location_hint text null,              -- e.g. file:line or package
  sink_class text null,                 -- sql/command/ssrf/deser/eval/path/etc
  notes text null
);

-- One evaluation run (tied to exact versions + snapshots)
create table eval.run (
  eval_run_id uuid primary key,
  scanner_version text not null,
  rules_hash text not null,
  concelier_snapshot_hash text not null,   -- feed snapshot / advisory set hash
  replay_manifest_hash text not null,
  started_at timestamptz not null default now(),
  finished_at timestamptz null
);

-- Observed results captured from a scan run over the corpus
create table eval.observed_finding (
  observed_id uuid primary key,
  eval_run_id uuid not null references eval.run(eval_run_id) on delete cascade,
  sample_id uuid not null references eval.sample(sample_id) on delete cascade,
  vuln_key text not null,
  tier text not null check (tier in ('imported','executed','tainted_sink')),
  score double precision not null,      -- 0..1
  rule_key text not null,
  evidence jsonb not null,              -- summarized evidence blob
  first_signal_ms int not null          -- TTFS-like metric for this finding
);

-- Computed metrics, per tier and operating point
create table eval.metrics (
  eval_run_id uuid not null references eval.run(eval_run_id) on delete cascade,
  tier text not null check (tier in ('imported','executed','tainted_sink')),
  op_point text not null,               -- e.g. "recall>=0.80" or "threshold=0.72"
  precision double precision not null,
  recall double precision not null,
  f1 double precision not null,
  pr_auc double precision not null,
  latency_p50_ms int not null,
  latency_p95_ms int not null,
  n_expected int not null,
  n_observed int not null,
  primary key (eval_run_id, tier, op_point)
);
```

### 2.2 Canonical vuln key (avoid mismatches)

Define a single canonical key for matching expected↔observed:

* For dependency vulns: `purl + advisory_id` (or `purl + cve` if available).
* For code-pattern vulns: `rule_family + stable fingerprint` (e.g., `sink_class + file + normalized AST span`).

You need this to stop “matching hell” from destroying the usefulness of metrics.

---

## 3) Corpus format (how developers add truth samples)

Create `/corpus/` repo (or folder) with strict structure:

```
/corpus/
  /samples/
    /py_sql_injection_001/
      sample.yml
      app.py
      requirements.txt
      expected.json
    /js_ssrf_002/
      sample.yml
      index.js
      package-lock.json
      expected.json
  replay-manifest.yml        # pins concelier snapshot, rules hash, analyzers
  tools/
    run-scan.ps1
    run-scan.sh
```

**`sample.yml`** includes:

* language, scenario, entrypoints,
* how to run/build (if needed),
* “golden” command line for deterministic scanning.

**`expected.json`** is a list of expected findings with `vuln_key`, `tier`, optional `sink_class`.

---

## 4) Pipeline changes in StellaOps (where code changes go)

### 4.1 Scanner workers: emit evidence primitives (no tiering here)

**Modules:**

* `StellaOps.Scanner.Worker.DotNet`
* `StellaOps.Scanner.Worker.Python`
* `StellaOps.Scanner.Worker.Node`
* `StellaOps.Scanner.Worker.Java`

**Change:**

* Every raw finding must include:

  * `vuln_key`
  * `rule_key`
  * `score` (even if coarse at first)
  * `evidence[]` primitives (dependency / reachability / taint as available)
  * `first_signal_ms` (time from scan start to first evidence emitted for that finding)

Workers do **not** decide tiers. They only report what they saw.

### 4.2 Scanner webservice: tiering + lattice merge (this is the policy brain)

**Module:** `StellaOps.Scanner.WebService`

Responsibilities:

* Merge evidence for the same `vuln_key` across analyzers.
* Run reachability/taint algorithms (your lattice policy engine sits here).
* Assign `evidence_tier` deterministically.
* Persist normalized findings (production tables) + export to eval capture.

### 4.3 Concelier + Excititor (preserve prune source)

* Concelier stores advisory data; does not “tier” anything.
* Excititor stores VEX statements; when it references a finding, it may *annotate* tier context, but it must preserve pruning provenance and not recompute tiers.

---

## 5) Evaluator implementation (the thing that computes tiered precision/recall)

### 5.1 New service/tooling

Create:

* `StellaOps.Scanner.Evaluation.Core` (library)
* `StellaOps.Scanner.Evaluation.Cli` (dotnet tool)

CLI responsibilities:

1. Load corpus samples + expected findings into `eval.sample` / `eval.expected_finding`.
2. Trigger scans (via Scheduler or direct Scanner API) using `replay-manifest.yml`.
3. Capture observed findings into `eval.observed_finding`.
4. Compute per-tier PR curve + PR-AUC + operating-point precision/recall.
5. Write `eval.metrics` + produce Markdown/JSON artifacts for CI.

### 5.2 Matching algorithm (practical and robust)

For each `sample_id`:

* Group expected by `(vuln_key, tier)`.
* Group observed by `(vuln_key, tier)`.
* A match is “same vuln_key, same tier”.

  * (Later enhancement: allow “higher tier” observed to satisfy a lower-tier expected only if you explicitly want that; default: **exact tier match** so you catch tier regressions.)

Compute:

* TP/FP/FN per tier.
* PR curve by sweeping threshold over observed scores.
* `first_signal_ms` percentiles per tier.

### 5.3 Operating points (so it’s not academic)

Pick tier-specific gates:

* `tainted_sink`: require Recall ≥ 0.80, minimize FP
* `executed`: require Recall ≥ 0.70
* `imported`: require Recall ≥ 0.60

Store the chosen threshold per tier per version (so you can compare apples-to-apples in regressions).

---

## 6) CI gating (how this becomes “real” engineering pressure)

In GitLab/Gitea pipeline:

1. Build scanner + webservice.
2. Pull pinned concelier snapshot bundle (or local snapshot).
3. Run evaluator CLI against corpus.
4. Fail build if:

   * `PR-AUC(tainted_sink)` drops > 1% vs baseline
   * or precision at `Recall>=0.80` drops below a floor (e.g. 0.95)
   * or `latency_p95_ms(tainted_sink)` regresses beyond a budget

Store baselines in repo (`/corpus/baselines/<scanner_version>.json`) to make diffs explicit.

---

## 7) UI and alerting (so tiering changes behavior)

### 7.1 UI

Add three KPI cards:

* Imported PR-AUC trend
* Executed PR-AUC trend
* Tainted→Sink PR-AUC trend

In the findings list:

* show tier badge
* default sort: `tainted_sink` then `executed` then `imported`
* clicking a finding shows evidence summary (entrypoint, path length, sink class)

### 7.2 Notify policy

Default policy:

* Page/urgent only on `tainted_sink` above a confidence threshold.
* Create ticket on `executed`.
* Batch report on `imported`.

This is the main “why”: the system stops screaming about irrelevant imports.

---

## 8) Rollout plan (phased, developer-friendly)

### Phase 0: Contracts (1–2 days)

* Define `vuln_key`, `rule_key`, evidence DTOs, tier enum.
* Add schema `eval.*`.

**Done when:** scanner output can carry evidence + score; eval tables exist.

### Phase 1: Evidence emission + tiering (1–2 sprints)

* Workers emit evidence primitives.
* Webservice assigns tier using deterministic precedence.

**Done when:** every finding has a tier + evidence summary.

### Phase 2: Corpus + evaluator (1 sprint)

* Build 30–50 samples (10 per tier minimum).
* Implement evaluator CLI + metrics persistence.

**Done when:** CI can compute tiered metrics and output markdown report.

### Phase 3: Gates + UX (1 sprint)

* Add CI regression gates.
* Add UI tier badge + dashboards.
* Add Notify tier-based routing.

**Done when:** a regression in tainted→sink breaks CI even if imported improves.

### Phase 4: Scale corpus + harden matching (ongoing)

* Expand to 200+ samples, multi-language.
* Add fingerprinting for code vulns to avoid brittle file/line matching.

---

## Definition of “success” (so nobody bikesheds)

* You can point to one release where **overall precision stayed flat** but **tainted→sink PR-AUC improved**, and CI proves you didn’t “cheat” by just silencing imported findings.
* On-call noise drops because paging is tier-gated.
* TTFS p95 for tainted→sink stays within a budget you set (e.g., <30s on corpus and <N seconds on real images).

If you want, I can also give you:

* a concrete DTO set (`FindingEnvelope`, `EvidenceUnion`, etc.) in C#/.NET 10,
* and a skeleton `StellaOps.Scanner.Evaluation.Cli` command layout (`import-corpus`, `run`, `compute`, `report`) that your agents can start coding immediately.