Files
git.stella-ops.org/docs/product-advisories/archived/16-Dec-2025 - Measuring Progress with Tiered Precision Curves.md
master 8bbfe4d2d2 feat(rate-limiting): Implement core rate limiting functionality with configuration, decision-making, metrics, middleware, and service registration
- Add RateLimitConfig for configuration management with YAML binding support.
- Introduce RateLimitDecision to encapsulate the result of rate limit checks.
- Implement RateLimitMetrics for OpenTelemetry metrics tracking.
- Create RateLimitMiddleware for enforcing rate limits on incoming requests.
- Develop RateLimitService to orchestrate instance and environment rate limit checks.
- Add RateLimitServiceCollectionExtensions for dependency injection registration.
2025-12-17 18:02:37 +02:00

16 KiB
Raw Blame History

Heres a clean way to measure and report scanner accuracy without letting one metric hide weaknesses: track precision/recall (and AUC) separately for three evidence tiers: Imported, Executed, and Tainted→Sink. This mirrors how risk truly escalates in Python/JSstyle ecosystems.

Why tiers?

  • Imported: vuln in a dep thats present (lots of noise).
  • Executed: code/deps actually run on typical paths (fewer FPs).
  • Tainted→Sink: usercontrolled data reaches a sensitive sink (highest signal).

Minimal spec to implement now

Groundtruth corpus design

  • Label each finding as: tier ∈ {imported, executed, tainted_sink}, true_label ∈ {TP,FN}; store model confidence p∈[0,1].
  • Keep language tags (py, js, ts), package manager, and scenario (web API, cli, job).

DB schema (add to test analytics db)

  • gt_sample(id, repo, commit, lang, scenario)
  • gt_finding(id, sample_id, vuln_id, tier, truth, score, rule, scanner_version, created_at)
  • gt_split(sample_id, split ∈ {train,dev,test})

Metrics to publish (all stratified by tier)

  • Precision@K (e.g., top100), Recall@K
  • PRAUC, ROCAUC (only if calibrated)
  • Latency p50/p95 from “scan start → first evidence”
  • Coverage: % of samples with any signal in that tier

Reporting layout (one chart per tier)

  • PR curve + table: Precision, Recall, F1, PRAUC, N(findings), N(samples)
  • Error buckets: top 5 falsepositive rules, top 5 falsenegative patterns

Evaluation protocol

  1. Freeze a toy but diverse corpus (50200 repos) with deterministic fixture data and replay scripts.

  2. For each release candidate:

    • Run scanner with fixed flags and feeds.
    • Emit perfinding scores; map each to a tier with your reachability engine.
    • Join to ground truth; compute metrics per tier and overall.
  3. Fail the build if any of:

    • PRAUC(imported) drops >2%, or PRAUC(executed/tainted_sink) drops >1%.
    • FP rate in tainted_sink > 5% at operating point Recall ≥ 0.7.

How to classify tiers (deterministic rules)

  • imported: package appears in lockfile/SBOM and is reachable in graph.
  • executed: function/module reached by dynamic trace, coverage, or proven path in static call graph used by entrypoints.
  • tainted_sink: taint source → sanitizers → sink path proven, with sink taxonomy (eval, exec, SQL, SSRF, deserialization, XXE, command, path traversal).

Developer checklist (StellaOps naming)

  • Scanner.Worker: emit evidence_tier and score on each finding.
  • Excititor (VEX): include tier in statements; allow policy pertier thresholds.
  • Concelier (feeds): tag advisories with sink classes when available to help tier mapping.
  • Scheduler/Notify: gate alerts on tiered thresholds (e.g., page only on tainted_sink at Recalltarget oppoint).
  • Router dashboards: three small PR curves + trend sparklines; hover shows last 5 FP causes.

Quick JSON result shape

{
  "finding_id": "…",
  "vuln_id": "CVE-2024-12345",
  "rule": "py.sql.injection.param_concat",
  "evidence_tier": "tainted_sink",
  "score": 0.87,
  "reachability": { "entrypoint": "app.py:main", "path_len": 5, "sanitizers": ["escape_sql"] }
}

Operational point selection

  • Choose oppoints per tier by maximizing F1 or fixing Recall targets:

    • imported: Recall 0.60
    • executed: Recall 0.70
    • tainted_sink: Recall 0.80 Then record pertier precision at those recalls each release.

Why this prevents metric gaming

  • A model cant inflate “overall precision” by overpenalizing noisy imported findings: you still have to show gains in executed and tainted_sink curves, where it matters.

If you want, I can draft a tiny sample corpus template (folders + labels) and a onefile evaluator that outputs the three PR curves and a markdown summary ready for your CI artifact. What you are trying to solve is this:

If you measure “scanner accuracy” as one overall precision/recall number, you can accidentally optimize the wrong thing. A scanner can look “better” by getting quieter on the easy/noisy tier (dependencies merely present) while getting worse on the tier that actually matters (user-data reaching a dangerous sink). Tiered accuracy prevents that failure mode and gives you a clean product contract:

  • Imported = “it exists in the artifact” (high volume, high noise)
  • Executed = “it actually runs on real entrypoints” (materially more useful)
  • Tainted→Sink = “user-controlled input reaches a sensitive sink” (highest signal, most actionable)

This is not just analytics. It drives:

  • alerting (page only on tainted→sink),
  • UX (show the reason a vuln matters),
  • policy/lattice merges (VEX decisions should not collapse tiers),
  • engineering priorities (dont let “imported” improvements hide “tainted→sink” regressions).

Below is a concrete StellaOps implementation plan (aligned to your architecture rules: lattice algorithms run in scanner.webservice, Concelier/Excititor preserve prune source, Postgres is SoR, Valkey only ephemeral).


1) Product contract: what “tier” means in StellaOps

1.1 Tier assignment rule (single source of truth)

Owner: StellaOps.Scanner.WebService Input: raw findings + evidence objects from workers (deps, callgraph, trace, taint paths) Output: evidence_tier on each normalized finding (plus an evidence summary)

Tier precedence (highest wins):

  1. tainted_sink
  2. executed
  3. imported

Deterministic mapping rule:

  • imported if SBOM/lockfile indicates package/component present AND vuln applies to that component.
  • executed if reachability engine can prove reachable from declared entrypoints (static) OR runtime trace/coverage proves execution.
  • tainted_sink if taint engine proves source→(optional sanitizer)→sink path with sink taxonomy.

1.2 Evidence objects (the “why”)

Workers emit evidence primitives; webservice merges + tiers them:

  • DependencyEvidence { purl, version, lockfile_path }
  • ReachabilityEvidence { entrypoint, call_path[], confidence }
  • TaintEvidence { source, sink, sanitizers[], dataflow_path[], confidence }

2) Data model in Postgres (system of record)

Create a dedicated schema eval for ground truth + computed metrics (keeps it separate from production scans but queryable by the UI).

2.1 Tables (minimal but complete)

create schema if not exists eval;

-- A “sample” = one repo/fixture scenario you scan deterministically
create table eval.sample (
  sample_id uuid primary key,
  name text not null,
  repo_path text not null,              -- local path in your corpus checkout
  commit_sha text null,
  language text not null,               -- py/js/ts/java/dotnet/mixed
  scenario text not null,               -- webapi/cli/job/lib
  entrypoints jsonb not null,           -- array of entrypoint descriptors
  created_at timestamptz not null default now()
);

-- Expected truth for a sample
create table eval.expected_finding (
  expected_id uuid primary key,
  sample_id uuid not null references eval.sample(sample_id) on delete cascade,
  vuln_key text not null,               -- your canonical vuln key (see 2.2)
  tier text not null check (tier in ('imported','executed','tainted_sink')),
  rule_key text null,                   -- optional: expected rule family
  location_hint text null,              -- e.g. file:line or package
  sink_class text null,                 -- sql/command/ssrf/deser/eval/path/etc
  notes text null
);

-- One evaluation run (tied to exact versions + snapshots)
create table eval.run (
  eval_run_id uuid primary key,
  scanner_version text not null,
  rules_hash text not null,
  concelier_snapshot_hash text not null,   -- feed snapshot / advisory set hash
  replay_manifest_hash text not null,
  started_at timestamptz not null default now(),
  finished_at timestamptz null
);

-- Observed results captured from a scan run over the corpus
create table eval.observed_finding (
  observed_id uuid primary key,
  eval_run_id uuid not null references eval.run(eval_run_id) on delete cascade,
  sample_id uuid not null references eval.sample(sample_id) on delete cascade,
  vuln_key text not null,
  tier text not null check (tier in ('imported','executed','tainted_sink')),
  score double precision not null,      -- 0..1
  rule_key text not null,
  evidence jsonb not null,              -- summarized evidence blob
  first_signal_ms int not null          -- TTFS-like metric for this finding
);

-- Computed metrics, per tier and operating point
create table eval.metrics (
  eval_run_id uuid not null references eval.run(eval_run_id) on delete cascade,
  tier text not null check (tier in ('imported','executed','tainted_sink')),
  op_point text not null,               -- e.g. "recall>=0.80" or "threshold=0.72"
  precision double precision not null,
  recall double precision not null,
  f1 double precision not null,
  pr_auc double precision not null,
  latency_p50_ms int not null,
  latency_p95_ms int not null,
  n_expected int not null,
  n_observed int not null,
  primary key (eval_run_id, tier, op_point)
);

2.2 Canonical vuln key (avoid mismatches)

Define a single canonical key for matching expected↔observed:

  • For dependency vulns: purl + advisory_id (or purl + cve if available).
  • For code-pattern vulns: rule_family + stable fingerprint (e.g., sink_class + file + normalized AST span).

You need this to stop “matching hell” from destroying the usefulness of metrics.


3) Corpus format (how developers add truth samples)

Create /corpus/ repo (or folder) with strict structure:

/corpus/
  /samples/
    /py_sql_injection_001/
      sample.yml
      app.py
      requirements.txt
      expected.json
    /js_ssrf_002/
      sample.yml
      index.js
      package-lock.json
      expected.json
  replay-manifest.yml        # pins concelier snapshot, rules hash, analyzers
  tools/
    run-scan.ps1
    run-scan.sh

sample.yml includes:

  • language, scenario, entrypoints,
  • how to run/build (if needed),
  • “golden” command line for deterministic scanning.

expected.json is a list of expected findings with vuln_key, tier, optional sink_class.


4) Pipeline changes in StellaOps (where code changes go)

4.1 Scanner workers: emit evidence primitives (no tiering here)

Modules:

  • StellaOps.Scanner.Worker.DotNet
  • StellaOps.Scanner.Worker.Python
  • StellaOps.Scanner.Worker.Node
  • StellaOps.Scanner.Worker.Java

Change:

  • Every raw finding must include:

    • vuln_key
    • rule_key
    • score (even if coarse at first)
    • evidence[] primitives (dependency / reachability / taint as available)
    • first_signal_ms (time from scan start to first evidence emitted for that finding)

Workers do not decide tiers. They only report what they saw.

4.2 Scanner webservice: tiering + lattice merge (this is the policy brain)

Module: StellaOps.Scanner.WebService

Responsibilities:

  • Merge evidence for the same vuln_key across analyzers.
  • Run reachability/taint algorithms (your lattice policy engine sits here).
  • Assign evidence_tier deterministically.
  • Persist normalized findings (production tables) + export to eval capture.

4.3 Concelier + Excititor (preserve prune source)

  • Concelier stores advisory data; does not “tier” anything.
  • Excititor stores VEX statements; when it references a finding, it may annotate tier context, but it must preserve pruning provenance and not recompute tiers.

5) Evaluator implementation (the thing that computes tiered precision/recall)

5.1 New service/tooling

Create:

  • StellaOps.Scanner.Evaluation.Core (library)
  • StellaOps.Scanner.Evaluation.Cli (dotnet tool)

CLI responsibilities:

  1. Load corpus samples + expected findings into eval.sample / eval.expected_finding.
  2. Trigger scans (via Scheduler or direct Scanner API) using replay-manifest.yml.
  3. Capture observed findings into eval.observed_finding.
  4. Compute per-tier PR curve + PR-AUC + operating-point precision/recall.
  5. Write eval.metrics + produce Markdown/JSON artifacts for CI.

5.2 Matching algorithm (practical and robust)

For each sample_id:

  • Group expected by (vuln_key, tier).

  • Group observed by (vuln_key, tier).

  • A match is “same vuln_key, same tier”.

    • (Later enhancement: allow “higher tier” observed to satisfy a lower-tier expected only if you explicitly want that; default: exact tier match so you catch tier regressions.)

Compute:

  • TP/FP/FN per tier.
  • PR curve by sweeping threshold over observed scores.
  • first_signal_ms percentiles per tier.

5.3 Operating points (so its not academic)

Pick tier-specific gates:

  • tainted_sink: require Recall ≥ 0.80, minimize FP
  • executed: require Recall ≥ 0.70
  • imported: require Recall ≥ 0.60

Store the chosen threshold per tier per version (so you can compare apples-to-apples in regressions).


6) CI gating (how this becomes “real” engineering pressure)

In GitLab/Gitea pipeline:

  1. Build scanner + webservice.

  2. Pull pinned concelier snapshot bundle (or local snapshot).

  3. Run evaluator CLI against corpus.

  4. Fail build if:

    • PR-AUC(tainted_sink) drops > 1% vs baseline
    • or precision at Recall>=0.80 drops below a floor (e.g. 0.95)
    • or latency_p95_ms(tainted_sink) regresses beyond a budget

Store baselines in repo (/corpus/baselines/<scanner_version>.json) to make diffs explicit.


7) UI and alerting (so tiering changes behavior)

7.1 UI

Add three KPI cards:

  • Imported PR-AUC trend
  • Executed PR-AUC trend
  • Tainted→Sink PR-AUC trend

In the findings list:

  • show tier badge
  • default sort: tainted_sink then executed then imported
  • clicking a finding shows evidence summary (entrypoint, path length, sink class)

7.2 Notify policy

Default policy:

  • Page/urgent only on tainted_sink above a confidence threshold.
  • Create ticket on executed.
  • Batch report on imported.

This is the main “why”: the system stops screaming about irrelevant imports.


8) Rollout plan (phased, developer-friendly)

Phase 0: Contracts (12 days)

  • Define vuln_key, rule_key, evidence DTOs, tier enum.
  • Add schema eval.*.

Done when: scanner output can carry evidence + score; eval tables exist.

Phase 1: Evidence emission + tiering (12 sprints)

  • Workers emit evidence primitives.
  • Webservice assigns tier using deterministic precedence.

Done when: every finding has a tier + evidence summary.

Phase 2: Corpus + evaluator (1 sprint)

  • Build 3050 samples (10 per tier minimum).
  • Implement evaluator CLI + metrics persistence.

Done when: CI can compute tiered metrics and output markdown report.

Phase 3: Gates + UX (1 sprint)

  • Add CI regression gates.
  • Add UI tier badge + dashboards.
  • Add Notify tier-based routing.

Done when: a regression in tainted→sink breaks CI even if imported improves.

Phase 4: Scale corpus + harden matching (ongoing)

  • Expand to 200+ samples, multi-language.
  • Add fingerprinting for code vulns to avoid brittle file/line matching.

Definition of “success” (so nobody bikesheds)

  • You can point to one release where overall precision stayed flat but tainted→sink PR-AUC improved, and CI proves you didnt “cheat” by just silencing imported findings.
  • On-call noise drops because paging is tier-gated.
  • TTFS p95 for tainted→sink stays within a budget you set (e.g., <30s on corpus and <N seconds on real images).

If you want, I can also give you:

  • a concrete DTO set (FindingEnvelope, EvidenceUnion, etc.) in C#/.NET 10,
  • and a skeleton StellaOps.Scanner.Evaluation.Cli command layout (import-corpus, run, compute, report) that your agents can start coding immediately.