- Add RateLimitConfig for configuration management with YAML binding support. - Introduce RateLimitDecision to encapsulate the result of rate limit checks. - Implement RateLimitMetrics for OpenTelemetry metrics tracking. - Create RateLimitMiddleware for enforcing rate limits on incoming requests. - Develop RateLimitService to orchestrate instance and environment rate limit checks. - Add RateLimitServiceCollectionExtensions for dependency injection registration.
16 KiB
Here’s a clean way to measure and report scanner accuracy without letting one metric hide weaknesses: track precision/recall (and AUC) separately for three evidence tiers: Imported, Executed, and Tainted→Sink. This mirrors how risk truly escalates in Python/JS‑style ecosystems.
Why tiers?
- Imported: vuln in a dep that’s present (lots of noise).
- Executed: code/deps actually run on typical paths (fewer FPs).
- Tainted→Sink: user‑controlled data reaches a sensitive sink (highest signal).
Minimal spec to implement now
Ground‑truth corpus design
- Label each finding as:
tier ∈ {imported, executed, tainted_sink},true_label ∈ {TP,FN}; store model confidencep∈[0,1]. - Keep language tags (py, js, ts), package manager, and scenario (web API, cli, job).
DB schema (add to test analytics db)
gt_sample(id, repo, commit, lang, scenario)gt_finding(id, sample_id, vuln_id, tier, truth, score, rule, scanner_version, created_at)gt_split(sample_id, split ∈ {train,dev,test})
Metrics to publish (all stratified by tier)
- Precision@K (e.g., top‑100), Recall@K
- PR‑AUC, ROC‑AUC (only if calibrated)
- Latency p50/p95 from “scan start → first evidence”
- Coverage: % of samples with any signal in that tier
Reporting layout (one chart per tier)
- PR curve + table:
Precision, Recall, F1, PR‑AUC, N(findings), N(samples) - Error buckets: top 5 false‑positive rules, top 5 false‑negative patterns
Evaluation protocol
-
Freeze a toy but diverse corpus (50–200 repos) with deterministic fixture data and replay scripts.
-
For each release candidate:
- Run scanner with fixed flags and feeds.
- Emit per‑finding scores; map each to a tier with your reachability engine.
- Join to ground truth; compute metrics per tier and overall.
-
Fail the build if any of:
- PR‑AUC(imported) drops >2%, or PR‑AUC(executed/tainted_sink) drops >1%.
- FP rate in
tainted_sink> 5% at operating point Recall ≥ 0.7.
How to classify tiers (deterministic rules)
imported: package appears in lockfile/SBOM and is reachable in graph.executed: function/module reached by dynamic trace, coverage, or proven path in static call graph used by entrypoints.tainted_sink: taint source → sanitizers → sink path proven, with sink taxonomy (eval, exec, SQL, SSRF, deserialization, XXE, command, path traversal).
Developer checklist (Stella Ops naming)
- Scanner.Worker: emit
evidence_tierandscoreon each finding. - Excititor (VEX): include
tierin statements; allow policy per‑tier thresholds. - Concelier (feeds): tag advisories with sink classes when available to help tier mapping.
- Scheduler/Notify: gate alerts on tiered thresholds (e.g., page only on
tainted_sinkat Recall‑target op‑point). - Router dashboards: three small PR curves + trend sparklines; hover shows last 5 FP causes.
Quick JSON result shape
{
"finding_id": "…",
"vuln_id": "CVE-2024-12345",
"rule": "py.sql.injection.param_concat",
"evidence_tier": "tainted_sink",
"score": 0.87,
"reachability": { "entrypoint": "app.py:main", "path_len": 5, "sanitizers": ["escape_sql"] }
}
Operational point selection
-
Choose op‑points per tier by maximizing F1 or fixing Recall targets:
- imported: Recall 0.60
- executed: Recall 0.70
- tainted_sink: Recall 0.80 Then record per‑tier precision at those recalls each release.
Why this prevents metric gaming
- A model can’t inflate “overall precision” by over‑penalizing noisy imported findings: you still have to show gains in executed and tainted_sink curves, where it matters.
If you want, I can draft a tiny sample corpus template (folders + labels) and a one‑file evaluator that outputs the three PR curves and a markdown summary ready for your CI artifact. What you are trying to solve is this:
If you measure “scanner accuracy” as one overall precision/recall number, you can accidentally optimize the wrong thing. A scanner can look “better” by getting quieter on the easy/noisy tier (dependencies merely present) while getting worse on the tier that actually matters (user-data reaching a dangerous sink). Tiered accuracy prevents that failure mode and gives you a clean product contract:
- Imported = “it exists in the artifact” (high volume, high noise)
- Executed = “it actually runs on real entrypoints” (materially more useful)
- Tainted→Sink = “user-controlled input reaches a sensitive sink” (highest signal, most actionable)
This is not just analytics. It drives:
- alerting (page only on tainted→sink),
- UX (show the reason a vuln matters),
- policy/lattice merges (VEX decisions should not collapse tiers),
- engineering priorities (don’t let “imported” improvements hide “tainted→sink” regressions).
Below is a concrete StellaOps implementation plan (aligned to your architecture rules: lattice algorithms run in scanner.webservice, Concelier/Excititor preserve prune source, Postgres is SoR, Valkey only ephemeral).
1) Product contract: what “tier” means in StellaOps
1.1 Tier assignment rule (single source of truth)
Owner: StellaOps.Scanner.WebService
Input: raw findings + evidence objects from workers (deps, callgraph, trace, taint paths)
Output: evidence_tier on each normalized finding (plus an evidence summary)
Tier precedence (highest wins):
tainted_sinkexecutedimported
Deterministic mapping rule:
importedif SBOM/lockfile indicates package/component present AND vuln applies to that component.executedif reachability engine can prove reachable from declared entrypoints (static) OR runtime trace/coverage proves execution.tainted_sinkif taint engine proves source→(optional sanitizer)→sink path with sink taxonomy.
1.2 Evidence objects (the “why”)
Workers emit evidence primitives; webservice merges + tiers them:
DependencyEvidence { purl, version, lockfile_path }ReachabilityEvidence { entrypoint, call_path[], confidence }TaintEvidence { source, sink, sanitizers[], dataflow_path[], confidence }
2) Data model in Postgres (system of record)
Create a dedicated schema eval for ground truth + computed metrics (keeps it separate from production scans but queryable by the UI).
2.1 Tables (minimal but complete)
create schema if not exists eval;
-- A “sample” = one repo/fixture scenario you scan deterministically
create table eval.sample (
sample_id uuid primary key,
name text not null,
repo_path text not null, -- local path in your corpus checkout
commit_sha text null,
language text not null, -- py/js/ts/java/dotnet/mixed
scenario text not null, -- webapi/cli/job/lib
entrypoints jsonb not null, -- array of entrypoint descriptors
created_at timestamptz not null default now()
);
-- Expected truth for a sample
create table eval.expected_finding (
expected_id uuid primary key,
sample_id uuid not null references eval.sample(sample_id) on delete cascade,
vuln_key text not null, -- your canonical vuln key (see 2.2)
tier text not null check (tier in ('imported','executed','tainted_sink')),
rule_key text null, -- optional: expected rule family
location_hint text null, -- e.g. file:line or package
sink_class text null, -- sql/command/ssrf/deser/eval/path/etc
notes text null
);
-- One evaluation run (tied to exact versions + snapshots)
create table eval.run (
eval_run_id uuid primary key,
scanner_version text not null,
rules_hash text not null,
concelier_snapshot_hash text not null, -- feed snapshot / advisory set hash
replay_manifest_hash text not null,
started_at timestamptz not null default now(),
finished_at timestamptz null
);
-- Observed results captured from a scan run over the corpus
create table eval.observed_finding (
observed_id uuid primary key,
eval_run_id uuid not null references eval.run(eval_run_id) on delete cascade,
sample_id uuid not null references eval.sample(sample_id) on delete cascade,
vuln_key text not null,
tier text not null check (tier in ('imported','executed','tainted_sink')),
score double precision not null, -- 0..1
rule_key text not null,
evidence jsonb not null, -- summarized evidence blob
first_signal_ms int not null -- TTFS-like metric for this finding
);
-- Computed metrics, per tier and operating point
create table eval.metrics (
eval_run_id uuid not null references eval.run(eval_run_id) on delete cascade,
tier text not null check (tier in ('imported','executed','tainted_sink')),
op_point text not null, -- e.g. "recall>=0.80" or "threshold=0.72"
precision double precision not null,
recall double precision not null,
f1 double precision not null,
pr_auc double precision not null,
latency_p50_ms int not null,
latency_p95_ms int not null,
n_expected int not null,
n_observed int not null,
primary key (eval_run_id, tier, op_point)
);
2.2 Canonical vuln key (avoid mismatches)
Define a single canonical key for matching expected↔observed:
- For dependency vulns:
purl + advisory_id(orpurl + cveif available). - For code-pattern vulns:
rule_family + stable fingerprint(e.g.,sink_class + file + normalized AST span).
You need this to stop “matching hell” from destroying the usefulness of metrics.
3) Corpus format (how developers add truth samples)
Create /corpus/ repo (or folder) with strict structure:
/corpus/
/samples/
/py_sql_injection_001/
sample.yml
app.py
requirements.txt
expected.json
/js_ssrf_002/
sample.yml
index.js
package-lock.json
expected.json
replay-manifest.yml # pins concelier snapshot, rules hash, analyzers
tools/
run-scan.ps1
run-scan.sh
sample.yml includes:
- language, scenario, entrypoints,
- how to run/build (if needed),
- “golden” command line for deterministic scanning.
expected.json is a list of expected findings with vuln_key, tier, optional sink_class.
4) Pipeline changes in StellaOps (where code changes go)
4.1 Scanner workers: emit evidence primitives (no tiering here)
Modules:
StellaOps.Scanner.Worker.DotNetStellaOps.Scanner.Worker.PythonStellaOps.Scanner.Worker.NodeStellaOps.Scanner.Worker.Java
Change:
-
Every raw finding must include:
vuln_keyrule_keyscore(even if coarse at first)evidence[]primitives (dependency / reachability / taint as available)first_signal_ms(time from scan start to first evidence emitted for that finding)
Workers do not decide tiers. They only report what they saw.
4.2 Scanner webservice: tiering + lattice merge (this is the policy brain)
Module: StellaOps.Scanner.WebService
Responsibilities:
- Merge evidence for the same
vuln_keyacross analyzers. - Run reachability/taint algorithms (your lattice policy engine sits here).
- Assign
evidence_tierdeterministically. - Persist normalized findings (production tables) + export to eval capture.
4.3 Concelier + Excititor (preserve prune source)
- Concelier stores advisory data; does not “tier” anything.
- Excititor stores VEX statements; when it references a finding, it may annotate tier context, but it must preserve pruning provenance and not recompute tiers.
5) Evaluator implementation (the thing that computes tiered precision/recall)
5.1 New service/tooling
Create:
StellaOps.Scanner.Evaluation.Core(library)StellaOps.Scanner.Evaluation.Cli(dotnet tool)
CLI responsibilities:
- Load corpus samples + expected findings into
eval.sample/eval.expected_finding. - Trigger scans (via Scheduler or direct Scanner API) using
replay-manifest.yml. - Capture observed findings into
eval.observed_finding. - Compute per-tier PR curve + PR-AUC + operating-point precision/recall.
- Write
eval.metrics+ produce Markdown/JSON artifacts for CI.
5.2 Matching algorithm (practical and robust)
For each sample_id:
-
Group expected by
(vuln_key, tier). -
Group observed by
(vuln_key, tier). -
A match is “same vuln_key, same tier”.
- (Later enhancement: allow “higher tier” observed to satisfy a lower-tier expected only if you explicitly want that; default: exact tier match so you catch tier regressions.)
Compute:
- TP/FP/FN per tier.
- PR curve by sweeping threshold over observed scores.
first_signal_mspercentiles per tier.
5.3 Operating points (so it’s not academic)
Pick tier-specific gates:
tainted_sink: require Recall ≥ 0.80, minimize FPexecuted: require Recall ≥ 0.70imported: require Recall ≥ 0.60
Store the chosen threshold per tier per version (so you can compare apples-to-apples in regressions).
6) CI gating (how this becomes “real” engineering pressure)
In GitLab/Gitea pipeline:
-
Build scanner + webservice.
-
Pull pinned concelier snapshot bundle (or local snapshot).
-
Run evaluator CLI against corpus.
-
Fail build if:
PR-AUC(tainted_sink)drops > 1% vs baseline- or precision at
Recall>=0.80drops below a floor (e.g. 0.95) - or
latency_p95_ms(tainted_sink)regresses beyond a budget
Store baselines in repo (/corpus/baselines/<scanner_version>.json) to make diffs explicit.
7) UI and alerting (so tiering changes behavior)
7.1 UI
Add three KPI cards:
- Imported PR-AUC trend
- Executed PR-AUC trend
- Tainted→Sink PR-AUC trend
In the findings list:
- show tier badge
- default sort:
tainted_sinkthenexecutedthenimported - clicking a finding shows evidence summary (entrypoint, path length, sink class)
7.2 Notify policy
Default policy:
- Page/urgent only on
tainted_sinkabove a confidence threshold. - Create ticket on
executed. - Batch report on
imported.
This is the main “why”: the system stops screaming about irrelevant imports.
8) Rollout plan (phased, developer-friendly)
Phase 0: Contracts (1–2 days)
- Define
vuln_key,rule_key, evidence DTOs, tier enum. - Add schema
eval.*.
Done when: scanner output can carry evidence + score; eval tables exist.
Phase 1: Evidence emission + tiering (1–2 sprints)
- Workers emit evidence primitives.
- Webservice assigns tier using deterministic precedence.
Done when: every finding has a tier + evidence summary.
Phase 2: Corpus + evaluator (1 sprint)
- Build 30–50 samples (10 per tier minimum).
- Implement evaluator CLI + metrics persistence.
Done when: CI can compute tiered metrics and output markdown report.
Phase 3: Gates + UX (1 sprint)
- Add CI regression gates.
- Add UI tier badge + dashboards.
- Add Notify tier-based routing.
Done when: a regression in tainted→sink breaks CI even if imported improves.
Phase 4: Scale corpus + harden matching (ongoing)
- Expand to 200+ samples, multi-language.
- Add fingerprinting for code vulns to avoid brittle file/line matching.
Definition of “success” (so nobody bikesheds)
- You can point to one release where overall precision stayed flat but tainted→sink PR-AUC improved, and CI proves you didn’t “cheat” by just silencing imported findings.
- On-call noise drops because paging is tier-gated.
- TTFS p95 for tainted→sink stays within a budget you set (e.g., <30s on corpus and <N seconds on real images).
If you want, I can also give you:
- a concrete DTO set (
FindingEnvelope,EvidenceUnion, etc.) in C#/.NET 10, - and a skeleton
StellaOps.Scanner.Evaluation.Clicommand layout (import-corpus,run,compute,report) that your agents can start coding immediately.