feat(rate-limiting): Implement core rate limiting functionality with configuration, decision-making, metrics, middleware, and service registration

- Add RateLimitConfig for configuration management with YAML binding support.
- Introduce RateLimitDecision to encapsulate the result of rate limit checks.
- Implement RateLimitMetrics for OpenTelemetry metrics tracking.
- Create RateLimitMiddleware for enforcing rate limits on incoming requests.
- Develop RateLimitService to orchestrate instance and environment rate limit checks.
- Add RateLimitServiceCollectionExtensions for dependency injection registration.
This commit is contained in:
master
2025-12-17 18:02:37 +02:00
parent 394b57f6bf
commit 8bbfe4d2d2
211 changed files with 47179 additions and 1590 deletions

View File

@@ -0,0 +1,433 @@
Heres a clean way to **measure and report scanner accuracy without letting one metric hide weaknesses**: track precision/recall (and AUC) separately for three evidence tiers: **Imported**, **Executed**, and **Tainted→Sink**. This mirrors how risk truly escalates in Python/JSstyle ecosystems.
### Why tiers?
* **Imported**: vuln in a dep thats present (lots of noise).
* **Executed**: code/deps actually run on typical paths (fewer FPs).
* **Tainted→Sink**: usercontrolled data reaches a sensitive sink (highest signal).
### Minimal spec to implement now
**Groundtruth corpus design**
* Label each finding as: `tier ∈ {imported, executed, tainted_sink}`, `true_label ∈ {TP,FN}`; store model confidence `p∈[0,1]`.
* Keep language tags (py, js, ts), package manager, and scenario (web API, cli, job).
**DB schema (add to test analytics db)**
* `gt_sample(id, repo, commit, lang, scenario)`
* `gt_finding(id, sample_id, vuln_id, tier, truth, score, rule, scanner_version, created_at)`
* `gt_split(sample_id, split ∈ {train,dev,test})`
**Metrics to publish (all stratified by tier)**
* Precision@K (e.g., top100), Recall@K
* PRAUC, ROCAUC (only if calibrated)
* Latency p50/p95 from “scan start → first evidence”
* Coverage: % of samples with any signal in that tier
**Reporting layout (one chart per tier)**
* PR curve + table: `Precision, Recall, F1, PRAUC, N(findings), N(samples)`
* Error buckets: top 5 falsepositive rules, top 5 falsenegative patterns
**Evaluation protocol**
1. Freeze a **toy but diverse corpus** (50200 repos) with deterministic fixture data and replay scripts.
2. For each release candidate:
* Run scanner with fixed flags and feeds.
* Emit perfinding scores; map each to a tier with your reachability engine.
* Join to ground truth; compute metrics **per tier** and **overall**.
3. Fail the build if any of:
* PRAUC(imported) drops >2%, or PRAUC(executed/tainted_sink) drops >1%.
* FP rate in `tainted_sink` > 5% at operating point Recall ≥ 0.7.
**How to classify tiers (deterministic rules)**
* `imported`: package appears in lockfile/SBOM and is reachable in graph.
* `executed`: function/module reached by dynamic trace, coverage, or proven path in static call graph used by entrypoints.
* `tainted_sink`: taint source → sanitizers → sink path proven, with sink taxonomy (eval, exec, SQL, SSRF, deserialization, XXE, command, path traversal).
**Developer checklist (StellaOps naming)**
* Scanner.Worker: emit `evidence_tier` and `score` on each finding.
* Excititor (VEX): include `tier` in statements; allow policy pertier thresholds.
* Concelier (feeds): tag advisories with sink classes when available to help tier mapping.
* Scheduler/Notify: gate alerts on **tiered** thresholds (e.g., page only on `tainted_sink` at Recalltarget oppoint).
* Router dashboards: three small PR curves + trend sparklines; hover shows last 5 FP causes.
**Quick JSON result shape**
```json
{
"finding_id": "…",
"vuln_id": "CVE-2024-12345",
"rule": "py.sql.injection.param_concat",
"evidence_tier": "tainted_sink",
"score": 0.87,
"reachability": { "entrypoint": "app.py:main", "path_len": 5, "sanitizers": ["escape_sql"] }
}
```
**Operational point selection**
* Choose oppoints per tier by maximizing F1 or fixing Recall targets:
* imported: Recall 0.60
* executed: Recall 0.70
* tainted_sink: Recall 0.80
Then record **pertier precision at those recalls** each release.
**Why this prevents metric gaming**
* A model cant inflate “overall precision” by overpenalizing noisy imported findings: you still have to show gains in **executed** and **tainted_sink** curves, where it matters.
If you want, I can draft a tiny sample corpus template (folders + labels) and a onefile evaluator that outputs the three PR curves and a markdown summary ready for your CI artifact.
What you are trying to solve is this:
If you measure “scanner accuracy” as one overall precision/recall number, you can *accidentally* optimize the wrong thing. A scanner can look “better” by getting quieter on the easy/noisy tier (dependencies merely present) while getting worse on the tier that actually matters (user-data reaching a dangerous sink). Tiered accuracy prevents that failure mode and gives you a clean product contract:
* **Imported** = “it exists in the artifact” (high volume, high noise)
* **Executed** = “it actually runs on real entrypoints” (materially more useful)
* **Tainted→Sink** = “user-controlled input reaches a sensitive sink” (highest signal, most actionable)
This is not just analytics. It drives:
* alerting (page only on tainted→sink),
* UX (show the *reason* a vuln matters),
* policy/lattice merges (VEX decisions should not collapse tiers),
* engineering priorities (dont let “imported” improvements hide “tainted→sink” regressions).
Below is a concrete StellaOps implementation plan (aligned to your architecture rules: **lattice algorithms run in `scanner.webservice`**, Concelier/Excititor **preserve prune source**, Postgres is SoR, Valkey only ephemeral).
---
## 1) Product contract: what “tier” means in StellaOps
### 1.1 Tier assignment rule (single source of truth)
**Owner:** `StellaOps.Scanner.WebService`
**Input:** raw findings + evidence objects from workers (deps, callgraph, trace, taint paths)
**Output:** `evidence_tier` on each normalized finding (plus an evidence summary)
**Tier precedence (highest wins):**
1. `tainted_sink`
2. `executed`
3. `imported`
**Deterministic mapping rule:**
* `imported` if SBOM/lockfile indicates package/component present AND vuln applies to that component.
* `executed` if reachability engine can prove reachable from declared entrypoints (static) OR runtime trace/coverage proves execution.
* `tainted_sink` if taint engine proves source→(optional sanitizer)→sink path with sink taxonomy.
### 1.2 Evidence objects (the “why”)
Workers emit *evidence primitives*; webservice merges + tiers them:
* `DependencyEvidence { purl, version, lockfile_path }`
* `ReachabilityEvidence { entrypoint, call_path[], confidence }`
* `TaintEvidence { source, sink, sanitizers[], dataflow_path[], confidence }`
---
## 2) Data model in Postgres (system of record)
Create a dedicated schema `eval` for ground truth + computed metrics (keeps it separate from production scans but queryable by the UI).
### 2.1 Tables (minimal but complete)
```sql
create schema if not exists eval;
-- A “sample” = one repo/fixture scenario you scan deterministically
create table eval.sample (
sample_id uuid primary key,
name text not null,
repo_path text not null, -- local path in your corpus checkout
commit_sha text null,
language text not null, -- py/js/ts/java/dotnet/mixed
scenario text not null, -- webapi/cli/job/lib
entrypoints jsonb not null, -- array of entrypoint descriptors
created_at timestamptz not null default now()
);
-- Expected truth for a sample
create table eval.expected_finding (
expected_id uuid primary key,
sample_id uuid not null references eval.sample(sample_id) on delete cascade,
vuln_key text not null, -- your canonical vuln key (see 2.2)
tier text not null check (tier in ('imported','executed','tainted_sink')),
rule_key text null, -- optional: expected rule family
location_hint text null, -- e.g. file:line or package
sink_class text null, -- sql/command/ssrf/deser/eval/path/etc
notes text null
);
-- One evaluation run (tied to exact versions + snapshots)
create table eval.run (
eval_run_id uuid primary key,
scanner_version text not null,
rules_hash text not null,
concelier_snapshot_hash text not null, -- feed snapshot / advisory set hash
replay_manifest_hash text not null,
started_at timestamptz not null default now(),
finished_at timestamptz null
);
-- Observed results captured from a scan run over the corpus
create table eval.observed_finding (
observed_id uuid primary key,
eval_run_id uuid not null references eval.run(eval_run_id) on delete cascade,
sample_id uuid not null references eval.sample(sample_id) on delete cascade,
vuln_key text not null,
tier text not null check (tier in ('imported','executed','tainted_sink')),
score double precision not null, -- 0..1
rule_key text not null,
evidence jsonb not null, -- summarized evidence blob
first_signal_ms int not null -- TTFS-like metric for this finding
);
-- Computed metrics, per tier and operating point
create table eval.metrics (
eval_run_id uuid not null references eval.run(eval_run_id) on delete cascade,
tier text not null check (tier in ('imported','executed','tainted_sink')),
op_point text not null, -- e.g. "recall>=0.80" or "threshold=0.72"
precision double precision not null,
recall double precision not null,
f1 double precision not null,
pr_auc double precision not null,
latency_p50_ms int not null,
latency_p95_ms int not null,
n_expected int not null,
n_observed int not null,
primary key (eval_run_id, tier, op_point)
);
```
### 2.2 Canonical vuln key (avoid mismatches)
Define a single canonical key for matching expected↔observed:
* For dependency vulns: `purl + advisory_id` (or `purl + cve` if available).
* For code-pattern vulns: `rule_family + stable fingerprint` (e.g., `sink_class + file + normalized AST span`).
You need this to stop “matching hell” from destroying the usefulness of metrics.
---
## 3) Corpus format (how developers add truth samples)
Create `/corpus/` repo (or folder) with strict structure:
```
/corpus/
/samples/
/py_sql_injection_001/
sample.yml
app.py
requirements.txt
expected.json
/js_ssrf_002/
sample.yml
index.js
package-lock.json
expected.json
replay-manifest.yml # pins concelier snapshot, rules hash, analyzers
tools/
run-scan.ps1
run-scan.sh
```
**`sample.yml`** includes:
* language, scenario, entrypoints,
* how to run/build (if needed),
* “golden” command line for deterministic scanning.
**`expected.json`** is a list of expected findings with `vuln_key`, `tier`, optional `sink_class`.
---
## 4) Pipeline changes in StellaOps (where code changes go)
### 4.1 Scanner workers: emit evidence primitives (no tiering here)
**Modules:**
* `StellaOps.Scanner.Worker.DotNet`
* `StellaOps.Scanner.Worker.Python`
* `StellaOps.Scanner.Worker.Node`
* `StellaOps.Scanner.Worker.Java`
**Change:**
* Every raw finding must include:
* `vuln_key`
* `rule_key`
* `score` (even if coarse at first)
* `evidence[]` primitives (dependency / reachability / taint as available)
* `first_signal_ms` (time from scan start to first evidence emitted for that finding)
Workers do **not** decide tiers. They only report what they saw.
### 4.2 Scanner webservice: tiering + lattice merge (this is the policy brain)
**Module:** `StellaOps.Scanner.WebService`
Responsibilities:
* Merge evidence for the same `vuln_key` across analyzers.
* Run reachability/taint algorithms (your lattice policy engine sits here).
* Assign `evidence_tier` deterministically.
* Persist normalized findings (production tables) + export to eval capture.
### 4.3 Concelier + Excititor (preserve prune source)
* Concelier stores advisory data; does not “tier” anything.
* Excititor stores VEX statements; when it references a finding, it may *annotate* tier context, but it must preserve pruning provenance and not recompute tiers.
---
## 5) Evaluator implementation (the thing that computes tiered precision/recall)
### 5.1 New service/tooling
Create:
* `StellaOps.Scanner.Evaluation.Core` (library)
* `StellaOps.Scanner.Evaluation.Cli` (dotnet tool)
CLI responsibilities:
1. Load corpus samples + expected findings into `eval.sample` / `eval.expected_finding`.
2. Trigger scans (via Scheduler or direct Scanner API) using `replay-manifest.yml`.
3. Capture observed findings into `eval.observed_finding`.
4. Compute per-tier PR curve + PR-AUC + operating-point precision/recall.
5. Write `eval.metrics` + produce Markdown/JSON artifacts for CI.
### 5.2 Matching algorithm (practical and robust)
For each `sample_id`:
* Group expected by `(vuln_key, tier)`.
* Group observed by `(vuln_key, tier)`.
* A match is “same vuln_key, same tier”.
* (Later enhancement: allow “higher tier” observed to satisfy a lower-tier expected only if you explicitly want that; default: **exact tier match** so you catch tier regressions.)
Compute:
* TP/FP/FN per tier.
* PR curve by sweeping threshold over observed scores.
* `first_signal_ms` percentiles per tier.
### 5.3 Operating points (so its not academic)
Pick tier-specific gates:
* `tainted_sink`: require Recall ≥ 0.80, minimize FP
* `executed`: require Recall ≥ 0.70
* `imported`: require Recall ≥ 0.60
Store the chosen threshold per tier per version (so you can compare apples-to-apples in regressions).
---
## 6) CI gating (how this becomes “real” engineering pressure)
In GitLab/Gitea pipeline:
1. Build scanner + webservice.
2. Pull pinned concelier snapshot bundle (or local snapshot).
3. Run evaluator CLI against corpus.
4. Fail build if:
* `PR-AUC(tainted_sink)` drops > 1% vs baseline
* or precision at `Recall>=0.80` drops below a floor (e.g. 0.95)
* or `latency_p95_ms(tainted_sink)` regresses beyond a budget
Store baselines in repo (`/corpus/baselines/<scanner_version>.json`) to make diffs explicit.
---
## 7) UI and alerting (so tiering changes behavior)
### 7.1 UI
Add three KPI cards:
* Imported PR-AUC trend
* Executed PR-AUC trend
* Tainted→Sink PR-AUC trend
In the findings list:
* show tier badge
* default sort: `tainted_sink` then `executed` then `imported`
* clicking a finding shows evidence summary (entrypoint, path length, sink class)
### 7.2 Notify policy
Default policy:
* Page/urgent only on `tainted_sink` above a confidence threshold.
* Create ticket on `executed`.
* Batch report on `imported`.
This is the main “why”: the system stops screaming about irrelevant imports.
---
## 8) Rollout plan (phased, developer-friendly)
### Phase 0: Contracts (12 days)
* Define `vuln_key`, `rule_key`, evidence DTOs, tier enum.
* Add schema `eval.*`.
**Done when:** scanner output can carry evidence + score; eval tables exist.
### Phase 1: Evidence emission + tiering (12 sprints)
* Workers emit evidence primitives.
* Webservice assigns tier using deterministic precedence.
**Done when:** every finding has a tier + evidence summary.
### Phase 2: Corpus + evaluator (1 sprint)
* Build 3050 samples (10 per tier minimum).
* Implement evaluator CLI + metrics persistence.
**Done when:** CI can compute tiered metrics and output markdown report.
### Phase 3: Gates + UX (1 sprint)
* Add CI regression gates.
* Add UI tier badge + dashboards.
* Add Notify tier-based routing.
**Done when:** a regression in tainted→sink breaks CI even if imported improves.
### Phase 4: Scale corpus + harden matching (ongoing)
* Expand to 200+ samples, multi-language.
* Add fingerprinting for code vulns to avoid brittle file/line matching.
---
## Definition of “success” (so nobody bikesheds)
* You can point to one release where **overall precision stayed flat** but **tainted→sink PR-AUC improved**, and CI proves you didnt “cheat” by just silencing imported findings.
* On-call noise drops because paging is tier-gated.
* TTFS p95 for tainted→sink stays within a budget you set (e.g., <30s on corpus and <N seconds on real images).
If you want, I can also give you:
* a concrete DTO set (`FindingEnvelope`, `EvidenceUnion`, etc.) in C#/.NET 10,
* and a skeleton `StellaOps.Scanner.Evaluation.Cli` command layout (`import-corpus`, `run`, `compute`, `report`) that your agents can start coding immediately.