feat(rate-limiting): Implement core rate limiting functionality with configuration, decision-making, metrics, middleware, and service registration

- Add RateLimitConfig for configuration management with YAML binding support. - Introduce RateLimitDecision to encapsulate the result of rate limit checks. - Implement RateLimitMetrics for OpenTelemetry metrics tracking. - Create RateLimitMiddleware for enforcing rate limits on incoming requests. - Develop RateLimitService to orchestrate instance and environment rate limit checks. - Add RateLimitServiceCollectionExtensions for dependency injection registration.
2025-12-17 18:02:37 +02:00
parent 394b57f6bf
commit 8bbfe4d2d2
211 changed files with 47179 additions and 1590 deletions
--- a/docs/contributing/corpus-contribution-guide.md
+++ b/docs/contributing/corpus-contribution-guide.md
@@ -0,0 +1,301 @@
+# Corpus Contribution Guide
+
+**Sprint:** SPRINT_3500_0003_0001  
+**Task:** CORPUS-014 - Document corpus contribution guide
+
+## Overview
+
+The Ground-Truth Corpus is a collection of validated test samples used to measure scanner accuracy. Each sample has known reachability status and expected findings, enabling deterministic quality metrics.
+
+## Corpus Structure
+
+```
+datasets/reachability/
+├── corpus.json                # Index of all samples
+├── schemas/
+│   └── corpus-sample.v1.json  # JSON schema for samples
+├── samples/
+│   ├── gt-0001/               # Sample directory
+│   │   ├── sample.json        # Sample metadata
+│   │   ├── expected.json      # Expected findings
+│   │   ├── sbom.json          # Input SBOM
+│   │   └── source/            # Optional source files
+│   └── ...
+└── baselines/
+    └── v1.0.0.json            # Baseline metrics
+```
+
+## Sample Format
+
+### sample.json
+
+```json
+{
+  "id": "gt-0001",
+  "name": "Python SQL Injection - Reachable",
+  "description": "Flask app with reachable SQL injection via user input",
+  "language": "python",
+  "ecosystem": "pypi",
+  "scenario": "webapi",
+  "entrypoints": ["app.py:main"],
+  "reachability_tier": "tainted_sink",
+  "created_at": "2025-01-15T00:00:00Z",
+  "author": "security-team",
+  "tags": ["sql-injection", "flask", "reachable"]
+}
+```
+
+### expected.json
+
+```json
+{
+  "findings": [
+    {
+      "vuln_key": "CVE-2024-1234:pkg:pypi/sqlalchemy@1.4.0",
+      "tier": "tainted_sink",
+      "rule_key": "py.sql.injection.param_concat",
+      "sink_class": "sql",
+      "location_hint": "app.py:42"
+    }
+  ]
+}
+```
+
+## Contributing a Sample
+
+### Step 1: Choose a Scenario
+
+Select a scenario that is not well-covered in the corpus:
+
+| Scenario | Description | Example |
+|----------|-------------|---------|
+| `webapi` | Web application endpoint | Flask, FastAPI, Express |
+| `cli` | Command-line tool | argparse, click, commander |
+| `job` | Background/scheduled job | Celery, cron script |
+| `lib` | Library code | Reusable package |
+
+### Step 2: Create Sample Directory
+
+```bash
+cd datasets/reachability/samples
+mkdir gt-NNNN
+cd gt-NNNN
+```
+
+Use the next available sample ID (check `corpus.json` for the highest).
+
+### Step 3: Create Minimal Reproducible Case
+
+**Requirements:**
+- Smallest possible code to demonstrate the vulnerability
+- Real or realistic vulnerability (use CVE when possible)
+- Clear entrypoint definition
+- Deterministic behavior (no network, no randomness)
+
+**Example Python Sample:**
+
+```python
+# app.py - gt-0001
+from flask import Flask, request
+import sqlite3
+
+app = Flask(__name__)
+
+@app.route("/user")
+def get_user():
+    user_id = request.args.get("id")  # Taint source
+    conn = sqlite3.connect(":memory:")
+    # SQL injection: user_id flows to query without sanitization
+    result = conn.execute(f"SELECT * FROM users WHERE id = {user_id}")  # Taint sink
+    return str(result.fetchall())
+
+if __name__ == "__main__":
+    app.run()
+```
+
+### Step 4: Define Expected Findings
+
+Create `expected.json` with all expected findings:
+
+```json
+{
+  "findings": [
+    {
+      "vuln_key": "CWE-89:pkg:pypi/flask@2.0.0",
+      "tier": "tainted_sink",
+      "rule_key": "py.sql.injection",
+      "sink_class": "sql",
+      "location_hint": "app.py:13",
+      "notes": "User input from request.args flows to sqlite3.execute"
+    }
+  ]
+}
+```
+
+### Step 5: Create SBOM
+
+Generate or create an SBOM for the sample:
+
+```json
+{
+  "bomFormat": "CycloneDX",
+  "specVersion": "1.6",
+  "version": 1,
+  "components": [
+    {
+      "type": "library",
+      "name": "flask",
+      "version": "2.0.0",
+      "purl": "pkg:pypi/flask@2.0.0"
+    },
+    {
+      "type": "library",
+      "name": "sqlite3",
+      "version": "3.39.0",
+      "purl": "pkg:pypi/sqlite3@3.39.0"
+    }
+  ]
+}
+```
+
+### Step 6: Update Corpus Index
+
+Add entry to `corpus.json`:
+
+```json
+{
+  "id": "gt-0001",
+  "path": "samples/gt-0001",
+  "language": "python",
+  "tier": "tainted_sink",
+  "scenario": "webapi",
+  "expected_count": 1
+}
+```
+
+### Step 7: Validate Locally
+
+```bash
+# Run corpus validation
+dotnet test tests/reachability/StellaOps.Reachability.FixtureTests \
+  --filter "FullyQualifiedName~CorpusFixtureTests"
+
+# Run benchmark
+stellaops bench corpus run --sample gt-0001 --verbose
+```
+
+## Tier Guidelines
+
+### Imported Tier Samples
+
+For `imported` tier samples:
+- Vulnerability in a dependency
+- No execution path to vulnerable code
+- Package is in lockfile but not called
+
+**Example:** Unused dependency with known CVE.
+
+### Executed Tier Samples
+
+For `executed` tier samples:
+- Vulnerable code is called from entrypoint
+- No user-controlled data reaches the vulnerability
+- Static or coverage analysis proves execution
+
+**Example:** Hardcoded SQL query (no injection).
+
+### Tainted→Sink Tier Samples
+
+For `tainted_sink` tier samples:
+- User-controlled input reaches vulnerable code
+- Clear source → sink data flow
+- Include sink class taxonomy
+
+**Example:** User input to SQL query, command execution, etc.
+
+## Sink Classes
+
+When contributing `tainted_sink` samples, specify the sink class:
+
+| Sink Class | Description | Examples |
+|------------|-------------|----------|
+| `sql` | SQL injection | sqlite3.execute, cursor.execute |
+| `command` | Command injection | os.system, subprocess.run |
+| `ssrf` | Server-side request forgery | requests.get, urllib.urlopen |
+| `path` | Path traversal | open(), os.path.join |
+| `deser` | Deserialization | pickle.loads, yaml.load |
+| `eval` | Code evaluation | eval(), exec() |
+| `xxe` | XML external entity | lxml.parse, ET.parse |
+| `xss` | Cross-site scripting | innerHTML, document.write |
+
+## Quality Criteria
+
+Samples must meet these criteria:
+
+- [ ] **Deterministic**: Same input → same output
+- [ ] **Minimal**: Smallest code to demonstrate
+- [ ] **Documented**: Clear description and notes
+- [ ] **Validated**: Passes local tests
+- [ ] **Realistic**: Based on real vulnerability patterns
+- [ ] **Self-contained**: No external network calls
+
+## Negative Samples
+
+Include "negative" samples where scanner should NOT find vulnerabilities:
+
+```json
+{
+  "id": "gt-0050",
+  "name": "Python SQL - Properly Sanitized",
+  "tier": "imported",
+  "expected_count": 0,
+  "notes": "Uses parameterized queries, no injection possible"
+}
+```
+
+## Review Process
+
+1. Create PR with new sample(s)
+2. CI runs validation tests
+3. Security team reviews expected findings
+4. QA team verifies determinism
+5. Merge and update baseline
+
+## Updating Baselines
+
+After adding samples, update baseline metrics:
+
+```bash
+# Generate new baseline
+stellaops bench corpus run --all --output baselines/v1.1.0.json
+
+# Compare to previous
+stellaops bench corpus compare baselines/v1.0.0.json baselines/v1.1.0.json
+```
+
+## FAQ
+
+### How many samples should I contribute?
+
+Start with 2-3 high-quality samples covering different aspects of the same vulnerability class.
+
+### Can I use synthetic vulnerabilities?
+
+Yes, but prefer real CVE patterns when possible. Synthetic samples should document the vulnerability pattern clearly.
+
+### What if my sample has multiple findings?
+
+Include all expected findings in `expected.json`. Multi-finding samples are valuable for testing.
+
+### How do I test tier classification?
+
+Run with verbose output:
+```bash
+stellaops bench corpus run --sample gt-NNNN --verbose --show-evidence
+```
+
+## Related Documentation
+
+- [Tiered Precision Curves](../benchmarks/tiered-precision-curves.md)
+- [Reachability Analysis](../product-advisories/14-Dec-2025%20-%20Reachability%20Analysis%20Technical%20Reference.md)
+- [Corpus Index Schema](../../datasets/reachability/schemas/corpus-sample.v1.json)